profiling vector3d

Everything related to the code /
Tout ce qui touche au code
Post Reply
shaddim
Posts: 11
Joined: Sat Sep 06, 2008 6:00 pm

profiling vector3d

Post by shaddim » Mon Sep 29, 2008 1:26 pm

hello,
first, i want to introduce myself... i'm a big TA fan since years now and i'm also thinking a worthy successor is still missing. So, thank you for this well progressed work ! :)
So why not helping this project? up to now my game programming knowledge is limited but on my professional work i produced some time high performance C/assembler code... maybe thats an area where i can bring in some work.... ;)

so, the last time i played around with profiling and learning whats going on with TA3d ....
zuzuf hinted me to the matrix & vector operations which are taking a significant amount of the calculation time, here the profile of a single player game:

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 37.55     95.70    95.70   158452     0.00     0.00  TA3D::MAP::draw(TA3D::Camera*, unsigned char, bool, float, float, float, bool, bool, bool)
 10.43    122.28    26.58   465746     0.00     0.00  TA3D::OBJECT::draw_shadow(Vector3D, float, TA3D::SCRIPT_DATA*, bool, bool)
  6.01    137.61    15.33 2725950308     0.00     0.00  Vector3D::operator=(Vector3D const&)
  4.84    149.94    12.33 2537142537     0.00     0.00  Vector3D::Vector3D(Vector3D const&)
  3.61    159.13     9.19 773944233     0.00     0.00  Vector3D::operator+=(Vector3D const&)
  3.37    167.72     8.59   316036     0.00     0.00  TA3D::function_has_unit(lua_State*)
  3.19    175.85     8.13 738929072     0.00     0.00  operator+(Vector3D const&, Vector3D const&)
  2.70    182.74     6.89                             $memcpyEntry2
  2.36    188.75     6.01    79986     0.00     0.00  TA3D::OBJECT::hit(Vector3D, Vector3D, TA3D::SCRIPT_DATA*, Vector3D*, MATRIX_4x4)
  1.81    193.36     4.61 496807512     0.00     0.00  operator%(Vector3D const&, Vector3D const&)
  1.37    196.86     3.50 741558566     0.00     0.00  Vector3D::Vector3D()
  0.97    199.34     2.48 293888929     0.00     0.00  Vector3D::operator-=(Vector3D const&)
  0.72    201.17     1.83        1     1.83   224.68  TA3D::Battle::execute()
  0.71    202.99     1.82                             _Unwind_SjLj_Register
  0.68    204.73     1.74     4235     0.00     0.00  TA3D::UTILS::HPI::cHPIHandler::LZ77Decompress(unsigned char*, unsigned char*, TA3D::UTILS::HPI::cHPIHandler::HPICHUNK*)
  0.67    206.44     1.71                             _Unwind_SjLj_Unregister
  0.48    207.67     1.23   556260     0.00     0.00  TA3D::MAP::hit(Vector3D, Vector3D, bool, float, bool)
  0.48    208.90     1.23    79226     0.00     0.00  TA3D::PARTICLE_ENGINE::draw(TA3D::Camera*, int, int, int, int, unsigned char**)
  0.46    210.06     1.16 289536584     0.00     0.00  operator-(Vector3D const&, Vector3D const&)
  0.46    211.22     1.16     6824     0.00     0.00  TA3D::UTILS::HPI::cHPIHandler::Decompress(unsigned char*, unsigned char*, TA3D::UTILS::HPI::cHPIHandler::HPICHUNK*)
  0.37    212.16     0.94 127471133     0.00     0.00  Vector3D::operator*=(float)
  0.35    213.04     0.89    79226     0.00     0.00  TA3D::MAP::draw_mini(int, int, int, int, TA3D::Camera*, unsigned char)
  0.31    213.85     0.80 13266564     0.00     0.00  TA3D::UNIT::draw_shadow(Vector3D const&, TA3D::MAP*)
  0.31    214.63     0.78 25817060     0.00     0.00  TA3D::MAP::get_max_h(int, int)
  0.29    215.37     0.74   158452     0.00     0.00  TA3D::FXManager::draw(TA3D::Camera&, TA3D::MAP*, float, bool)
  0.28    216.07     0.71 127471133     0.00     0.00  operator*(float const&, Vector3D const&)
  0.25    216.72     0.64 12182987     0.00     0.00  TA3D::MAP::get_unit_h(float, float)
  0.24    217.34     0.62   141331     0.00     0.00  TA3D::OBJECT::draw(float, TA3D::SCRIPT_DATA*, bool, bool, bool, int, bool, bool)
  0.23    217.91     0.58  7494209     0.00     0.00  TA3D::FX::doDrawAnimWave(int)
  0.22    218.47     0.56  7510462     0.00     0.00  TA3D::FX::doCanDrawAnim(TA3D::MAP*) const
  0.20    218.99     0.51 122641848     0.00     0.00  __gnu_cxx::__normal_iterator<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> >*, std::vector<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> >, std::allocator<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> > > > >::operator+(int const&) const
  0.20    219.50     0.51 122641848     0.00     0.00  std::vector<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> >, std::allocator<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> > > >::operator[](unsigned int)
  0.20    220.01     0.51 80987831     0.00     0.00  float TA3D::Math::Max<float>(float, float)
  0.19    220.50     0.49 64882324     0.00     0.00  Vector3D::operator*=(Vector3D)
<snip>

vector3d operations are 7 times in the top ten, so here some optimizations would be helpful.

after some tries of introducing some other ideas without changing the interfaces, i gived up and tried some ideas with an "array" interface form.

vector3d is defined as single vector consisting of x,y,z and arrays from that are formed and accessed explicitly.
i generalized the vector3d to an vector3d_array with the special case arraylength=1 which is the original vector3d.

interfaces would for an vector3d add are at the moment:

Code: Select all

//create
vector3d g[20000],d[20000];
//add with loop
for {i=1;i=<2000;i++} g[i]+=d[i];
would then look like this

Code: Select all

//create
g = vector3da(20000); //pointer
d = vector3da(20000); //pointer
//add
vector3da(g,d,20000);
advantages should be the ability to used 'vectorized' operations like the SSE 'movups' which moves 4 floats around. if we alloc the memory explicit the 'movaps' instruction could be used which relies on 16byte aligned memory but is faster.
here you can also the problems, an vector3d consists of 3 floats and not of 4, so the complete SSE vectorization stuff could not (or only with some padding) be used. but if you have array of vector3ds we have more then 4 floats and the vectorized operations can be used.

here are some tests:

Code: Select all

 /home/Sources/ta3d/src
$ mingw32-gcc -g -o3 test.cpp -otest
 /home/Sources/ta3d/src
$ test.exe
Create Vector3d(native) 2.235750e-004 s
Create Vector3d(malloc) 6.693068e-007 s
Create Vector3d(calloc) 4.019108e-005 s
Assign Vector3d to Vector3d(native)  1.132356e-007 s
Assign Vector3d to Vector3d(array)  1.319796e-008 s
Assign Vector3d to Vector3d(asm_array) 3.291458e-009 s
Assign Vector3d to Vector3d(asm_array_align) 3.119521e-009 s
SQ Vector3d (native)  1.718706e-008 s
SQ Vector3d(array)  1.474200e-008 s
SQ Vector3d(asm_array) 3.319333e-009 s
SQ Vector3d(asm_array_align) 8.830333e-009 s
SCALE Vector3d (native)  2.169829e-008 s
SCALE Vector3d(array)  1.514652e-008 s
SCALE Vector3d(asm_array) 3.683542e-009 s
SCALE Vector3d(asm_array_align) 3.391396e-009 s
here i tried to measure the performance of the now implemented C++ class style vector class operations against an C-array approach, especially for big matrices/array (throughput)
20000 Vector3d are used for these tests, the numbers showing the times for one vector3d operation.
first run is for "warmup" (filling the caches, to give the different approaches the same startup situation), second the one here given.
native is the c++ code, array is an C-array. Asm is SSE-vector assembly, asm align the SEE-vector assembly with 16byte aligned memory accesses.



here the test code (used under windows)

Code: Select all

/*  TA3D, a remake of Total Annihilation
    Copyright (C) 2005  Roland BROCHARD

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA*/

/*
 **  File: main.cpp
 ** Notes: The applications main entry point. 

 */
//#include "stdafx.h"					// standard pch inheritance.
#include <math.h>
#include <windows.h>
//#include <string.h>
#include "stdio.h"	
#include "vector.h"	


void assign_asm(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
        //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop:\n\t"
		"movups xmm0,[ebx+ecx]\n\t" 
		"movups [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); 
		
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest:\n\t"
		"movss xmm0,[ebx+edx]\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0"); }
        return;
 }
 
 void assign_asm_align(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop2:\n\t"
		"movaps xmm0,[ebx+ecx]\n\t" 
		"movaps [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop2\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest2:\n\t"
		"movss xmm0,[ebx+edx]\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest2\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }
 
 void sq_asm_align(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop3:\n\t"
		"movaps xmm0,[ebx+ecx]\n\t" 
		"mulps xmm0,xmm0\n\t"
		"movaps [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop3\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest3:\n\t"
		"movss xmm0,[ebx+edx]\n\t"
		"mulss xmm0,xmm0\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest3\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }
 
 void sq_asm(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop4:\n\t"
		"movups xmm0,[ebx+ecx]\n\t" 
		"mulps xmm0,xmm0\n\t"
		"movups [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop4\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest4:\n\t"
		"movss xmm0,[ebx+edx]\n\t"
		"mulss xmm0,xmm0\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest4\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }
 
 void mul_asm_align(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop5:\n\t"
		"movaps xmm0,[ebx+ecx]\n\t" 
		"movaps xmm1,[eax+ecx]\n\t" 
		"mulps xmm0,xmm1\n\t"
		"movaps [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop5\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest5:\n\t"
		"movss xmm0,[ebx+edx]\n\t"
		"movss xmm1,[eax+edx]\n\t"
		"mulss xmm0,xmm1\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest5\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }
 
 void mul_asm(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop6:\n\t"
		"movups xmm0,[ebx+ecx]\n\t" 
		"movups xmm1,[eax+ecx]\n\t" 
		"mulps xmm0,xmm1\n\t"
		"movups [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop6\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest6:\n\t"
		"movss xmm0,[ebx+edx]\n\t"
		"movss xmm1,[eax+edx]\n\t"
		"mulss xmm0,xmm1\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest6\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }
 
 void scale_asm_align(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop7:\n\t"
		"movaps xmm0,[eax+ecx]\n\t" 
		"movss xmm1,[ebx]\n\t"
		"shufps xmm1,xmm1,0\n\t" 
		"mulps xmm0,xmm1\n\t"
		"movaps [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop7\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest7:\n\t"
		"movss xmm0,[eax+edx]\n\t" 
		"movss xmm1,[ebx]\n\t"
		"mulss xmm0,xmm1\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest7\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }
 
 void scale_asm(float* out, float* in, unsigned int leng)
{	 unsigned int count, rest;
        rest= (leng*3*4)%16;
        //printf("%i\n",rest);
        count = (leng*3*4)-rest;
         //printf("%i\n",count);
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"loop8:\n\t"
		"movups xmm0,[eax+ecx]\n\t" 
		"movss xmm1,[ebx]\n\t"
		"shufps xmm1,xmm1,0\n\t" 
		"mulps xmm0,xmm1\n\t"
		"movups [eax+ecx],xmm0\n\t"
		";movhps [ebx+4],xmm0\n\t"
		"sub ecx,16\n\t"
		"jnz loop8\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); 
        
		if (rest!=0)
		{
 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"add eax,ecx\n\t"
		"add ebx,ecx\n\t"
		
		"rest8:\n\t"
		"movss xmm0,[eax+edx]\n\t" 
		"movss xmm1,[ebx]\n\t"
		"mulss xmm0,xmm1\n\t"
		"movss [eax+edx],xmm0\n\t"
		"sub edx,1\n\t"
		"jnz rest8\n\t"
		".att_syntax prefix\n\t"
        : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
        return;
 }


//////////////////////////////////////////////////////////
 double tic()
    { double time=0;
	unsigned int high,low;
  //  void* ebx= &x; 
  //  const void* eax= &rhs.x;
  /*__asm __volatile__ ("pushl %eax\n\t"				\
			    "pushl %ecx\n\t"				\
		 	    "pushl %edx"); */

 __asm __volatile__  (".intel_syntax noprefix\n\t"
		"RDTSC\n\t"
		".att_syntax prefix\n\t"
             //"movl %%eax,%%eax\n\t"
		:   "=d" (high), "=a" (low): :); 
//printf("%u\n",low);
//printf("%u\n",high);
time = high;
time = (time*4294967295.0) + low;
return time; 
}

////////////////////////////////////////////////////////////////////////////////////////////////////////////
// __TA3D_XX_MISC_VECTOR_H__

/*
 ** Function: main
 **    Notes: Whats this for anyhow? :)  Just kidding, this is where it all begin baby!
 */
int main(int argc, char *argv[])
{
    double timings[10];
    void *vptr;
    float *ptr,*ptr2,*ptr3, *ptr3_save,*ptr4, *ptr4_save;
	
	#define NumberOfVectors 20000
	#define CPU_MHZ 2400000000.0
      
    //create Vector3d Array
    timings[1] = tic();
    for (int i=1;i<NumberOfVectors;i++) {Vector3D gg[NumberOfVectors];}
    timings[2] = tic();
    printf("Create Vector3d(native) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
    
    //malloc
    timings[1] = tic();
    for (int i=1;i<NumberOfVectors;i++) {ptr= (float*) malloc(sizeof(float)*NumberOfVectors*3); free(ptr);}
    timings[2] = tic();
    printf("Create Vector3d(malloc) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 

    //memory zero initalized
    timings[1] = tic();
    for (int i=1;i<NumberOfVectors;i++) {ptr2=(float*) calloc(3,sizeof(float)*NumberOfVectors); free(ptr2);}
    timings[2] = tic();
    printf("Create Vector3d(calloc) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
    
    
    //for assign test
    Vector3D d[NumberOfVectors];  
    Vector3D g[NumberOfVectors];
    ptr= (float*) malloc(sizeof(float)*NumberOfVectors*3);
    ptr2 = (float*) calloc(3,sizeof(float)*NumberOfVectors); 
    
    //memory zero initalized (ALIGNED 16)
    timings[1] = tic();
    vptr = calloc(3,15+sizeof(float)*NumberOfVectors); 
    if ((unsigned int) vptr%16!=0)
     {ptr3 = (float*) ((unsigned int) vptr+(16-(unsigned int) vptr%16));
     printf("%p\n",(unsigned int) ptr3_save%16);
     ptr3_save = (float*) vptr;
     }
    else {ptr4_save = (float*) vptr; ptr4 = (float*) vptr;}
    //printf("%p\n",(unsigned int) ptr3_save%16);
    //printf("%p\n",(unsigned int) ptr3%16);
    //printf("%p\n",ptr3_save);
    //printf("%p\n",ptr3);
    timings[2] = tic();
    
    //memory zero initalized (ALIGNED 16)
    timings[1] = tic();
    vptr = calloc(3,15+sizeof(float)*NumberOfVectors); 
    if ((unsigned int) vptr%16!=0)
     {ptr4 = (float*) ((unsigned int) vptr+(16-(unsigned int) vptr%16));
     printf("%p\n",(unsigned int) ptr4_save%16);
     ptr4_save = (float*) vptr;
     }
    else {ptr4_save = (float*) vptr; ptr4 = (float*) vptr;}
   
    printf("%p\n",(unsigned int) ptr4%16);
    //printf("%p\n",ptr4_save);
    //printf("%p\n",ptr4);
    timings[2] = tic();

    //////////test  assign

       for (int i=1;i<NumberOfVectors;i++) g[i]=d[i]; //COLD

	timings[1] = tic();
    for (int i=1;i<NumberOfVectors;i++) g[i]=d[i];
	timings[2] = tic();
	printf("Assign Vector3d to Vector3d(native)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 

	timings[1] = tic();
    for (int i=0;i<NumberOfVectors*3;i=i+3) 
    {
    ptr[i]=ptr2[i];  ptr[i+1]=ptr2[i+1];  ptr[i+2]=ptr2[i+2];
    }
    timings[2] = tic();
	//printf("Assign Vector3d to Vector3d(array)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	  
	timings[1] = tic();
    for (int i=0;i<NumberOfVectors*3;i=i+3) 
    {
    ptr[i]=ptr2[i];  ptr[i+1]=ptr2[i+1];  ptr[i+2]=ptr2[i+2];
    }
    timings[2] = tic();
	printf("Assign Vector3d to Vector3d(array)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    assign_asm(ptr,ptr2,NumberOfVectors);
    timings[2] = tic();
	//printf("Assign Vector3d to Vector3d(asm_array) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	 
	timings[1] = tic();
    assign_asm(ptr,ptr2,NumberOfVectors);
    timings[2] = tic();
	printf("Assign Vector3d to Vector3d(asm_array) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
	
	timings[1] = tic();
    assign_asm_align(ptr4,ptr3,NumberOfVectors);
    timings[2] = tic();
	//printf("Assign Vector3d to Vector3d(asm_array_align) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    assign_asm_align(ptr4,ptr3,NumberOfVectors);
    timings[2] = tic();
	printf("Assign Vector3d to Vector3d(asm_array_align) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
		 
	//////////test  SQ

    for (int i=1;i<NumberOfVectors;i++) d[i].sq(); //COLD
	timings[1] = tic();
    for (int i=1;i<NumberOfVectors;i++) d[i].sq();
	timings[2] = tic();
	printf("SQ Vector3d (native)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 

	timings[1] = tic();
    for (int i=0;i<NumberOfVectors*3;i=i+3) 
    {
    ptr2[i]=ptr[i]*ptr[i]; ptr2[i+1]=ptr[i+1]*ptr[i+1];  ptr2[i+2]=ptr[i+2]*ptr[i+2];
    }
    timings[2] = tic();
	//printf("SQ Vector3d(array)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    for (int i=0;i<NumberOfVectors*3;i=i+3) 
    {
    ptr2[i]=ptr[i]*ptr[i]; ptr2[i+1]=ptr[i+1]*ptr[i+1];  ptr2[i+2]=ptr[i+2]*ptr[i+2];
    }
    timings[2] = tic();
	printf("SQ Vector3d(array)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	  
	timings[1] = tic();
    sq_asm(ptr,ptr2,NumberOfVectors);
    timings[2] = tic();
	//printf("SQ Vector3d(asm_array) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    sq_asm(ptr,ptr2,NumberOfVectors);
    timings[2] = tic();
	printf("SQ Vector3d(asm_array) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    sq_asm_align(ptr4,ptr3,NumberOfVectors);
    timings[2] = tic();
	//printf("SQ Vector3d(asm_array_align) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();	
    sq_asm_align(ptr4,ptr3,NumberOfVectors);
    timings[2] = tic();
	printf("SQ Vector3d(asm_array_align) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
		
	//////test SCALE
	float scale_f = 5.5;  

	for (int i=1;i<NumberOfVectors;i++) {g[i]*=scale_f;} //COLD
	timings[1] = tic();
    for (int i=1;i<NumberOfVectors;i++) {g[i]*=scale_f;}
	timings[2] = tic();
	printf("SCALE Vector3d (native)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 

	timings[1] = tic();
    for (int i=0;i<NumberOfVectors*3;i=i+3) 
    {
    ptr2[i]=ptr[i]*scale_f; ptr2[i+1]=ptr[i+1]*scale_f;  ptr2[i+2]=ptr[i+2]*scale_f;
    }
    timings[2] = tic();
	//printf("SCALE Vector3d(array)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    for (int i=0;i<NumberOfVectors*3;i=i+3) 
    {
    ptr2[i]=ptr[i]*scale_f; ptr2[i+1]=ptr[i+1]*scale_f;  ptr2[i+2]=ptr[i+2]*scale_f;
    }
    timings[2] = tic();
	printf("SCALE Vector3d(array)  %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	  
	timings[1] = tic();
    scale_asm(ptr,&scale_f,NumberOfVectors);
    timings[2] = tic();
	//printf("SCALE Vector3d(asm_array) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    scale_asm(ptr,&scale_f,NumberOfVectors);
    timings[2] = tic();
	printf("SCALE Vector3d(asm_array) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();
    scale_asm_align(ptr4,&scale_f,NumberOfVectors);
    timings[2] = tic();
	//printf("SCALE Vector3d(asm_array_align) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
	timings[1] = tic();	
    scale_asm_align(ptr4,&scale_f,NumberOfVectors);
    timings[2] = tic();
	printf("SCALE Vector3d(asm_array_align) %e s\n",  ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors); 
	
    return 0; 		// thats it folks.
}

so, some speedup is possible but would require the change of the interfaces.... not so nice.
For the asm appoach disadavantages are the reduced portablilty for different hardware and software platforms, but an C-fallback code could be also easy provided.

so... what do you think? :)

greetings
Last edited by shaddim on Tue Sep 30, 2008 5:06 pm, edited 1 time in total.

milipili
Posts: 545
Joined: Thu Nov 02, 2006 8:52 am
Location: Paris (France)
Contact:

Post by milipili » Mon Sep 29, 2008 2:27 pm

Woa your performance is impressive :D

However, I don't really think the problem should be solved like this (for now). A lot of code must be optimized before playing with asm code. Like reducing the use of Vector3D, in complex computations for example. The whole code structure must be clean up before that, which is not the case.

The main loop is being restructured and we should have a better visibility of what to improve in the next release.

For now, I don't think we should mix up our code. A clean and well-used C++ code will be fast enough. Then we will see for asm and all problems that will bring.

(but keep your code somewhere ^^)
Damien Gerard
Ta3d & Yuni Developer

shaddim
Posts: 11
Joined: Sat Sep 06, 2008 6:00 pm

Post by shaddim » Tue Sep 30, 2008 3:15 pm

milipili wrote:Woa your performance is impressive :D

However, I don't really think the problem should be solved like this (for now). A lot of code must be optimized before playing with asm code. Like reducing the use of Vector3D, in complex computations for example. The whole code structure must be clean up before that, which is not the case.

The main loop is being restructured and we should have a better visibility of what to improve in the next release.

For now, I don't think we should mix up our code. A clean and well-used C++ code will be fast enough. Then we will see for asm and all problems that will bring.

(but keep your code somewhere ^^)
Hello milipili, thanks for your fitting comments, which i must agree to... :) assembler parts might be at that stage risky (but you have not even seen the stuff which is possible on complexer parts...square-roots, angle calculations....) ...
but "fast enough" is term i can't accept (sorry!) ;c)
...faster means broader audience of players with heterogene hardware (older), faster means also, more units, more complex physics, better path-finding, maybe bigger map.... overall a better and more realistic experience.... one of the things i loved on TA, is its perfection on the technical point of view (something only can be also said about some ID-soft titles)... enourmous maps with a, compared with nowadays, tiny memory footprint, gigantic maps, enourmous amount of units, realistic physics... with the limited ressources in that time (nineties!)... a real masterpiece of code optimization...:)
therefore, from my point of view the optimization of performance made it possible that we was able to enjoy TA not as another "C&C clone", but as the great game we all love ... ;)

... sorry for this speech, back to the topic... *hmmm* one thought which was going around in my mind on general data structure: the question if definition of a single "vector3d" and then array'ing it in the form

Code: Select all

vector3d g[20000]
is an optimal approach for the use-cases of Ta3d.
if single vector3d are used most of the time, optimizations chances might limited...but if the array'ed form is the most used form.....
For example for an array-size of 20000, the factor 10 in time between an calloc of that size and a generation in the style shown, is coming from that that the compiler is unable by that definition of optimizing in a way big data chunks would allow/need.
i'm not compeletly sure but i think a code like that

Code: Select all

 g[20000]+=d[20000];
is translated from the compiler to 20000 single function calls (or even when inlined, 20000 single computations) which wasted the chances of 'burst' and 'vectorized' operations and additionally add overhead.
therefore a definition of vector3d as array would help significant the compiler, and also would made it possible to implement algorithms in a more efficient way on lower level C or even asm.... so, maybe someone with better C++ knowledge then me might help here, what is the C++ point of view on that topic?
greetings and thank you for audience...

User avatar
zuzuf
Administrateur - Site Admin
Posts: 3281
Joined: Mon Oct 30, 2006 8:49 pm
Location: Toulouse, France
Contact:

Post by zuzuf » Tue Sep 30, 2008 4:29 pm

That's impressive :)

I already see some parts of the code that could be optimized using that kind of parallelism. Shadows for example need to do that kind of operation and it's badly implemented at that time. Particle physics could be greatly improved too.

But we can also optimize code that do heavy calculations by optimizing math formulas and the way we write them. There are lots of calculations that do useless copies of vectors and other objects, it should be possible to remove that kind of useless operations.

I'll look deeper into this as soon as I have time (not before Friday :( )
=>;-D Penguin Powered

milipili
Posts: 545
Joined: Thu Nov 02, 2006 8:52 am
Location: Paris (France)
Contact:

Post by milipili » Tue Sep 30, 2008 7:26 pm

"fast enough" is term i can't accept (sorry!) ;c)
Believe me, there is much work to do before that :) A dev must be done in the good order.
Damien Gerard
Ta3d & Yuni Developer

User avatar
zuzuf
Administrateur - Site Admin
Posts: 3281
Joined: Mon Oct 30, 2006 8:49 pm
Location: Toulouse, France
Contact:

Post by zuzuf » Wed Oct 01, 2008 11:11 am

Of course but this doesn't prevent someone to work on this specific part while we're working on the rest :) (and we can still use generic code if you are afraid of bugs :D)
=>;-D Penguin Powered

milipili
Posts: 545
Joined: Thu Nov 02, 2006 8:52 am
Location: Paris (France)
Contact:

Post by milipili » Wed Oct 01, 2008 3:56 pm

As you wish :)
Damien Gerard
Ta3d & Yuni Developer

Post Reply

Who is online

Users browsing this forum: No registered users and 12 guests