first, i want to introduce myself... i'm a big TA fan since years now and i'm also thinking a worthy successor is still missing. So, thank you for this well progressed work !
So why not helping this project? up to now my game programming knowledge is limited but on my professional work i produced some time high performance C/assembler code... maybe thats an area where i can bring in some work....
so, the last time i played around with profiling and learning whats going on with TA3d ....
zuzuf hinted me to the matrix & vector operations which are taking a significant amount of the calculation time, here the profile of a single player game:
Code: Select all
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
37.55 95.70 95.70 158452 0.00 0.00 TA3D::MAP::draw(TA3D::Camera*, unsigned char, bool, float, float, float, bool, bool, bool)
10.43 122.28 26.58 465746 0.00 0.00 TA3D::OBJECT::draw_shadow(Vector3D, float, TA3D::SCRIPT_DATA*, bool, bool)
6.01 137.61 15.33 2725950308 0.00 0.00 Vector3D::operator=(Vector3D const&)
4.84 149.94 12.33 2537142537 0.00 0.00 Vector3D::Vector3D(Vector3D const&)
3.61 159.13 9.19 773944233 0.00 0.00 Vector3D::operator+=(Vector3D const&)
3.37 167.72 8.59 316036 0.00 0.00 TA3D::function_has_unit(lua_State*)
3.19 175.85 8.13 738929072 0.00 0.00 operator+(Vector3D const&, Vector3D const&)
2.70 182.74 6.89 $memcpyEntry2
2.36 188.75 6.01 79986 0.00 0.00 TA3D::OBJECT::hit(Vector3D, Vector3D, TA3D::SCRIPT_DATA*, Vector3D*, MATRIX_4x4)
1.81 193.36 4.61 496807512 0.00 0.00 operator%(Vector3D const&, Vector3D const&)
1.37 196.86 3.50 741558566 0.00 0.00 Vector3D::Vector3D()
0.97 199.34 2.48 293888929 0.00 0.00 Vector3D::operator-=(Vector3D const&)
0.72 201.17 1.83 1 1.83 224.68 TA3D::Battle::execute()
0.71 202.99 1.82 _Unwind_SjLj_Register
0.68 204.73 1.74 4235 0.00 0.00 TA3D::UTILS::HPI::cHPIHandler::LZ77Decompress(unsigned char*, unsigned char*, TA3D::UTILS::HPI::cHPIHandler::HPICHUNK*)
0.67 206.44 1.71 _Unwind_SjLj_Unregister
0.48 207.67 1.23 556260 0.00 0.00 TA3D::MAP::hit(Vector3D, Vector3D, bool, float, bool)
0.48 208.90 1.23 79226 0.00 0.00 TA3D::PARTICLE_ENGINE::draw(TA3D::Camera*, int, int, int, int, unsigned char**)
0.46 210.06 1.16 289536584 0.00 0.00 operator-(Vector3D const&, Vector3D const&)
0.46 211.22 1.16 6824 0.00 0.00 TA3D::UTILS::HPI::cHPIHandler::Decompress(unsigned char*, unsigned char*, TA3D::UTILS::HPI::cHPIHandler::HPICHUNK*)
0.37 212.16 0.94 127471133 0.00 0.00 Vector3D::operator*=(float)
0.35 213.04 0.89 79226 0.00 0.00 TA3D::MAP::draw_mini(int, int, int, int, TA3D::Camera*, unsigned char)
0.31 213.85 0.80 13266564 0.00 0.00 TA3D::UNIT::draw_shadow(Vector3D const&, TA3D::MAP*)
0.31 214.63 0.78 25817060 0.00 0.00 TA3D::MAP::get_max_h(int, int)
0.29 215.37 0.74 158452 0.00 0.00 TA3D::FXManager::draw(TA3D::Camera&, TA3D::MAP*, float, bool)
0.28 216.07 0.71 127471133 0.00 0.00 operator*(float const&, Vector3D const&)
0.25 216.72 0.64 12182987 0.00 0.00 TA3D::MAP::get_unit_h(float, float)
0.24 217.34 0.62 141331 0.00 0.00 TA3D::OBJECT::draw(float, TA3D::SCRIPT_DATA*, bool, bool, bool, int, bool, bool)
0.23 217.91 0.58 7494209 0.00 0.00 TA3D::FX::doDrawAnimWave(int)
0.22 218.47 0.56 7510462 0.00 0.00 TA3D::FX::doCanDrawAnim(TA3D::MAP*) const
0.20 218.99 0.51 122641848 0.00 0.00 __gnu_cxx::__normal_iterator<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> >*, std::vector<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> >, std::allocator<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> > > > >::operator+(int const&) const
0.20 219.50 0.51 122641848 0.00 0.00 std::vector<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> >, std::allocator<std::list<TA3D::QUAD_QUEUE*, std::allocator<TA3D::QUAD_QUEUE*> > > >::operator[](unsigned int)
0.20 220.01 0.51 80987831 0.00 0.00 float TA3D::Math::Max<float>(float, float)
0.19 220.50 0.49 64882324 0.00 0.00 Vector3D::operator*=(Vector3D)
vector3d operations are 7 times in the top ten, so here some optimizations would be helpful.
after some tries of introducing some other ideas without changing the interfaces, i gived up and tried some ideas with an "array" interface form.
vector3d is defined as single vector consisting of x,y,z and arrays from that are formed and accessed explicitly.
i generalized the vector3d to an vector3d_array with the special case arraylength=1 which is the original vector3d.
interfaces would for an vector3d add are at the moment:
Code: Select all
//create
vector3d g[20000],d[20000];
//add with loop
for {i=1;i=<2000;i++} g[i]+=d[i];
Code: Select all
//create
g = vector3da(20000); //pointer
d = vector3da(20000); //pointer
//add
vector3da(g,d,20000);
here you can also the problems, an vector3d consists of 3 floats and not of 4, so the complete SSE vectorization stuff could not (or only with some padding) be used. but if you have array of vector3ds we have more then 4 floats and the vectorized operations can be used.
here are some tests:
Code: Select all
/home/Sources/ta3d/src
$ mingw32-gcc -g -o3 test.cpp -otest
/home/Sources/ta3d/src
$ test.exe
Create Vector3d(native) 2.235750e-004 s
Create Vector3d(malloc) 6.693068e-007 s
Create Vector3d(calloc) 4.019108e-005 s
Assign Vector3d to Vector3d(native) 1.132356e-007 s
Assign Vector3d to Vector3d(array) 1.319796e-008 s
Assign Vector3d to Vector3d(asm_array) 3.291458e-009 s
Assign Vector3d to Vector3d(asm_array_align) 3.119521e-009 s
SQ Vector3d (native) 1.718706e-008 s
SQ Vector3d(array) 1.474200e-008 s
SQ Vector3d(asm_array) 3.319333e-009 s
SQ Vector3d(asm_array_align) 8.830333e-009 s
SCALE Vector3d (native) 2.169829e-008 s
SCALE Vector3d(array) 1.514652e-008 s
SCALE Vector3d(asm_array) 3.683542e-009 s
SCALE Vector3d(asm_array_align) 3.391396e-009 s
20000 Vector3d are used for these tests, the numbers showing the times for one vector3d operation.
first run is for "warmup" (filling the caches, to give the different approaches the same startup situation), second the one here given.
native is the c++ code, array is an C-array. Asm is SSE-vector assembly, asm align the SEE-vector assembly with 16byte aligned memory accesses.
here the test code (used under windows)
Code: Select all
/* TA3D, a remake of Total Annihilation
Copyright (C) 2005 Roland BROCHARD
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA*/
/*
** File: main.cpp
** Notes: The applications main entry point.
*/
//#include "stdafx.h" // standard pch inheritance.
#include <math.h>
#include <windows.h>
//#include <string.h>
#include "stdio.h"
#include "vector.h"
void assign_asm(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop:\n\t"
"movups xmm0,[ebx+ecx]\n\t"
"movups [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest:\n\t"
"movss xmm0,[ebx+edx]\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0"); }
return;
}
void assign_asm_align(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop2:\n\t"
"movaps xmm0,[ebx+ecx]\n\t"
"movaps [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop2\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest2:\n\t"
"movss xmm0,[ebx+edx]\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest2\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
void sq_asm_align(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop3:\n\t"
"movaps xmm0,[ebx+ecx]\n\t"
"mulps xmm0,xmm0\n\t"
"movaps [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop3\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest3:\n\t"
"movss xmm0,[ebx+edx]\n\t"
"mulss xmm0,xmm0\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest3\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
void sq_asm(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop4:\n\t"
"movups xmm0,[ebx+ecx]\n\t"
"mulps xmm0,xmm0\n\t"
"movups [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop4\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest4:\n\t"
"movss xmm0,[ebx+edx]\n\t"
"mulss xmm0,xmm0\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest4\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
void mul_asm_align(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop5:\n\t"
"movaps xmm0,[ebx+ecx]\n\t"
"movaps xmm1,[eax+ecx]\n\t"
"mulps xmm0,xmm1\n\t"
"movaps [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop5\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest5:\n\t"
"movss xmm0,[ebx+edx]\n\t"
"movss xmm1,[eax+edx]\n\t"
"mulss xmm0,xmm1\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest5\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
void mul_asm(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop6:\n\t"
"movups xmm0,[ebx+ecx]\n\t"
"movups xmm1,[eax+ecx]\n\t"
"mulps xmm0,xmm1\n\t"
"movups [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop6\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest6:\n\t"
"movss xmm0,[ebx+edx]\n\t"
"movss xmm1,[eax+edx]\n\t"
"mulss xmm0,xmm1\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest6\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
void scale_asm_align(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop7:\n\t"
"movaps xmm0,[eax+ecx]\n\t"
"movss xmm1,[ebx]\n\t"
"shufps xmm1,xmm1,0\n\t"
"mulps xmm0,xmm1\n\t"
"movaps [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop7\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest7:\n\t"
"movss xmm0,[eax+edx]\n\t"
"movss xmm1,[ebx]\n\t"
"mulss xmm0,xmm1\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest7\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
void scale_asm(float* out, float* in, unsigned int leng)
{ unsigned int count, rest;
rest= (leng*3*4)%16;
//printf("%i\n",rest);
count = (leng*3*4)-rest;
//printf("%i\n",count);
__asm __volatile__ (".intel_syntax noprefix\n\t"
"loop8:\n\t"
"movups xmm0,[eax+ecx]\n\t"
"movss xmm1,[ebx]\n\t"
"shufps xmm1,xmm1,0\n\t"
"mulps xmm0,xmm1\n\t"
"movups [eax+ecx],xmm0\n\t"
";movhps [ebx+4],xmm0\n\t"
"sub ecx,16\n\t"
"jnz loop8\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
if (rest!=0)
{
__asm __volatile__ (".intel_syntax noprefix\n\t"
"add eax,ecx\n\t"
"add ebx,ecx\n\t"
"rest8:\n\t"
"movss xmm0,[eax+edx]\n\t"
"movss xmm1,[ebx]\n\t"
"mulss xmm0,xmm1\n\t"
"movss [eax+edx],xmm0\n\t"
"sub edx,1\n\t"
"jnz rest8\n\t"
".att_syntax prefix\n\t"
: : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1"); }
return;
}
//////////////////////////////////////////////////////////
double tic()
{ double time=0;
unsigned int high,low;
// void* ebx= &x;
// const void* eax= &rhs.x;
/*__asm __volatile__ ("pushl %eax\n\t" \
"pushl %ecx\n\t" \
"pushl %edx"); */
__asm __volatile__ (".intel_syntax noprefix\n\t"
"RDTSC\n\t"
".att_syntax prefix\n\t"
//"movl %%eax,%%eax\n\t"
: "=d" (high), "=a" (low): :);
//printf("%u\n",low);
//printf("%u\n",high);
time = high;
time = (time*4294967295.0) + low;
return time;
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////
// __TA3D_XX_MISC_VECTOR_H__
/*
** Function: main
** Notes: Whats this for anyhow? :) Just kidding, this is where it all begin baby!
*/
int main(int argc, char *argv[])
{
double timings[10];
void *vptr;
float *ptr,*ptr2,*ptr3, *ptr3_save,*ptr4, *ptr4_save;
#define NumberOfVectors 20000
#define CPU_MHZ 2400000000.0
//create Vector3d Array
timings[1] = tic();
for (int i=1;i<NumberOfVectors;i++) {Vector3D gg[NumberOfVectors];}
timings[2] = tic();
printf("Create Vector3d(native) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
//malloc
timings[1] = tic();
for (int i=1;i<NumberOfVectors;i++) {ptr= (float*) malloc(sizeof(float)*NumberOfVectors*3); free(ptr);}
timings[2] = tic();
printf("Create Vector3d(malloc) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
//memory zero initalized
timings[1] = tic();
for (int i=1;i<NumberOfVectors;i++) {ptr2=(float*) calloc(3,sizeof(float)*NumberOfVectors); free(ptr2);}
timings[2] = tic();
printf("Create Vector3d(calloc) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
//for assign test
Vector3D d[NumberOfVectors];
Vector3D g[NumberOfVectors];
ptr= (float*) malloc(sizeof(float)*NumberOfVectors*3);
ptr2 = (float*) calloc(3,sizeof(float)*NumberOfVectors);
//memory zero initalized (ALIGNED 16)
timings[1] = tic();
vptr = calloc(3,15+sizeof(float)*NumberOfVectors);
if ((unsigned int) vptr%16!=0)
{ptr3 = (float*) ((unsigned int) vptr+(16-(unsigned int) vptr%16));
printf("%p\n",(unsigned int) ptr3_save%16);
ptr3_save = (float*) vptr;
}
else {ptr4_save = (float*) vptr; ptr4 = (float*) vptr;}
//printf("%p\n",(unsigned int) ptr3_save%16);
//printf("%p\n",(unsigned int) ptr3%16);
//printf("%p\n",ptr3_save);
//printf("%p\n",ptr3);
timings[2] = tic();
//memory zero initalized (ALIGNED 16)
timings[1] = tic();
vptr = calloc(3,15+sizeof(float)*NumberOfVectors);
if ((unsigned int) vptr%16!=0)
{ptr4 = (float*) ((unsigned int) vptr+(16-(unsigned int) vptr%16));
printf("%p\n",(unsigned int) ptr4_save%16);
ptr4_save = (float*) vptr;
}
else {ptr4_save = (float*) vptr; ptr4 = (float*) vptr;}
printf("%p\n",(unsigned int) ptr4%16);
//printf("%p\n",ptr4_save);
//printf("%p\n",ptr4);
timings[2] = tic();
//////////test assign
for (int i=1;i<NumberOfVectors;i++) g[i]=d[i]; //COLD
timings[1] = tic();
for (int i=1;i<NumberOfVectors;i++) g[i]=d[i];
timings[2] = tic();
printf("Assign Vector3d to Vector3d(native) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
for (int i=0;i<NumberOfVectors*3;i=i+3)
{
ptr[i]=ptr2[i]; ptr[i+1]=ptr2[i+1]; ptr[i+2]=ptr2[i+2];
}
timings[2] = tic();
//printf("Assign Vector3d to Vector3d(array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
for (int i=0;i<NumberOfVectors*3;i=i+3)
{
ptr[i]=ptr2[i]; ptr[i+1]=ptr2[i+1]; ptr[i+2]=ptr2[i+2];
}
timings[2] = tic();
printf("Assign Vector3d to Vector3d(array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
assign_asm(ptr,ptr2,NumberOfVectors);
timings[2] = tic();
//printf("Assign Vector3d to Vector3d(asm_array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
assign_asm(ptr,ptr2,NumberOfVectors);
timings[2] = tic();
printf("Assign Vector3d to Vector3d(asm_array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
assign_asm_align(ptr4,ptr3,NumberOfVectors);
timings[2] = tic();
//printf("Assign Vector3d to Vector3d(asm_array_align) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
assign_asm_align(ptr4,ptr3,NumberOfVectors);
timings[2] = tic();
printf("Assign Vector3d to Vector3d(asm_array_align) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
//////////test SQ
for (int i=1;i<NumberOfVectors;i++) d[i].sq(); //COLD
timings[1] = tic();
for (int i=1;i<NumberOfVectors;i++) d[i].sq();
timings[2] = tic();
printf("SQ Vector3d (native) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
for (int i=0;i<NumberOfVectors*3;i=i+3)
{
ptr2[i]=ptr[i]*ptr[i]; ptr2[i+1]=ptr[i+1]*ptr[i+1]; ptr2[i+2]=ptr[i+2]*ptr[i+2];
}
timings[2] = tic();
//printf("SQ Vector3d(array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
for (int i=0;i<NumberOfVectors*3;i=i+3)
{
ptr2[i]=ptr[i]*ptr[i]; ptr2[i+1]=ptr[i+1]*ptr[i+1]; ptr2[i+2]=ptr[i+2]*ptr[i+2];
}
timings[2] = tic();
printf("SQ Vector3d(array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
sq_asm(ptr,ptr2,NumberOfVectors);
timings[2] = tic();
//printf("SQ Vector3d(asm_array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
sq_asm(ptr,ptr2,NumberOfVectors);
timings[2] = tic();
printf("SQ Vector3d(asm_array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
sq_asm_align(ptr4,ptr3,NumberOfVectors);
timings[2] = tic();
//printf("SQ Vector3d(asm_array_align) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
sq_asm_align(ptr4,ptr3,NumberOfVectors);
timings[2] = tic();
printf("SQ Vector3d(asm_array_align) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
//////test SCALE
float scale_f = 5.5;
for (int i=1;i<NumberOfVectors;i++) {g[i]*=scale_f;} //COLD
timings[1] = tic();
for (int i=1;i<NumberOfVectors;i++) {g[i]*=scale_f;}
timings[2] = tic();
printf("SCALE Vector3d (native) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
for (int i=0;i<NumberOfVectors*3;i=i+3)
{
ptr2[i]=ptr[i]*scale_f; ptr2[i+1]=ptr[i+1]*scale_f; ptr2[i+2]=ptr[i+2]*scale_f;
}
timings[2] = tic();
//printf("SCALE Vector3d(array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
for (int i=0;i<NumberOfVectors*3;i=i+3)
{
ptr2[i]=ptr[i]*scale_f; ptr2[i+1]=ptr[i+1]*scale_f; ptr2[i+2]=ptr[i+2]*scale_f;
}
timings[2] = tic();
printf("SCALE Vector3d(array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
scale_asm(ptr,&scale_f,NumberOfVectors);
timings[2] = tic();
//printf("SCALE Vector3d(asm_array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
scale_asm(ptr,&scale_f,NumberOfVectors);
timings[2] = tic();
printf("SCALE Vector3d(asm_array) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
scale_asm_align(ptr4,&scale_f,NumberOfVectors);
timings[2] = tic();
//printf("SCALE Vector3d(asm_array_align) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
timings[1] = tic();
scale_asm_align(ptr4,&scale_f,NumberOfVectors);
timings[2] = tic();
printf("SCALE Vector3d(asm_array_align) %e s\n", ((timings[2]- timings[1])/CPU_MHZ)/NumberOfVectors);
return 0; // thats it folks.
}
For the asm appoach disadavantages are the reduced portablilty for different hardware and software platforms, but an C-fallback code could be also easy provided.
so... what do you think?
greetings