The published documents on how to generate sse instructions via gcc's vector arithmetic support are generally poor and contain conflicting information. I've been experimenting with different ways to generate sse, sse2, and mmx code on the x86_64. The x86_64 is notable in that it has 16 sse registers (rather than 8) and that by default it does all math via those registers by default. It seems to do a halfway decent job of scheduling non-vectorized code but I wanted to find a way to exercise more of the instruction set. The case I decided to play with first was cubic interpolation in Linuxsampler. I am NOT trying to optimize for a specific case for any particular reason, just trying to get to where I understand SSE well enough to apply it when there *is* a reason. Rather than let the knowledge rot, I thought I'd stick it in here in the hope it might be useful to others, especially those that are hand optimizing something where getting the compiler to do some of the heavy lifting is helpful. Here's some C code that at least compiles - and may not otherwise be correct - and only generates 37 sse instructions (vs the original stereo code which generated over 60). My note on this is that a stereo version would not be much more complex (basically just ignore the upper half of the sse registers) and still result in less instructions than the default stereo cubic interpolation code. It would probably still be outperformed by the original stereo code, however, and perhaps the 3dnow code would be the only thing that could outperform that. Note the v4sf declaration in the union is second. Declared that way, you can still do structure assignment of const variables, and just specify the vector part of the union during the math. Also, constants have to be declared as per below in order to retain the expressiveness of vector = vector + vector. Telling the compiler it's a v2sf (for stereo) is hell on sse code generation, but does nice stuff when 3dnow is specified.... Compilation line (on gcc4): g++ -O3 -msse -msse2 -fverbose-asm -S t.cpp #include <iostream> #include <stdio.h> typedef struct { float left; float right; float front; float rear; } quad_samples; typedef float v4sf __attribute__ ((vector_size (16))); typedef union { quad_samples s; v4sf v; } quad_sample_t; const quad_sample_t c3 = {3.0f,3.0f,3.0f,3.0f}; const quad_sample_t cpoint5 = {.5f,.5f,.5f,.5f}; const quad_sample_t c2 = { 2.0f,2.0f,2.0f,2.0f}; const quad_sample_t c5 = { 5.0f,5.0f,5.0f,5.0f}; static quad_sample_t pos_fract; typedef float sample_t; static quad_sample_t Interpolate1StepQuadCPP(quad_sample_t* pSrc, double* Pos, float& Pitch) { int pos_int = (long) *Pos; // integer position // float pos_fract = *Pos - pos_int; // fractional part of position pos_int <<= 1; quad_sample_t samplePoint; quad_sample_t xm1 = pSrc[pos_int]; quad_sample_t x0 = pSrc[pos_int+1]; quad_sample_t x1 = pSrc[pos_int+2]; quad_sample_t x2 = pSrc[pos_int+3]; quad_sample_t a; a.v = (c3.v * (x0.v - x1.v) - xm1.v + x2.v) * cpoint5.v; quad_sample_t b; b.v = c2.v * x1.v + xm1.v- (c5.v* x0.v + x2.v) * cpoint5.v; quad_sample_t c; c.v = (x1.v - xm1.v) * cpoint5.v; samplePoint.v = (((a.v * pos_fract.v) + b.v) * pos_fract.v + c.v) * pos_fract.v + x0.v; *Pos += Pitch; return samplePoint; } static double t = 1.0; static float t1 = .5; extern void b (quad_sample_t a) ; main(char *argv[], int argc) { quad_sample_t result; quad_sample_t samples[1024]; result = Interpolate1StepQuadCPP(samples,&t,t1); b (result); //quad_sample_t results = (quad_sample_t) result; // printf("%f:,%f:",results.t.left,results.t.right); }
I'm still looking for the best/std way to declare vector specific variables that you can manipulate other than this union thing. The total coverage of how to use them is in: http://gcc.gnu.org/onlinedocs/gcc-4.0.1/gcc/Vector-Extensions.html#Vector-Extensions Which doesn't go into that. The code generated IS 16 byte aligned in the test program, but perhaps it's better to make that also an __attribute__.
Vladimir and I also played around with GCC's vector extensions before we decided to go with hand crafted assembly code. The problem was that the vector extension implementation was still quite incomplete (we used gcc 3.4 at that point I think). Some important operations like accessing a single cell of a vector were missing (as already pointed out by you). Of course for simple algorithms like inerpolation / resampling this is not a problem, but for feedback control systems like the filter we are using in the gig Engine, where every calculated sample point depends on the result of the previous one, accessing single vector cells is mandatory. Also IIRC g++ 3.4 did not support vector extensions at all, only gcc 3.4 (that is the C compiler part). This seems to have changed with gcc/g++ 4.0 fortunately. Another problem (as already pointed out by Vladimir on the list) was that by ABI definition all floating point arguments of a function / method are transferred via the 387 FPU stack on x86 machines. For float->int conversions though we need to use MMX instructions, and you cannot mix 387 FPU and MMX instructions without exiting the MMX mode (by using the EMMS instruction) which takes a looooot CPU cycles. But this problem could be solved with current CVS version of LS, since the main loop is now placed in just one method ATM (at least if the Filter::apply() method is compiled as an inliner). But of course the vector extensions will be the way to go in future. Because the hand crafted assembly is a lot work to maintain and causes other problems like register shortage on -O1 optimization level for example. So IMO the main question currently is when the following operations are imlemented in gcc: * accessing single cells of a vector * rotating the cells of a vector / bitshifting Maybe the last one is already present with gcc 4.0, not tested yet. For the first one we might ask on the GCC list?
Ah... and of course you are right about the "aligned" GCC extension attribute, this is mandatory to let GCC know that it's operating on SSE safe data, since most SSE instructions require 16 byte aligned memory addresses. GCC would of course not be able to figure that out if you are using memalign() for example and most probably use scalar (maybe even 387 FPU) instructions then instead. So anyway without that attribute it would be slower than it could be.
Closing this report now. GCC vector extensions have been added to audio mix down functions of AudioChannel.cpp, for interpolation and other more complex tasks the current GCC vector extensions seem to be not sufficient enough yet. Feel free to reopen this report, in case vector extension support improved in GCC.