Bug 25 - Just some notes towards convincing gcc 4.0 to emit sse instructions
Summary: Just some notes towards convincing gcc 4.0 to emit sse instructions
Status: CLOSED FIXED
Alias: None
Product: LinuxSampler
Classification: Unclassified
Component: other (show other bugs)
Version: SVN Trunk
Hardware: PC Linux
: P5 enhancement
Assignee: Christian Schoenebeck
URL:
Depends on:
Blocks:
 
Reported: 2005-09-22 02:33 CEST by Mike Taht
Modified: 2013-05-31 19:00 CEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Taht 2005-09-22 02:33:59 CEST
The published documents on how to generate sse instructions via gcc's vector
arithmetic support are generally poor and contain conflicting information. I've
been experimenting with different ways to generate sse, sse2, and mmx code
on the x86_64. The x86_64 is notable in that it has 16 sse registers (rather
than 8) and that by default it does all math via those registers by default. It
seems to do a halfway decent job of scheduling non-vectorized code but I wanted
to find a way to exercise more of the instruction set.

The case I decided to play with first was cubic interpolation in Linuxsampler. I
am NOT trying to optimize for a specific case for any particular reason, just
trying to get to where I understand SSE well enough to apply it when there *is*
a reason.  Rather than let the knowledge rot, I thought I'd stick it in here in
the hope it might be useful to others, especially those that are hand optimizing
something where getting the compiler to do some of the heavy lifting is helpful. 


Here's some C code that at least compiles - and may not otherwise be correct -
and only generates 37 sse instructions (vs the original stereo code which
generated over 60). My note on this is that a stereo version would not be much
more complex (basically just ignore the upper half of the sse registers) and
still result in less instructions than the default stereo cubic interpolation
code. It would probably still be outperformed by the original stereo code,
however, and perhaps the 3dnow code would be the only thing that could
outperform that.

Note the v4sf declaration in the union is second. Declared that way, you can
still do structure assignment of const variables, and just specify the vector
part of the union during the math.

Also, constants have to be declared as per below in order to retain the
expressiveness of vector = vector + vector.

Telling the compiler it's a v2sf (for stereo) is hell on sse code generation,
but does nice stuff when 3dnow is specified....

Compilation line (on gcc4): g++ -O3 -msse -msse2  -fverbose-asm -S t.cpp

#include <iostream>
#include <stdio.h>

typedef struct {
        float left;
        float right;
        float front;
        float rear;
} quad_samples;

typedef float v4sf __attribute__ ((vector_size (16)));

typedef union {
        quad_samples s;
        v4sf v;
} quad_sample_t;

const quad_sample_t c3 = {3.0f,3.0f,3.0f,3.0f};
const quad_sample_t cpoint5 = {.5f,.5f,.5f,.5f};
const quad_sample_t c2 = { 2.0f,2.0f,2.0f,2.0f};
const quad_sample_t c5 = { 5.0f,5.0f,5.0f,5.0f};
static quad_sample_t pos_fract;

typedef float sample_t;

static quad_sample_t Interpolate1StepQuadCPP(quad_sample_t* pSrc, double* Pos,
float& Pitch) {
     int   pos_int   = (long) *Pos;  // integer position
//     float pos_fract = *Pos - pos_int;     // fractional part of position
     pos_int <<= 1;
     quad_sample_t samplePoint;
                    quad_sample_t xm1 = pSrc[pos_int];
                    quad_sample_t x0  = pSrc[pos_int+1];
                    quad_sample_t x1  = pSrc[pos_int+2];
                    quad_sample_t x2  = pSrc[pos_int+3];
                    quad_sample_t a;
                    a.v  = (c3.v * (x0.v - x1.v) - xm1.v + x2.v) * cpoint5.v;
                    quad_sample_t b;
                    b.v  = c2.v * x1.v + xm1.v- (c5.v* x0.v + x2.v) * cpoint5.v;
                    quad_sample_t c;
                    c.v  = (x1.v - xm1.v) * cpoint5.v;
                    samplePoint.v = (((a.v * pos_fract.v) + b.v) * pos_fract.v +
c.v) * pos_fract.v + x0.v;

                *Pos += Pitch;
                return samplePoint;
            }

static double t = 1.0;
static float t1 = .5;
extern void b (quad_sample_t a) ;

main(char *argv[], int argc) {
        quad_sample_t result;
        quad_sample_t samples[1024];
        result = Interpolate1StepQuadCPP(samples,&t,t1);
        b (result);
//quad_sample_t results = (quad_sample_t) result;
//      printf("%f:,%f:",results.t.left,results.t.right);
}
Comment 1 Mike Taht 2005-09-22 03:07:00 CEST
I'm still looking for the best/std way to declare vector specific variables that
you can manipulate other than this union thing. The total coverage of how to use
them is in:

http://gcc.gnu.org/onlinedocs/gcc-4.0.1/gcc/Vector-Extensions.html#Vector-Extensions

Which doesn't go into that.

The code generated IS 16 byte aligned in the test program, but perhaps it's
better to make that also an __attribute__.


Comment 2 Christian Schoenebeck 2005-09-22 15:45:36 CEST
Vladimir and I also played around with GCC's vector extensions before we   
decided to go with hand crafted assembly code. The problem was that the vector   
extension implementation was still quite incomplete (we used gcc 3.4 at that   
point I think). Some important operations like accessing a single cell of a   
vector were missing (as already pointed out by you). Of course for simple   
algorithms like inerpolation / resampling this is not a problem, but for   
feedback control systems like the filter we are using in the gig Engine, where   
every calculated sample point depends on the result of the previous one,   
accessing single vector cells is mandatory.   
   
Also IIRC g++ 3.4 did not support vector extensions at all, only gcc 3.4 (that   
is the C compiler part). This seems to have changed with gcc/g++ 4.0   
fortunately.   
   
Another problem (as already pointed out by Vladimir on the list) was that by   
ABI definition all floating point arguments of a function / method are   
transferred via the 387 FPU stack on x86 machines. For float->int conversions   
though we need to use MMX instructions, and you cannot mix 387 FPU and MMX   
instructions without exiting the MMX mode (by using the EMMS instruction) which   
takes a looooot CPU cycles. But this problem could be solved with current CVS   
version of LS, since the main loop is now placed in just one method ATM (at   
least if the Filter::apply() method is compiled as an inliner).   
   
But of course the vector extensions will be the way to go in future. Because   
the hand crafted assembly is a lot work to maintain and causes other problems   
like register shortage on -O1 optimization level for example.   
   
So IMO the main question currently is when the following operations are   
imlemented in gcc:  
  
    * accessing single cells of a vector 
    * rotating the cells of a vector / bitshifting 
 
Maybe the last one is already present with gcc 4.0, not tested yet. For the 
first one we might ask on the GCC list? 
Comment 3 Christian Schoenebeck 2005-09-22 16:00:47 CEST
Ah... and of course you are right about the "aligned" GCC extension attribute, 
this is mandatory to let GCC know that it's operating on SSE safe data, since 
most SSE instructions require 16 byte aligned memory addresses. GCC would of 
course not be able to figure that out if you are using memalign() for example 
and most probably use scalar (maybe even 387 FPU) instructions then instead. So 
anyway without that attribute it would be slower than it could be. 
Comment 4 Christian Schoenebeck 2013-05-31 19:00:17 CEST
Closing this report now.

GCC vector extensions have been added to audio mix down functions of AudioChannel.cpp, for interpolation and other more complex tasks the current GCC vector extensions seem to be not sufficient enough yet.

Feel free to reopen this report, in case vector extension support improved in GCC.