Faster Floating Point to Integer Conversions.

Faster Floating Point to Integer Conversions.

(c) 2001 Erik de Castro Lopo
Version 1.1
2001/11/02

Introduction    Motivation    Analysis    Ugly Macro Hack    A C Solution

Benchmarking    The Unix Solution    The Win32 Solution

Request for Feedback    Change Log    Acknowledgements

Introduction

In many applications such as audio, video and graphics processing, calculations are done with floating point values and the final results converted to integer. There are a number of ways of converting from floating point to integer with the C cast mechanism being the most commonly used. Unfortunately, casting from float or double to int on i386 can cause large performance hits when this operation is used frequently.

If the programmer considers the speed of the operation more important that the type of conversion between float and int, speed improvements of 6 to 10 times can be achieved on Pentium III and Athlon CPUs. It is highly likely this is also the case on other processors such as PowerPC.

This paper investigates the reasons why the C cast operation which is so important for many applications is so slow and provides an alternative which aims to be as portable as possible.

Motivation

The author of this paper is the main developer of libsndfile, a cross platform Free Software library for reading and writing audio files. One of the features of this library is the ability to read a file consisting of integers, convert them on the fly to floating point which can be manipulated by the user program before writing them to second file, again converting the floating point values to integers on the fly.

In August of 2001 the author was contacted by David Viens a developer from Canada. Viens supplied a patch for Win32 consisting of a function with inline assembler to replace the standard C float to int cast. He stated that the inline asssembler came from the Music-Dsp mailing list archive and that he would understand if his patch would not go into the official distribution of libsndfile. His aim was to raise awareness of the issue. The supplied patch did not go into the official distribution but did inspire this investigation and modifications to libsndfile which achieved the same result as his original patch in a more portable and cross platform manner.

Analysis

Consider the following C functions (pulled from the libsndfile project with nothing more than minor modifications) which converts an array of floats to an array of ints using the standard C casting mechanism.

         1: void f2i_array (float *fptr, int *buffer, int count)
         2: {   while (count)
         3:     {   count -- ;
         4:         buffer [count] = fptr [count] ;
         5:         } ;
         6: }

As will be shown below in the benchmarking section, the standard C cast from float (or double) to int is slow in comparison to a number of other conversion methods. The root cause of this problem becomes obvious when the assembler output of the GNU C compiler (obtained using gcc -S) is viewed. Neglecting the stack handling code at the start and end of the function, the while loop is as follows (comments added):

         1: .L363:
         2:       decl      %edx                   ; decrement the count variable
         3:       flds      (%ebx,%edx,4)          ; load a float from input array
		 
         4:       fnstcw    -2(%ebp)               ; store FPU control word 
         5:       movw      -2(%ebp),%si           ; move FPU control word to si register
         6:       orw       $3072,%si              ; modify si
         7:       movw      %si,-4(%ebp)           ; move si to the stack
         8:       fldcw     -4(%ebp)               ; load same value from stack into FPU control word
		 
         9:       fistpl    -8(%ebp)               ; store floating point value as an integer on the stack
		 
        10:       movl      -8(%ebp),%eax          ; move the integer value from stack to eax
        11:       fldcw     -2(%ebp)               ; restore FPU control word
        12:       movl      %eax,(%ecx,%edx,4)     ; move eax to output array

        13:       testl     %edx,%edx              ; test of count is zero
        14:       jne      .L363                   ; jump to label if zero

The instruction which causes the real damage in this block is fldcw, (FPU load control word) on lines 8 and 11. Whenever the FPU encounters this instruction it flushes its pipeline and loads the control word before continuing operation. The FPUs of modern CPUs like the Pentium III, Pentium IV and AMD Athlons rely on deep pipelines to achieve higher peak performance. Unfortunately certain pieces of C code can reduce the floating point performance of the CPU to level of a non-pipelined FPU.

So why is the fldcw instruction used? Unfortunately, it is required to make the calculation meet the ISO C Standard which specifies that casting from floating point to integer is a truncation operation. However, if the fistpl instruction was executed without changing the mode of the FPU, the value would have been rounded instead of truncated. The standard rounding mode is required for all normal operations like addition, subtraction, multiplication etc while truncation mode is required for the float to int cast. Hence if a block of code contains a float to int cast, the FPU will spend a large amount of its time switching between the two modes.

Removing the instructions dealing with changing the FPU mode results in a loop that looks like this:

         1: .L363:
         2:       decl      %edx                   ; decrement the count variable
         3:       flds      (%ebx,%edx,4)          ; load a float from input array
         4:       fistpl    (%ecx,%edx,4)          ; store as an int in the output array
         5:       testl     %edx,%edx              ; is count zero?
         6:       jne      .L363                   ; if not, jump to label

Instead of the using truncation, the above loop performs a rounding operation and doesn't adversely effect the FPU pipeline. In addition, since this loop contains far fewer instructions than the previous one, it executes more quickly. The programmer must decide whether the rounding operation is an acceptable substitute for truncation. In most audio applications it would be.

Ugly Macro Hack

Rewriting the original C code as assembler is not a suitable solution. However, the minimal assembler version can be obtained from C code compiled with the GNU C compiler (gcc) by the use of the following inline assembler macro:


        #define FLOAT_TO_INT(in,out)  \
                    __asm__ __volatile__ ("fistpl %0" : "=m" (out) : "t" (in) : "st") ;

The problem with the above macro is that it is extremely non-portable. It will only work with gcc targeting i386 family of processors. Another solution must be found.

A C Solution

Fortunately, the 1999 ISO C Standard defines two functions which were not a part of earlier versions of the standard. These functions round doubles and floats to long ints and have the following function prototypes:

        long int   lrint    (double x) ;
        long int   lrintf   (float x) ;

These functions are defined in <math.h> but are only usable with the GNU C compiler if C99 extensions have been enabled before <math.h> is included. This is done as follows:


        #define	_ISOC9X_SOURCE	1
        #define _ISOC99_SOURCE	1

        #include  <math.h>

Two versions of the defines ensure that the required functions are picked up with older header files. In the GLIBC (the standard version of the C library on Linux) header files, these functions are defined as inline functions and are in fact inlined by gcc (the standard C compiler on Linux) when optimisation is switched on. If optimisation is switched off, the functions are not inlined and an executable calling these functions will need to be linked with the maths library.

The original C code can now modified to use one of these functions :

         1: void f2i_array (float *fptr, int *buffer, int count)
         2: {   while (count)
         3:     {   count -- ;
         4:         buffer [count] = lrintf (fptr [count]) ;
         5:         } ;
         6: }

which generates the following assembler (again, just looking at the assembler within the while loop):

         1: .L363:
         2:       decl      %edx                   ; decrement the count variable
         3:       flds      (%ebx,%edx,4)          ; load a float from the input array
         4: #APP
         5:       fistpl    -4(%ebp)               ; convert float to int and store on stack
         6: #NO_APP
         7:       movl      -4(%ebp),%eax          ; load value from the stack to eax
         8:       movl      %eax,(%ecx,%edx,4)     ; store eax in output array
         9:       testl     %edx,%edx              ; is count zero?
        10:       jne       .L363                  ; if yes, jump to label

The new assembler function does contain a bit more data shuffling than the optimal assembler version but is guaranteed to be portable across CPUs and compilers which meet the C99 standards.

Benchmarking

Benchmarking was performed throughout this investigation to ensure that the gains obtained by replacing float to int casts with something else were worthwhile. All benchmarking times were measured relative to a base value of the time to cast from float to int. One of the most surprising facts uncovered during this experiment was that using the integer pipeline to pull apart the floating point value and construct an int was faster than the float to int cast. However when one remembers how much the float to int cast operation interferes with the FPU pipeline, the surprise decreases.

The testing was carried out on a dual 450MHz Pentium III machine running Linux 2.4.10 and versions 2.95.4 and 3.0.2 of the GNU C compiler. The timing was measured using the pentium's rdtsc timer instruction. Typical test result are as follows:

        GCC version : 2.95
        
        Testing float -> int cast.
            cast time                 =   1.000
            cast time / int_pipe time =   3.581
            cast time / lrintf time   =   6.982
            cast time / macro time    =  10.005
        
        
        Testing double -> int cast.
            cast time                 =   1.000
            cast time / lrint time    =   6.655
            cast time / macro time    =  11.982

The above results show the time to convert a set of floats to ints with respect to the cast operation. In all cases, the the investigated operation was able to perform its task significantly quicker then the C cast mechanism. In the float to int cast tests, the int_pipe time is for the C code which uses integer pipeline operations to construct an integer from the raw bits of a float.

Testing was also carried out on a number of other systems with Pentium II, Pentium III and AMD Athlon CPUs of various clock speeds. The results across these different processors were very similar to the results above.

The source code for this benchmark is available here so these results can be independently verified. This code makes use of gcc and i386 assembler macros for the timing functions and therefore is not as portable as the header file supplied below for addressing this problem. However, it should not be too difficult to modify this program to allow benchmarking on other platforms.

The Unix Solution

The Unix solution to this problem consists of a number of pieces:

An autoconf m4 macro for detecting lrint : lrint.m4
An autoconf m4 macro for detecting lrintf : lrintf.m4
The float cast header file : float_cast.h

The autoconf macros define two new feature detection functions; AC_C99_FUNC_LRINT and AC_C99_FUNC_LRINTF. These functions, when run as part of a configure script set HAVE_LRINT and HAVE_LRINTF in the Autoconf generated file config.h.

To use these autoconf macros in a project that already uses autoconf is simple. The m4 files themselves should be placed in a location where the aclocal program can find them; maybe an m4 directory within the top level directory of the project's source distribution. The macros themselves can be invoked in the configure.in file. An appropriate place to do this is where the configure script check for the existence of other functions or libraries. Once configure.in has been modified, the aclocal program should be run, followed but the usual autoconf, autoheader, automake combination.

The float_cast.h header file uses lrint and lrintf if found and defaults to a standard C cast if they are not. If required, solutions to specific compiler and CPU combinations could be added to this header file. Such a solution has already been implemented for Win32 and the Visual C++ compiler combination.

The Win32 Solution

Win32 and the Visual C++ compiler combination do not implement the lrint or lrintf functions. The float_cast.h header file therefore contains inline implementations of these functions.

Request for Feedback

I believe that the problem this solution attempts to fix also exists for other CPU and compiler combinations. I would be interested in confirming this and receiving patches for processors and/or compilers which do not define lrint and lrintf. Feedback should be sent to erikd@mega-nerd.com.

Change Log

Version 1.0 (2001/10/31) : Initial paper.

Version 1.1 (2001/11/02) : Was notified by Phil Frisbie Jr of Hawk Software that Win32 did allow assembler in inline functions. Modified float_cast.h to reflect this. Added proper source attribution for original Win32 assembler optimisation.

Acknowledgements

David Viens : A Canadian developer who first brought this problem to the authors attention. He also ran a number of tests on Win32 while the author was attempting to work out a portable solution to this problem.

Andrew Bennetts, Bart Bunting, Simon Rumble and Jeff Waugh : Members of the Sydney Linux User Group (SLUG) mailing list who were kind enough to run benchmarks on their AMD Athlon based machines.

Sydney Linux User Group (SLUG) : Allowed the presentation of preliminary findings at one of their meetings before the matter had been fully investigated. Unfortunately there were some minor mistakes in the details of that presentation which have been corrected in this paper.

(c) 2001 Erik de Castro Lopo.

This document may not be reproduced in any form without the written permission of the author.

The files fp_cast_test.c, lrint.m4, lrintf.m4 and float_cast.h may be used as per the license agreement in the files themselves.