Sun, 29 Dec 2013
Haskell : The Problem with Integer.
Haskellers may or not be aware that there are two libraries in the GHC sources for implementing the Integer data type.
The first, integer-gmp links to the GNU Multiple Precision Arithmetic Library which is licensed under the GNU LGPL. On most systems, libgmp is dynamically linked and all is fine. However, if you want to create statically linked binaries from Haskell source code you end up with your executable statically linking libgmp which means your binary needs to be under an LGPL compatible license if you want to release it. This is especially a problem on iOS which doesn't allow dynamic linking anyway.
The second Integer implementation is integer-simple which is implemented purely in Haskell (using a number of GHC extension) and is BSD licensed.
So why doesn't everyone just the the BSD licensed one? Well, integer-simple has a reputation for being slow. Even more intriguingly, I seem to remember Duncan Coutts telling me a couple of years ago that integer-simple was a little faster than integer-gmp when the Integer was small enough to fit in a single machine Word, but much slower when that was not the case. At the time I heard this, I decided to look into it at some time. That time has come.
A couple of weeks ago I put together some scripts and code to allow me to compile the two Integer implementations into a single program and benchmark them against each other. My initial results looked like this:
That confirmed the slowness for multiplication and division if nothing else.
Taking a look at the code to integer-simple I found that it was storing Word#s (unboxed machine sized words) in a Haskell list. As convenient as lists are they are not an optimal data structure for a something like the Integer library.
I have already started work on replacement for both versions of the current Integer library with the following properties:
- BSD licensed.
- Implemented in Haskell (with GHC extensions) so there are no issues with linking to external C libraries.
- Fast. I'm aiming to outperform both integer-simple and integer-gmp on as many benchmarks as possible.
- Few dependencies so it can easily replace the existing versions. Currently my code only depends on ghc-prim and primitive.
So far the results are looking encouraging. For Integer values smaller than a machine word, addition with my prototype code is faster than both existing libraries and for adding large integers its currently half the speed of integer-gmp, but I have an idea which will likely make the new library match the speed of integer-gmp.