m3ga blog http://www.mega-nerd.com/erikd/Blog An ocassional rant en Objects vs Modules. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/objects_vs_modules.html Although I've been using Ocaml for a several years now, I've not yet been in a situation where I've needed to write an Ocaml class to define a C++/Java/Python/Smalltalk/OO style object. I've found that most of the problems I encountered could be easily solved using functional code and that Ocaml's objects didn't provide an obviously better solution. Until now (or so I thought).

The problem was one of moving around the filesystem keeping track of the old directories so they were easy to return to. The obvious model for this was the pushd and popd built-ins in command shells like GNU Bash. This functionality can be easily wrapped up in an Ocaml object as in the following example and demo code (which needs to be linked to the Unix module):


  class dirstack = object
      val mutable stack = []

      method push dirname =
          (* Find the current working directory. *)
          let cwd = Unix.getcwd () in
          (* Change to the new directory. *)
          Unix.chdir dirname ;
          (* If successful, push old cwd onto the stack. *)
          stack <- cwd :: stack

      method pop () =
          match stack with
          |    [] -> failwith "Directory stack is empty."
          |    head :: tail ->
                  Unix.chdir head

  	end

  let () =
      print_endline (Unix.getcwd ()) ;
      let dstack = new dirstack in
      dstack#push "/tmp" ;
      print_endline (Unix.getcwd ()) ;
      dstack#push "/bin" ;
      print_endline (Unix.getcwd ()) ;
      dstack#pop () ;
      print_endline (Unix.getcwd ()) ;
      dstack#pop () ;
      print_endline (Unix.getcwd ())


However, there are some problems with the above code. Firstly, if the push and pop methods need to be used throughout the program, the dstack object needs to be made more widely accessible using one of the following three methods:

  1. Being placed in the global scope.
  2. Being made into a Singleton objecct.
  3. Being passed around as a parameter to whatever function may need it.

Yuck! Yuck! Double yuck! Suddenly, this object oriented solution didn't look like such a great idea.

Then it struck me. This object can be easily transformed into an Ocaml module like this:


  module Dirstack = struct
      let stack = ref []

      let push dirname =
          (* Find the current working directory. *)
          let cwd = Unix.getcwd () in
          (* Change to the new directory. *)
          Unix.chdir dirname ;
          (* If successful, push old cwd onto the stack. *)
          stack := cwd :: !stack

      let pop () =
          match !stack with
          |    [] -> failwith "Directory stack is empty."
          |    head :: tail ->
                  stack := tail ;
                  Unix.chdir head

     end

  let () =
      print_endline (Unix.getcwd ()) ;
      Dirstack.push "/tmp" ;
      print_endline (Unix.getcwd ()) ;
      Dirstack.push "/bin" ;
      print_endline (Unix.getcwd ()) ;
      Dirstack.pop () ;
      print_endline (Unix.getcwd ()) ;
      Dirstack.pop () ;
      print_endline (Unix.getcwd ())

This solution using a module is much better than the one using an object. The Dirstack module itself is globally accessible and is already a singleton while the stack used to hold past directories is implemented as a list whose scope is limited to the module itself. (Furthermore, if Dirstack is implemented in its own file instead of using a module defined within a larger file, then the stack variable can be hidden completely by not listing it in the Dirstack interface file.)

So while I'm pleased with this solution, it does mean that I'll have to continue my hunt for a problem where an object provides a better solution than any other feature of the Ocaml language. This is particularly ironic because when choosing between two strict statically typed languages, Haskell and Ocaml, I chose Ocaml because I thought I needed objects. However, I stuck with Ocaml because of its pragmatism.

]]>
Cross Compiling for Legacy Win32 Systems (Part 2). http://www.mega-nerd.com/erikd/Blog/CodeHacking/MinGWCross/cross_compiling_2.html Cross compiling from Linux to Windows requires the installation of a couple of packages. On a Debian or Ubuntu system this can be done using:


  sudo apt-get install build-essential
  sudo apt-get install mingw32 mingw32-binutils mingw32-runtime wine

I'm running Ubuntu's Hardy Heron pre-release and the following is known to work with these versions:


  mingw32               4.2.1.dfsg-1ubuntu1
  mingw32-binutils      2.17.50-20070129.1-1
  mingw32-runtime       3.13-1
  wine                  0.9.59-0ubuntu5

For an example of a project which can be successfully cross-compiled, I have chosen libogg which is one of the two libraries required to encode and decode Ogg/Vorbis files. I also happen to know that the current libogg sources in the Xiph Foundation's SVN repository cross-compile from Linux to Windows correctly because I committed the patch to make it possible.

However, we need to look ahead a little. After we have cross compiled libogg we will also want to cross compile the associated libvorbis library which relies on libogg. We therefore need to configure libogg so that when we install it, it can be found by the libvorbis configure script.

For me that meant creating a MinGW32 directory in my home directory:


  mkdir $HOME/MinGW32

The next step to to grab the libogg source code from the Xiph SVN server. This can be achieved using the command:


  svn co http://svn.xiph.org/trunk/ogg libogg

Changing into the libogg directory, we are now ready to configure, test and install the library. That can be done using:


  ./autogen.sh
  ./configure --host=i586-mingw32msvc --target=i586-mingw32msvc \
      --build=i586-linux --prefix=$HOME/MinGW32
  make
  make check
  make install

The first command above, runs the auto tools to generate that configure script. The second command, configure is broken across two lines. It sets up the generated Makefiles to compile Windows binaries from a Linux host, with the install directory we set up before. The third line builds the windows version of libogg, the fourth line runs the test suite, with the windows executables being run under WINE and the final line installs everything in the MinGW32 directory created earlier.

All of the above commands should pass without errors. If they don't, check your versions of of the mingw cross compiler tools and/or WINE.

]]>
Cross Compiling for Legacy Win32 Systems (Part 1). http://www.mega-nerd.com/erikd/Blog/CodeHacking/MinGWCross/cross_compiling.html My main two FOSS projects, libsndfile and libsamplerate have significant numbers of users that are tied to that particularly odious legacy system, Microsoft Windows. Since I don't normally use Windows myself, maintaining support for that OS has always been a huge pain in the neck.

Originally I shipped Microsoft project files for libsndfile, but that became unworkable because the different versions of the Microsoft tools (Visual C++ 5, Visual C++ 6, Visual Studio 2003, Visual Studio 2005 etc) used different and incompatible project file formats. I solved this by shipping a simple Makefile that used Microsoft's nmake and the command line compilers to build libsndfile. However, by about 2004, the Microsoft compiler's complete lack of support for the 1999 ISO C Standard made maintaining support too much trouble, so it was dropped.

Instead, I started using Cygwin and MinGW to compile libsndfile on Windows. Both of these tool-sets use a version of the GNU GCC compiler just like Linux and building libsndfile using these two tool-sets was trivial:


  ./configure
  make
  make check

Of course there were howls of protest from Windows users, but since they (with a small number of exceptions) had contributed so little, I didn't fell like I owed them anything. I also started releasing pre-compiled Windows binaries at the same time as the source code tarballs were released.

However, while the MinGW compiler was a huge improvement over the Microsoft one it was still a huge pain in the neck. I had to keep a Windows machine and keep it updated and patched against vulnerabilities. Furthermore, installing and updating MinGW was a painful manual process. Oh how I longed for a Debian/Ubuntu style apt-get command to look for and install updates. Finally, copying source code back and forth between Linux and Windows while debugging Windows issues was another pain point because version control systems like GNU Arch and bzr simply didn't work very well on Windows.

In about 2004, I tried the MinGW Linux to Windows cross compiler, a compiler that runs on Linux but generates binaries for Windows. This compiler worked, but left one rather large problem; how do I run libsndfile's rather large and comprehensive test suite? Compiling libsndfile without running the test suite is a waste of time. I did try to run the tests under WINE (the Windows emulator), but at the time tests were failing under WINE that didn't fail on Windows.

From that time on, I would try running the cross-compiled test suite under WINE once or twice a year. Then, some time in the last year or so, the number of problems with the test suite dropped to one, which was only a FIXME message. A little hacking on the WINE sources resulted in a patch that was sent to the WINE mailing list and has since been applied to the main WINE source tree.

With that bug fixed, I can now cross compile from Linux to Windows and run the full libsndfile test suite under WINE. That means that Windows has just become that little bit less relevant that it was before.

A future post will explain how to set up the cross compiler and WINE and walk through compiling and testing of a standard FOSS project.

]]>
You Stupid Git! http://www.mega-nerd.com/erikd/Blog/CodeHacking/stupid_git.html As far as I can tell, the absolute, canonical, got-to-first documentation for the git distributed version control system (DVCS) can be found here:

http://www.kernel.org/pub/software/scm/git/docs/user-manual.html

This documentation seems comprehensive and well laid out. It explains commits, manipulating-branches, merging, collaborative development and the pretty damn interesting rebase and bisect commands. This documentation is called a user manual but it contains sufficient examples to make it a pretty damn fine tutorial.

Normally something like "here's a link to the documentation" would not be worthy of a blog post. However, failure to find the canonical user manual could lead a person (ie me) to post messages to mailing lists saying things like:

"I'm sure git is very clever and all, but its UI and documentation is probably the most user hateful thing I have seen [since] sendmail's cf files."

or, on finding a one hour long video screen-cast tutorial (apparently aimed at all those Ruby on Rails writing Mac OSX users):

"This makes me wonder, how fscked up does a DVCS have to be that you need tens of megabytes of video to show how it works when Bzr and many others can do it with less than ten kilobytes of html text?"

So while I was wrong about the documentation I still have huge reservations about git's user interface and stand by this statement:

"I am currently trying to learn git and I can see very clearly that git is designed by kernel programmers whose normal approach to a user interface is something like a Unix system call."

I'm sure git is a powerful tool and the rebase feature is something I've been wishing for in other systems for some time, but git's UI is already starting to grate.

]]>
Ocaml : Exception Back Traces in Native Code. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/native_backtraces.html Some time ago I wrote a blog post about exception back traces which at the time of that post only existed for the Ocaml byte code compiler.

However, version 3.10 of the Ocaml compiler which was released about a year ago, included exception back traces for native code as well as byte code. With the imminent release of Ubuntu's Hardy Heron, version 3.10 of the compiler is about to become much more widely available .

Enabling exception back traces is as simple as adding the "-g" option to the ocamlopt command line and then setting a single environment variable as follows.


  export OCAMLRUNPARAM="b1"

]]>
libsamplerate 0.1.3. http://www.mega-nerd.com/erikd/Blog/CodeHacking/SecretRabbitCode/rel_0_1_3.html About a week ago I released a new version of SecretRabbitCode (aka libsamplerate).

The major change was that the new improved SINC based converters I blogged about here are now the default. There were also a couple of minor bug fixes.

The fine people at Infinitewave have now updated their test results to include the new converter and it shows Secret Rabbit Code comes very close to the best of the commercial converters in terms of quality.

]]>
Cross Compiling with pkg-config. http://www.mega-nerd.com/erikd/Blog/CodeHacking/MinGWCross/pkg-config.html I'm currently playing with the MinGW cross compiler versions of the GNU C and C++ compilers available via apt-get on Debian and Ubuntu systems. These cross compilers generate windows binaries from a Linux host system which is potentially a much less painful way turning FOSS code into binaries for that particularly odious legacy platform.

Most of the software I'm compiling uses the GNU tools; autoconf, automake, libtool and pkg-config for configuring the software before compiling. Autoconf already has good support for cross compiling and automake and libtool just do what autoconf tells them to do. Pkg-config however is the odd one out.

Pkg-config's job is to retrieve information about installed libraries so that the compiler can find the required header files for inclusion and libraries for linking. For instance, if you wanted compile a program that uses the gconf-2.0 library you could find out the required CFLAGS to be passed to the C compiler and required libraries for linking, by doing something like the following in the Makefile.


  GCONF_CFLAGS = $(shell pkg-config --cflags gconf-2.0)
  GCONF_LIBS = $(shell pkg-config --libs gconf-2.0)

In the above example, when pkg-config is run, it looks in the directory /usr/lib/pkg-config/ and reads information from the file gconf-2.0.pc (each installed library should have one or more of these pkg-config files) which then gets printed out. While the information given by pkg-config would be correct for a native build, it is unlikely to be correct for the cross compiling case.

This issue came up as early as 2003 and there is even a wiki page which suggests some quite extensive changes to pkg-config. Unfortunately I think these suggestions are somewhat fragile and pkg-config itself (I'm using version 0.22) already has features for a better solution.

Like many Unix programs, pkg-config's behaviour can be modified by manipulating certain environment variables. The pkg-config man page explains these variables very well. The first one is PKG_CONFIG_LIBDIR which modifies the default location where pkg-config looks for its per installed library config file. Secondly, the PKG_CONFIG_PATH variable can be set to allow additional pkg-config search paths.

Overriding these two variables results in a MinGW cross pkg-config bash script which I have named i586-mingw32msvc-pkg-config and which looks like this:


  #!/bin/bash

  # This file has no copyright assigned and is placed in the Public Domain.
  # No warranty is given.

  # When using the mingw32msvc cross compiler tools, the native Linux
  # pkg-config executable works fine as long as the default PKG_CONFIG_LIBDIR
  # is overridden.
  export PKG_CONFIG_LIBDIR=/usr/i586-mingw32msvc/lib/pkgconfig

  # Also want to override the standard user defined PKG_CONFIG_PATH with
  # a mingw32msvc specific one.
  export PKG_CONFIG_PATH=$PKG_CONFIG_PATH_MINGW32MSVC

  # Now just execute pkg-config with the given command line args.
  pkg-config $@

Now autoconf generated configure scripts that realise that the i586-mingw32msvc-gcc cross compiler is being used will run the above script and get suitable information for the cross compiler rather than the native compiler.

The only downside to this solution is that a separate script is required for each cross compiler which uses pkg-config. This however is a minor price to pay and it is unlikely that people will end up with huge numbers of XXXX-pkg-config scripts like was common before the widespread use of pkg-config.

Until a better solution becomes available, this is what I will be using.

]]>
Progress on the Rabbit. http://www.mega-nerd.com/erikd/Blog/CodeHacking/SecretRabbitCode/progress.html For over three years now, I have been working on (on and off, but mostly off) a new algorithm for doing audio sample rate conversion in Secret Rabbit Code. The idea for the new algorithm has been rattling around in my head for most of that time, but the problem was always the implementation. While I am making progress it has been slow.

However, a public comparison between a large collection of converters showed that while the conversion quality of Secret Rabbit Code was good, it was nowhere near state of the art.

In order to see if I could get Secret Rabbit Code closer to state of the art quickly, I decided to revisit the existing converter during the xmas/new-year break.

The existing converter had a set of digital filters whose coefficients were generated by a small program written in GNU Octave. My first task was to convert that program to Ocaml which has become my favourite language for technical computing. I then spent quite a bit of time finding and analyzing where the filter design program was loosing precision and finding work arounds. Finally, I spent even more time looking at how the different filter design parameters interact with one another and with the conversion algorithm itself.

Read more »

]]>
Functional Programming and Testing. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/testing_ocaml.html I read quite a lot of programming related blogs, but its rare for me to find one as muddle headed as this one titled "Quality Begs for Object-Orientation" on the O'Reilly network.

The author, Michael Feathers, starts the post by mentioning that he is dabbling in Ocaml and then makes the assertion that:

"I think that most functional programming languages are fundamentally broken with respect to the software lifecycle."

Now I'm not too sure why he brings up software lifecycle, because all he talks about is testing. However, he does give an example in Java involving testing and wraps up his post by saying that his Java solution is difficult to do in Ocaml, Haskell and Erlang.

Feathers gets two things wrong. Firstly he seems to be writing Java code using Ocaml's syntax and then complains that Ocaml is not enough like Java. His conclusion is hardly surprising. Ocaml is simply not designed for writing Java-like object oriented code.

The second problem is his claim that testing in functional languages is more difficult than with Java. While this may be true when writing Java code with Ocaml's syntax, it is not true for the more general case of writing idiomatic Ocaml or functional code.

So lets look at the testing of Object Oriented code in comparison to Functional code.

With the object orientated approach, a bunch of data fields are bundled up together in an object and methods defined some of which may mutate the state of the object's data fields. When testing objects with mutable fields, its important to test that the state transitions are correct under mutation.

By way of contrast, when doing functional programming, one attempts to write pure functions; functions which have no internal state and where outputs depend only on inputs and constants.

The really nice thing about pure functions is that they are so easy to test. The absence of internal state means that there are no state transitions to test. The only testing left is to collect a bunch of inputs that test for all the boundary conditions, pass each through the function under test and validate the output.

Since testing pure functions is easier that testing objects with mutable state, I would suggest that assuring quality using automated testing is easier for functional code than for object oriented code. This conclusion directly contradicts the title of Feathers' blog post: "Quality Begs for Object-Orientation".

The lesson to be learned here is that if anyone with a purely Java background wants to learn Ocaml or any other functional language, they have to be prepared for a rather large paradigm shift. Old habits and ways of thinking need to be discarded. For Ocaml, that means ignoring Ocaml's object oriented and imperative programming features for as long as possible and attempting to write nothing but pure stateless functions.

Update : 2008-02-26 17:04

Conrad Parker posted this to to reddit and the ensuing discussion was quite interesting.

]]>
Ocaml Snippet : Sqlite3. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/snip_sqlite.html One of the really nice things about using Ocaml on Debian and Ubuntu is the large number of really well packaged third party libraries.

Most of these libraries are also well documented from doc strings extracted from the source code files using ocamldoc. However, the documentation for most ocaml libraries is purely reference documentation and its not always obvious how to use the library simply from reading the reference docs. What's really needed is example code to be read in conjunction with the reference docs.

I'm working on a program where I needed a small, fast easy to administer database. With those requitements, Sqlite is really hard to beat and best of all, someone has already written Ocaml bindings. On Debian or Ubuntu, the Ocaml Sqlite bindings can be installed using:


  sudo apt-get install libsqlite3-ocaml-dev

In order to get a feel for using it and take my first steps into the world of SQL (which I'd had very minimal exposure to before now), I wrote a small program to test out the features provided by the library.

The following stand alone program should be taken as an example of how to access a Sqlite database from Ocaml. Since I am not an SQL expert, the actual SQL usage should be taken with a grain of salt.


  exception E of string

  let create_tables db =
      (* Create two tables in the database. *)
      let tables =
      [    "people", "pkey INTEGER PRIMARY KEY, first TEXT, last TEXT, age INTEGER" ;
          "cars", "pkey INTEGER PRIMARY KEY, make TEXT, model TEXT" ;
          ]
      in
      let make_table (name, layout) =
          let stmt = Printf.sprintf "CREATE TABLE %s (%s);" name layout in
          match Sqlite3.exec db stmt with
          |    Sqlite3.Rc.OK -> Printf.printf "Table '%s' created.\n" name
          |    x -> raise (E (Sqlite3.Rc.to_string x))
      in
      List.iter make_table tables


  let insert_data db =
      (* Insert data in both the tables. *)
      let people_data =
      [    "John", "Smith", 23;
          "Helen", "Jones", 29 ;
          "Adam", "Von Schmitt", 32 ;
          ]
      in
      let car_data =
      [    "bugatti", "veyron" ;
          "porsche", "911" ;
          ]
      in
      let insert_people (first, last, age) =
          (* Use NULL for primary key and Sqlite will generate a unique key. *)
          let stmt = Printf.sprintf "INSERT INTO people values (NULL, '%s', '%s', %d);"
                                     first last age
          in
          match Sqlite3.exec db stmt with
          |    Sqlite3.Rc.OK -> ()
          |    x -> raise (E (Sqlite3.Rc.to_string x))
      in
      let insert_car (make, model) =
          let stmt = Printf.sprintf "INSERT INTO cars values (NULL, '%s', '%s');"
                                     make model
		  in
          match Sqlite3.exec db stmt with
          |    Sqlite3.Rc.OK -> ()
          |    x -> raise (E (Sqlite3.Rc.to_string x))
      in
      List.iter insert_people people_data ;
      List.iter insert_car car_data ;
      print_endline "Data inserted."


  let list_tables db =
      (* List the table names of the given database. *)
      let lister row headers =
          Printf.printf "    %s : '%s'\n" headers.(0) row.(0)
      in
      print_endline "Tables :" ;
      let code = Sqlite3.exec_not_null db ~cb:lister
                          "SELECT name FROM sqlite_master;"
      in
      (    match code with
          |    Sqlite3.Rc.OK -> ()
          |    x -> raise (E (Sqlite3.Rc.to_string x))
          ) ;
      print_endline "------------------------------------------------"


  let search_callback db =
      (* Perform a simple search using a callback. *)
      let print_headers = ref true in
      let lister row headers =
          if !print_headers then
          (    Array.iter (fun s -> Printf.printf "  %-12s" s) headers ;
              print_newline () ;
              print_headers := false
              ) ;
          Array.iter (Printf.printf "  %-12s") row ;
          print_newline ()
      in
      print_endline "People under 30 years of age :" ;
      let code = Sqlite3.exec_not_null db ~cb:lister
                                 "SELECT * FROM people WHERE age < 30;"
      in
      match code with
      |    Sqlite3.Rc.OK -> ()
      |    x -> raise (E (Sqlite3.Rc.to_string x))



  let search_iterator db =
      (* Perform a simple search. *)
      let str_of_rc rc =
          match rc with
          |    Sqlite3.Data.NONE -> "none"
          |    Sqlite3.Data.NULL -> "null"
          |    Sqlite3.Data.INT i -> Int64.to_string i
          |    Sqlite3.Data.FLOAT f -> string_of_float f
          |    Sqlite3.Data.TEXT s -> s
          |    Sqlite3.Data.BLOB _ -> "blob"
      in
      let dump_output s =
          Printf.printf "  Row   Col   ColName    Type       Value\n%!"  ;
          let row = ref 0 in
          while Sqlite3.step s = Sqlite3.Rc.ROW do
              for col = 0 to Sqlite3.data_count s - 1 do
                  let type_name = Sqlite3.column_decltype s col in
                  let val_str = str_of_rc (Sqlite3.column s col) in
                  let col_name = Sqlite3.column_name s col in
                  Printf.printf "  %2d  %4d    %-10s %-8s   %s\n%!"
                                 !row col col_name type_name val_str ;
                  done ;
              row := succ !row ;
              done
      in
      print_endline "People over 25 years of age :" ;
      let stmt = Sqlite3.prepare db "SELECT * FROM people WHERE age > 25;" in
      dump_output stmt    ;
      match Sqlite3.finalize stmt with
      |    Sqlite3.Rc.OK -> ()
      |    x -> raise (E (Sqlite3.Rc.to_string x))


  let update db =
      print_endline "Helen Jones has just turned 30, so update table." ;
      print_endline "Should now only be one person under 30." ;
      let stmt = "UPDATE people SET age = 30 WHERE " ^
                      "first = 'Helen' AND last = 'Jones';"
      in
      (    match Sqlite3.exec db stmt with
          |    Sqlite3.Rc.OK -> ()
          |    x -> raise (E (Sqlite3.Rc.to_string x))
          ) ;
      search_callback db


  let delete_from db =
      print_endline "Bugattis are too expensive, so drop that entry." ;
      let stmt = "DELETE FROM cars WHERE make = 'bugatti';" in
      match Sqlite3.exec db stmt with
      |    Sqlite3.Rc.OK -> ()
      |    x -> raise (E (Sqlite3.Rc.to_string x))


  let play_with_database db =
      print_endline "" ;
      create_tables db ;
      print_endline "------------------------------------------------" ;
      list_tables db ;
      insert_data db ;
      print_endline "------------------------------------------------" ;
      search_callback db ;
      print_endline "------------------------------------------------" ;
      search_iterator db ;
      print_endline "------------------------------------------------" ;
      update db ;
      print_endline "------------------------------------------------" ;
      delete_from db ;
      print_endline "------------------------------------------------"


  (* Program main. *)

  let () =
      (* The database is called test.db. Delete it if it already exists. *)
      let db_filename = "test.db" in
      (    try Unix.unlink db_filename
          with _ -> ()
          ) ;

      (* Create a new database. *)
      let db = Sqlite3.db_open db_filename in

      play_with_database db ;

      (* Close database when done. *)
      if Sqlite3.db_close db then print_endline "All done.\n"
      else print_endline "Cannot close database.\n"

The above code can be run as a script using:


  ocaml -I +sqlite3 sqlite3.cma unix.cma sqlite_test.ml

or compiled to a native binary using:


  ocamlopt -I +sqlite3 sqlite3.cmxa unix.cmxa sqlite_test.ml -o sqlite_test

When run, the output should look like this:


  Table 'people' created.
  Table 'cars' created.
  ------------------------------------------------
  Tables :
      name : 'people'
      name : 'cars'
  ------------------------------------------------
  Data inserted.
  ------------------------------------------------
  People under 30 years of age :
    pkey          first         last          age
    1             John          Smith         23
    2             Helen         Jones         29
  ------------------------------------------------
  People over 25 years of age :
    Row   Col   ColName    Type       Value
     0     0    pkey       INTEGER    2
     0     1    first      TEXT       Helen
     0     2    last       TEXT       Jones
     0     3    age        INTEGER    29
     1     0    pkey       INTEGER    3
     1     1    first      TEXT       Adam
     1     2    last       TEXT       Von Schmitt
     1     3    age        INTEGER    32
  ------------------------------------------------
  Helen Jones has just turned 30, so update table.
  Should now only be one person under 30.
  People under 30 years of age :
    pkey          first         last          age
    1             John          Smith         23
  ------------------------------------------------
  Bugattis are too expensive, so drop that entry.
  ------------------------------------------------
  All done.

]]>
GNU gcc and -Wmissing-prototypes. http://www.mega-nerd.com/erikd/Blog/CodeHacking/gcc_missing_prototypes.html Many people who code in C consider warning messages optional or if they do enable warnings, use gcc's -Wall warning flag and leave it at that. However, there are a number of problems that gcc can warn about but doesn't unless it is specifically told to do so.

For example, consider a rather trivial example consisting of a main program file (main.c) like this:


  #include <stdio.h>
  
  #include "other.h"
  
  int
  main (void)
  {
      printf ("two cubed : %f\n", int_power (2.0, 3)) ;
      return 0 ;
  }

a second C file (other.c) like this:

  double
  int_power (int pow, double value)
  {
      double output = value ;
  
      for ( ; pow > 1 ; pow --)
          output *= value ;
  
      return output ;
  }

and the header file for the above C file (other.h) like this:

  double int_power (double value, int pow) ;

Simple.

Compiling this code at the command line can be done like this:


  gcc -Wall -Wextra main.c other.c -o program

which gives no warnings. However, when the resulting executable is run, it gives an obviously wrong result:


  two cubed : 0.000000

What the ..... ?

Looking at the code to this rather trivial example, its pretty easy to figure out that the error is caused by the main program and the implementation of the function int_power disagreeing on the order of the two parameters.

In a more complicated real world situation, this can lead to seriously difficult to debug problems. The solution of course is to add the -Wmissing-prototype flag to the gcc command line:


  gcc -Wall -Wextra -Wmissing-prototypes main.c other.c -o program

Now the compiler gives us a warning message:


  other.c:3: warning: no previous prototype for 'int_power'

To get rid of this warning, the file other.c should include other.h. When we do that, we get a compile error telling us that there is a conflict between the function implementation in other.c and the function prototype in other.h:


  other.c:6: error: conflicting types for 'int_power'
  other.h:1: error: previous declaration of 'int_power' was here

The fix of course is to make the implementation of int_power in other.c match the function prototype. Once that is done, the program compiles and even gives the correct result.

But we're not quite done yet. The behavior of the original broken code is slightly different when compiled with a C++ compiler. Compiling with g++:


  g++ -Wall -Wextra main.c other.c -o program

results in an error message:


  /tmp/cccTLc2H.o: In function `main':
  main.c:(.text+0x23): undefined reference to `int_power(double, int)'
  collect2: ld returned 1 exit status

So how does the C++ compiler know that something is wrong here when the C compiler didn't?

The most important thing to notice is that the error is produced by the linker. Secondly, one needs to remember that C++ (unlike C) allows function name overloading; that is, two (or more) functions can have the same name as long as they all have a unique (ordered) set of function argument types.

In the case above, the C++ linker (which may be the same as the C linker but behaves differently when linking C++ object files) knows the function called from main.c takes two parameters, a double followed by an int. However, the file other.c has a function of the same name, but with the order of the parameters reversed and hence can't be used. Since there is no other function of that name the linker gives an error.

Interestingly, the C++ compiler does not accept the -Wmissing-prototypes warning flag. Personally, I think it should, because obvious warnings from the parser stage of the compiler are an order of magnitude better than obscure error messages from the linker.

Finally, some C++ fan-boys might give this as an example of why C++ is a safer language than C. The question I would ask of those people is, "if you are so concerned with programming safety, why are you using C++ instead of Ocaml or Haskell?". I would also suggest that using a good C compiler like GNU gcc with every warning message you can find turned on is just as safe as running the same code through a C++ compiler.

]]>
A Simple Introduction to Parsing with Flex and Bison. http://www.mega-nerd.com/erikd/Blog/CodeHacking/flex_bison.html On Friday night I gave a presentation at SLUG with title above. Unfortunately the SLUG video recording people weren't there on the night so no video was captured. I am however making the slides and code available for download here. The code examples demonstrate a simple email date header parser written in both C and Ocaml. The C code is in five different stages so people can see how the parser was developed.

If anyone has any questions about the code, or more generally with the techniques of parsing, I'd be happy to discuss them on the SLUG coders mailing list.

]]>
Horses For Courses. http://www.mega-nerd.com/erikd/Blog/CodeHacking/horses_for_courses.html

In my day job, I work with a hardware engineer named Joe. A couple of months ago, he had to do some C coding to talk to a serial port and came to me for pointers. He was basically on the right track, but was using too many global variables, not checking the return values of system calls etc. These weren't horrible problems, but I explained why practices like this can lead to problems later, showed him better solutions to the same problems and introduced him to the gcc warning flags and Valgrind. He was very grateful for my help and was a quick to pick up all the tips I'd given him.

More recently Joe came to me with another programming problem; he had to parse some numerical data out of a plain text log file. He already had about 60 lines of C code that opened a file based on a hard coded filename and he was starting to fiddle around with the fgetc function but was a little stuck on how to go further.

I had a look at his code and since Joe's a nice Irish chap, his predicament brought to mind a joke I once heard:

It is said that there was once an English motorist in Ireland who stopped his car to ask the way to Kilkenny. "Sure and to goodness," replied the Irishman., "If I wanted to go to Kilkenny, I wouldn't be starting from here."

The problem is that C is not a very good language for parsing text data. I told Joe that writing his log file parser in C could certainly be done, but that it would be painful, time consuming and error prone in comparison to other programming languages. So I told him that for this particular task Python would be a much better fit and asked him if he'd like me to teach him the basics of Python. Joe's no dummy; he agreed without hesitation.

First up I showed him the Python Tutorial, the Python Module index and how to used Google Groups advanced search to find Python specific answers from the comp.lang.python Usenet group. I then showed him the basic hello world program in Python:


  #!/usr/bin/python

  print "Hello world!"

Over the next hour we built up a good portion of his program. We used used the sys module to get the file name of the log file to parse, the built in Python file handling functions and the regular expression module. We even used a list comprehension to remove outliers from his data set.

In the end we had about 30 lines of Python code that was very much closer to Joe's end goal than his original 60 lines of C code. Joe was really, really impressed with how easy Python was in comparison to C. It was at this point that I warned Joe about the Blub Paradox. He was well aware that when he only knew C, C was his first choice for this programming task. However, now that he knows Python as well, he'll be able to pick between C and Python depending on the task. I also told him that many Python programmers see Python as the ultimate programming language and are really Blub programmers even if they don't know it.

In my own programming I'm currently using:

  • C : The first language I really became proficient in. Its great for low level hacking, good for libraries and wherever speed of execution is an important aspect. I still love C even if it does seem rather archaic in comparison to the others.
  • C++ : I'm not real keen on C++ even though I do use it for most of the coding I do at my day job. As a language its incredibly complex and unforgiving. There are a million ways to shoot yourself in the foot and no one lives long enough to experience them all. Its a language that can harbor the most obscure bugs that can be exceedingly difficult to track down. This is a language that as far as I am concerned is long past its use by date.
  • Python : Python is a great language for teaching and a great language for small scripts. Some people also use it for bigger projects like Bzr but for me personally, I find its dynamic type system too problematic for use on larger projects.
  • Ocaml : This is my current favorite for general purpose programming. It can be run as a script just like Python, but can also be compiled to a really fast native binary. It uses strict static type checking to find coding errors at compile time and run time array bounds checking. It has variant types and pattern matching which are powerful constructs that programmers who have only used mainstream languages like C, C++, Java, Perl, Python etc could only dream about.
  • Erlang : Lately I've been learning Erlang, both for a project of my own and for a project at work. I'm learning it because it does parallel, concurrent and distributed programming better than any of the above, probably better than any other language in existence. Its also been pretty easy to pick up because, like Ocaml, its a functional programming language. Like Python, it uses dynamic type checking, but also has quite a lot of static checking in the compiler.

So, with the above languages at my disposal, I can match a programming language to the task at hand. For numerical and mathematical programming I use Ocaml, for low level programming I use C and from now on, for multi-threaded and concurrent programming I will chose Erlang.

More importantly, correctly matching the language to the problem should make the task of developing a solution to the problem far easier than using an inappropriate language.

]]>
Learning Erlang. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Erlang/learning_erlang.html The decision has been made, I'm going to learn the Erlang programming language. The main reason for this decision is that Erlang does one thing better than any other programming language I am aware of; parallel, concurrent and distributed processing.

The big problem with parallel and concurrent processing in other languages is that the standard method of communication between threads in most languages is shared data protected by mutexes or semaphores which are difficult to get right when there are a lot of threads or a lot of data to be protected. The standard solution to the problems of dealing with parallelism simply doesn't scale well.

Erlang excels at parallel processing because it forgoes the use of semaphores, mutexes and other synchronisation primitives. It replaces these shared data synchronisation methods with message passing; a much simpler mechanism which is much easier to reason about and much harder, maybe even impossible, to get wrong.

When learning, a new language, my usual approach is to write lots of small demo programs, with each one demonstrating a different feature. These programs come in really useful later as an easy to reference catalogue of language features.

Here's my first complete Erlang program, which takes any number of integer parameters on the command line and prints the factorial of each one:


  #!/usr/bin/env escript

  -export ([main/1]).

  % Naive factorial function.
  fac (0) -> 1 ;
  fac (N) when N > 0 -> N * fac (N - 1).

  % Function to print the factorial of each list element.
  print_fact_list ([]) -> ok ;

  print_fact_list ([Head | Tail]) ->
      % Convert the Head from a string to an int.
      Int = list_to_integer (Head),
      % Calculate the factorial.
      Fact = fac (Int),
      % Print the result.
      io:format ("fac ~w : ~w~n", [Int, Fact]),
      % Call the function recursively with the tail of the list.
      print_fact_list (Tail).

  % Main function, accepts a list of strings contain argv [1], argv [2] etc.
  main (List) ->
      case length (List) of
          0 -> io:format ("Usage : factorial.erl <number>\n") ;
          _ -> print_fact_list (List)
      end.

To me, this Erlang code looks a little like Ocaml and a little like Prolog which I used briefly at university over a decade ago. A couple of things to note:

  • Comments begin with the percent character.
  • All variable names have a leading upper case letter.
  • Functions can be defined multiple times and pattern matching is used to decide which function variant is called.
  • Strings are stored as lists of characters and converted to integers using the list_to_integer function.

To run this program requires Erlang, which on Debian and Ubuntu means the packages erlang, erlang-base, erlang-dev and erlang-manpages. It also uses escript, which comes standard with Erlang R11b4 and can be obtained here for earlier versions (Ubuntu Feisty has R11b2). Escript allows Erlang code to be run as a script, just like Python or Ruby.

The output of this program when passed the numbers 10, 20 and 30 results in the following output:


  fac 10 : 3628800
  fac 20 : 2432902008176640000
  fac 30 : 265252859812191058636308480000000

Yep, Erlang uses arbitrary precision integers by default. Thats pretty cool.

]]>
Tridge Was Right. http://www.mega-nerd.com/erikd/Blog/CodeHacking/tridge_was_right.html At Linux.conf.au 2005, Tridge gave a keynote talk about some of the issues the Samba team had run into when designing Samba4. While discussing the problems of writing a complex server which has to serve multiple simultaneous requests he put up a series of three slides. The first said:


Threads suck!

Having used OS level threads in the past, I was in complete agreement with this. The problems of sharing data across threads and locking/unlocking of that data to make sure the accesses are safe is simply too difficult for mere mortals to get right in anything other than trivial cases.

Tridges' second slide said:


Processes suck!

Splitting multi threaded code into multiple processes fixes the locking problems by removing the ability of the processes to share data (ignoring IPC shared memory of course). Obviously for a server program like Samba, this is not a solution.

The third slide in the series said:


State machines suck!

At the time of Tridge's keynote, I didn't really appreciate what he was saying.

The idea is really quite simple; everything is done in a single process so no locking is required. All I/O is multiplexed using the Unix select system call and a state machine keeps track of state of all of the I/O channels.

The problem with this is that any blocking I/O operation must be replaced with a non-blocking operation. Failure to do this will mean that a single I/O call that blocks will prevent the servicing of all other I/O operations until the blocked operation decides to complete and return control to the state machine.

However, the state machine model does work relatively well for simple examples. Unfortunately, non-blocking I/O leads to a second problem; writing code to do non-blocking I/O is significantly more difficult than for regular blocking I/O.

In my day job I've been working on some C++ classes which talk to a web server using HTTP POST operations over a keep-alive connection. This code had a couple of requirements:

  • Must be non-blocking to fit in with the rest of the code.
  • Must be capable of HTTPS connections using OpenSSL (which is a particularly nasty to get working in non-blocking mode).
  • Must be able to connect via a HTTP proxy in both HTTP and HTTPS modes.
  • Must be able to detect a connection that gets broken and gracefully re-establish it.

I now have code that fits these requirements and a pretty comprehensive test suite. With this experience behind me I have to say that getting this working was a royal pain in the neck. I also agree with Tridge; state machines suck almost as much as threads.

Maybe its time for me to learn Erlang.

]]>
Lazy Lists. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/lazy_lists.html Lazy evaluation is a default feature of the Haskell programming language and an optional feature of Ocaml. Most programming languages (Ocaml, C, C++, Perl, Python, Java etc) use eager evaluation; where a result specified by a line of code is calculated as soon as the program gets to that line. Lazy evaluation on the other hand, defers the calculation of a result until that result is needed.

The real beauty of lazy evaluation is that a result that is never used is never evaluated. Lazy evaluation also allows the specification of lists which are effectively infinite, as long as the programmer doesn't actually try to access every element in the list. Obviously, attempting to do so would take infinite time and and require infinite memory to actually hold the list :-).

While searching for information on Ocaml's lazy programming features I came across a post at the enchanted mind blog. That post is ok, but the code is just snippets and when put together as it is, doesn't actually work.

After a bit of fiddling around, I managed to get it working. However, once I understood it, I didn't think the example was as good as it could be. Firstly, the input to the lazy list is just a standard finite length Ocaml list, but more importantly it doesn't give any idea of how to do a potentially infinite list which is a much more interesting case.

That left the field open for a nice blog post demonstrating lazy lists in Ocaml. Read on.

Anybody who has done high school or higher mathematics would probably have come across recurrence relations the most well know of which is the Fibonacci sequence.

The Fibonacci sequence is often used as example for teaching the concept of recursion in computer science (even if some people think there are better examples). The Fibonacci sequence can be expressed recursively in Ocaml like this:


  let rec fibonacci n =
      match n with
      |    1 -> 1
      |    2 -> 1
      |    x -> (fibonacci (n - 1)) + (fibonacci (n - 2))

If one wanted to generate a list containing say the first 20 Fibonacci numbers using the above recursive function, the 19th number in the sequence would be calculated twice, the 18th number three times so on. Its simply not efficient.

A better solution is to use a lazy list, which calculates new values of the sequence as they are needed, based on entries already in the list. Here's an example that creates a lazy list of the fibonacci numbers:


  type lazy_fib_t =
      Node of int * lazy_fib_t Lazy.t

  let create_fib_list () =
      let rec fib_n minus_2 minus_1 =
          let n = minus_1 + minus_2 in
          Printf.printf "fib_n %d %d -> %d\n" minus_2 minus_1 n ;
          Node (n, lazy (fib_n minus_1 n))
      in
      lazy (Node (1, lazy (Node (1, lazy (fib_n 1 1)))))

  let print_fib_list depth lst =
      let rec sub_print current remaining =
          if current > depth then ()
          else
          match Lazy.force remaining with
          |    Node (head, tail) ->
                  Printf.printf "%3d : %d\n" current head ;
                  sub_print (current + 1) tail
      in
      sub_print 0 lst

  let _ =
      let fib_list = create_fib_list () in
      print_fib_list 4 fib_list ;
      print_endline "------------" ;
      print_fib_list 6 fib_list ;

This is a complete working Ocaml program. To run it, just save the text to a file, say "lazy_fib.ml" and then do:


  ocaml lazy_fib.ml

We'll look at the output in detail later. First lets break it down; looking at the program, from top to bottom we have:


  type lazy_fib_t =
      Node of int * lazy_fib_t Lazy.t

The above two lines define a recursive type called lazy_fib_t, which has a single variant called Node which contains a tuple of an integer and the head of a lazy list.


  let create_fib_list () =
      let rec fib_n minus_2 minus_1 =
          let n = minus_1 + minus_2 in
          Printf.printf "fib_n %d %d -> %d\n" minus_2 minus_1 n ;
          Node (n, lazy (fib_n minus_1 n))
      in
      lazy (Node (1, lazy (Node (1, lazy (fib_n 1 1)))))

The function above, create_fib_list, creates a lazy list. It also contains an internal function, fib_n, which we'll look at later. The last line of the function is where all the magic is; it creates three nodes of a lazy list, the first two containing the first two integers of the Fibonacci sequence and a third node which is a closure, containing a call to the internal function fib_n with the correct parameters to generate the next number in the sequence.

The internal function fib_n takes two parameters, the values of the sequence for n - 1 and n - 2. From these two values, it generates the value for n, prints a message and then constructs a new Node containing the value for n and a lazy evaluation for the next value.

The next function is the function which prints the first n elements of a lazy list. It looks like this:


  let print_fib_list depth lst =
      let rec sub_print current remaining =
          if current > depth then ()
          else
          match Lazy.force remaining with
          |    Node (head, tail) ->
                  Printf.printf "%3d : %d\n" current head ;
                  sub_print (current + 1) tail
      in
      sub_print 0 lst

The print_fib_list function contains an internal function sub_print which is called with a current depth of zero and the head of the lazy list to be printed. The internal function recursively moves down the list until current is greater than depth, which cause the recursion to complete and unwind.

At each node of the lazy list where current is less than or equal to depth, the function forces the evaluation of the node. The forcing will only evaluate a node if it hasn't already been evaluated. Once the node has been force evaluated, the value is printed and the function is called recursively.

Finally, the main function of the program is this:


  let _ =
      let fib_list = create_fib_list () in
      print_fib_list 4 fib_list ;
      print_endline "------------" ;
      print_fib_list 6 fib_list ;

All it does is call the function create_fib_list, and then print the first four Fibonacci numbers of the list, prints a dashed line and then prints the first six Fibonacci numbers of the list. Its important to note that the print function is called with the same list on both occasions.

When the program is run, the output should look like this:


    0 : 1
    1 : 1
  fib_n 1 1 -> 2
    2 : 2
  fib_n 1 2 -> 3
    3 : 3
  fib_n 2 3 -> 5
    4 : 5
  ------------
    0 : 1
    1 : 1
    2 : 2
    3 : 3
    4 : 5
  fib_n 3 5 -> 8
    5 : 8
  fib_n 5 8 -> 13
    6 : 13

As can be seen above, the first time the print function is called, the fib_n closure is called for all values of n greater than one. Each time fib_n is called a new node is generated in the list. When the print function is called the second time, it fib_n is only called for values that weren't evaluated on the first call to the print function just as was expected.

One of the few problems with the above implementation is that it uses integers which in Ocaml on 32 bit CPU platforms is only a 31 bit integer. It would however be relatively easy to use Ocaml's Big_int module which provides arbitrary length integers.

]]>
Xtreme Numerical Accuracy. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/xtreme_numerical_accuracy.html I'm working on a digital filter design program in Ocaml which was suffering from some numerical issues with Ocaml's native 64 bit floats. The problem was that the algorithm operates on both large floating point numbers and small floating point numbers. These numbers eventually end up in a matrix, and I then use Gaussian elimination to solve a set of simultaneous equations.

Anyone who has done any numerical computation will know that adding large floating point numbers to small floating numbers is a recipe for numerical inaccuracy. For me, the numerical issues were screwing things up badly.

When faced with a problem like this there are two possible solutions:

  • Do all the computations symbolically, and only substitute numbers at the very last stage and then being careful to process the numerical parts in a way to minimize rounding and truncation error.
  • Replace the floating point operations with operations on a number type which can provide higher (and maybe even arbitrary) precision.

The first option, doing all the computations symbolically was not practical due to the complexity of the computation. That left only the second option.

Looking around for what was available for Ocaml, I found the contfrac project on Sourceforge. As all the math geeks (hi Mark) have probably guessed by now, contfrac expresses numbers in terms of a really cool mathematical concept called continued fractions.

The idea is that any number can be represented by a (potentially infinite) list of integers [ a0 ; a1, a2, a3, ...]. Given the list of integers, the number itself can be calculated using:

equation

All rational numbers have a finite length continued fraction expansion. For example, the rational number 75/99 is expressed as [ 0 ; 1, 3, 8 ].

Not surprisingly, all the irrational numbers have infinite length continued fraction expansions. The surprising thing (for me at least) is that many of the irrational numbers have CF expansions that are surprisingly regular. The square root of two is expressed as [ 1 ; 2, 2, 2, ...] with an infinitely repeating list of 2s. The natural logarithm e is expressed as [ 2 ; 1, 2, 1, 1, 4, 1, 1, 6, ...] which again has a regular pattern, as does the golden ratio, [ 1 ; 1, 1, 1, ...]. While all the previous CF expansions have a degree of regularity, the expansion of pi, is [ 3 ; 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1, 15, 3, 13,...], which looks completely random.

With numbers expressed as continued fractions, the Ocaml contfrac module then implements addition, subtraction, multiplication and division. Once the four arithmetic operations are defined, contfrac then implements a number of trigonometric and transcendental functions in terms of the same continued fractions.

Unfortunately, the module doesn't implement everything I need so I'm going to have to hack on some extra functionality. The actual Ocaml implementation uses Ocaml's lazy lists which is an aspect of Ocaml I hadn't played with yet. Time for some fiddling with lazy lists.

]]>
GNU gcc Stack Protection. http://www.mega-nerd.com/erikd/Blog/CodeHacking/gcc_stack_protect.html Wow, this is new. Version 4.1 of GNU gcc compiler shipped with Ubuntu Feisty includes stack smashing protection by default!

Consider the following code containing a buffer overflow of a stack based buffer :


    #include <stdio.h>

    static void
    kill_my_stack (void)
    {
        char buffer [10] ;
        int k ;

        for (k = 0 ; k < 20 ; k++)
            buffer [k] = 'a' + k ;
    } /* kill_my_stack */

    int
    main (void)
    {
        kill_my_stack () ;
        return 0 ;
    } /* main */

Compiling this with the default gcc compiler in Feisty produces an executable which when run gives the following error:


    *** stack smashing detected ***: /home/erikd/stack-protect-demo terminated
    Aborted

Obviously, for an error as simple as this even basic static analysis should find it, but we know that the vast majority of people don't use static analysis. In fact many don't even compile with a sensible set of compiler flags turned on. Well now, those people are protected from themselves.

]]>
Spectrogram Fun! http://www.mega-nerd.com/erikd/Blog/CodeHacking/spectrogram_fun.html Inspired by the spectrograms used in the SRC Comparison I decided to write a program that generates similar spectrograms from any given sound file. The program is now basically working and when run over a full song (the song "Vehicle" by the band "Golden Section") it produced this, which I think is quite beautiful:


[Secret Rabbit Code sweep test]

The program is written in C and uses libsndfile (of course) for reading the sound file, FFTW for generating the spectrum data and the wonderful Cairo library for the image generation back-end.

I intend to release the code for this under the GPL as soon as I can clean it up a bit, add handling for multi-channel files and improve the command line option handling.

]]>
SRC Comparison. http://www.mega-nerd.com/erikd/Blog/CodeHacking/SecretRabbitCode/src_compare.html One of my Free Software projects is Secret Rabbit Code, aka libsamplerate, aka the Rabbit, a library for performing sample rate conversion (Wikipedia) on audio signals. Recently, a company in Canada did a comparison of a number sample rate converters in professional audio software and also included the Rabbit in that test.

The tests were carried out by generating a input signal at a sampling rate of 96 kHz, configuring each sample rate converter to to do a conversion from 96 kHz input sample rate to 44.1 kHz output sample rate and passing the input signal through each converter and capturing each converter's output. The input test signal was a sine wave which sweeps from a low frequency of about 100 Hz at the start to a frequency of 44.1 kHz at the end. Finally, a spectrogram is then generated from each output signal.

The spectrogram of the output of Secret Rabbit Code's Best Sinc converter looks like this:


[Secret Rabbit Code sweep test] [Color key]

The spectrogram shows time in seconds along the x-axis and frequency in Hertz along the y-axis. The colour indicates the signal strength at each point in time and frequency, with white being the strongest signal (0 decibels) and black being the weakest signal (-180 decibels).

The tricky thing about the sample rate conversion process is that for any given sample rate fs, the highest frequency signal that can be correctly represented is at fs/2. When sample rate converting from 96 kHz to 44.1 kHz, all frequencies above half of the destination sample rate must be removed during the conversion process. Failure to do so will result in audio distortion and noise in the output signal.

Looking at the spectrogram of the Rabbit's output, its easy to see that the the main sweep (in bright white) clearly goes from some low frequency at the start to 22.05 kHz (half of the output sample rate) at 5 seconds. After about 5 seconds, the input signal's sine wave frequency goes above half the destination sample rate and the Rabbit does the correct thing and almost completely removes it.

The rest of the colour in the spectrogram is an artifact of the conversion process but by referencing the colour scale, its possible to confirm that all of these artifacts are 100 decibels below the level of the main signal. Ideally they shouldn't be there at all, but if they are the should be as low as possible.

Anyone who has read this far can now go to the comparison page pick any two converters and compare them. They can also confirm for themselves that although the Rabbit (Best Sinc) wasn't the best converter among the ones tested (that award would have to go to r8brain and iZotope), it certainly didn't disgrace itself either. A number of the commercial converters in expensive software packages (like Sony Vegas and Digital Performer) didn't perform all that well in comparison.

The good news is that the existence of commercial closed source converters that are better than the Rabbit gives me some incentive to come up with a better converter for inclusion in the Rabbit.

]]>
The Size of 'cp' (Update). http://www.mega-nerd.com/erikd/Blog/CodeHacking/size_of_cp_update.html André Pang read my blog post about the size of the compiled Haskell 'cp' executable and suggested that something was wrong. So, I looked at it again.

My laptop is running Ubuntu Edgy and for some reason Edgy installs version 6.4.2 of the Glasgow Haskell Compiler. I also have a desktop machine running Debian Testing which has version 6.6 of of ghc.

Sure enough, ghc 6.6 generates a 255 kilobyte executable which is a huge improvement over the 1.5 megabyte executable produced by version 6.4.2.

]]>
The Size of 'cp'. http://www.mega-nerd.com/erikd/Blog/CodeHacking/size_of_cp.html Conrad Parker blogged recently showing some simple examples in Haskell. I've been wanting to learn Haskell for a while so I took special interest in Conrad's post. For instance, the program implementing the basic functionality of the Unix cp in Haskell is small and extremely elegant:


  import System.Environment

  main = do
      [infile, outfile] <- getArgs
      s <- readFile infile 
      writeFile outfile s

However, on my machine (i686 laptop running Ubuntu Edgy), the generated executable is 1.5 megabytes in size even after being stripped. By way of contrast, the /bin/cp executable written in C is 56 kilobytes. WTF?

So lets look at the Ocaml version:


  let _ =
      let srcfile = open_in Sys.argv.(1) in
      let destfile = open_out Sys.argv.(2) in
      let maxlen = 8192 in
      let str = String.create maxlen in
      let count = ref 1 in
      while !count > 0 do
          count := input srcfile str 0 maxlen ;
          output destfile str 0 !count ;
          done

This is pure imperative code and doesn't use any of the functional language features of Ocaml, but it compiles to a 79 kilobyte stripped executable. Compared to the C executable, the Ocaml executable is 40% bigger and the Haskell one is 2500% bigger.

Obviously, the size of the executable is not the only determining factor in choice of programming language, but Haskell's executables do seem unreasonably large.

Update here.

]]>
Non-recursive Automake. http://www.mega-nerd.com/erikd/Blog/CodeHacking/nonrecursive_automake.html A lot of people (yeah, you know who you are) bitch about Automake and the associated tools like autoconf and libtool. While I do agree that these tools do have problems and limitations, they are also a better soultion to the problem than any of the alternatives I have looked at.

The thing I really like about automake is that it does automatic dependency checking so that if the file foo.cc includes foo.h which includes bar.h which includes baz.h and baz.h changes, automake knows that foo.c needs to be recompiled. Manually keeping track of dependencies like these is a royal pain in the neck and getting it wrong can lead to really obscure Heisenbugs; for example, two C++ object files disagreeing on the parameter list of a method of a class.

I've had a number of projects that have used automake for years. However, all of these projects used the traditional recursive make scheme where there is a Makefile.am in each directory of the source tree. I continued to do it this way with automake even after reading Peter Miller's excellent paper Recursive Make Considered Harmful, but with hand written Makefiles, I usually took Peter's advice.

My first test of a non-recursive automake solution was for a project I'm doing at work. The project started out with a standard single top level non-recursive Makefile which handled the compiling of about 150 C++ source files which compiled to a couple of static convenience libraries, a main executable and a couple of test programs.

The big problem with the existing standard Makefile was that it didn't properly encode dependancies and hence I often had to do a "make clean" followed by a make to get the thing built correctly. Fixing this issue was the prime motivator for moving to automake.

One slightly unusual aspect of the project was the way project specific internal include files were referenced within the project. As a result of the project having been developed with a single top level Makefile from the beginning, all hash includes within the project are of the form:


  #include <path/to/header.h>

with "path" being a directory in the same top level directory as the Makefile. What this means is that no source file which includes a header from within the project should need any extra project internal include path other than "-I.". This means that the resulting compile lines produced by automake (and libtool if it is in the picture) are considerably shorter than they would have been otherwise.

So, what does the Makefile.am look like?

Here's an example non-recursive Makefile.am which is basically a stripped down version of the one I'm using on my project but with some extra comments. Anyone who has hacked a Makefile.am before should be able to understand what is going on.


  # Tell automake to put the object file for apple/apple.c in dir apple/
  AUTOMAKE_OPTIONS := subdir-objects

  # The installable executable.
  bin_PROGRAMS = apple/apple

  # Couple of python scripts used using build.
  EXTRA_DIST = apple/version_create.py apple/tests/test_wrapper.py

  # Convienience libraries required during build.
  noinst_LTLIBRARIES = lib/libcore.la apple/libapple.la

  # All the project related headers required for building.
  noinst_HEADER = $(libcore_includes) $(libapple_includes)

  # Test programs that will not be installed.
  noinst_PROGRAMS = apple/tests/skin_test lib/pip/test/pip_test

  # A couple of autogenerted header files.
  nodist_include_HEADERS = apple/version.h

  DISTCLEANFILES = apple/version.h

  #=========================================================
  # libcore : The core library routines.

  lib_libcore_includes = \
      lib/red.h lib/green.h lib/blue.h

  lib_libcore_la_SOURCES = $(lib_libcore_includes) \
      lib/red.cc lib/green.cc lib/blue.cc

  #=========================================================
  # libpip : All the pips.

  apple_pip_libpip_includes = \
      apple/pip/cat.h apple/pip/dog.h apple/pip/mouse.h

  apple_pip_libpip_la_SOURCES = $(apple_pip_libpip_includes) \
      apple/pip/cat.cc apple/pip/dog.cc apple/pip/mouse.cc

  #=========================================================
  # libapple : Everything in the application except main.cc.

  libapple_includes = \
      apple/granny.h apple/smith.h apple/johnathon.h

  libapple_la_SOURCES = $(libapple_includes) \
      apple/granny.cc apple/smith.cc apple/johnathon.cc

  #=========================================================
  # apple : The application.

  apple_apple_SOURCES = apple/main.cc apple/version.h

  apple_apple_LDADD = apple/libapple.la lib/libcore.la \
      apple/pip/libpip.la $(EXT_A_LIBS) $(EXT_B_LIBS)

  #=========================================================
  # Test programs.

  apple_tests_skin_test_SOURCES = apple/tests/skin_test.cc
  apple_tests_skin_test_LDADD = lib/libcore.la apple/libapple.la

  lib_pip_test_pip_test_SOURCES = lib/pip/test/pip_test.cc
  lib_pip_test_pip_test_LDADD = lib/libcore.la apple/pip/libpip.la \
      $(EXT_A_LIBS) $(EXT_B_LIBS)

  check : $(noinst_PROGRAMS)
      ./apple/tests/skin_test
      ./lib/pip/test/pip_test
      $(top_srcdir)/apple/tests/test_wrapper.py
      @echo
      @echo "All tests passed."
      @echo

  #=========================================================
  # Autogenerated files and their dependancies.

  apple/main.o : apple/version.h
  apple/version.h : this_file_does_not_exist
      $(top_srcdir)/apple/version_create.py $@

  .PHONY : this_file_does_not_exist

The thing that surprised me most about converting this project to automake was how easy it was and how well it worked. I also immediately noticed that the autogenerted non-recursive make seemed to run a lot faster than I was used to with recursive make, but that is one of the benefits mentioned in Peter's paper.

Since this was such a success I'm going to look into applying this to some of my other projects.

]]>
Autoconf and #ifdef Considered Harmful. http://www.mega-nerd.com/erikd/Blog/CodeHacking/autoconf_ifdef.html I recently got an email from someone suggesting that my example for detecting the presence of libsamplerate was somewhat problematic. The crux of the complaint is that my configure.ac snippet ends up setting the HAVE_SAMPLERATE variable to 0 or 1 and that most people tend to use something like this in their C or C++ code:


    #ifdef  HAVE_SAMPLERATE
         /* Some code which uses libsamplerate here. */
    #endif

Obviously with HAVE_SAMPLERATE defined to 0 or 1, this code is not going to work as the developer expected. Instead, they should be using something like:


    #if  HAVE_SAMPLERATE
         /* Some code which uses libsamplerate here. */
    #endif

I know that the former idiom is more common, but I chose the second method for a reason; I believe that it is more robust. The problem of course is that autoconf and its related tools are fragile, but the standard idiom certainly doesn't help.

Consider the following code (from a bug I found in XMMS ):


    #ifndef  WORDS_BIGENDIAN
         /* Some little endian machine specific code. */
    #endif

The bug resulted from an error in configure.ac where the author had forgotten to invoke the autoconf macro which sets the WORDS_BIGENDIAN variable. The result was that this code compiled perfectly and ran perfectly on little endian machines. On big endian machines, it compiled perfectly and under certain circumstances failed badly causing really horrible noises to come out of my headphones.

Now consider my version where WORDS_BIGENDIAN gets set to either 0 or 1. In this case, if the author forgot to invoke the autoconf macro, WORDS_BIGENDIAN would have been undefined and this code:


    #if  (WORDS_BIGENDIAN == 0)
         /* Some little endian machine specific code. */
    #endif

would have failed to compile on both big and little endian machines.

It also doesn't take a genius to see that this is a symptom of a larger and much more common problem. In fact my solution is a classic example of moving something up Rusty's spectrum of interface simplicity. The original code was at position 11 on the scale (follow common convention and you'll get it wrong) and my alternative is at position 1 on the scale (compiler/linker won't let you get it wrong).

]]>
Monads Made Easy (and Hard). http://www.mega-nerd.com/erikd/Blog/CodeHacking/monads_made_easy.html For some time I've been rather keen on learning the Haskell programming language. The big problem for me was that when I started out trying to solve particular problems I quickly ran into a strange catch-22. Haskell uses a concept called Monads but it seemed to me that in order to understand Haskell one needs to understand Monads and in order to understand Monads, one needs to understand Haskell.

There are however numerous tutorials and explanations on Monads. For instance:

I've looked at many of these tutorials but never managed to get Monads. Its not that I can't understand difficult concepts or that I can't handle weird ass programming languages. The problem was that all these tutorials explained monads from the point of view of people who already understood Monads.

Anyway, my difficulty with Monads is at an end. I've just found an explanation of Monads called Of monads and spacesuits written by Eric Kow. It explains Monads using astronauts, space stations and space suits. I finally get it!

And now that I do, I can play with Monads in Ocaml and also go ahead and learn Haskell.

However, I will not be using Monads in C++ because the C++ implementation is just too damn weird. It makes a concept that can already be difficult even harder as well as making it unreadable. A pox on C++.

]]>
Ocaml : Exception Backtraces. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/exception_backtraces.html There's a paper dated December 2002 by Kevin Murphy where he explains why he was looking at Ocaml. That article was recently linked on programming.reddit.com and there was a comment complaining that Ocaml couldn't print out backtraces on exceptions. Someone posted later that this was not right, but I've heard this complaint often enough that I thought I should blog about how to do it.

First off, Ocaml has two compilers, one which produces bytecode and one which produces native binaries. The native code compiler is not currently able to produce exception backtraces and this is where the Reddit commenter got the idea. However, there is a patch in the Ocaml bug tracker which adds backtrace capabilities. I'm hoping that this goes into the compiler proper in the next release or two.

For a project that is currently compiling with ocamlopt (the native code compiler), changing the to bytecode compiler is as simple as editing the Makefile and replacing all invocations of "ocamlopt" with "ocamlc -g" where the "-g" turns on exception backtraces. You can then rebuild the application. The final step is to turn on backtraces in the bytecode run time environment which is done by setting an environment variable:


  export OCAMLRUNPARAM="b1"

Once compiled to bytecode and with the environment variable set, the application can be run and should produce the required backtrace. The following is an example of a backtrace from something I'm working on at the moment (I hacked the code to make sure I could get one).


  Fatal error: exception Invalid_argument("index out of bounds")
  Raised by primitive operation at unknown location
  Called from file "meyers_diff.ml", line 93, characters 1-31
  Called from file "meyers_diff.ml", line 200, characters 10-52
  Called from file "meyers_diff.ml", line 221, characters 16-60
  Called from file "meyers_diff.ml", line 264, characters 11-148
  Called from file "meyers_diff.ml", line 305, characters 17-50
  Called from file "array.ml", line 130, characters 31-51
  Called from file "meyers_diff.ml", line 323, characters 1-316

Obviously it would be nicer if function names were included here, but this is more than sufficient for debugging purposes.

]]>
On Design Patterns. http://www.mega-nerd.com/erikd/Blog/CodeHacking/design_patterns.html

The book "Design Patterns : Elements of Reusable Object-Oriented Software" by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides is one of the most well known books about the concept of design patterns ; the idea of codifying generic solutions to recurring programming problems. The book is so well known that the authors came to be dubbed the Gang Of Four. When programmers talk of the Gang of Four they are usually talking about this book and these authors.

Unfortunately, even though I do own a copy I simply haven't managed to find the time to actually read it. As a coder whose two favorite languages are Ocaml and C this book is not directly applicable to my coding in those languages, especially since I rarely feel the need to use Ocaml's OO features. However, I do code in C++ and Python, both in my own time and at work and I suspect that reading and understanding this book would help my coding in both those languages.

So while I may not have read the actual book I do like to read articles and blog entries about the book. One that recently came to my notice was an article by Mark Dominus which spawned a response from one of the Gang of Four authors which in turn spawned another article from Mark Dominus.

Mark Dominus' original article and his follow up suggest that design patterns arise to make up for weakness in the languages they are implemented in. In particular, patterns that exist in one language may disappear completely in a more advanced language which replaces the pattern with something far more simple and not worthy of being called a "pattern".

I really do hope that I find time to read the book because it will be interesting to make up my own mind on this issue. It will be especially interesting to see how many of the patterns in this book apply in Ocaml.

]]>
N Squared Problems. http://www.mega-nerd.com/erikd/Blog/CodeHacking/n_squared_problems.html Important note to self : When working on an algorithm which displays N squared (or cubed or worse) behavior in Big Oh Notation fiddling with the algorithm is useless if N is large. The only ways to improve performance is to reduce the complexity to O(NlogN) or O(N) or alternatively, find a way to drastically reduce the size of the N.

This problem came up while working on my current main non-day-job application which will be released under the GPL when its ready. Even on reasonable input data sets, the value of N was well into the tens of thousands. If any significant processing needed to be done, the total time blows out really badly. This was simply not acceptable.

With many problems, processing can be traded off against memory usage. Unfortunately, for this problem, memory usage was also O(N2) and I didn't consider it acceptable for my application to chew up hundreds of megabytes of memory for reasonably small data sets.

What did finally work for this problem was a divide and conquer trick that allowed me to reduce N by two orders of magnitude and do more work on the smaller chunks of data instead.

I'll elaborate on this issue and its solution in the paper I'm writing for OSDC 2006. I'll also be submitting a paper proposal for Linux.conf.au 2007.

]]>
Ocaml : Code for Variant Types and Pattern Matching. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/variant_types_code.html Since my blog post on Ocaml's Variant Types and Pattern Matching I've had two requests for the code, so here it is:


type expr_t =
    |  Var of string
    |  Plus of expr_t * expr_t
    |  Times of expr_t * expr_t
    |  Divide of expr_t * expr_t


let rec simplify expr =
    match expr with
        |   Times (a, Plus (b, c)) ->
                Plus (simplify (Times (a, b)), simplify (Times (a, c)))
        |   Divide (Divide (a, b), Divide (c, d)) ->
                Divide (simplify (Times (a, d)), simplify (Times (b, c)))
        |   anything -> anything (* Comment : default case *)


let rec str_of_expr expr =
    match expr with
        |   Var v -> v
        |   Plus (a, b) ->
                "(" ^ (str_of_expr a) ^ "+" ^ (str_of_expr b) ^ ")"
        |   Times (a, b) ->
                (str_of_expr a) ^ "*" ^ (str_of_expr b)
        |   Divide (a, b) ->
                (str_of_expr a) ^ " / " ^ (str_of_expr b)


let _ =
    let expr = Times (Var "x", Plus (Var "y", Var "z")) in
    Printf.printf "  orig : %s\n" (str_of_expr expr) ;
    let expr = simplify expr in
    Printf.printf "  new  : %s\n" (str_of_expr expr)

The above code is a single self contained program; to run that program, save it to to a file named say "cas.ml" then run it (assuming you have ocaml installed) using the command "ocaml cas.ml" which should result in the following output:


  orig : x*(y+z)
  new  : (x*y+x*z)

Obviously, this is just just a demo, but it should be pretty clear that this code could easily be extended to something more useful.

]]>
Ocaml : Variant Types and Pattern Matching. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/variant_types.html At last month's SLUG meeting, Mark Greenaway asked if anybody knew of any good Computer Algebra Systems (CAS) available under Linux. I spoke up and told him that I looked around for the same thing some time ago, couldn't find anything that I liked so I ended up writing something that fit my particular needs from scratch.

Later that night I was talking to Robert Collins and Andrew Cowie about stretch languages; languages that differ radically from the languages a programmer already knows so that learning the new language teaches them new programming concepts and paradigms.

For me, my last stretch language was Ocaml which I started using around mid 2004. I discovered Ocaml when I went looking for a language to implement my Computer Algebra System (CAS) in. I did do a trial implementation in C, but that was simply too much of a pain in the neck. I also knew that C++ was not a sufficiently big step away from C to be useful for this project. My other main language at the time was Python, but I knew Python's dynamic typing would make my life difficult.

It was at about this time that I asked my friend André Pang to suggest a language. André had recently given a talk at SLUG titled "Beyond C, C++, Python and Perl" and seemed to know about a whole bunch of different languages. I told him that I was looking for something that was strongly typed, statically typed, had garbage collection for memory management and had Object Oriented (OO) features.

One of André's suggestions was Java which I was already familiar with. However, I disliked the fact that Java does not allow one to write code outside of a class and Java was also a little too verbose for my tastes. He also tried to push Haskell, but Haskell doesn't have traditional OO features. In retrospect this wouldn't have been a problem, but at the time I rejected Haskell for this reason. However, his final suggestion was Ocaml which seemed to fit all of my requirements. While investigating Ocaml I found a small example on the Ocaml Tutorial that implemented a bare bones CAS.

The two things that makes Ocaml really great for CAS are variant types and pattern matching on these variants. I'll look at these separately.

Variant Types in Ocaml.

Ocaml's variant types are a little like a type safe, bullet proof version of unions in C and C++. In Ocaml one defines a variant type like this:

type expr_t =
    |  Var of string
    |  Plus of expr_t * expr_t
    |  Times of expr_t * expr_t
    |  Divide of expr_t * expr_t

So, here we have a type named expr_t (a mathematical expression) that can hold one of four things:

  • A variable name which is just a string.
  • A plus node with a left sub expression and a right sub expression.
  • A times node with a left sub expression and a right sub expression.
  • A divide node with a numerator sub expression and a denominator sub expression.

All of the sub expressions are of the same type, expr_t, which makes this a recursive type. Using this recursive variant type, an expression like "x * (y + z)" can be build like this:


let expr = Times (Var "x", Plus (Var "y", Var "z")) ;;

which results in a tree structure with each operator and our variables x, y and z being held in a node of type expr_t and represented by a circle in this diagram:

expression tree

with the variable expr being the Times node at the top of the diagram.

The really nice thing about variants is that each instance knows which variant it is. That means that its not possible (by mistake or on purpose) to access a node of one variant as another variant. The Ocaml compiler simply won't let that happen.

Compare this to a C version using unions where the programmer has be make sure he/she accesses each instance correctly, or the acres of code required to do the same thing with objects in C++ or Java.

Pattern Matching on Variants.

So once we can construct a mathematical expression we would also want to print it out. Thats where Ocaml's pattern matching comes in. Here's a function to convert any expression tree into a string representation that can be printed.

let rec str_of_expr expr =
    match expr with
        |   Var v -> v
        |   Plus (a, b) ->
                "(" ^ (str_of_expr a) ^ "+" ^ (str_of_expr b) ^ ")"
        |   Times (a, b) ->
                (str_of_expr a) ^ "*" ^ (str_of_expr b)
        |   Divide (a, b) ->
                (str_of_expr a) ^ " / " ^ (str_of_expr b)

The function is called str_of_expr and the "rec" just before the function name means that the function is recursive. The function takes a single parameter of type expr_t and returns a string.

The "match expr with" on the next line is a bit like a switch statement in C, C++, Java and other languages. On the lines following the match there are four options, one for each of the variants of the expr_t type. So for instance, if the expr is a Var variant the function just returns the string that is held by Var and if its a Plus node with two sub expressions, a and b, then the function is called on each of the sub expressions and a string is built using Ocaml's string concatenation operator "^".

The above usage of pattern matching is pretty simple and can done almost as easily in other languages. So lets look at something a little more complicated.

More Advanced Pattern Matching.

One of the many things one might want to do in a CAS is applying mathematical transforms on an expression. For instance, we might want to be able to expand out our expressions above "x * (y + z)" to give "(x * y + x * z)". Fortunately, using Ocaml's advanced pattern matching this is really easy. Here's an example:

let rec simplify expr =
    match expr with
        |   Times (a, Plus (b, c)) ->
                Plus (simplify (Times (a, b)), simplify (Times (a, c)))
        |   Divide (Divide (a, b), Divide (c, d)) ->
                Divide (simplify (Times (a, d)), simplify (Times (b, c)))
        |   anything -> anything (* Comment : default case *)

The function simplify has two transformations and a default case which does nothing. Again, the function is recursive, but the first two match cases match on much more complex expression trees. In fact, the first match case will exactly match our expression and generate the expression we're after, "(x * y + x * z)".

Obviously, to make a real Computer Algebra System requires quite a bit more than what I have here. However, as you can see, Ocaml's variant types and pattern matching are a perfect fit for the problems a programmer writing a CAS would face. In fact, few other languages, with the possible exception of Haskell, would have fit this problem as well.

]]>
libsndfile 1.0.17. http://www.mega-nerd.com/erikd/Blog/CodeHacking/libsndfile/rel_17.html I've just released a new version of libsndfile. The changes are:

  • Add a C++ header file which acts as a wrapper around libsndfile's C API.
  • Fix the documentation which describes using the precompiled windows DLL with evil ware compilers.
  • Minor bug fixes and cleanups.
]]>
Why I Don't Like Dynamic Typing. http://www.mega-nerd.com/erikd/Blog/CodeHacking/dynamic_typing.html I've just spent the last week or so coding in Python, a language I first picked up about eight years ago. For a number of years, I used Python quite extensively but have found in the last year or so that I'm tending to favor Ocaml for many of the tasks where I used to use Python.

A large part of the reason for this switch has to do with type checking; the process where the compiler or run time environment checks whether operations on data held in variables is valid for the type of that data. For instance, the concatenation operation is probably valid for strings and lists, but simply doesn't make sense for integers.

The big difference between Ocaml and Python is that Ocaml uses static type checking while Python uses dynamic type checking. For those too lazy to follow the links, static typing means that type and related correct-ness checking of the program is done by the compiler at compile time while languages that use dynamic typing do a large proportion of this checking at run time.

Here's a Python example that bit me just today:


  try:
      data = my_obj.read (1024)
  except:
      print "Read on '%s' failed" & my_obj.name ()

The error is that I had an ampersand (the '&' character) in the last line where I should have had a percent symbol. The above code has an fatal bug but ran perfectly well for hours until the first time that my object's read method threw an exception. It was then that the print statement was type checked and thats when the program exited with the following error message:


  TypeError: unsupported operand type(s) for &: 'str' and 'str'

This particular error is typical of a whole class of errors that can exist in dynamically typed programs [0] but may never show up until the program is in the hands of a user. Personally, I think programs blowing up like this in the hands of users is unacceptable. Unfortunately, its also extremely common; so common that most regular computer users would have experienced things like this at least once. To me, this is a failure of discipline of software engineering.

Many defenders of dynamic typing say that typing errors can be picked up in a test suite. While I am a huge advocate of rigorous testing and test driven development, I'm also way too lazy to write tests for every single code path including the exception handler for every single try/except block. Especially when there is a better way.

My work with Python coincided with the publishing of an article on the Register's Developer site titled Mathematical Approaches to Managing Defects. This article also included a section on Formal Methods where the programming environment and process uses mathematical proofs to prove that a piece of code conforms to its specification.

A question asked in the article was "is proof more effective than testing for industrial scale programs?". The answer according to a company specializing in high integrity/reliability software was:

"... that 'proof appears to be substantially more efficient at finding faults than the most efficient testing phase'. This implies, of course, that you use both proof and testing on the project, where each technique is appropriate (even though proof is more cost-effective at finding some errors than testing is at finding other errors, proof may not be able to find all errors)."

It then struck me that compile time type checking built into the programming language is in effect, a certain level of proof-like correct-ness testing. Every error that is found at compile time is one less error that can occur at run time. What's more, the stronger the type system, the higher the level of correct-ness testing applied.

Unfortunately, not all statically typed languages are equal. Languages like C and C++ are statically typed, but both have loopholes (like pointers, casting and automatic type conversion) which allow the programmer to bypass the type checking system and introduce bugs. Languages like Ocaml, Haskell and Ada are statically typed but with fewer loopholes. For instance, in Ocaml, there are no pointers, no implicit type conversions and the only typing loophole is the Marshal module. If the programmer avoids the Marshal module, Ocaml's type system is bullet proof.

As programmers we have to decide whether the current situation where our programming languages allow us to shoot ourselves in the foot is OK or whether we need to aim higher. A first obvious step towards better software for users is choosing better programming languages; languages that protect us as programmers from our own weaknesses. Using such languages means that the programmer can spend less time manually checking for bugs (something computers are better at anyway) and more time thinking about algorithms, design and implementation issues; the stuff that computers can't do.



[0] Obviously, Python is not the only language in common usage that uses dynamic typing. Others include Actionscript (one of the most evil languages I've ever had the misfortune to use), Javascript, Perl, Tcl, Ruby and a host of others.
Java on the other hand is statically typed with some loopholes and a run time which does some run time checking for things like casting operations. It is therefore more type safe than C and C++, but not as type safe as Ocaml, Haskell and Ada.
]]>
C++ Wrapper for libsndfile, Part 4. http://www.mega-nerd.com/erikd/Blog/CodeHacking/libsndfile/cpp_wrapper4.html I got some email from Jaq who said:

"You don't actually give a good reason for not wanting a close() method on SndfileHandle; your reasons for not wanting open() are good, but surely a handle needs a way of closing itself? Or does the API overload object deletion with the closing of the handle too?"

So first of all, I should have stated that my current candidate header file is available here. The SndfileHandle object is designed to be used with the Resource Acquisition Is Initialization pattern. In particular, the object will close the file and release all allocations when it goes out of scope. For instance:

  {
      SndfileHandle file ("foo.wav") ;

      // Do something with file which gets closed automatically
      // when file goes out of scope.
  }

So what's the problem with the close() method? Well, its very similar to the problems with the open() method. Lets look at an example:

  SndfileHandle file1 ("foo.wav") ;

  // Make file2 == file1
  SndfileHandle file2 = file1 ;

  // Close file1
  file1.close ("bar.wav") ;

Obviously, the handle associated with file1 should be closed, but what about file2? Should that be closed or remain open?

The fact that its not obvious means that its best left out. If anyone really wants to make sure that a SndfileHandle is closed they can do:

  SndfileHandle file1 ("foo.wav") ;

  // Make file2 == file1
  SndfileHandle file2 = file1 ;

  // Close file1
  file1 = SndfileHandle () ;

In this case its a little more obvious that file1 and file2 now refer to different handles.

]]>
C++ Wrapper for libsndfile, Part 3. http://www.mega-nerd.com/erikd/Blog/CodeHacking/libsndfile/cpp_wrapper3.html Thinking about the C++ wrapper continues. In the current candidate version, a SndfileHandle contains a pointer to a private reference counted struct which contains the actual data. The SndfileHandle class also has a copy constructor and an assignment operator.

In addition to the above, a number of people on the mailing list have asked for the SndfileHandle class to have open() and close() methods. This seems reasonable on the face of it, but Daniel Schmitt points out that the combination of copy/assign and open/close results in a rather strange ambiguity.

Daniel gives the following example using copy/assign (ie no open/close methods):

  SndfileHandle file1 ("foo.wav") ;

  // Make file2 == file1
  SndfileHandle file2 = file1 ;

  // Now reuse file1
  file1 = SndfileHandle("bar.wav");

At the end of that block of code we now have file1 and file2 operating on different handles, which is exactly what any reasonable programmer would expect.

Now look at what happens if we have open/close methods:

  SndfileHandle file1 ("foo.wav") ;

  // Make file2 == file1
  SndfileHandle file2 = file1 ;

  // Now reuse file1
  file1.open ("bar.wav") ;

The open method can be implemented in one of the following two ways :

  • Decrease the reference count on what file1 used to operate on and then initialize a one.
  • Modify the data that both SndfileHandles point to.

After the block of code above, the two different implementations would result in the following state :

  • file1 and file2 refer to different handles
  • file2 now refers to "bar.wav" which is even worse

Obviously, the second implementation is completely wrong, but the first implementation is at least questionable. In terms of providing something which balances utility and consistency I'm tending to favor the idea of keeping the copy/assign operations and not providing open/close methods.

]]>
C++ Wrapper for libsndfile, Part 2. http://www.mega-nerd.com/erikd/Blog/CodeHacking/libsndfile/cpp_wrapper2.html After yesterday's blog post a guy in Germany, Daniel Schmitt, piped up on the libsndfile mailing lists and insisted I reconsider the C++ wrapper class' copy/assign issue. My big problem with copy/assign were that they would not behave the way people might reasonable expect them to.

Daniel's major contribution was renaming the class from Sndfile to SndfileHandle. Once that is done, having a copy constructor and an assignment operator using reference counting makes sense. With the class name containing the word "Handle", the name now fits the behavior. This is such a minor and seemingly trivial change but I simply didn't see it.

Thanks Daniel. Brilliant!

]]>
A C++ Wrapper for libsndfile. http://www.mega-nerd.com/erikd/Blog/CodeHacking/libsndfile/cpp_wrapper.html Over the years I've received a bunch emails saying stuff like "why did you write libsndfile in that old fashioned C language instead of nice modern shiny C++?". Obviously anyone who even thinks something like this is too ignorant of C to be a good C++ programmer. A competent C++ programmer needs to know and be comfortable with the whole of the C language as well as the whole of C++.

At the time I started work on libsndfile in 1998 I was writing far more C++ code than C code. However, back then, the GNU C++ compiler was nowhere near as good as it is today and I thought a C library interface was a safer bet than C++. In retrospect, I believe the decision of using C was spot on, for the following reasons:

  • The GNU C++ ABI (Application Binary Interface) has changed at least two times since 1998. The C ABI hasn't changed once in the same time frame.
  • The level of C++ support by the GNU C++ compiler has been changing so much that using C++ for libsndfile would have been a huge pain. For instance, code that compiles without warnings with g++-3.3 can spit out a huge number of warnings for any later version of the compiler.
  • Since 1999, the use of other high level languages has blossomed. Writing wrappers for these high level languages around C libraries is usually significantly easier than writing wrappers for C++ libraries.

However, some people do prefer C++ to C and many of those would probably be writing their own C++ wrapper. Since the vast majority of these wrappers would largely the same, it makes sense for me to distribute a C++ wrapper with libsndfile.

I decided on the following set of criteria for the wrapper:

  • It must be light weight. The wrapper should be a single header file with all methods being inlined.
  • It should not depend on iostream.
  • It should avoid the Standard Template Library.

It does however use templates for the read/write/readf/writef methods:

  template <typename T> sf_count_t read   (T *ptr, sf_count_t items) ;
  template <typename T> sf_count_t readf  (T *ptr, sf_count_t frames) ;
  template <typename T> sf_count_t write  (const T *ptr, sf_count_t items) ;
  template <typename T> sf_count_t writef (const T *ptr, sf_count_t frames) ;

with explicit specializations for types short, int, float and double.

It also explicitly makes the copy constructor and assignment operator private. The problem with these two is that two objects wrapping the same SNDFILE* pointer will not give the expected behavior. With the C version:

  SNDFILE *file1 = sf_open (filename, ...);
  SNDFILE *file2 = file1 ;

anyone reading the code can see that file1 and file2 are two pointers pointing to the same object. The code reader knows what behavior to expect here.

Now contrast this with the C++ version:

  Sndfile file1 (filename, ...) ;
  Sndfile file2 (file1) ;

The objects file1 and file2 look like two independent objects and should behave like two independent objects, but instead they behave like they do in the C version above. I believe that this is inconsistent.

The only solution that would maintain consistency would be to make the copy constructor do a deep copy but that is simply too much of a pain in the neck to implement.

The current version of the wrapper is available here or in verision 1.0.17 or later of libsndfile.

]]>
C : sprintf vs snprintf. http://www.mega-nerd.com/erikd/Blog/CodeHacking/snprintf.html Recently on the Jack Audio Connect Kit (JACK) developers mailing list someone posted a piece of C code using the function sprintf. As someone interested in raising the awareness of good software engineering practices, I piped up and suggested that people should use snprintf instead.

The problem with sprintf is the potential for buffer overflows. For instance the code:

    char buffer [5] ;
    sprintf (buffer, "%d", x) ;

will result in an unterminated C string for any value of x outside the range [-999,9999], and overwrite parts of the CPU's stack for all values of x outside the range [-9999,99999]. Malicious computer crackers exploit buffer overflows like this to take control of peoples machines, infect them with computer viruses, steal personal data and so on.

Anyway, on the mailing list, a third person piped up and attempted to defend the use sprintf and stated:

"Oh, come on. Saying "don't use sprintf anywhere" is just as stupid as saying "never use goto" or "never use scanf". We are not first year computer science students, you don't have to write statements like that."
"sprintf is a cleaner command than snprintf, so code with sprintf is easier to read than snprintf."

The first statement can be taken as opinion, but the second one is simply wrong, and here's why.

As software engineers working on projects with other people, we will be required to read other people's code. Given the example above, where the declaration of buffer is immediately above its use, its easy to see that the buffer is too short. However, what about this?

    void fill_buffer (struct s_type *s)
    {
        sprintf (s->buffer, "%d", x) ;
    }

In this case case the reader has to find the definition of struct s_type to figure out if there is any possibility of a buffer overflow. Now replace sprintf with snprintf:

    void fill_buffer (struct s_type *s)
    {
        snprintf (s->buffer, sizeof (s->buffer), "%d", x) ;
    }

Here the code reader can immediately see that there is no possibility for a buffer overflow. If the reader is not the person who wrote the code, snprintf is the clear winner here and when working in groups, code must be reader friendly.

Programming is hard. Its especially hard in a languages like C and C++ where its so easy to shoot oneself in the foot. Programmers should grab every opportunity they can to make their software better and the decision between sprintf and snprintf should be a no brainer.

]]>
Licenses. http://www.mega-nerd.com/erikd/Blog/CodeHacking/licenses.html Here's a really useful link to a large list of software licenses. The nice thing here is that each license also has a link to the full text of the license itself and states whether the license is compatible with the GNU GPL or not.

]]>
Ocaml : Fold. http://www.mega-nerd.com/erikd/Blog/CodeHacking/Ocaml/fold.html In a previous post, I blogged about Ocaml's iter and map functions and how they can be applied to arrays and lists. In some circumstances, these functions can be used as a replacement for a for loop. However, there are some other situations where iter and map are can only provide a non-optimal solution. For example, here's a small program which uses Ocaml's imperative features to calculate the sum of the elements of an integer array:

let _ =
    let a = [| 1 ; 2 ; 5 ; 7 ; 11 |] in
    let sum = ref 0 in
    for i = 0 to Array.length a - 1 do
        sum := !sum + a.(i)
    done ;
    Printf.printf "Sum : %d\n" !sum

The value sum is a reference to an integer which is initialized to zero and the referenced sum is then updated inside the for loop. Operating on references is a little different to operating on values; it requires the use of the de-reference operator "!" to access the referenced value and requires the use of the de-reference assignment operator ":=" to update the referenced value.

Like the previous iter and map examples, there are a number of places this can go wrong, even though its only a very small chunk of code. Here's how to use iter to acheive the same result:

let _ =
    let a = [| 1 ; 2 ; 5 ; 7 ; 11 |] in
    let sum = ref 0 in
    Array.iter (fun x -> sum := !sum + x) a ;
    Printf.printf "Sum : %d\n" !sum

bit thats only a small win.

Fortunately, there is a significantly better Higher Order Function solution to this problem, a concept called fold and implemented as functions fold_left and fold_right. The following example program uses both and reduces the for loop in the first example to a single line, including the initialization of the accumulator used to calculate the sum:

let _ =
    let a = [| 1 ; 2 ; 5 ; 7 ; 11 |] in
    let fold_left_sum = Array.fold_left (fun x y -> x + y) 0 a in
    Printf.printf "Fold_left sum  : %d\n" fold_left_sum ;
    let fold_right_sum = Array.fold_right (fun x y -> x + y) a 0 in
    Printf.printf "Fold_right sum : %d\n" fold_right_sum

So lets have a look at a single fold_left:

Array.fold_left (fun x y -> x + y) 0 a

Obviously, the first parameter passed to fold_left is an anonymous function which takes two parameters x and y and returns their sum and the last parameter is simply the array the fold is being applied. The second parameter, 0 in this case, is where all the magic happens. The value 0 is the value that will be passed to the anonymous function as the x parameter, the first time it is called. For subsequent calls, the value of the x parameter will be the return value of the previous call of the anonymous function.

Obviously the easiest way to visualize this is with an example that prints out the values. Here it is:

let _ =
    let a = [| 1 ; 2 ; 5 ; 7 ; 11 |] in
    let fold_left_sum = Array.fold_left
    (   fun x y ->
            Printf.printf "%4d %4d\n" x y ;
            x + y
            )
        0 a
    in
    Printf.printf "\nFold_left sum  : %d\n" fold_left_sum

For those of you too lazy to try this yourselves :-), here is the output:

   0    1
   1    2
   3    5
   8    7
  15   11

Fold_left sum  : 26

There, just as I explained it. So what about fold_right? Well there are two differences and they are a little subtle so here's the program:

let _ =
    let a = [| 1 ; 2 ; 5 ; 7 ; 11 |] in
    let fold_right_sum = Array.fold_right
    (   fun x y ->
            Printf.printf "%4d %4d\n" x y ;
            x + y
            )
        a 0