SWIG me gently from Mr. C to Mrs. P

I prepared a presentation on SWIG for our internal institute discussion group recently and ran into a couple of unexpected problems. One of the big advantages of SWIG is that it’s well-documented and there are many tutorials but still I did not find the solution to my problems instantly (i.e. using a simple Google search). That’s why I’m wrapping up my results in this blog entry for those that might have had similar problems.

Disclaimer: SWIG is only one of many interface generators to integrate your C/C++ code in a Python environment. There is a good comparison of different wrapper tools in Python Wrapper Tools: A Performance Study. SWIG seems to be somewhat slower than native Python/C wrapping, SIP or Boost.Python, but still I find it much more intuitive to get a working solution quickly (if you read the complete documentation including tutorials which I obviously didn’t at that point).

In the following, I will give two examples of what can be done with SWIG. First, I will show how to integrate a C implementation of suffix arrays within Python. Second, I will give an example of how to swig a C++ implementation of edit distance (aka Levenshtein distance) into Python.

Swigging a suffix array from C to Python

Above you will find a link to an efficient implementation of suffix arrays in C. Looking at the interface sarray.h, we see two interesting functions that we might want to swig:

int sarray(int *a, int n);
int bsarray(const uchar *b, int *a, int n);

The first function takes an array of ints a and returns the correspoding suffix array in place, whereas the second function takes a string *b, its length n and returns the suffix array in *a. Note that „int *a“ plays a different role in the two functions. Thus, we’ll have to define different typemaps for each of them. I won’t introduce typemaps here, so please read the documentation on typemaps if you’re unfamiliar with the concept.

So first, let’s create a file called sarray.i which will contain the interface code. Let’s start by filling in the uchar typedef via the %internal directive and define the module sarray that will be accessible within Python:

%inline %{
typedef unsigned char uchar;
%}

%module sarray %{ extern int bsarray(const uchar *b, int *a, int n); extern int sarray(int *a, int n); %}

In the following, we have to specify which arguments are defining input and which output variables. SWIG will not try guessing (actually, it does implement default wrappers for primitives, but here it can’t know whether we only return one value behind the pointer to a or a whole array), so that’s where we have to define our magic interfacing of C and Python data structures. First, let’s take a closer look at the function bsarray(). Array b holds a string (const uchar*) of length n. Array int* a needs to be initialized with length n+1 (this is a precondition which I failed to realize at the beginning, see sarray.c, so be sure to read the documentation of the code you’re wrapping). This part is quite straight-forward:

%typemap(in) (const uchar b, int *outval, int n) {
   if (!PyString_Check($input)) {
       PyErr_SetString(PyExc_ValueError, "Expecting a string");
       return NULL;
   }
   $1 = PyString_AsString($input);
   $3 = PyString_Size($input);
   $2 = (int) malloc($3*sizeof(int)+1); // +1 required, see sarray.c
}

There are several things to comment on:

  • note the usage of int *outval instead of int *a: lateron, we need the initial int *a in another context, namely as input and output variable, whereas here, it’s only an output variable which needs to be initialized properly
  • we check whether the incoming parameter $input is a string, since we want to have something like sa = sarray.bsarray("abracadabra") in Python; the outval array will be a return value within Python and the n will be set automatically, so we don’t need either one as input parameter
  • $1, $2 and $3 are the local input variables that will be matched by the header of the function lateron, i.e. they are input parameters b, outval and n used in the actual C implementation
  • $input is a PyObject*, so we can use all kinds of useful Python API functions
  • the malloc on $2 is essential, since we have to make sure that the memory is initialized (which is not done in sarray.c, another error occurring to me due to not reading all documentation at first)
  • +1 in the malloc is required since the output array will hold the index of the empty suffix as well

Since we malloced memory, we better free it in order to not create memory leaks in our applications, so let’s add:

%typemap(freearg) int * {
    free($1);
}

This is a more general matching rule than used above for int *outval. This will be clearer as soon as we will turn to swig the function sarray() of the suffix array interface. But let’s first finish bsarray(). So far, we converted the Python input structures (a simple Python string) to the C variants. After computing the suffix array, we have to convert the contents in outval to a Python structure, i.e. a list of ints. For this, we can specify an argout typemap (above was an in typemap):

%typemap(argout) (int *outval, int n) {
    PyObject *sa_list = PyList_New($2);
    int i;
    for (i = 0; i < $2; i++) {
      PyList_SetItem(sa_list, i, PyInt_FromLong($1[i]));
    }
    $result = sa_list;
}

As can be seen, we first create a new Python object, namely a list of length n. Then we iterate through the first argument (holding the values of the computed suffix array) and set the corresponding elements of the Python list. In the end, we set the list to be the $return value. We can now add the following line to sarray.i in order to be able to use the bsarray() function in Python:

int bsarray(const uchar *b, int *outval, int n);

Great, before I give you a Makefile that will do all the swigging, compiling, linking and testing, let’s first finish the tutorial by defining the SWIG interface for the other function, sarray(). Here, we give it a sequence of ints and the corresponding suffix array is returned in-place. What is it good for? Imagine you have a large text collection, i.e. a stream of tokens. You could index the occuring words using a mapping from string representation to integers (saves a lot of memory), give this list of ints as input to calculate the suffix array and use the result to find a sequence of words (which are mapped to their indices) in the original text. Since the suffix array allows us to use binary search, we are able to find matching suffixes in O(logn) time instead of a linear time complexity, i.e. O(n). On large amounts of texts, this really makes a difference! True, we have also to consider the time needed for creating the suffix array, which happens to be around O(n2logn), but on huge data collections, the logn factor beats a linear search by far. Here is what needs to be specified for sarray() to work in Python:

%typemap(in) (int inandoutval, int n) {
   if (!PySequence_Check($input)) {
       PyErr_SetString(PyExc_ValueError, "Expecting a sequence");
       return NULL;
   }
   $2 = PySequence_Size($input)+1;
   $1 = (int) malloc($2*sizeof(int));
   int i;
   for (i = 0; i < PyList_Size($input); i++) {
       PyObject *pobj = PySequence_GetItem($input,i);
       if (PyNumber_Check(pobj)) {
          $1[i] = (int) PyInt_AsLong(pobj);
       } else {
          PyErr_SetString(PyExc_ValueError,"Sequence elements must be numbers");
          free($1);
          return NULL;
       }
   }
   $1[PyList_Size($input)] = 0;
}

The logic of parsing a Python list and setting the values to a malloced array of ints looks somewhat more complex but is self-explanatory. We check whether the input is a sequence (could also be a tuple), allocate memory for the variable inandoutval, fill it with the ints from the input sequence including checks whether the sequence really contains ints and set the final cell to zero which is required by sarray.c. The last missing bit is to specify that the argout typemap for this input type is the same as with the bsarray() case and to include the wrapper function:

%apply (int *outval, int n) { (int *inandoutval, int n) };
int sarray(int *inandoutval, int n);

That should be it. Here are the files at your disposal:

If you’re getting an error message like „Python.h: no such file or directory“ while compiling the generated wrapper code, please also install the python-dev package for the header files. If you compile the whole thing using the Makefile, you should see something like this:

$ make
swig -python sarray.i
gcc -c sarray_wrap.c sarray.c -I/usr/include/python2.5
ld -shared sarray_wrap.o sarray.o -o _sarray.so

SWIG will first generate the wrapper code sarray_wrap.c which is compiled and linked to a shared library _sarray.so. Now start Python and test your freshly brewed sarray module:

$ python
Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42) 
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sarray sarray.bsarray("abracadabra") [11, 10, 7, 0, 3, 5, 8, 1, 4, 6, 9]

Note that the first value of the list is the position of the empty prefix at the end of the input string. Strangely enough, this also happens to be the exact length of the string. 😉 If you want a nicer output of this, run make test which will invoke satest.py and give some prettified output of all suffixes of the string including their occurrence in the input.

Last but not least, satest-search.py provides example code including binary search for matching suffix prefixes. Run make test2 to see some example output for the call „python satest-search.py abracadabra ca dab bra r foo bar a“ which searches for several suffix prefixes both using the built-in (linear) find for strings and the swig’n’tasty suffix arrays which we just hooked into Python from the efficient C implementation.

Well, thanks for your attention, I hope this is somewhat useful. Drop a comment if you have further questions or remarks on this topic. Cheers!

Swigging edit distance from C++ to Python

Stay tuned, I promise to blog on this section later. Concise writing of tutorials seems to take more of my time than expected. 8) (And I first have to fix the wp-syntax plugin for WordPress, my code portions do not seem to look syntax highlighted at all… 10 minutes later: okay, fixed by using the Google Syntax Highlighter)

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.