Dave Beazley's mondo computer blog. [ homepage ]

Sunday, August 09, 2009


Python Binary I/O Handling

As a followup to my last post about the Essential Reference, I thought I'd talk about the one topic that I wish I had addressed in more detail in my book--and that's the subject of binary data and I/O handling. Let me elaborate.

One of the things that interests me a lot right now is the subject of concurrent programming. In the early 1990's, I spent a lot of time writing big physics simulation codes for Connection Machines and Crays. All of those programs had massive parallelism (e.g., 1000s of processors) and were based largely on message-passing. In fact, my first use of Python was to control a large massively parallel C program that used MPI. Now, we're starting to see message passing concepts incorporated into the Python standard library. For example, I think the inclusion of the multiprocessing library is probably one of the most significant additions to the Python core that has occurred in the past 10 years.

A major aspect of message passing concerns the problem of quickly getting data from point A to point B. Obviously, you want to do it as fast as possible. A high speed connection helps. However, it also helps to eliminate as much processing overhead as possible. Such overhead can come from many places--decoding data, copying memory buffers, and so forth.

Python makes it pretty easy to pass data around between processes. For example, you can use the pickle module, json, XML-RPC, or some other similar mechanism. However, all of these approaches involve a significant amount of overhead to encode and decode data. You probably wouldn't want to use them for any kind of bulk data transfer (e.g., if you wanted to send a large array of floats between processes). Nor would you really want to use this for some kind of high-performance networking on a big cluster.

However, lurking within the Python standard library is another way to deal with data in messaging and interprocess communication. However, it's all spread out in a way that's not entirely obvious unless you're looking for it (and even then it's still pretty subtle). Let's start with the ctypes library. I always assumed that ctypes was all about accessing C libraries from Python (an alternative approach to Swig). However, that's only part of the story. For instance, using ctypes, you can define binary data structures:

from ctypes import *
class Point(Structure):
     _fields_ = [ ('x',c_double), ('y',c_double), ('z',c_double) ]

This defines an object representing a C data structure. You can even create and manipulate such objects just like an ordinary Python class:

>>> p = Point(2,3.5,6)
>>> p.x
>>> p.y
>>> p.z = 7

However, keep in mind that under the covers, this is manipulating a C structure represented in a contiguous block of memory.

Now this is where things start to get interesting. I wonder how many Python programmers know that they can directly write a ctypes data structure onto a file opened in binary mode. For example, you can take the point above and do this:

>>> f = open("foo","wb")
>>> f.write(p)       
>>> f.close()

Not only that, you can read the file directly back into a ctypes structure if you use the poorly documented readinto() method of files.

>>> g = open("foo","rb")
>>> q = Point()
>>> g.readinto(q)
>>> q.x

The mechanism that makes all of this work is Python's so-called "buffer protocol." Since C types structures are contiguous in memory, I/O operations can be performed directly with that memory without making copies or first converting such structures into strings as you might do with something like the struct module. The buffer protocol simply exposes the underlying memory buffers for use in I/O.

Direct binary I/O like this is not limited to files. If s is a socket, you can perform similar operations like this:

p = Point(2,3,4)           #  Create a point
s.send(p)                  #  Send across a socket

q = Point()
s.recv_info(q)               # Receive directly into q

If that wasn't enough to make your brain explode, similar functionality is provided by the multiprocessing library as well. For example, Connection objects (as created by the multiprocessing.Pipe() function) have send_bytes() and recv_bytes_into() methods that also work directly with ctypes objects. Here's an experiment to try. Start two different Python interpreters and define the Point structure above. Now, try sending a point through a multiprocessing connection object:

>>> p = Point(2,3,4)
>>> from multiprocessing.connection import Listener
>>> serv = Listener(("",25000),authkey="12345")
>>> c = serv.accept()
>>> c.send_bytes(p)

In the other Python process, do this:

>>> q = Point()
>>> from multiprocessing.connection import Client
>>> c = Client(("",25000),authkey="12345")
>>> c.recv_bytes_into(q)
>>> q.x
>>> q.y

As you can see, the point defined in one process has been directly transferred to the other.

If you put all of the pieces of this together, you find that there is this whole binary handling layer lurking under the covers of Python. If you combine it with something like ctypes, you'll find that you can directly pass binary data structures such as C structures and arrays around between different interpreters. Moreover, if you combine this with C extensions, it seems to be possible pass data around without a lot of extra overhead. Finally, if that wasn't enough, it turns out that some popular extensions such as numpy also play in this arena. For instance, in certain cases you can perform similar direct I/O operations with numpy arrays (e.g., directly passing arrays through multiprocessing connections).

I think that this functionality is pretty interesting--and highly relevant to anyone who is thinking about parallel processing and messaging. However, all of this is also somewhat unsettling. For one, much of this functionality is all very poorly documented in the Python documentation (and in my book for that matter). If you look at the documentation for methods such as the read_into() method files, it simply says "undocumented, don't use it." The buffer interface, which makes much of this work, has always been rather obscure and poorly understood--although it got a redesign in Python 3.0 (see Travis Oliphant's talk from PyCon). And if it wasn't complicated enough already, much of this functionality gets tied into the bytes/Unicode handling part of Python --a hairy subject on its own.

To wrap up, I think much of what I've described here represents a part of Python that probably deserves more investigation (and at the very least, more documentation). Unfortunately, I only started playing around with this recently--too late for inclusion in the Essential Reference (which was already typeset and out the door). However, I'm thinking it might be a good topic for a PyCon tutorial. Stay tuned.

Note: If anyone has links to articles or presentations about this, let me know and I'll add them here.

It seems to me that whether or not these mechanisms can, in fact, be useful (in either a shared memory or a cluster multiprocessing model) is going to depend heavily on the quality and details of their implementation. Out of curiosity, does the undocumented nature of these features imply anything about their quality (by which I mean reliability and performance) or up-to-date-ness?
The "quality of the implementation" and "performance" aspects of this are really big questions. To be honest, I really don't have an answer. I know that the buffer interface has been floating around in Python for quite a long time (maybe even as far back as 1.5.2). However, if you look at things like the recv_into() method of sockets, that first showed up in Python 2.6. Then, you start to have things such as mutable bytearrays in Python 2.6/3.0 which are also related.

My impression is that a lot of this functionality is new, but obscure. Couple that with all of the possible corner cases that arise and you end up with a lot of unknowns.
My concern is that ctypes will encode the structure in native format (as it must), which is not portable when shared across a network socket or with a file. Why not simply use the struct module to unpack/pack data? Or use something like Google's protocol buffers. The Python version uses struct, and it is portable to many other language implementations.
The struct module is already well-known in the Python world. The whole point of my post was to discuss a very specific facet of binary data handling that is not well known by programmers---specifically, the fact that you can perform direct I/O from ctypes objects, arrays, and similar datatypes.

No, it's not portable. Heck, it might not even be a good idea. However, if you're messing around with multiprocessing or some kind of low-level parallel computing, you're usually going for as much speed as possible. If so, then this is a facet of Python worthy of further exploration.
Sure, it would be interesting to test struct vs ctypes to see which one is faster. (Of course, use struct.Struct objects to precompile the format.) Just casually looking at the ctypes code, and extrapolating from what I know of the Jython implementation (which is a simple port of CPython's), I would not expect too much of a difference, but even 10-20% is important in the use cases you mention.
Some benchmarks would definitely be interesting. I think one of more interesting aspects of this is where it's all going in Python 3.0. Most of the I/O system has been redesigned in Python 3. Plus, you're seeing new kinds of objects (mutable byte arrays) and expanded use of the buffer interface for I/O in the library. It would be interesting to know if these features offer any performance benefit at all. If so, can they be useful reliably? A lot of open questions here.
Wouldn't it be possible to use BigEndianStructure to enforce network order encoding of the structure? I'm a Python newbie so I may be oversimplifying, but I'm looking to port some C and/or Java code, and being able to read/write ctypes structures from/to files and sockets would be a massive benefit to us!
Excellent blog. I recently starting doing exactly what you detail for a network server. I have run into a couple of issues, which I continue to work through. Most are limitations of the ctypes library. After having previously implemented this type thing using structs I can absolutely so ctypes is a vastly superior solution for the general, non-complex use case. More complex structure representation while still allowing for buffer protocol stream i/o is the source of the issues I've encountered.

The following patch to ctypes is required to really make it useful for anything but the most trivial of networking solutions.

I found someone else already created the same patch after discovering the same ctypes bug. I have no idea why this patch has not ago been adopted. This very minor patch makes the difference between ctypes being near useless of protocol implementation of it becoming capable of addressing a multitude of more complex structures for networking services.

One you've fixed the ctypes bug, you can now create big endian structures and embed them within other user defined big endian structures; making network byte ordered i/o a piece of cake.

Patch ctypes/_endian.py in your python's lib directory.
*************** def _other_endian(typ):
*** 17,22 ****
--- 18,25 ----
except AttributeError:
if type(typ) == _array_type:
return _other_endian(typ._type_) * typ._length_
+ elif issubclass(typ, Structure):
+ return typ
raise TypeError("This type does not support other endian: %s" % typ)

To summarize, this patch is direly needed for mainline ctypes and should be immediately back ported to all python releases where ctypes is part of the standard lib. IMOHO, ctypes is fundamentally broken without this basic patch.

Make sure this bug gets reported to the Python issue tracker on python.org. That's probably the most direct way to make sure that this gets addressed in a future release.

While I was searching, hoping to find better solutions to some of my more obscure issues with using ctypes in my project, I found the author of ctypes was directly provided with a patch which only slightly differs from my own (elif = if). Functionally the two patches are no different. The patch was provided in November of 2008 to TH and python.org.

As I said, I have no idea why the patch has not already been absorbed. Its very frustrating to find out I'm re-inventing the wheel with a more or less identical solution almost a year after the problem was previously identified and a patch provided. Even more so is that I've not identified a reason for patch rejection or that there is a reason to not use the patch. As for my own use, I've not identified a single negative for its inclusion.
I'll quickly follow up that using ctypes for variable length data structures or even repeated fixed length data structures, having a variable repetition count, is extremely problematic when using ctypes.

It looks like many have discovered this same issue and it appears some effort is underway to improve ctypes in this area. But in the mean time, using ctypes for anything but the simplest of protocols and/or I/O will continue to be rather problematic in the near future.

Do you have any links that point to some of this? I think it would be useful for me to add them here.

is the patch on bugs.python.org?

The only one that I know of is http://bugs.python.org/issue4376, but it contains an open question that noone has answered.
Is there any way to take into account padding bytes during reading?
I'm trying to read a format which is a little... odd
(Looks fine, until you notice the offsets)
Post a Comment

<< Home


08/01/2009 - 09/01/2009   09/01/2009 - 10/01/2009   10/01/2009 - 11/01/2009   11/01/2009 - 12/01/2009   12/01/2009 - 01/01/2010   01/01/2010 - 02/01/2010   02/01/2010 - 03/01/2010   04/01/2010 - 05/01/2010  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]