Dabeaz

This blog has moved

noreply@blogger.com (Dave) — Thu, 22 Apr 2010 00:38:00 +0000

This blog is now located at http://dabeaz.blogspot.com/. You will be automatically redirected in 30 seconds, or you may click here. For feed subscribers, please update your feed subscriptions to http://dabeaz.blogspot.com/feeds/posts/default.

Upcoming Python Training Classes

noreply@blogger.com (Dave) — Fri, 26 Feb 2010 20:56:00 +0000

Please forgive the brief commercial interruption. I'd just like to plug a few of my upcoming Python training classes--yes, if you must know, this is how I pay the bills so that I can spend the rest of my time thinking about the GIL and other diabolical Python-related topics.

New! Python Mastery Bootcamp, April 12-16, 2010 (Atlanta)

First, I'm pleased to announce a brand-new Python course that I'm offering for the first time at Big Nerd Ranch in Atlanta. The Python Mastery Bootcamp might be the ultimate Python tutorial for programmers who already know the basics of Python, but who want to take their understanding of the language to a whole new level. Over the past few years, I have given a number of well-reviewed PyCON tutorials on advanced topics such as Generator Tricks for Systems Programmers, A Curious Course on Coroutines and Concurrency, or most recently Mastering Python 3 I/O. Well, the Mastery Bootcamp is sort of similar except that it lasts 5 days, it covers far more material (network programming, threads, multiprocessing, asynchronous I/O, functional programming, metaprogramming, distributed computing, C extensions, etc.), and it has more hands-on projects that allow the material to be explored in greater depth than at a conference.

The experience at Big Nerd Ranch is quite unique--for 5 days, you will be completely immersed in Python programming without the annoyance of outside distractions. This makes it the perfect environment to interact with other class participants and to really focus on the course material. There's really nothing quite like it in the training world--you won't be disappointed.

March 12,2010 Update! The Mastery Bootcamp is confirmed to run and there are still a few slots available. It's going to be great experience for anyone who wants to learn enough about Python to be dangerous.

Introduction to Python Programming, March 16-18, 2010 (Chicago)

If you're relatively new to Python and want to master the fundamentals, consider coming to my Introduction to Python Programming class in Chicago. This course is aimed at programmers, system administrators, scientists, and engineers who want to apply Python to everyday tasks such as analyzing data files, automating system tasks, scraping web pages, using databases, and more. Through practical examples, you will learn all of the major features of Python including data handling, functions, modules, classes, generators, testing, and more. This is a highly refined class that has been taught for numerous corporate and government clients over the past three years. The class features a 300 page fully indexed course guide and more than 50 hands-on exercises.

My Chicago classes are also taught in a rather unique format. Unlike a typical corporate training course, I conduct the course in a round-table format that is strictly limited to 6 attendees--a size that encourages interaction and allows course topics to be easily customized to your interests. The course is located in Chicago's distinctive Andersonville neighborhood where just steps away, you will find dozens of unique restaurants, bakeries, coffee houses, pubs, and more. You're definitely going to like it!

March 12, 2010 update! The Chicago class is now sold out. However, be on the lookout for its return in a few months.

Revisiting thread priorities and the new GIL

noreply@blogger.com (Dave) — Mon, 22 Feb 2010 21:47:00 +0000

Well, PyCon is over and it's time to get back to work. First, I'd just like to thank everyone who came to my GIL Talk and participated in all of the discussion that followed. It was almost as if part of PyCon had turned into a mini operating systems conference!

This post is a followup to the GIL open space at PyCon where we looked at the new GIL and explored the possibility of introducing thread priorities. For those of you not at PyCon, the open space was attended by about 30-40 people and included Guido, Antoine Pitrou, and a large number of systems hackers, some of which had previously worked on thread library implementations and operating system kernels.

First, a little background. As might know, Antoine Pitrou implemented a new Python GIL that is currently only available in the Python 3.2 development branch (you can obtain it via subversion). This new GIL is described in his original mailing list post as well as the slides for my PyCon talk. You should read those first if you haven't already.

Right before PyCON, I discovered an I/O performance problem with the new GIL that is related to CPU-bound threads stalling the progress of I/O bound threads which it turn leads to a severe performance degradation of I/O bandwidth and response time. This is described in Issue 7946 : Convoy effect with I/O bound threads and New GIL.

In the bug report, I submitted a very simple test case that illustrated the problem. However, here is a more refined experiment that you can try. The following program, iotest.py contains both CPU-bound threads and an I/O server thread that echos UDP packets. It is meant to study the case in which CPU-processing and I/O processing are overlapped.

# iotest.py

import time
import threading
from socket import *
import itertools

def task_pidigits():
    """Pi calculation (Python)"""
    _map = map
    _count = itertools.count
    _islice = itertools.islice

    def calc_ndigits(n):
        # From http://shootout.alioth.debian.org/
        def gen_x():
            return _map(lambda k: (k, 4*k + 2, 0, 2*k + 1), _count(1))

        def compose(a, b):
            aq, ar, as_, at = a
            bq, br, bs, bt = b
            return (aq * bq,
                    aq * br + ar * bt,
                    as_ * bq + at * bs,
                    as_ * br + at * bt)

        def extract(z, j):
            q, r, s, t = z
            return (q*j + r) // (s*j + t)

        def pi_digits():
            z = (1, 0, 0, 1)
            x = gen_x()
            while 1:
                y = extract(z, 3)
                while y != extract(z, 4):
                    z = compose(z, next(x))
                    y = extract(z, 3)
                z = compose((10, -10*y, 0, 1), z)
                yield y

        return list(_islice(pi_digits(), n))

    return calc_ndigits, (50, )

def spin():
    task,args = task_pidigits()
    while True:
       r= task(*args)

def echo_server():
    s = socket(AF_INET, SOCK_DGRAM)
    s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1)
    s.bind(("",16000))
    while True:
        msg, addr = s.recvfrom(16384)
        s.sendto(msg,addr)  

# Launch threads (adjust the number to see different results)
NUMTHREADS = 1
for n in range(NUMTHREADS):
    t = threading.Thread(target=spin)
    t.daemon = True
    t.start()

# Launch a background echo server
echo_server()

Next, here is a client program ioclient.py that simply measures the time it takes to echo 10MB of data to the server in the iotest.py program.

# echoclient.py
from socket import *
import time

CHUNKSIZE = 8192
NUMMESSAGES = 1280     # Total of 10MB

# Dummy message
msg = b"x"*CHUNKSIZE

# Connect and send messages
s = socket(AF_INET,SOCK_DGRAM)
start = time.time()
for n in range(NUMMESSAGES):
    s.sendto(msg,("",16000))
    msg, addr = s.recvfrom(65536)
end = time.time()
print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start)))

If you run iotest.py on a dual-core Macbook with only 1 spin() thread. You get the following result if you run ioclient.py:

Python 3.2 (New GIL) : 9.166 seconds (1143998.140 bytes/sec)

It works, but it's hardly impressive (just barely over 1MB/sec transfer rate between two processes?). However, if you make the server have two spin() threads, the performance gets much worse:

Python 3.2 (New GIL) : 28.064 seconds (373642.858 bytes/sec)

Now to further complicate matters, if you disable all but one of the CPU cores, you get this inexplicable result:

Python 3.2 (New GIL, 1 CPU) : 0.297 seconds (35326299.028 bytes/sec)

Needless to say, there are many bizarre things going on here. The most major effect is that on multiple cores, it is very easy for CPU-bound threads to reacquire the GIL whenever an I/O bound thread performs I/O. This means that CPU-threads have a greater tendency to hog the GIL.

At PyCON, I did some experiments with thread priorities and a modified GIL that adjusted priorities in a manner similar to what you find with multilevel feedback queues in operating systems. Namely:

If a thread is forced to give up the GIL due to a timeout, it is penalized with lower priority.
If a thread voluntarily gives up the GIL because it performed I/O, it is reward with higher priority.
High priority threads always preempty low-priority threads.

The results of this approach were impressive. If you run the same tests with priorities on 2 CPU cores, you get this result:

Python 3.2 (New GIL with priorities), 0.298 seconds (35156921.564 bytes/sec)

The prioritized GIL also gives good performance for Antoine's own ccbench.py benchmark.

New GIL	New GIL with priorities
== CPython 3.2a0.0 (py3k:78250) == == i386 Darwin on 'i386' == --- Throughput --- Pi calculation (Python) threads=1: 873 iterations/s. threads=2: 845 ( 96 %) threads=3: 837 ( 95 %) threads=4: 820 ( 93 %) regular expression (C) threads=1: 348 iterations/s. threads=2: 339 ( 97 %) threads=3: 328 ( 94 %) threads=4: 317 ( 91 %) bz2 compression (C) threads=1: 367 iterations/s. threads=2: 655 ( 178 %) threads=3: 642 ( 174 %) threads=4: 646 ( 175 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 5 ms. (std dev: 0 ms.) CPU threads=2: 2 ms. (std dev: 2 ms.) CPU threads=3: 138 ms. (std dev: 100 ms.) CPU threads=4: 132 ms. (std dev: 99 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 6 ms. (std dev: 1 ms.) CPU threads=2: 6 ms. (std dev: 6 ms.) CPU threads=3: 6 ms. (std dev: 4 ms.) CPU threads=4: 10 ms. (std dev: 8 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)	== CPython 3.2a0.0 (py3k:78215M) == == i386 Darwin on 'i386' == --- Throughput --- Pi calculation (Python) threads=1: 885 iterations/s. threads=2: 860 ( 97 %) threads=3: 869 ( 98 %) threads=4: 859 ( 97 %) regular expression (C) threads=1: 362 iterations/s. threads=2: 358 ( 98 %) threads=3: 349 ( 96 %) threads=4: 354 ( 97 %) bz2 compression (C) threads=1: 373 iterations/s. threads=2: 654 ( 175 %) threads=3: 649 ( 173 %) threads=4: 638 ( 170 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 2 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 2 ms. (std dev: 1 ms.) CPU threads=2: 3 ms. (std dev: 3 ms.) CPU threads=3: 2 ms. (std dev: 1 ms.) CPU threads=4: 2 ms. (std dev: 2 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 1 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.)

New GIL

New GIL with priorities

== CPython 3.2a0.0 (py3k:78250) ==
== i386 Darwin on 'i386' ==

--- Throughput ---

Pi calculation (Python)

threads=1: 873 iterations/s.
threads=2: 845 ( 96 %)
threads=3: 837 ( 95 %)
threads=4: 820 ( 93 %)

regular expression (C)

threads=1: 348 iterations/s.
threads=2: 339 ( 97 %)
threads=3: 328 ( 94 %)
threads=4: 317 ( 91 %)

bz2 compression (C)

threads=1: 367 iterations/s.
threads=2: 655 ( 178 %)
threads=3: 642 ( 174 %)
threads=4: 646 ( 175 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 5 ms. (std dev: 0 ms.)
CPU threads=2: 2 ms. (std dev: 2 ms.)
CPU threads=3: 138 ms. (std dev: 100 ms.)
CPU threads=4: 132 ms. (std dev: 99 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 6 ms. (std dev: 1 ms.)
CPU threads=2: 6 ms. (std dev: 6 ms.)
CPU threads=3: 6 ms. (std dev: 4 ms.)
CPU threads=4: 10 ms. (std dev: 8 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 0 ms.)
CPU threads=3: 0 ms. (std dev: 0 ms.)
CPU threads=4: 0 ms. (std dev: 0 ms.)

== CPython 3.2a0.0 (py3k:78215M) ==
== i386 Darwin on 'i386' ==

--- Throughput ---

Pi calculation (Python)

threads=1: 885 iterations/s.
threads=2: 860 ( 97 %)
threads=3: 869 ( 98 %)
threads=4: 859 ( 97 %)

regular expression (C)

threads=1: 362 iterations/s.
threads=2: 358 ( 98 %)
threads=3: 349 ( 96 %)
threads=4: 354 ( 97 %)

bz2 compression (C)

threads=1: 373 iterations/s.
threads=2: 654 ( 175 %)
threads=3: 649 ( 173 %)
threads=4: 638 ( 170 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 2 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 2 ms. (std dev: 1 ms.)
CPU threads=2: 3 ms. (std dev: 3 ms.)
CPU threads=3: 2 ms. (std dev: 1 ms.)
CPU threads=4: 2 ms. (std dev: 2 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 1 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

The overall outcome of the GIL open space was that having a priority mechanism was probably a good idea. However, a lot of people wanted to study the problem in more detail and to think about different possible implementations. I am posting the following tar file that has my own modifications to the GIL used for the above benchmarks:

prioritygil.tar.gz

Note: This tar file has all of the modified files in the Python 3.2 source (pystate.h, pystate.c, and ceval_gil.h) along with the io testing benchmark. Be advised that this patch is only intended for further study by others---it's kind of hacked together and really only a proof of concept implementation of one possible priority scheme. A real implementation would still need to address some issues not covered in my patch (e.g., starvation effects).

Due to other time commitments, I'm not going to be able to do much followup with this patch at this moment. However, I do want to encourage others to at least consider the benefit of introducing thread priorities and to explore different possible implementations. Initial results seem to indicate that this can fix the GIL for both CPU-bound threads and for
I/O performance.

A function that works as a context manager and a decorator

noreply@blogger.com (Dave) — Wed, 03 Feb 2010 02:12:00 +0000

As a followup to my last blog post on timings, I present the following function which works as both a decorator and a context manager.

# timethis.py
import time
from contextlib import contextmanager

def timethis(what):
    @contextmanager
    def benchmark():
        start = time.time()
        yield
        end = time.time()
        print("%s : %0.3f seconds" % (what, end-start))
    if hasattr(what,"__call__"):
        def timed(*args,**kwargs):
            with benchmark():
                return what(*args,**kwargs)
        return timed
    else:
        return benchmark()

Here is a short demonstration of how it works:

# Usage as a context manager
with timethis("iterate by lines (UTF-8)"):
     for line in open("biglog.txt",encoding='utf-8'):
          pass

# Usage as a decorator
@timethis
def iterate_by_lines_latin_1():
    for line in open("biglog.txt",encoding='latin-1'):
        pass

iterate_by_lines_latin_1()

If you run it, you'll get output like this:

bash % python3 timethis.py
iterate by lines (UTF-8) : 3.762 seconds
<function iterate_by_lines_latin_1 at 0x100537958> : 3.513 seconds

Naturally, this bit of code would be a good thing to bring into your next code review just to make sure people are actually paying attention.

A Context Manager for Timing Benchmarks

noreply@blogger.com (Dave) — Tue, 02 Feb 2010 13:01:00 +0000

I spend a lot of time studying different aspects of Python, different implementation techniques, and so forth. As part of that, I often carry out little performance benchmarks. For small things, I'll often use the timeit module. For example:

>>> from timeit import timeit
>>> timeit("math.sin(2)","import math")
0.29826998710632324
>>> timeit("sin(2)","from math import sin")
0.21983098983764648
>>>

However, for larger blocks of code, I tend to just use the time module directly like this:

import time
start = time.time()
...
... some big calculation
...
end = time.time()
print("Whatever : %0.3f seconds" % (end-start))

Having typed that particular code template more often than I care to admit, it occurred to me that I really ought to just make a context manager for doing it. Something like this:

# benchmark.py
import time
class benchmark(object):
    def __init__(self,name):
        self.name = name
    def __enter__(self):
        self.start = time.time()
    def __exit__(self,ty,val,tb):
        end = time.time()
        print("%s : %0.3f seconds" % (self.name, end-self.start))
        return False

Now, I can just use that context manager whenever I want to do that kind of timing benchmark. For example:

# fileitertest.py
from benchmark import benchmark

with benchmark("iterate by lines (UTF-8)"):
     for line in open("biglog.txt",encoding='utf-8'):
          pass

with benchmark("iterate by lines (Latin-1)"):
     for line in open("biglog.txt",encoding='latin-1'):
         pass

with benchmark("iterate by lines (Binary)"):
     for line in open("biglog.txt","rb"):
         pass

If you run it, you might get output like this:

bash % python3 fileitertest.py
iterate by lines (UTF-8) : 3.903 seconds
iterate by lines (Latin-1) : 3.615 seconds
iterate by lines (Binary) : 1.886 seconds

Nice. I like it already!

A few useful bytearray tricks

noreply@blogger.com (Dave) — Fri, 29 Jan 2010 03:02:00 +0000

When I first saw the new Python 3 bytearray object (also back-ported to Python 2.6), I wasn't exactly sure what to make of it. On the surface, it seemed like a kind of mutable 8-bit string (a feature sometimes requested by users of Python 2). For example:

>>> s = bytearray(b"Hello World")
>>> s[:5] = b"Cruel"
>>> s
bytearray(b'Cruel World')
>>>

On the other hand, there are aspects of bytearray objects that are completely unlike a string. For example, if you iterate over a bytearray, you get integer byte values:

>>> s = bytearray(b"Hello World")
>>> for c in s: print(c)
...
72
101
108
108
111
32
87
111
114
108
100
>>>

Similarly, indexing operations are tied to integers:

>>> s[1]
101
>>> s[1] = 97
>>> s[1] = b'a'
Traceback (most recent call last):
  File "", line 1, in 
TypeError: an integer is required
>>>

Finally, there's the fact bytearray instances have most of the methods associated with strings as well as methods associated with lists. For example:

>>> s.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> s.append(33)
>>> s
bytearray(b'Hello World!')
>>>

Although documentation on bytearrays describes these features, it is a little light on meaningful use cases. Needless to say, if you have too much spare time (sic) on your hands, this is the kind of thing that you start to think about. So, I thought I'd share three practical uses of bytearrays.

Example 1: Assembling a message from fragments

Suppose you're writing some network code that is receiving a large message on a socket connection. If you know about sockets, you know that the recv() operation doesn't wait for all of the data to arrive. Instead, it merely returns what's currently available in the system buffers. Therefore, to get all of the data, you might write code that looks like this:

# remaining = number of bytes being received (determined already)
msg = b""
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msg += chunk                 # Add it to the message
    remaining -= len(chunk)

The only problem with this code is that concatenation (+=) has horrible performance. Therefore, a common performance optimization in Python 2 is to collect all of the chunks in a list and perform a join when you're done. Like this:

# remaining = number of bytes being received (determined already)
msgparts = []
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msgparts.append(chunk)       # Add it to list of chunks
    remaining -= len(chunk)  
msg = b"".join(msgparts)          # Make the final message

Now, here's a third solution using a bytearray:

# remaining = number of bytes being received (determined already)
msg = bytearray()
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msg.extend(chunk)            # Add to message
    remaining -= len(chunk)

Notice how the bytearray version is really clean. You don't collect parts in a list and you don't perform that cryptic join at the end. Nice.

Of course, the big question is whether or not it performs. To test this out, I first made a list of small byte fragments like this:

chunks = [b"x"*16]*512

I then used the timeit module to compare the following two code fragments:

# Version 1
msgparts = []
for chunk in chunks:
    msgparts.append(chunk)
msg = b"".join(msgparts)

# Version 2
msg = bytearray()
for chunk in chunks:
    msg.extend(chunk)

When tested, version 1 of the code ran in 99.8s whereas version 2 ran in 116.6s (a version using += concatenation takes 230.3s by comparison). So while performing a join operation is still faster, it's only faster by about 16%. Personally, I think the cleaner programming of the bytearray version might make up for it.

Example 2: Binary record packing

This example is an slight twist on the last example. Support you had a large Python list of integer (x,y) coordinates. Something like this:

points = [(1,2),(3,4),(9,10),(23,14),(50,90),...]

Now, suppose you need to write that data out as a binary encoded file consisting of a 32-bit integer length followed by each point packed into a pair of 32-bit integers. One way to do it would be to use the struct module like this:

import struct
f = open("points.bin","wb")
f.write(struct.pack("I",len(points)))
for x,y in points:
    f.write(struct.pack("II",x,y))
f.close()

The only problem with this code is that it performs a large number of small write() operations. An alternative approach is to pack everything into a bytearray and only perform one write at the end. For example:

import struct
f = open("points.bin","wb")
msg = bytearray()
msg.extend(struct.pack("I",len(points))
for x,y in points:
    msg.extend(struct.pack("II",x,y))
f.write(msg)
f.close()

Sure enough, the version that uses bytearray runs much faster. In a simple timing test involving a list of 100000 points, it runs in about half the time as the version that makes a lot of small writes.

Example 3: Mathematical processing of byte values

The fact that bytearrays present themselves as arrays of integers makes it easier to perform certain kinds of calculations. In a recent embedded systems project, I was using Python to communicate with a device over a serial port. As part of the communications protocol, all messages had to be signed with a Longitudinal Redundancy Check (LRC) byte. An LRC is computed by taking an XOR across all of the byte values.

Bytearrays make such calculations easy. Here's one version:

message = bytearray(...)     # Message already created
lrc = 0
for b in message:
    lrc ^= b
message.append(lrc)          # Add to the end of the message

Here's a version that increases your job security:

message.append(functools.reduce(lambda x,y:x^y,message))

And here's the same calculation in Python 2 without bytearrays:

message = "..."       # Message already created
lrc = 0
for b in message:
    lrc ^= ord(b)
message += chr(lrc)        # Add the LRC byte

Personally, I like the bytearray version. There's no need to use ord() and you can just append the result at the end of the message instead of using concatenation.

Here's another cute example. Suppose you wanted to run a bytearray through a simple XOR-cipher. Here's a one-liner to do it:

>>> key = 37
>>> message = bytearray(b"Hello World")
>>> s = bytearray(x ^ key for x in message)
>>> s
bytearray(b'm@IIJ\x05rJWIA')
>>> bytearray(x ^ key for x in s)
bytearray(b"Hello World")
>>>

Final Comments

Although some programmers might focus on bytearrays as a kind of mutable string, I find their use as an efficient means for assembling messages from fragments to be much more interesting. That's because this kind of problem comes up a lot in the context of interprocess communication, networking, distributed computing, and other related areas. Thus, if you know about bytearrays, it might lead to code that has good performance and is easy to understand.

That's it for this installment. In case you're wondering, this topic is also related to my upcoming PyCON'2010 tutorial "Mastering Python 3 I/O."

Reexamining Python 3 Text I/O

noreply@blogger.com (Dave) — Wed, 27 Jan 2010 04:36:00 +0000

Note: Since I first posted this, I added a performance test using the Python 2.6.4 codecs module. This addition is highlighted in red.

When Python 3.0 was first released, I tried it out on a few things and walked away unimpressed. By far, the big negative was the horrible I/O performance. For instance, scripts to perform simple data analysis tasks like processing a web server log file were running more than 30 times slower than Python 2. Even though there were many new features of Python 3 to be excited about, the I/O performance alone was enough to make me not want to use it---or recommend it to anyone else for that matter.

Some time has passed since then. For example, Python-3.1.1 is out and many improvements have been made. To force myself to better understand the new Python 3 I/O system, I've been working on a tutorial Mastering Python 3 I/O for the upcoming PyCON'2010 conference in Atlanta. Overall, I have to say that I'm pretty impressed with what I've found--and not just in terms of improved performance.

Due to space constraints, I can't talk about everything in my tutorial here. However, I thought I would share some thoughts about text-based I/O in Python 3.1 and discuss a few examples. Just as a disclaimer, I show a few benchmarks, but my intent is not to do a full study of every possible aspect of text I/O handling. I would strongly advise you to download Python 3.1.1 and perform your own tests to get a better feel for it.

Like many people, one of my main uses of Python is data processing and parsing. For example, consider the contents of a typical Apache web server log:

75.54.118.139 - - [24/Feb/2008:00:15:42 -0600] "GET /favicon.ico HTTP/1.1" 404 133
75.54.118.139 - - [24/Feb/2008:00:15:49 -0600] "GET /software.html HTTP/1.1" 200 3163
75.54.118.139 - - [24/Feb/2008:00:16:10 -0600] "GET /ply/index.html HTTP/1.1" 200 8018
213.145.165.82 - - [24/Feb/2008:00:16:19 -0600] "GET /ply/ HTTP/1.1" 200 8018
...

Let's look at a simple script that processes this file. For example, suppose you wanted to produce a list of all URLs that have generated a 404 error. Here's a really simple (albeit hacky) script that does that:

error_404_urls = set()
for line in open("access-log"):
    fields = line.split()
    if fields[-2] == '404':
        error_404_urls.add(fields[-4])

for name in error_404_urls:
    print(name)

On my machine, I have a 325MB log file consisting of 3649000 lines--a perfect candidate for performing a few benchmarks. Here are the numbers that you get running the above script with different Python versions. UCS-2 refers to Python compiled with 16-bit Unicode characters. UCS-4 refers to Python compiled with 32-bit Unicode characters (the --with-wide-unicode configuration option). Also, in the interest of full disclosure, these tests were performed with a warm disk cache on a 2 GHZ Intel Core 2 Duo Apple Macbook with 4GB of memory under OS-X 10.6.2 (Snow Leopard).

Python Version Time (seconds)

2.6.4 7.91s

3.0 125.42s

3.1.1 (UCS-2) 14.11s

3.1.1 (UCs-4) 17.32s

Python Version	Time (seconds)
2.6.4	7.91s
3.0	125.42s
3.1.1 (UCS-2)	14.11s
3.1.1 (UCs-4)	17.32s

As you can see, Python 3.0 performance was an anomaly--the performance of Python 3.1.1 is substantially better. To better understand the I/O component of this script, I ran a modified test with the following code

for line in open("access-log"):
    pass

Here are the performance results for iterating over the file by lines:

Python Version Time (seconds)

2.6.4 1.50s

2.6.4 (codecs, UTF-8) 52.22s

3.0 105.87s

3.1.1 (UCS-2) 4.35s

3.1.1 (UCs-4) 6.11s

Python Version	Time (seconds)
2.6.4	1.50s
2.6.4 (codecs, UTF-8)	52.22s
3.0	105.87s
3.1.1 (UCS-2)	4.35s
3.1.1 (UCs-4)	6.11s

If you look at these numbers, you will see that the I/O performance of Python 3.1 has improved substantially. It is also substantially faster than using the codecs module in Python 2.6. However, you'll also observe that the performance is still quite a bit worse than the native Python 2.6 file object. For example, in the table, iterating over lines is about 3x slower in Python 3.1.1 (UCS-2). How can that be good? That's 300% slower!

Let's talk about the numbers in more detail. The decreased performance in Python 3 is almost solely due to the overhead of the underlying Unicode conversion applied to text input. That conversion process involves two distinct steps:

Input data (bytes) has to be scanned and characters decoded according to some encoding (UTF-8 by default).
The decoded character data has to be stored as an array of multibyte integers that represent the associated string result.

The overhead of decoding is a direct function of how complicated the underlying codec is. Although UTF-8 is relatively simple, it's still more complex than an encoding such as Latin-1. Let's see what happens if we try reading the file with "latin-1" encoding instead. Here's the modified test code:

for line in open("access-log",encoding='latin-1'):
    pass

Here are the modified performance results that show an improvement:

Python Version Time (seconds)

3.1.1 (UCS-2) 3.64s (was 4.35s)

3.1.1 (UCs-4) 5.33s (was 6.11s)

Python Version	Time (seconds)
3.1.1 (UCS-2)	3.64s (was 4.35s)
3.1.1 (UCs-4)	5.33s (was 6.11s)

Lesson learned : The encoding matters. So, if you're working purely with ASCII text, specifying an encoding such as 'latin-1' will speed everything up. Just so you know, if you specify 'ascii' encoding, you get no improvement over UTF-8. This is because 'ascii' requires more work to decode than 'latin-1' (due to an extra check for bytes outside the range 0-127 in the decoding process).

At this point, you're still saying that it's slower. Yes, even with a faster encoding, Python 3.1.1 is still about 2.5x slower than Python 2.6.4 on this simple I/O test. Is there anything that can be done about that?

The short answer is "not really." Since Python 3 strings are Unicode, the process of reading a simple 8-bit text file is always going to involve an extra process of converting and copying the byte-oriented data into the multibyte Unicode representation. Just to give you an idea, let's drop into C programming and consider the following program:

#include <stdio.h>

int main() {
  FILE *f;
  char  bytes[256];

  f = fopen("access-log","r");
  while (fgets(bytes,256,f)) {  // Yes, hacky 
  }
}

This program does nothing more than iterate over lines of a file--think of it as the ultimate stripped down version of our Python-2.6.4 test. If you run it, takes 1.13s to run on the same log file used for our earlier Python tests.

When you go to Python 3, there is always extra conversion. It's like modifying the C program as follows:

#include <stdio.h>

int main() {
  FILE *f;
  char  bytes[256], *c;
  short  unicode[256], *u;

  f = fopen("biglog.txt","r");
  while (fgets(bytes,256,f)) {
    c = bytes;
    u = unicode;
    while (*c) {    /* Convert to Unicode */
      *(u++) = (short) *(c++);
    }
  }
}

Sure enough, if you run this modified C program, it takes about 1.7 seconds--a nearly 50% performance hit just from that extra copying and conversion step. Minimally, Python 3 has to do the same conversion. However, it's also performing dynamic memory allocation, reference counting, and other low-level operations. So, if you factor all of that in, the performance numbers start to make a little more sense. You also start to understand why it might be really hard to do much better.

Now, should you care about all of this? Truthfully, most programs are probably not going to be affected by degraded text I/O performance as much as you think. That's because most interesting programs do far more than just I/O. Go back and consider the original script that I presented. On Python-2.6.4, it took 7.91s to execute. If I go back and tune the script to use the more efficient 'latin-1' encoding, it takes 13.8s with Python-3.1.1. Yes, that's about 1.75x slower than before. However, the key point is that it's not 2.5x slower as our earlier I/O tests would suggest. The performance impact will become less and less as the script performs more non-IO related work.

Finally, let's say that you still can't live with the performance degradation. If you're just working with simple ASCII data files, you might solve this problem by turning to binary I/O instead. For example, the following script variant uses binary I/O and bytes for most of its processing--only converting text to Unicode when absolutely necessary for printing.

error_404_urls = set()
for line in open("access-log","rb"):
    fields = line.split()
    if fields[-2] == b'404':
        error_404_urls.add(fields[-4])

for name in error_404_urls:
    print(name.decode('latin-1'))

If you run this final script, you find that it takes 8.22s in Python 3.1.1--which is only about 4% slower than the Python-2.6.4. How about that!

The bottom line is that Python-3.1 is definitely worth a second look--especially if you tried the earlier Python 3.0 release and were disappointed with its performance. Although text-based I/O is always going to be slower in Python 3 due to extra Unicode processing, it might not matter as much in practice. Plus, binary I/O in Python 3 is still quite fast which means that you can turn to it as a last resort.

If you want to know more, attend my Mastering Python 3 I/O at PyCON'2010 or sign up for the Special Preview in Chicago.

Final Notes:

All versions of Python were compiled from source using the exact same configuration, compiler, and environment settings.
Python timing tests were performed using the time module and enclosing code with these statements:
```
import time
start = time.time()
... statements ...
end = time.time()
print(end-start)
```

Slashdot, Pronouns, and the Python Essential Reference

noreply@blogger.com (Dave) — Thu, 21 Jan 2010 18:00:00 +0000

Yesterday, I was ecstatic to see a positive review of my Python Essential Reference book on Slashdot. I've never had a book reviewed on Slashdot before. However, I also know that with Slashdot, one never really knows what direction the subsequent discussion is going to take. For instance, will someone jump in and say something like "in Soviet Russia, Python indents you" or will the conversation devolve into something about how Python programmers will never have a girlfriend? That's not true by the way. I once had a girlfriend who went to hear me talk for 90 minutes about LALR(1) parser generators at a Chipy meeting despite the fact that she didn't know the first thing about programming. That's surely a sign of true love or insanity if there ever was one. Needless to say, I married her. However, I digress.

No, this time around, the Slashdot discussion decided it was going to focus on the use of pronouns--namely in response to a comment that included the sentence "... there is a lot of what a developer needs and very little of what she doesn't need." Now, I am by no means any fan of political correctness, but I had to chuckle at the irony. Of all of the things to discuss about the Python Essential Reference, pronouns would have to rank at about the bottom of the list. This is because the entire book is virtually devoid of personal pronouns. With the exception of the word "you" (e.g., "you type this..."), you won't find "he", "she", "him", "her", "we", or anything like that used anywhere in the text. This was an intentional choice, but it wasn't related to any kind of political influence (in fact, editors of the Essential Reference have often tried to add pronouns like "he" and "she" to the text only to have me take them out again).

First published in 1999, the Essential Reference was actually my second major writing project--the first being my Ph.D. dissertation which had been completed the year before. As you know, writing a dissertation is a pretty major affair. Not only do you have to do original research and defend it, you also have to write a major document describing the results. For a typical graduate student, the dissertation is the most technically demanding document you will ever write. It might even be the first document that you will ever submit to a real-world copy editor--an editor who will very likely tear your precious document to shreds in front of your eyes.

In my case, the final stage of my dissertation involved a somewhat prolonged battle with the dissertation editor at the University of Utah. Upon submitting the document, she would immediately put it under the microscope to see if it met the required "technical specifications." This meant measuring margins, line spacing, tables, figures, and other details with a ruler. Any deviation whatsoever meant instant rejection of the entire document--please play again.

Assuming one could pass the basic technical requirements, the next stage involved a review to see if you were strictly adhering to the required writing "style guide." When submitting a dissertation, you actually had to indicate a specific writing style guide. For example, I said that I was writing the document according to the "Chicago Manual of Style." What this meant in practical terms is that upon submitting the dissertation to the editor, she would read it and return it to you a few days later dripping in a sea of red ink. Every sentence of the document that did not precisely adhere to that style guide would be torn apart. I have to say that in my entire academic and professional career (grade school, high school, college, etc.), I have never had any paper reviewed like that.

Just to give you an example of the agony, if I wrote something like "the data is plotted" (something that sounded perfectly reasonable to me as a programmer) the editor would reject it because "data" is a plural (of datum) and you can't use "is" with a plural (e.g., you would never say "the points is plotted."). The other major source of agony was in the use of pronouns. The editor would instantly punish you for any use of a personal pronoun. So, a sentence like "we took the points and processed them with a script" would be rejected.

Essentially the editor wanted the entire document to be written in what I would roughly describe as "academic passive voice." It's a style of writing where you never identify who is actually carrying out various actions. So, instead of saying "we took the points and processed them with a script" you had to write "the points were processed with a script." As you can see, A major feature of this writing style is that it is very direct and precise. Not only is the second sentence more compact, it doesn't muddle the discussion with unimportant details about who is actually carrying out the action. Obviously, you also avoid the whole issue of "he" versus "she" with such a writing style.

Anyways, work on the Python Essential Reference started just 6 months after finishing my dissertation. Having fought all of those editor battles, I wrote it in the exact same style. So far as I can remember, I don't think any pronoun other than "it" or "its" appeared in the text. It must have blown the copy editor's mind. What kind of deranged lunatic would write a 300-page impersonal document like that? Especially since writing in the passive voice is something so actively discouraged.

Over the last ten years, various copy-editors have worked on the Essential Reference, but much of that original academic writing style remains. At some point, use of the word "you" was introduced in the book. I was somewhat lukewarm about it at the time, but as an author you also learn to pick and choose your battles--and that wasn't one that seemed worth fighting (unlike the battle to convince my publisher that putting out a Python 2.6 book hot on the heels of Python 3.0 was going to make any sense).

So there you have it. A review of a book virtually devoid of personal pronouns spawns a big discussion on the use of he/she on Slashdot. Who would have thought?

Naturally, I disavow any grammatical mistakes in this blog post---after all, I don't have a editor.

Presentation on the new Python GIL

noreply@blogger.com (Dave) — Sun, 17 Jan 2010 18:09:00 +0000

For anyone who missed it, I gave a presentation on the new Python GIL, implemented by Antoine Pitrou, at the January 14, 2010 meeting of Chipy. The presentation slides can be found at http://www.dabeaz.com/python/NewGIL.pdf. I don't have any followup comments to put here at this time. However, I think this is an exciting new development for Python 3.

The Python GIL Visualized

noreply@blogger.com (Dave) — Tue, 05 Jan 2010 14:18:00 +0000

In preparation for my upcoming PyCON'2010 talk on "Understanding the Python GIL", I've been working on a variety of new material--including some graphical visualization of the GIL behavior described in my earlier talk. I'm still experimenting, but check it out.

In these graphs, Python interpreter ticks are shown along the X-axis. The two bars indicate two different threads that are executing. White regions indicate times at which a thread is completely idle. Green regions indicate when a thread holds the GIL and is running. Red regions indicate when a thread has been scheduled by the operating system only to awake and find that the GIL is not available (e.g., the infamous "GIL Battle"). For those who don't want to read, here is the legend again in pictures:

Okay, now let's look at some threads. First, here is the behavior of running two CPU-bound threads on a single CPU system. As you will observe, the threads nicely alternate with each other after long periods of computation.

Now, let's go fire up the code on your fancy new dual-core laptop. Yow! Look at all of that GIL contention. Again, all of those red regions indicate times where the operating system has scheduled a Python thread on one of the cores, but it can't run because the thread on the other core is holding it.

Here's an interesting case that involves an I/O bound thread competing with a CPU-bound thread. In this example, the I/O thread merely echoes UDP packets. Here is the code for that thread.

def thread_1(port):
    s = socket(AF_INET,SOCK_DGRAM)
    s.bind(("",port))
    while True:
        msg, addr = s.recvfrom(1024)
        s.sendto(msg,addr)

The other thread (thread 2) is just mindlessly spinning. This graph shows what happens when you send a UDP message to thread 1.

As you would expect, most of the time is spent running the CPU-bound thread. However, when I/O is received, there is a flurry of activity that takes place in the I/O thread. Let's zoom in on that region and see what's happening.

In this graph, you're seeing how difficult it is for the I/O bound to get the GIL in order to perform its small amount of processing. For instance, approximately 17000 interpreter ticks pass between the arrival of the UDP message and successful return of the s.recvfrom() call (and notice all of the GIL contention). More that 34000 ticks pass between the execution of s.sendto() and looping back to the next s.recvfrom() call. Needless to say, this is not the behavior you usually want for I/O bound processing.

Anyways, that is all for now. Come to my PyCON talk to see much more. Also check out Antoine Pitrou's work on a new GIL.

Note: It is not too late to sign up for my Concurrency Workshop next week (Jan 14-15).

Python Concurrency Workshop (Reprise)

noreply@blogger.com (Dave) — Mon, 14 Dec 2009 11:50:00 +0000

Well, the winter months are now upon us--making it a perfect time to come to Chicago in the middle of January and have your brain exploded by the second edition of my Python Concurrency Workshop (January 14-15, 2010). Over the last few months, I've been working on numerous refinements to the previous workshop and adding some new material related to distributed computing (Actors, REST, distributed objects, etc.). I think I'm even more excited by this version than the last.

So what is this concurrency workshop you ask? Well, first all, you may have already encountered a small portion of it if you saw my presentation on the Python GIL---that was only a small part of the workshop's thread programming section. The rest of the workshop aims to explore a variety of other topics at a similar technical depth. For example, thread synchronization, thread debugging, message passing, data serialization, interprocess communication, multiprocessing, distributed computing, and advanced I/O handling. In a nutshell, it's an opportunity to learn more about what makes Python tick and to go beyond what you normally find in the user manual. The workshop is also a kind of proving ground for some of my future book projects and PyCON tutorials--I have made every effort to keep it cutting edge.

So, you might ask, who is the target audience of the workshop? Although a lot of advanced material is covered, I think the workshop is best suited for intermediate Python programmers who want to learn more. For instance, the workshop utilizes numerous Python features such as context managers, decorators, generators, and coroutines. If you've heard of such topics before, but aren't quite sure what they're all about, the workshop will fill in details. Second, the workshop has a very strong focus on networking and distributed systems. If you've been doing work in web services, cloud computing, parallel computing, or any related topic, the workshop aims to fill in a variety of essential technical details that will help you write more efficient code. Finally, if you simply want to escape the office and hang out with other Python hackers, the workshop won't disappoint.

Finally, although there is a small chance the workshop will be held in the middle of a wind-whipped Chicago blizzard, other amenities will more than make up for it. Some of Chicago's finest bakeries and coffee shops surround the workshop venue--ensuring a proper balance of sugar and caffeine required for a workshop of this nature. You won't be disappointed.

In any case, hopefully I'll see you at the workshop. It's going to be great!

Fun with block towers

noreply@blogger.com (Dave) — Fri, 27 Nov 2009 15:50:00 +0000

Lately, I've been having a lot of fun playing with wooden blocks--a great toy for toddlers and grown-ups alike.

There's a certain primal simplicity to blocks. Sure, you can stack them up in simple towers or piles. However, my inner geek makes me want to build more tricky structures. For example, this diamond structure:

Or maybe a diamond with huge spire

Or flip the whole thing upside down if you're inclined:

A more interesting challenge is to build an arch.

And if you can keep that stable, to find out how much you can stack on top of it

Lately, I've been experimenting with expanding the number of dimensions. For example, this interesting structure:

Or this more complex extension of the idea

Somewhere in all of this, there's probably some kind of software development analogy. Maybe it's the fact that even with simple components, you can make some pretty cool things. Or maybe it's somehow related to the same inner urge that drives a programmer to build their entire application out of closures, generators, coroutines, actors, tasklets, or something similarly "simple."

Then again, maybe it's more of a warning. After all, there are those pesky end-users who are going to put their dirty hands on everything when you're done (observe their look of terror).

... and well, we all know what happens next.

Anyways, that is all for now. Hope everyone is enjoying the holiday!

Python Thread Deadlock Avoidance

noreply@blogger.com (Dave) — Fri, 20 Nov 2009 22:50:00 +0000

One danger of writing programs based on threads is the potential for deadlock--a problem that almost invariably shows up if you happen to write thread code that tries to acquire more than one mutex lock at once. For example:

a_lock = threading.Lock()
b_lock = threading.Lock()

def foo():
    with a_lock:
         ...
         with b_lock:
              # Do something
              ...

t1 = threading.Thread(target=foo)
t1.start()

Code like that looks innocent enough until you realize that some other thread in the system also has a similar idea about locking--but acquires the locks in a slightly different order:

def bar():
    with b_lock:
         ...
         with a_lock:
              # Do something (maybe)
              ...

Sure, the code might be lucky enough work "most" of the time. However, you will suffer a thousand sorrows if both threads try to acquire those locks at about the same time and you have to figure out why your program is mysteriously nonresponsive.

Computer scientists love to spend time thinking about such problems--especially if it means they can make up some diabolical problem about philosophers that they can put on an operating systems exam. However, I'll spare you the details of that.

The problem of deadlock is not something that I would normally spend much time thinking about, but I recently saw some material talking about improved thread support in C++0x. For example, this article has some details. In particular, it seems that C++0x offers a new locking operation std::lock() that can acquire multiple mutex locks all at once while avoiding deadlock. For example:

std::unique_lock<std::mutex> lock_a(a.m,std::defer_lock);
std::unique_lock<std::mutex> lock_b(b.m,std::defer_lock);
std::lock(lock_a,lock_b);      // Lock both locks
...
... do something involving data protected by both locks
...

I don't actually know how C++0x implements its lock() operation, but I do know that one way to avoid deadlock is to put some kind of ordering on all of the locks in a program. If you then strictly enforce a policy that all locks have to be acquired in increasing order, you can avoid deadlock. Just as an example, if you had two locks A and B, you could assign a unique number to each lock such as A=1 and B=2. Then, in any part of the program that wanted to acquire both lock A and B, you just make a rule that A always has to be acquired first (because its number is lower). In such a scheme, the thread bar() shown earlier would simply be illegal. That lock() operation in C++ is almost certainly doing something similar to this--that is, it knows enough about the locks so that they can acquired without deadlock.

All of this got me thinking--I wonder how hard it would be to implement the lock() operation in Python? Not hard as it turns out. First step is to change the name--given that acquire() is the typical method used to acquire a lock, let's just call the operation acquire() to make it more clear. You can define acquire() as a context-manager and simply order locks according to their id() value like this:

class acquire(object):
    def __init__(self,*locks):
        self.locks = sorted(locks, key=lambda x: id(x))
    def __enter__(self):
        for lock in self.locks:
            lock.acquire()
    def __exit__(self,ty,val,tb):
        for lock in reversed(self.locks):
            lock.release()
        return False

Okay, that was easy enough to do, but does it work? Let's try it on the classic dining philosophers problem (look it up if you need a refresher):

import threading

# The philosopher thread
def philosopher(left, right):
    while True:
        with acquire(left,right):
             print threading.currentThread(), "eating"

# The chopsticks
NSTICKS = 5
chopsticks = [threading.Lock() 
              for n in xrange(NSTICKS)]

# Create all of the philosophers
phils = [threading.Thread(target=philosopher,
                          args=(chopsticks[n],chopsticks[(n+1) % NSTICKS]))
         for n in xrange(NSTICKS)]

# Run all of the philosophers
for p in phils:
    p.start()

If you try this code, you'll find that the philosophers run all day with no deadlock. Just as an experiment, you can try changing the philosopher() implementation to one that acquires the locks separately:

def philosopher(left, right):
    while True:
        with left:
             with right:
                 print threading.currentThread(), "eating"

Yep, almost instantaneously deadlock. So, as you can see, our acquire() operation seems to be working.

There's still one last aspect of this experiment that needs to be addressed. One potential problem with our acquire() operation is that it doesn't prevent a user from using it in a nested manner as before. For example, someone might write code like this:

with acquire(a_lock,b_lock):
     ...
     with acquire(c_lock, d_lock):
          ...

Catching such cases at the time of definition would be difficult (if not impossible). However, we could make the acquire() context manager keep a record of all previously acquired locks using a list placed in thread local storage. Here's a new implementation--and just for kicks, I'm going to switch it over to a context manager defined by a generator (mainly because I can and generators are cool):

import threading
from contextlib import contextmanager

local = threading.local()
@contextmanager
def acquire(*locks):
    locks = sorted(locks, key=lambda x: id(x))   
    acquired = getattr(local,"acquired",[])
    # Check to make sure we're not violating the order of locks already acquired   
    if acquired:
        if max(id(lock) for lock in acquired) >= id(locks[0]):
            raise RuntimeError("Lock Order Violation")
    acquired.extend(locks)
    local.acquired = acquired
    try:
        for lock in locks:
            lock.acquire()
        yield
    finally:
        for lock in reversed(locks):
            lock.release()
        del acquired[-len(locks):]

If you use this version, you'll find that the philosophers work just fine as before. However, now consider this slightly modified version with the nested acquires:

# The philosopher thread                                                                                             
def philosopher(left, right):
    while True:
        with acquire(left):
            with acquire(right):
                print threading.currentThread(), "eating"

Unlike the previous version that had nested with statements and deadlocked, this one runs. However, one of the philosophers crashes with a nasty traceback:

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 522, in __bootstrap_inner
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "hier4.py", line 53, in philosopher
    with acquire(right):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "hier4.py", line 35, in acquire
    raise RuntimeError("Lock Order Violation")
RuntimeError: Lock Order Violation

Very good. That's exactly what we wanted.

So, what's the moral of this story. First of all, I don't think you should use this as a license to go off and write a bunch of multithreaded code that relies on nested lock acquisitions. Sure, the context manager might catch some potential problems, but it won't change the fact that you'll still want to blow your head off after debugging some other horrible problem that comes up with your overly clever and/or complicated design.

I think the main take-away is an appreciation for Python's context-manager feature. There's so much more you can do with a context manager than simply closing a file or releasing an individual lock.

Disclaimer: I didn't do a hugely exhaustive internet search to see if anyone else had implemented anything similar to this in Python. If you know of some links to related work, tell me. I'll add them here.

Ultimate Python Quickstart Guide

noreply@blogger.com (Dave) — Sat, 31 Oct 2009 16:19:00 +0000

As the father of a toddler and a newborn, I've been getting my fair share of practice putting together various sorts of baby accessories (strollers, bassinets, cribs, etc.). It has inspired me to write this ultimate quick start guide to getting started with the Python programming language. I hope that you find it to be as incredibly useful as I have.

Congratulations!

Congratulations on your wise decision to use Python! Follow this quick and easy guide to get started.

(a) Get

(b) Click

(d) Code

Enjoy your new Python interpreter!

Python Thread Synchronization Primitives : Not Entirely What You Think

noreply@blogger.com (Dave) — Mon, 14 Sep 2009 00:43:00 +0000

If you have done any kind of programming with Python threads, you are probably familiar with the basic synchronization primitives provided by the threading module. Specifically, you get the following kinds of synchronization objects to work with:

Lock. Mutual exclusion lock that's commonly used to protect shared data structures.
RLock. Reentrant mutual exclusion lock that is useful for code-based locking on functions or methods or to implement monitors.
Event. An object that that allows one or more threads to wait for some "event" to occur. Used to implement barriers or to signal the completion of some task.
Condition. Condition variable. Used to send signals between threads. For example in producer-consumer problems, the producer will use a condition variable to send a signal to the consumer that data is available.
Semaphore. A high-level synchronization primitive based on an integer counter; Acquiring the semaphore decreases the counter and releasing the semaphore increases the counter. If the counter is 0 and a thread tries to acquire, it will block until a different thread releases the semaphore.

Knowing how and when to use the various synchronization primitives is often a non-trivial exercise. However, the point of this post is not about that--so if you're here looking for a gentle tutorial, you're in the wrong place.

Instead, I'd like to look at the inner workings of Python's thread synchronization primitives. In part, this is motivated by a general interest in knowing how Python works on multicore machines. However, it's also related to something that I noticed when putting my GIL talk together. So, we'll take a little tour under the covers, do a few experiments, and think about how this might fit into the "big picture."

A Curious Fact

If you write threaded programs, you should know that Python uses real system-level threads to carry out its work. That is, threads are implemented using pthreads or some other native threading mechanism provided by the operating system. However, the same can not be said of Python's basic synchronization primitives such as Lock, Condition, Semaphore and so forth. That is, even though low-level libraries such as pthreads provide various kinds of basic locks and synchronization objects, the threading library doesn't make direct use of them (so, when you're using something like a Lock object in your program, you're not manipulating a pthreads mutex).

This fact may surprise experienced programmers. Many of Python's core library modules provide a direct interface to low-level functionality written in C (e.g., think about the os or socket modules). However, thread synchronization objects are an exception to that rule.

Some History

Python has included support for threads for most of its history. In fact, if Guido ever gets around to updating his History of Python blog, he will eventually tell you that threads were first added to Python in 1992 after a contribution by one of his coworkers Sjoerd Mullender (disclaimer: I don't have a time machine, but I have seen the entire "History of Python" article that Guido is using as the basis for his history blog). This early work is where you find the introduction of the global interpreter lock (GIL) as well as the low-level thread library module.

Part of the problem faced by early versions of Python was the fact that thread programming interfaces weren't always available or standardized across systems. Thus, threads were only supported on certain machines such as SGI Irix and Sun Solaris. The pthreads interface wasn't standardized until a little later (~1995). The modern threading library that virtually all Python programmers now use first appeared in Python-1.5.1 (1998).

A consequence of this chaos was that Python's support for threads was intentionally designed to have a minimal set of basic requirements. The thread library module simply provided a function for launching a Python callable in its own execution thread. A single function, allocate_lock() could be used to allocate a "lock" object. This object provided the usual acquire() and release() operations, but not much else.

If you dig into the C implementation of the interpreter, you'll find that all support for locking is reduced to just four C functions.

PyThread_allocate_lock()
PyThread_free_lock()
PyThread_acquire_lock()
PyThread_release_lock()

You can find these functions in a series of files such as thread_pthread.h, thread_nt.h, thread_solaris.h, and so forth in the Python/ directory of the Python interpreter source. Each file simply contains a platform specific implementation of a basic lock. This lock then becomes the basis for all other synchronization primitives as we'll see in a minute. It should also be noted that these functions are also used to implement the infamous global interpreter lock (GIL).

What is a lock exactly?

If you have worked with thread locking in C, you might think that the above C functions are simply a wrapper around something like a pthreads mutex lock. However, this is not the case. Instead, the lock is minimally implemented as a binary semaphore. Here is a simplified example of the lock implementation that's used on many Unix systems:

#include <stdlib.h>
#include <pthread.h>
#include <string.h>

typedef struct {
  char           locked;
  pthread_cond_t lock_released;
  pthread_mutex_t mut;
} lock_t;

lock_t *
allocate_lock(void) {
  lock_t *lock;
  lock = (lock_t *) malloc(sizeof(lock_t));
  memset((void *)lock, '\0', sizeof(lock_t));
  pthread_mutex_init(&lock->mut,NULL);
  pthread_cond_init(&lock->lock_released, NULL);
  return lock;
}

void 
free_lock(lock_t *lock) {
  pthread_mutex_destroy( &lock->mut );
  pthread_cond_destroy( &lock->lock_released );
  free((void *)lock);
}

int 
acquire_lock(lock_t *lock, int waitflag) {
  int success;
  pthread_mutex_lock( &lock->mut );
  success = lock->locked == 0;

  if ( !success && waitflag ) {
    while ( lock->locked ) {
      pthread_cond_wait(&lock->lock_released,&lock->mut);
    }
    success = 1;
  }
  if (success) lock->locked = 1;
  pthread_mutex_unlock( &lock->mut );
  return success;
}

void 
release_lock(lock_t *lock) {
  pthread_mutex_lock( &lock->mut );
  lock->locked = 0;
  pthread_mutex_unlock( &lock->mut );
  pthread_cond_signal( &lock->lock_released );
}

Understanding this code requires some careful study. However, the key part of it is that Python lock objects manually keep track of their internal state (locked or unlocked). This is the locked attribute of the lock structure. The pthreads mutex lock is simply being used to synchronize access to the locked attribute in the acquire() and release() operations (note: this mutex lock is not actually the lock). Finally, the condition variable is being used to perform a kind of thread signaling that's used to wake up any sleeping threads when the lock gets released.

What about Native Semaphores?

As just mentioned, the Python lock is minimally implemented as a binary semaphore. If you've done thread programming in C, you probably know that many systems optionally include a native semaphore object. On such systems, Python may be built in a way so that it simply uses the native semaphore object for the lock. For example, this what Python uses for synchronization on Windows.

I don't intend to say any more about this here except to emphasize that using some kind of semaphore is actually a requirement for other parts of Python's threading to work correctly. For instance, the high-level threading library won't work if the lock isn't implemented in this manner.

Semaphores vs. Mutex Locks

The differences between a semaphore and mutex lock are subtle. However, the most obvious one pertains to the issue of ownership. When you use a mutex lock, there is almost always a strong sense of ownership. Specifically, if a thread acquires a mutex, it is the only thread that is allowed to release it. Semaphores don't have this restriction. In fact, once a semaphore has been acquired, any thread can later release it. This allows for more varied forms of thread signaling and synchronization. Here is one such experiment you can try in Python:

>>> import threading, time
>>> done = threading.Lock()
>>> def foo():
...      print "I'm foo and I'm running"
...      time.sleep(30)
...      done.release()       # Signal completion by releasing the lock
...
>>> done.acquire()
>>> threading.Thread(target=foo).start()
I'm foo and I'm running
>>> done.acquire(); print "Foo done"
Foo done                        (note: prints after 30 seconds)
>>>

In this example, a lock is being used to signal the completion of some task. The main thread acquires the lock to clear it and then launches a thread to carry out some work. Immediately after launching this thread, the main thread attempts to immediately acquire the lock again. Since the lock was already in use, this operation blocks. However, when the worker thread finishes, it releases the lock--notifying the main thread that it has finished. It is critical to emphasize that the lock is being acquired and released by two different threads. This is the essential property provided by using a semaphore. If a traditional mutex lock were used, the program would deadlock or crash with an error.

Just as aside, I would not recommend writing Python code that uses Lock objects in this way. Most programmers are going to associate Lock with a mutex-lock. You definitely don't use mutex-locks in the manner shown.

Other differences between mutex locks and semaphores tend to be more subtle. There are a number of well-known problems concerning mutex locks that typically get addressed by thread libraries and the operating system. For example, the system may implement policies to prevent thread starvation or provide some sense of fairness when many threads are competing for the same lock. If threads have different scheduling priorities, the system may also try to work around problems related to priority inversion (a problem where a low-priority thread holds a lock needed by a high-priority thread). Semaphores aren't necessarily treated in the same manner which means that a multithreaded program using semaphores may execute in a manner that is slightly different than one that uses mutex locks. For now, however, let's skip though details.

The threading Library

Now, that we've talked about the low-level locking mechanism used by the interpreter, let's talk about the synchronization primitives defined in the threading library. With the exception of Lock objects, which are identical to the lock described in the above section, all of the other synchronization primitives are written entirely in Python. For example, consider the RLock implementation. Here is a cleaned up version of how it is implemented:

class RLock:
    def __init__(self):
        self._block = _allocate_lock()
        self._owner = None
        self._count = 0

    def acquire(self, blocking=1):
        me = current_thread()
        if self._owner is me:
            self._count = self._count + 1
            return 1
        rc = self._block.acquire(blocking)
        if rc:
            self._owner = me
            self._count = 1
        return rc

    def release(self):
        if self._owner is not current_thread():
            raise RuntimeError("cannot release un-aquired lock")
        self._count = count = self._count - 1
        if not count:
            self._owner = None
            self._block.release()

The fact that an RLock is implemented entirely as a Python layer over a regular lock object significantly impacts its performance. For example:

>>> from timeit import timeit
>>> timeit("lock.acquire();lock.release()","from threading import Lock; lock = Lock()")
0.50123405456542969
>>> timeit("lock.acquire();lock.release()","from threading import RLock; lock = RLock()")
5.2153160572052002
>>>

Here, you see that acquiring and releasing a RLock object is about ten times slower than using a Lock. The performance impact is worse for more advanced synchronization primitives. For example, here is the result of using a Semaphore object (which is also implemented entirely in Python)

>>> timeit("lock.acquire();lock.release()","from threading import Semaphore; lock = Semaphore(1)")
6.5345189571380615
>>>

Condition and Event objects are also implemented entirely in Python. However, their implementation is also rather interesting. Keep in mind that the primary purpose of a Condition object is to perform signaling between threads. Here is a very common scenario that you see with producer-consumer problems such as in the implementation of a queue.

from threading import Lock, Condition
from collections import deque

items      = deque()
items_cv   = Condition()

def producer():
    while True:
         # produce some item
         items_cv.acquire()
         items.append(item)
         items_cv.notify()
         items_cv.release()

def consumer():
    while True:
         items_cv.acquire()
         while not items:
               items_cv.wait()
         item = items.popleft()
         items_cv.release()
         # Do something with item

Of particular interest here are the wait() and notify() operations that perform the thread signaling. This signaling is actually carried out using a Lock object. When you wait on a condition variable, a new Lock object is created and acquired. The lock is then acquired again to force the thread to block. If you look at the implementation of Condition you find code like this:

class Condition:
    ...
    def wait(self, timeout=None):
        ...
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        ...
        waiter.acquire()       # Block
    ...

The notify() operation that wakes up a thread is carried out by simply releasing the waiter lock created above:

class Condition:
    ...
    def notify(self, n=1):
        waiters = self._waiters[:n]
        for waiter in waiters:
            waiter.release()
    ...

Needless to say, a lot of processing is going on underneath the covers when you use something like a Condition object in Python. Every wait() operation involves creating an entirely new lock object. Signaling is carried out with acquire() and release() operations on that lock. Moreover, there are additional locking operations carried out on the lock object associated with the condition variable itself. Plus, consider that all of this high-level locking actually involves more locks and condition variables in C.

Who Cares?

At this point, you might be asking yourself "who cares? This is all a bunch of low-level esoteric details." However, I think that anyone who is serious about using threads in Python should take an interest in how the synchronization primitives are actually put together.

For one, a common rule of thumb with thread programming is to try and avoid the use of locks and synchronization primitives as much as possible. This is certainly true in C, but even more so in Python. The fact that almost all of the synchronization primitives are implemented in Python means that they are substantially slower than any comparable operations in a C/C++ threading library. So, if you care about performance, using a lot of locks is something you'll definitely want to avoid.

The other reason to care about this concerns the Queue module. It is commonly advised that the Queue module be used as a means for exchanging data between threads because it already deals with all of the underlying synchronization. This is all well and good except for the fact that Queue objects add even more layers to all of the synchronization primitives that we've talked about. In particular, the locking performed by a queue is done using a combination of locks and condition variables from the threading module.

This means that if you're using queues, you're not really avoiding all of the overhead of locking. Instead, you're just moving it to a different location where it's out of view.

One might wonder just how much overhead gets added by all of this. For instance, a Queue object is really just a wrapper around a collections.deque with the added locking. You can try a few performance experiments. For instance, inserting items:

>>> timeit("q.append(1)","from collections import deque; q = deque()")
0.17505884170532227
>>> timeit("q.put(1)","from Queue import Queue; q = Queue()")
4.4164938926696777
>>>

Here, you find that inserting into a Queue is about 25 times slower than inserting into a deque. You get similar figures for removing items. Keep in mind that these simple benchmarks don't even cover the case of working with multiple threads where even more overhead would be added.

Some Final Thoughts

There surely seems to be an opportunity for some experimentation with better implementations of Python's thread synchronization primitives. For example, condition variables are a core component of Python's Semaphore, Event, and Queue objects, yet Python makes no effort to use any kind of native implementation (e.g., pthreads condition variables). Moreover, why is Python using custom implementations of synchronization objects already provided by the operating system and thread libraries (e.g., semaphores). Given that much of Python's thread implementation was worked out more than ten years ago, it would be interesting to perform some experiments and revisit the threading implementation on modern systems--especially in light of the increased interested in concurrency, multiple CPU cores, and other matters.

Anyways, that's it for now. I'd love to hear your comments. Also, if you are aware of prior work related to optimizing the threading library, benchmarks, or anything else that might be related, I'd be interested in links so that I can post them here.

Inside the "Inside the Python GIL" Presentation

noreply@blogger.com (Dave) — Thu, 27 Aug 2009 12:39:00 +0000

On June 11, 2009 I gave a presentation about the inner workings of the Python GIL at the Chicago Python user group meeting. To be honest, I always expected the event to be a pretty low-key affair involving some local Python hackers and some beers. However, the presentation went a little viral and I've received a number of requests to get the code modifications I made to investigate thread behavior--especially the traces that show thread switching and other details.

In this post, I'll briefly outline the code changes I made to generate the traces. Before going any further, you should probably first view the original presentation. Also, as a disclaimer, none of these changes are easily packaged into a neat "patch" that one can simply download and install into any Python distribution. So, to start, you should first go download a Python source distribution for the version of Python you want to experiment with. For my talk, I was using Python 2.6.

First, let's talk about a major issue--any investigation of threads at a low-level (especially thread scheduling) tends to be a rather tricky affair involving some kind of computer science variant of the uncertainty principle. That is, once you start trying to observe thread behavior, you run the risk of changing the very thing you're trying to observe. The problem gets worse if you add a lot of extra complexity--especially if there are extra system calls or I/O. So, a major underlying concern was to try and devise a technique for recording thread behavior in a minimally invasive manner (as an aside, I considered the idea of trying to use dtrace for this, but decided that it would take longer for me to learn dtrace than it would to simply make a few minor modifications to the interpreter).

Step 1: Defining time

Everything that happens inside the Python interpreter is focused around the concept of "ticks." Each tick loosely corresponds to a single instruction in the virtual machine. Locate the file Python/ceval.c in the Python source code. In this file, you will find a global variable _Py_Ticker holding the tick counter. Here's what the code looks like:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
...

Add a new variable declaration _Py_Ticker_Counter to this code so that it looks like this:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
volatile int _Py_Ticker_Count = 0;
...

Later in the same file, you will find code that decrements the value of _Py_Ticker. Modify this code so that each time _Py_Ticker reaches 0, the value of _Py_Ticker_Count is incremented. Here's what it looks like:

/* ceval.c */
...
  if (--_Py_Ticker < 0) {
   if (*next_instr == SETUP_FINALLY) {
    /* Make the last opcode before
       a try: finally: block uninterruptable. */
    goto fast_next_opcode;
   }
   _Py_Ticker = _Py_CheckInterval;
   _Py_Ticker_Count++; 
   tstate->tick_counter++;
...

The _Py_Ticker_Count and _Py_Ticker variables together define a kind of internal clock. _Py_Ticker is a countdown to the next time the interpreter might thread-switch. The _Py_Ticker_Count keeps track of how many times the interpreter has actually signaled the operating system to schedule waiting threads (if any). In the traces that follow, these two values are used together to record the sequence of events that occur in terms of interpreter ticks.

Step 2 : Recording Trace Data

Python defines a general purpose lock object that is used for both the GIL and locking primitives in the threading modules. On Unix systems using pthreads, the implementation of the lock can be found in the file Python/thread_pthread.h. In that file, there are two functions that we are going to modify: PyThread_acquire_lock() and PyThread_release_lock().

Here's the general idea : The lock/unlock functions are instrumented to record a large in-memory trace of lock-related events. These include lock entry (when a thread first tries to acquire a lock), busy (when the lock is busy), retry (a repeated failed attempt to acquire a lock), acquire (lock successfully acquired), and release (lock released). In addition to events, the trace records current values of the _Py_Ticker and _Py_Ticker_Count variables as well as the pointer to the currently executing thread.

All trace data is stored entirely in memory as programs execute. The size of the history can be controlled with a macro in the code. To dump the trace, a function print_history() is registered to execute on interpreter exit using the atexit() call. It is important to emphasize that no I/O occurs as programs are executing--traces are only dumped on program exit.

Here a copy of the modified code. Be aware that thread_pthread.h is a bit of a mess and that there are a few different implementations of locks. This code is meant to go in the non-semaphore implemention of locks. Further discussion appears afterwards

/* thread_pthread.h */
...
/* Thread lock monitoring modifications (beazley) */

#include <sys/resource.h>
#include <sched.h>

#define MAXHISTORY 5000000
static int           thread_history[MAXHISTORY];
static unsigned char tick_history[MAXHISTORY];
static int           tick_count_history[MAXHISTORY];
static unsigned char tick_acquire[MAXHISTORY];
static double        time_history[MAXHISTORY];
static unsigned int  history_count = 0;

#define EVENT_ENTRY   0
#define EVENT_BUSY    1
#define EVENT_RETRY   2
#define EVENT_ACQUIRE 3
#define EVENT_RELEASE 4

static char *_codes[] = {"ENTRY","BUSY","RETRY","ACQUIRE","RELEASE" };

static void print_history(void) {
 int i;
 FILE *f;

 f = fopen("tickhistory.txt","w");
 for (i = 0; i < history_count; i++) {
   fprintf(f,"%x %d %d %s %0.6f\n",thread_history[i],tick_history[i],tick_count_history[i],_codes[tick_acquire[i]],time_history[i]);
 }
 fclose(f);
}

/* External variables recorded in the history */
extern volatile int _Py_Ticker;
extern volatile int _Py_Ticker_Count;


int
PyThread_acquire_lock(PyThread_type_lock lock, int waitflag)
{
 int success;
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;
 int start_thread = 0;

 if (history_count == 0) {
   atexit(print_history);
 }

 dprintf(("PyThread_acquire_lock(%p, %d) called\n", lock, waitflag));

 status = pthread_mutex_lock( &thelock->mut );

 /* Record information in the log */
 start_thread = (int) pthread_self(); 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = start_thread;
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   time_history[history_count] = 0.0;
   tick_acquire[history_count++] = EVENT_ENTRY;
 }

 CHECK_STATUS("pthread_mutex_lock[1]");
 success = thelock->locked == 0;

 if ( !success && waitflag ) {

   int ntries = 0;
  /* continue trying until we get the lock */

  /* mut must be locked by me -- part of the condition
   * protocol */

  while ( thelock->locked ) {
    if (ntries == 0) {
      if (history_count < MAXHISTORY) {
        thread_history[history_count] = start_thread;
        tick_history[history_count] = _Py_Ticker;
        tick_count_history[history_count] = _Py_Ticker_Count;
        time_history[history_count] = 0.0;
        tick_acquire[history_count++] = EVENT_BUSY;
      }
    }

   status = pthread_cond_wait(&thelock->lock_released,
         &thelock->mut);
   CHECK_STATUS("pthread_cond_wait");
   if (thelock->locked) {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       time_history[history_count] = 0.0;
       tick_acquire[history_count++] = EVENT_RETRY;
       ntries += 1;
     }
   } else {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       {
         struct timeval t;
#ifdef GETTIMEOFDAY_NO_TZ
         if (gettimeofday(&t) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#else /* !GETTIMEOFDAY_NO_TZ */
         if (gettimeofday(&t, (struct timezone *)NULL) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#endif /* !GETTIMEOFDAY_NO_TZ */
       }
       tick_acquire[history_count++] = EVENT_ACQUIRE;
     }
   }

  }
  success = 1;
 } else {
   if (history_count < MAXHISTORY) {
     thread_history[history_count] = start_thread;
     tick_history[history_count] = _Py_Ticker;
     tick_count_history[history_count] = _Py_Ticker_Count;
     time_history[history_count] = 0.0;
     tick_acquire[history_count++] = EVENT_ACQUIRE;
   }
 }
 if (success) thelock->locked = 1;
 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[1]");

 if (error) success = 0;
 dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success));
 return success;
}

void
PyThread_release_lock(PyThread_type_lock lock)
{
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;

 dprintf(("PyThread_release_lock(%p) called\n", lock));

 status = pthread_mutex_lock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_lock[3]");
 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = (int) pthread_self();
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   tick_acquire[history_count++] = EVENT_RELEASE;
 }

 thelock->locked = 0;

 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[3]");

 /* wake up someone (anyone, if any) waiting on the lock */
 status = pthread_cond_signal( &thelock->lock_released );
 CHECK_STATUS("pthread_cond_signal");
}

Step 3 : Rebuilding and Running Python

Once you have made the above changes, rebuild the Python interpreter and run it on some sample code. The code should run the same as before, but on program exit, you will get get a huge data file tickhistory.txt dumped into the current working directory. The contents of this file are going to look something like this:

a0811720 8 1299 RELEASE 0.000000
a0811720 15 1302 ENTRY 0.000000
a0811720 15 1302 ACQUIRE 0.000000
a0811720 10 1302 ENTRY 0.000000
a0811720 10 1302 ACQUIRE 0.000000
a0811720 10 1302 RELEASE 0.000000
a0811720 7 1302 ENTRY 0.000000
a0811720 7 1302 ACQUIRE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 BUSY 0.000000
a0811720 1 1302 RELEASE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 100 1303 RELEASE 0.000000
a0811720 100 1303 ENTRY 0.000000
a0811720 100 1303 ACQUIRE 0.000000
a0811720 92 1303 RELEASE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
...

Be forewarned--the size of this file can be substantial. Running a threaded program for even 10-20 seconds might generate a trace file that contains 3-4 million events. To do any kind of analysis on it, you'll probably want to do what everyone normally does and write a Python script.

Discussion

Interpreting the contents of the trace file are left as an exercise for the reader. However, here are few tips. First, the normal sequence of lock acquisition and release on the GIL with a CPU-bound thread looks something like this (notice that the _Py_Ticker value in the 2nd column is always 100 and that the lock goes through a repeated ENTRY->ACQUIRE->RELEASE cycle):

a000d000 100 3570 ENTRY 0.000000
a000d000 100 3570 ACQUIRE 0.000000
a000d000 100 3571 RELEASE 0.000000
a000d000 100 3571 ENTRY 0.000000
a000d000 100 3571 ACQUIRE 0.000000
a000d000 100 3572 RELEASE 0.000000
a000d000 100 3572 ENTRY 0.000000
a000d000 100 3572 ACQUIRE 0.000000
a000d000 100 3573 RELEASE 0.000000
...

If you're looking at thread contention, you're going to see a trace that has an event series of ENTRY->BUSY->RETRY->...->RETRY->ACQUIRE->RELEASE like this:

a000d000 48 4794 ENTRY 0.000000
a000d000 48 4794 BUSY 0.000000
7091800 32 4794 RELEASE 0.000000
7069a00 32 4794 ACQUIRE 1251397338.473370
7091800 32 4794 ENTRY 0.000000
7091800 32 4794 BUSY 0.000000
a000d000 32 4794 RETRY 0.000000
7069a00 100 4795 RELEASE 0.000000
7069a00 100 4795 ENTRY 0.000000
7069a00 100 4795 ACQUIRE 0.000000
a000d000 66 4795 RETRY 0.000000
7069a00 100 4796 RELEASE 0.000000
7069a00 100 4796 ENTRY 0.000000
7069a00 100 4796 ACQUIRE 0.000000
a000d000 95 4796 RETRY 0.000000
7069a00 100 4797 RELEASE 0.000000
7069a00 100 4797 ENTRY 0.000000
7069a00 100 4797 ACQUIRE 0.000000
...
a000d000 100 5083 ACQUIRE 1251397338.478188
...

Here are some other notes concerning its analysis:

The first column is the hex memory address of a lock object. If you run the program on a threaded program that is using many different locks, you will be tracing not only the GIL, but every lock in the program. You might be able to use this to investigate lock contention.
The GIL is not specifically identified in the trace file. However, it will be one of the first locks used.
The last column of the trace file is a system timer that is only recorded when locks are acquired after repeated failed acquisition attempts. At some point, I was using this to investigate some issues related to response times, but to be honest, I didn't spend much time exploring that angle. It might be useful if you want to get an idea for how long each thread runs before giving up control. Of course, you may just want to comment that code out.

Other Comments

Since giving the presentation, I've received a few comments through email offering suggestions for a GIL fix. I stand by my earlier assertion that there is no easy fix for the problem described in the presentation. Here are some specific suggestions followed by my response:

"Perhaps the GIL could be fixed by adding some kind of scheduling queue." If you were to add a scheduling queue to the GIL, you would effectively turn it into a kind of poorly implemented mutex lock. Mutex locks are already implemented (by pthreads and the OS) using queues into order to avoid thread starvation. More details can be found in an operating system textbook. You might also look at the Bakery Algorithm
"Perhaps the GIL could be fixed by simply using a mutex lock." As just mentioned, mutex locks are generally implemented using a queuing mechanism. If you do this, runnable threads will always context switch every 100 interpreter ticks (you'll see the threads cycling in a round-robin manner). This will definitely eliminate the multicore contention problem, but now your programs will perform a tremendous amount of context switching. Also, you might lose the high scheduling priority of I/O bound threads. Needless to say, there are some downsides that need to be considered (just for the record, I think the use of a condition variable in the current implementation is probably the best overall solution for running on a single CPU).
"Could you fix the problem by telling the operating system to schedule all threads on the same core?" Short answer: No. C extensions to Python (and even significant parts of Python itself) often release the GIL by design so that they can run concurrently while carrying out work that doesn't directly involve the Python interpreter. If you force everything to one core, you will most likely make these programs run worse, not better.

Final Words

As mentioned in the presentation, deep exploration of the Python GIL is not a project I'm actively working on. In fact, all of this was really just an exploration to find out how the GIL works and to see if I could track down pathological performance for a certain test case on my Mac. Feel free to take this code and hack it in any way that you wish. If it proves to be useful, just give me an acknowledgment when you give your PyCon presentation. Have fun!

Python Binary I/O Handling

noreply@blogger.com (Dave) — Sun, 09 Aug 2009 22:32:00 +0000

As a followup to my last post about the Essential Reference, I thought I'd talk about the one topic that I wish I had addressed in more detail in my book--and that's the subject of binary data and I/O handling. Let me elaborate.

One of the things that interests me a lot right now is the subject of concurrent programming. In the early 1990's, I spent a lot of time writing big physics simulation codes for Connection Machines and Crays. All of those programs had massive parallelism (e.g., 1000s of processors) and were based largely on message-passing. In fact, my first use of Python was to control a large massively parallel C program that used MPI. Now, we're starting to see message passing concepts incorporated into the Python standard library. For example, I think the inclusion of the multiprocessing library is probably one of the most significant additions to the Python core that has occurred in the past 10 years.

A major aspect of message passing concerns the problem of quickly getting data from point A to point B. Obviously, you want to do it as fast as possible. A high speed connection helps. However, it also helps to eliminate as much processing overhead as possible. Such overhead can come from many places--decoding data, copying memory buffers, and so forth.

Python makes it pretty easy to pass data around between processes. For example, you can use the pickle module, json, XML-RPC, or some other similar mechanism. However, all of these approaches involve a significant amount of overhead to encode and decode data. You probably wouldn't want to use them for any kind of bulk data transfer (e.g., if you wanted to send a large array of floats between processes). Nor would you really want to use this for some kind of high-performance networking on a big cluster.

However, lurking within the Python standard library is another way to deal with data in messaging and interprocess communication. However, it's all spread out in a way that's not entirely obvious unless you're looking for it (and even then it's still pretty subtle). Let's start with the ctypes library. I always assumed that ctypes was all about accessing C libraries from Python (an alternative approach to Swig). However, that's only part of the story. For instance, using ctypes, you can define binary data structures:

from ctypes import *
class Point(Structure):
     _fields_ = [ ('x',c_double), ('y',c_double), ('z',c_double) ]

This defines an object representing a C data structure. You can even create and manipulate such objects just like an ordinary Python class:

>>> p = Point(2,3.5,6)
>>> p.x
2.0
>>> p.y
3.5
>>> p.z = 7
>>>

However, keep in mind that under the covers, this is manipulating a C structure represented in a contiguous block of memory.

Now this is where things start to get interesting. I wonder how many Python programmers know that they can directly write a ctypes data structure onto a file opened in binary mode. For example, you can take the point above and do this:

>>> f = open("foo","wb")
>>> f.write(p)       
>>> f.close()

Not only that, you can read the file directly back into a ctypes structure if you use the poorly documented readinto() method of files.

>>> g = open("foo","rb")
>>> q = Point()
>>> g.readinto(q)
24
>>> q.x
2.0
>>>

The mechanism that makes all of this work is Python's so-called "buffer protocol." Since C types structures are contiguous in memory, I/O operations can be performed directly with that memory without making copies or first converting such structures into strings as you might do with something like the struct module. The buffer protocol simply exposes the underlying memory buffers for use in I/O.

Direct binary I/O like this is not limited to files. If s is a socket, you can perform similar operations like this:

p = Point(2,3,4)           #  Create a point
s.send(p)                  #  Send across a socket

q = Point()
s.recv_info(q)               # Receive directly into q

If that wasn't enough to make your brain explode, similar functionality is provided by the multiprocessing library as well. For example, Connection objects (as created by the multiprocessing.Pipe() function) have send_bytes() and recv_bytes_into() methods that also work directly with ctypes objects. Here's an experiment to try. Start two different Python interpreters and define the Point structure above. Now, try sending a point through a multiprocessing connection object:

>>> p = Point(2,3,4)
>>> from multiprocessing.connection import Listener
>>> serv = Listener(("",25000),authkey="12345")
>>> c = serv.accept()
>>> c.send_bytes(p)
>>>

In the other Python process, do this:

>>> q = Point()
>>> from multiprocessing.connection import Client
>>> c = Client(("",25000),authkey="12345")
>>> c.recv_bytes_into(q)
24
>>> q.x
2.0
>>> q.y
3.0
>>>

As you can see, the point defined in one process has been directly transferred to the other.

If you put all of the pieces of this together, you find that there is this whole binary handling layer lurking under the covers of Python. If you combine it with something like ctypes, you'll find that you can directly pass binary data structures such as C structures and arrays around between different interpreters. Moreover, if you combine this with C extensions, it seems to be possible pass data around without a lot of extra overhead. Finally, if that wasn't enough, it turns out that some popular extensions such as numpy also play in this arena. For instance, in certain cases you can perform similar direct I/O operations with numpy arrays (e.g., directly passing arrays through multiprocessing connections).

I think that this functionality is pretty interesting--and highly relevant to anyone who is thinking about parallel processing and messaging. However, all of this is also somewhat unsettling. For one, much of this functionality is all very poorly documented in the Python documentation (and in my book for that matter). If you look at the documentation for methods such as the read_into() method files, it simply says "undocumented, don't use it." The buffer interface, which makes much of this work, has always been rather obscure and poorly understood--although it got a redesign in Python 3.0 (see Travis Oliphant's talk from PyCon). And if it wasn't complicated enough already, much of this functionality gets tied into the bytes/Unicode handling part of Python --a hairy subject on its own.

To wrap up, I think much of what I've described here represents a part of Python that probably deserves more investigation (and at the very least, more documentation). Unfortunately, I only started playing around with this recently--too late for inclusion in the Essential Reference (which was already typeset and out the door). However, I'm thinking it might be a good topic for a PyCon tutorial. Stay tuned.

Note: If anyone has links to articles or presentations about this, let me know and I'll add them here.

Essential Misconceptions

noreply@blogger.com (Dave) — Sun, 09 Aug 2009 15:26:00 +0000

A few days ago, Mike Riley posted a great review of the new "Python Essential Reference, 4th Edition" on Dr. Dobb's CodeTalk. In that review, he writes:

"While the author could have taken the easy path of regurgitating the online documentation, he has instead reworked the explanation for each class and function call in the Python core library with commendable clarity, frequently accompanying these detailed examinations with extremely useful and meaningful code examples. The book is also very well designed and organized, making it a snap to find information within a matter of seconds."

This is a reviewer who really gets what this book is about. However, for every great review like this, I also encounter comments that simply dismiss the book out-of-hand saying it "offers nothing" over Python's online documentation. With all due respect to Python's fine documentation, I beg to differ.

First and foremost, I've always viewed the Python Essential Reference as a serious programming reference for myself (yes, I always have a copy next to my desk and I use it regularly). Although, I will admit that Python certainly has a lot of online documentation, it's also missing a lot of essential details. For example, I can't count the number of times I've looked at the online documentation for something only to have to go out and do some kind of extended Google search to fill in a missing detail (or worse, having to load the source code for some module and look through it).

Let's look at an example. Suppose you're writing some networking code with the socket module and you want to use the recv(bufsize [, flags]) method of a socket. If you head off to the online documentation you will certainly find some information.

"Receive data from the socket. The return value is a string representing the data received. The maximum amount of data to be received at once is specified by bufsize. See the Unix manual page recv(2) for the meaning of the optional argument flags; it defaults to zero."

Yes, this is all very useful. Especially that part about having to refer to a Unix man page. I'm sure the Windows programmers find that especially useful. If you turn to the Essential Reference p. 483, you'll not only find a description, but you will also get a complete table showing you exactly what can be given for flags along with a brief description of each option. This approach is found throughout the book--with few exceptions are readers simply referred to other documentation. As another example, I would challenge anyone to effectively use something like the setsockopt() or getsockopt() methods of a socket using nothing by Python's online docs.

The other thing that I've tried to do in the book is answer all sorts of questions about tricky interactions between different parts of Python. Take, for example, this question: Can a separate execution thread safely close a generator/coroutine function by invoking the generator's close() method? Sure, that's not the kind of question that comes up every day, but if you know a thing or two about generators and coroutines, you'll know that they are often used in the context of concurrent programming, just like threads. Not only that, threads and generators might be used together (for example, using threads to carry out blocking operations). Thus, it is reasonable to assume that programmers working with both threads and generators in the same program might start to wonder about their possible interaction. I know I did.

If you try to find an answer to this question using the online documentation, you will be searching for some time and probably come up with nothing. Although there is plenty of discussion about generators, the yield statement, and other matters, you really don't find much about generators and threads mixed together. Even PEP 342, the official specification that introduced the generator close() method says nothing on this matter.

Now, let's look at the Essential Reference. First, if you turn to the index and look up "Threads", you will find about a half-page of subentries. In fact, there is even an entry labeled "Threads: close() method of generators, p. 104." If you turn to p. 104, you will find a sentence "if a program is currently iterating on a generator, you should not call close() asynchronously on that generator from a separate thread of execution or from a signal handler."

This is certainly not the only example, but there are a wide variety of similar questions that I try to address. For example, can you use a decorator with a recursive function? (p. 113). Or what is the interaction between the __slots__ feature of a class and inheritance? (p. 133). Or, does the name mangling of private attributes (e.g., __foo) in a class introduce a runtime performance penalty? (p. 128). All of these questions fall into a general category of issues related to the "side-effects" of using various Python features. Although you can find some of this in the online docs, it is often scattered and incomplete. I've tried to fix that.

Finally, I've really tried to make the Essential Reference a kind of programming "cookbook" of sorts. Although its primary goal is to be a reference, I have also incorporated a wide variety of practical examples from the Python training courses that I run. For instance, if you know about the Generators or Coroutines tutorials I presented at PyCON, you'll find similar information. I also include examples that explore tricky interactions and customization features of certain library modules. For example, how do I customize an XML-RPC server to only accept connections from known IP addresses? (p. 494). Or how do I use the ssl module to implement a secure server? (p. 489). Many of these examples are related to things that I've had to figure out once before, but can never quite remember on a day-to-day basis. By putting them in the book, it helps me remember how to do a variety of tricky things.

So, that's about it. I hope people find the book to be useful. If so, tell your friends. If not, feel free to use it for propping up some uneven furniture. Just don't say that it's the same as the online docs.

First post

noreply@blogger.com (Dave) — Sun, 09 Aug 2009 15:01:00 +0000

Well, this is my blog. Welcome! I have to admit that I've never been much of a blogger in the past--preferring to focus my energy on writing books and giving conference presentations. However, I'll probably use this space to post occasional technical articles about projects that I'm working on as well as followups to my presentations. Enjoy!