Dabeaz

Dave Beazley's mondo computer blog. [ homepage ]

Thursday, August 27, 2009

Inside the "Inside the Python GIL" Presentation

On June 11, 2009 I gave a presentation about the inner workings of the Python GIL at the Chicago Python user group meeting. To be honest, I always expected the event to be a pretty low-key affair involving some local Python hackers and some beers. However, the presentation went a little viral and I've received a number of requests to get the code modifications I made to investigate thread behavior--especially the traces that show thread switching and other details.

In this post, I'll briefly outline the code changes I made to generate the traces. Before going any further, you should probably first view the original presentation. Also, as a disclaimer, none of these changes are easily packaged into a neat "patch" that one can simply download and install into any Python distribution. So, to start, you should first go download a Python source distribution for the version of Python you want to experiment with. For my talk, I was using Python 2.6.

First, let's talk about a major issue--any investigation of threads at a low-level (especially thread scheduling) tends to be a rather tricky affair involving some kind of computer science variant of the uncertainty principle. That is, once you start trying to observe thread behavior, you run the risk of changing the very thing you're trying to observe. The problem gets worse if you add a lot of extra complexity--especially if there are extra system calls or I/O. So, a major underlying concern was to try and devise a technique for recording thread behavior in a minimally invasive manner (as an aside, I considered the idea of trying to use dtrace for this, but decided that it would take longer for me to learn dtrace than it would to simply make a few minor modifications to the interpreter).

Step 1: Defining time

Everything that happens inside the Python interpreter is focused around the concept of "ticks." Each tick loosely corresponds to a single instruction in the virtual machine. Locate the file Python/ceval.c in the Python source code. In this file, you will find a global variable _Py_Ticker holding the tick counter. Here's what the code looks like:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
...

Add a new variable declaration _Py_Ticker_Counter to this code so that it looks like this:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
volatile int _Py_Ticker_Count = 0;
...

Later in the same file, you will find code that decrements the value of _Py_Ticker. Modify this code so that each time _Py_Ticker reaches 0, the value of _Py_Ticker_Count is incremented. Here's what it looks like:

/* ceval.c */
...
  if (--_Py_Ticker < 0) {
   if (*next_instr == SETUP_FINALLY) {
    /* Make the last opcode before
       a try: finally: block uninterruptable. */
    goto fast_next_opcode;
   }
   _Py_Ticker = _Py_CheckInterval;
   _Py_Ticker_Count++; 
   tstate->tick_counter++;
...

The _Py_Ticker_Count and _Py_Ticker variables together define a kind of internal clock. _Py_Ticker is a countdown to the next time the interpreter might thread-switch. The _Py_Ticker_Count keeps track of how many times the interpreter has actually signaled the operating system to schedule waiting threads (if any). In the traces that follow, these two values are used together to record the sequence of events that occur in terms of interpreter ticks.

Step 2 : Recording Trace Data

Python defines a general purpose lock object that is used for both the GIL and locking primitives in the threading modules. On Unix systems using pthreads, the implementation of the lock can be found in the file Python/thread_pthread.h. In that file, there are two functions that we are going to modify: PyThread_acquire_lock() and PyThread_release_lock().

Here's the general idea : The lock/unlock functions are instrumented to record a large in-memory trace of lock-related events. These include lock entry (when a thread first tries to acquire a lock), busy (when the lock is busy), retry (a repeated failed attempt to acquire a lock), acquire (lock successfully acquired), and release (lock released). In addition to events, the trace records current values of the _Py_Ticker and _Py_Ticker_Count variables as well as the pointer to the currently executing thread.

All trace data is stored entirely in memory as programs execute. The size of the history can be controlled with a macro in the code. To dump the trace, a function print_history() is registered to execute on interpreter exit using the atexit() call. It is important to emphasize that no I/O occurs as programs are executing--traces are only dumped on program exit.

Here a copy of the modified code. Be aware that thread_pthread.h is a bit of a mess and that there are a few different implementations of locks. This code is meant to go in the non-semaphore implemention of locks. Further discussion appears afterwards

/* thread_pthread.h */
...
/* Thread lock monitoring modifications (beazley) */

#include <sys/resource.h>
#include <sched.h>

#define MAXHISTORY 5000000
static int           thread_history[MAXHISTORY];
static unsigned char tick_history[MAXHISTORY];
static int           tick_count_history[MAXHISTORY];
static unsigned char tick_acquire[MAXHISTORY];
static double        time_history[MAXHISTORY];
static unsigned int  history_count = 0;

#define EVENT_ENTRY   0
#define EVENT_BUSY    1
#define EVENT_RETRY   2
#define EVENT_ACQUIRE 3
#define EVENT_RELEASE 4

static char *_codes[] = {"ENTRY","BUSY","RETRY","ACQUIRE","RELEASE" };

static void print_history(void) {
 int i;
 FILE *f;

 f = fopen("tickhistory.txt","w");
 for (i = 0; i < history_count; i++) {
   fprintf(f,"%x %d %d %s %0.6f\n",thread_history[i],tick_history[i],tick_count_history[i],_codes[tick_acquire[i]],time_history[i]);
 }
 fclose(f);
}

/* External variables recorded in the history */
extern volatile int _Py_Ticker;
extern volatile int _Py_Ticker_Count;


int
PyThread_acquire_lock(PyThread_type_lock lock, int waitflag)
{
 int success;
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;
 int start_thread = 0;

 if (history_count == 0) {
   atexit(print_history);
 }

 dprintf(("PyThread_acquire_lock(%p, %d) called\n", lock, waitflag));

 status = pthread_mutex_lock( &thelock->mut );

 /* Record information in the log */
 start_thread = (int) pthread_self(); 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = start_thread;
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   time_history[history_count] = 0.0;
   tick_acquire[history_count++] = EVENT_ENTRY;
 }

 CHECK_STATUS("pthread_mutex_lock[1]");
 success = thelock->locked == 0;

 if ( !success && waitflag ) {

   int ntries = 0;
  /* continue trying until we get the lock */

  /* mut must be locked by me -- part of the condition
   * protocol */

  while ( thelock->locked ) {
    if (ntries == 0) {
      if (history_count < MAXHISTORY) {
        thread_history[history_count] = start_thread;
        tick_history[history_count] = _Py_Ticker;
        tick_count_history[history_count] = _Py_Ticker_Count;
        time_history[history_count] = 0.0;
        tick_acquire[history_count++] = EVENT_BUSY;
      }
    }

   status = pthread_cond_wait(&thelock->lock_released,
         &thelock->mut);
   CHECK_STATUS("pthread_cond_wait");
   if (thelock->locked) {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       time_history[history_count] = 0.0;
       tick_acquire[history_count++] = EVENT_RETRY;
       ntries += 1;
     }
   } else {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       {
         struct timeval t;
#ifdef GETTIMEOFDAY_NO_TZ
         if (gettimeofday(&t) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#else /* !GETTIMEOFDAY_NO_TZ */
         if (gettimeofday(&t, (struct timezone *)NULL) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#endif /* !GETTIMEOFDAY_NO_TZ */
       }
       tick_acquire[history_count++] = EVENT_ACQUIRE;
     }
   }

  }
  success = 1;
 } else {
   if (history_count < MAXHISTORY) {
     thread_history[history_count] = start_thread;
     tick_history[history_count] = _Py_Ticker;
     tick_count_history[history_count] = _Py_Ticker_Count;
     time_history[history_count] = 0.0;
     tick_acquire[history_count++] = EVENT_ACQUIRE;
   }
 }
 if (success) thelock->locked = 1;
 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[1]");

 if (error) success = 0;
 dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success));
 return success;
}

void
PyThread_release_lock(PyThread_type_lock lock)
{
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;

 dprintf(("PyThread_release_lock(%p) called\n", lock));

 status = pthread_mutex_lock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_lock[3]");
 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = (int) pthread_self();
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   tick_acquire[history_count++] = EVENT_RELEASE;
 }

 thelock->locked = 0;

 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[3]");

 /* wake up someone (anyone, if any) waiting on the lock */
 status = pthread_cond_signal( &thelock->lock_released );
 CHECK_STATUS("pthread_cond_signal");
}

Step 3 : Rebuilding and Running Python

Once you have made the above changes, rebuild the Python interpreter and run it on some sample code. The code should run the same as before, but on program exit, you will get get a huge data file tickhistory.txt dumped into the current working directory. The contents of this file are going to look something like this:

a0811720 8 1299 RELEASE 0.000000
a0811720 15 1302 ENTRY 0.000000
a0811720 15 1302 ACQUIRE 0.000000
a0811720 10 1302 ENTRY 0.000000
a0811720 10 1302 ACQUIRE 0.000000
a0811720 10 1302 RELEASE 0.000000
a0811720 7 1302 ENTRY 0.000000
a0811720 7 1302 ACQUIRE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 BUSY 0.000000
a0811720 1 1302 RELEASE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 100 1303 RELEASE 0.000000
a0811720 100 1303 ENTRY 0.000000
a0811720 100 1303 ACQUIRE 0.000000
a0811720 92 1303 RELEASE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
...

Be forewarned--the size of this file can be substantial. Running a threaded program for even 10-20 seconds might generate a trace file that contains 3-4 million events. To do any kind of analysis on it, you'll probably want to do what everyone normally does and write a Python script.

Discussion

Interpreting the contents of the trace file are left as an exercise for the reader. However, here are few tips. First, the normal sequence of lock acquisition and release on the GIL with a CPU-bound thread looks something like this (notice that the _Py_Ticker value in the 2nd column is always 100 and that the lock goes through a repeated ENTRY->ACQUIRE->RELEASE cycle):

a000d000 100 3570 ENTRY 0.000000
a000d000 100 3570 ACQUIRE 0.000000
a000d000 100 3571 RELEASE 0.000000
a000d000 100 3571 ENTRY 0.000000
a000d000 100 3571 ACQUIRE 0.000000
a000d000 100 3572 RELEASE 0.000000
a000d000 100 3572 ENTRY 0.000000
a000d000 100 3572 ACQUIRE 0.000000
a000d000 100 3573 RELEASE 0.000000
...

If you're looking at thread contention, you're going to see a trace that has an event series of ENTRY->BUSY->RETRY->...->RETRY->ACQUIRE->RELEASE like this:

a000d000 48 4794 ENTRY 0.000000
a000d000 48 4794 BUSY 0.000000
7091800 32 4794 RELEASE 0.000000
7069a00 32 4794 ACQUIRE 1251397338.473370
7091800 32 4794 ENTRY 0.000000
7091800 32 4794 BUSY 0.000000
a000d000 32 4794 RETRY 0.000000
7069a00 100 4795 RELEASE 0.000000
7069a00 100 4795 ENTRY 0.000000
7069a00 100 4795 ACQUIRE 0.000000
a000d000 66 4795 RETRY 0.000000
7069a00 100 4796 RELEASE 0.000000
7069a00 100 4796 ENTRY 0.000000
7069a00 100 4796 ACQUIRE 0.000000
a000d000 95 4796 RETRY 0.000000
7069a00 100 4797 RELEASE 0.000000
7069a00 100 4797 ENTRY 0.000000
7069a00 100 4797 ACQUIRE 0.000000
...
a000d000 100 5083 ACQUIRE 1251397338.478188
...

Here are some other notes concerning its analysis:

The first column is the hex memory address of a lock object. If you run the program on a threaded program that is using many different locks, you will be tracing not only the GIL, but every lock in the program. You might be able to use this to investigate lock contention.
The GIL is not specifically identified in the trace file. However, it will be one of the first locks used.
The last column of the trace file is a system timer that is only recorded when locks are acquired after repeated failed acquisition attempts. At some point, I was using this to investigate some issues related to response times, but to be honest, I didn't spend much time exploring that angle. It might be useful if you want to get an idea for how long each thread runs before giving up control. Of course, you may just want to comment that code out.

Other Comments

Since giving the presentation, I've received a few comments through email offering suggestions for a GIL fix. I stand by my earlier assertion that there is no easy fix for the problem described in the presentation. Here are some specific suggestions followed by my response:

"Perhaps the GIL could be fixed by adding some kind of scheduling queue." If you were to add a scheduling queue to the GIL, you would effectively turn it into a kind of poorly implemented mutex lock. Mutex locks are already implemented (by pthreads and the OS) using queues into order to avoid thread starvation. More details can be found in an operating system textbook. You might also look at the Bakery Algorithm
"Perhaps the GIL could be fixed by simply using a mutex lock." As just mentioned, mutex locks are generally implemented using a queuing mechanism. If you do this, runnable threads will always context switch every 100 interpreter ticks (you'll see the threads cycling in a round-robin manner). This will definitely eliminate the multicore contention problem, but now your programs will perform a tremendous amount of context switching. Also, you might lose the high scheduling priority of I/O bound threads. Needless to say, there are some downsides that need to be considered (just for the record, I think the use of a condition variable in the current implementation is probably the best overall solution for running on a single CPU).
"Could you fix the problem by telling the operating system to schedule all threads on the same core?" Short answer: No. C extensions to Python (and even significant parts of Python itself) often release the GIL by design so that they can run concurrently while carrying out work that doesn't directly involve the Python interpreter. If you force everything to one core, you will most likely make these programs run worse, not better.

Final Words

As mentioned in the presentation, deep exploration of the Python GIL is not a project I'm actively working on. In fact, all of this was really just an exploration to find out how the GIL works and to see if I could track down pathological performance for a certain test case on my Mac. Feel free to take this code and hack it in any way that you wish. If it proves to be useful, just give me an acknowledgment when you give your PyCon presentation. Have fun!

Labels: gil, python, threads

posted by Dave # 5:39 AM

Comments:

To the blogger gurus---is there some better way to post computer code in a blog entry so that the formatting doesn't get all funky?

# posted by

Dave : 11:58 AM

> ...blogger gurus...

Do what Guido does, and use Google's own google-code-prettify Javascript:

http://manuelmoeg.blogspot.com/2009/03/def-hello-pass.html

(this is not as full featured as SyntaxHighlighter http://alexgorbatchev.com/wiki/SyntaxHighlighter but bone simple to use)

but in general, Blogger's support for code snippets is embarrassing.

# posted by

manuelg : 2:18 PM

oh yeah, thanks for doing some of the very best writing and research on Python's GIL. It removed much of my own stupidity on the subject.

What seems to be the answer:

*** I/O bound threaded code:

For CPython's implementation, implementing the GIL using both a condition variable and an pthreads/OS mutex lock is the way to go, so that code can be developed on a single core machine, and not blow up on a multiple core machine (and vice-versa, to a degree).

*** CPU intensive threaded code:

1) event-driven and deferred objects (using Twisted or rolling your own) - being careful to avoid loops and other long running code

2) "green" threads, implemented in pure Python - maintain your own stack of tuples of ("functionname", arg0, arg1) and ("continuationname", statevar0, statevar1), and dispatch on name from that stack - being careful to avoid loops and other long running code - if you have some code with a loop, break it up into more than one "continuationname0" "continuationname1" etc.

3) message passing between Python processes

4) multiprocessing

5) combination of the above four

The biggest missing piece: in a long running high availability application (a candidate for multi-threaded code), code reloading on the fly. Right now, a terrible solution is using Erlang as a thin layer of supervisor code, where the real work is farmed out to Python. The only advantage of this approach is avoiding predictable failure modes. Armin Ronacher blogged about this problem: http://lucumr.pocoo.org/2009/7/24/singletons-and-their-problems-in-python

# posted by

manuelg : 2:56 PM

I'd be interested to see the results on other platforms. On Mac OS X, Python uses the mutex + condition variable, but on Linux it uses pthreads semaphores. I imagine it would be completely different there, not to mention any platform that doesn't use pthreads.

# posted by

Reid Kleckner : 4:01 PM

A note on semaphores: On many systems, semaphores are implemented using condition variables and a counter. So, the underlying mechanism is very similar underneath the covers.

Both Linux and Windows systems exhibit the same performance problems as OS-X (although it isn't as pronounced on Linux). We did some tests across a variety of platforms in the concurrency workshop I ran a few months ago and observed the behavior described in the presentation on all of the multicore systems.

# posted by

Dave : 4:14 PM

Thanks for doing this research. Back when you gave the original presentation, it got posted to the python concurrency sig mailing list, and I wrote the following about why this is a tricky problem to fix:

We have two cases with very different optimal behavior for the GIL: single core and multicore machines.

* On a multicore machine, a waiting thread has to do some amount of work to wake up. A reasonable ballpark is ~1us. It makes sense to let the foreground thread continue making progress while the background thread is waking up, especially since the OS may not choose to wake up a Python thread first. So we let the foreground thread re-acquire the
GIL immediately after releasing it, in the hope that it can get a
couple more checks in before the background thread actually wakes up.
BUT, we don't really want to let it continue running after the waiting
thread does wake up, so perhaps we should have the waiting thread set
a flag when it does wake up which forces the foreground thread to
sleep asap. Then the waiting thread has to wait for the GIL again, but
we DON'T want it to hand control back to the OS or we would have
wasted that waking-up time. So maybe we have it spin-wait. But what
happens if the OS has actually swapped out the foreground thread for another process? Then we waste lots of time. I don't know of any OSes that give us a way to do something when a thread gets swapped out. They don't even let another thread check whether a given thread is currently running.

* On a single core, any time the foreground thread spends executing
after signaling a waiting thread is time the waiting thread can't use
to wake up. So it makes sense to force a context switch to a particular waiting thread. This is actually pretty easy: instead of a
GIL, we have a binary semaphore per thread that gets upped to instruct
a particular thread to run, and then the previously-running thread
immediately waits on its own semaphore. The issue here is just the time it takes to switch threads: ~1us. The GIL checks are currently every 100 ticks (every couple opcodes), which means that in
arithmetic-heavy code those checks occur on the order of every microsecond too. You don't want to spend half of your time switching
threads. On the other hand, as Dave pointed out, sometimes even 100
ticks isn't soon enough. I think we could solve this by checking the
elapsed time on each "check" rather than unconditionally switching
threads, but we might want to do something to give I/O-bound threads
higher priority.

# posted by

Jeffrey Yasskin : 4:38 PM

To Dave Beazley and the Unladen-Swallow implementers: is it possible to handle I/O bound multi-threaded code and CPU intensive multi-threaded code using the same techniques?

I know shared mutable state threading like in Java and C# can do it all, but only by giving up *ANY* guarantees about low-latency and performance and avoiding-race-conditions with *ANY* change to the code - a shared mutable state threading program is expected to be brittle and non-deterministic in performance with many small changes made during the life-cycle of the code.

It seems to me: any techniques adequate to handle CPU intensive multi-threaded code would be overkill for I/O bound multi-threaded code. So best to deal with the cases separately.

# posted by

manuelg : 6:49 PM

This post has been removed by the author.

# posted by

Sturla : 5:50 PM

This post has been removed by the author.

# posted by

Sturla : 5:55 PM

Not an expert on threading or paraller programming but I loved your presentation. Very enlightening and interesting!

PS: Bought your book as a result :-p

# posted by

Andriy Drozdyuk : 8:27 AM

Dabeaz

Thursday, August 27, 2009

Inside the "Inside the Python GIL" Presentation

Archives