Dabeaz

Dave Beazley's mondo computer blog. [ homepage ]

Thursday, August 27, 2009

Inside the "Inside the Python GIL" Presentation

On June 11, 2009 I gave a presentation about the inner workings of the Python GIL at the Chicago Python user group meeting. To be honest, I always expected the event to be a pretty low-key affair involving some local Python hackers and some beers. However, the presentation went a little viral and I've received a number of requests to get the code modifications I made to investigate thread behavior--especially the traces that show thread switching and other details.

In this post, I'll briefly outline the code changes I made to generate the traces. Before going any further, you should probably first view the original presentation. Also, as a disclaimer, none of these changes are easily packaged into a neat "patch" that one can simply download and install into any Python distribution. So, to start, you should first go download a Python source distribution for the version of Python you want to experiment with. For my talk, I was using Python 2.6.

First, let's talk about a major issue--any investigation of threads at a low-level (especially thread scheduling) tends to be a rather tricky affair involving some kind of computer science variant of the uncertainty principle. That is, once you start trying to observe thread behavior, you run the risk of changing the very thing you're trying to observe. The problem gets worse if you add a lot of extra complexity--especially if there are extra system calls or I/O. So, a major underlying concern was to try and devise a technique for recording thread behavior in a minimally invasive manner (as an aside, I considered the idea of trying to use dtrace for this, but decided that it would take longer for me to learn dtrace than it would to simply make a few minor modifications to the interpreter).

Step 1: Defining time

Everything that happens inside the Python interpreter is focused around the concept of "ticks." Each tick loosely corresponds to a single instruction in the virtual machine. Locate the file Python/ceval.c in the Python source code. In this file, you will find a global variable _Py_Ticker holding the tick counter. Here's what the code looks like:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
...

Add a new variable declaration _Py_Ticker_Counter to this code so that it looks like this:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
volatile int _Py_Ticker_Count = 0;
...

Later in the same file, you will find code that decrements the value of _Py_Ticker. Modify this code so that each time _Py_Ticker reaches 0, the value of _Py_Ticker_Count is incremented. Here's what it looks like:

/* ceval.c */
...
  if (--_Py_Ticker < 0) {
   if (*next_instr == SETUP_FINALLY) {
    /* Make the last opcode before
       a try: finally: block uninterruptable. */
    goto fast_next_opcode;
   }
   _Py_Ticker = _Py_CheckInterval;
   _Py_Ticker_Count++; 
   tstate->tick_counter++;
...

The _Py_Ticker_Count and _Py_Ticker variables together define a kind of internal clock. _Py_Ticker is a countdown to the next time the interpreter might thread-switch. The _Py_Ticker_Count keeps track of how many times the interpreter has actually signaled the operating system to schedule waiting threads (if any). In the traces that follow, these two values are used together to record the sequence of events that occur in terms of interpreter ticks.

Step 2 : Recording Trace Data

Python defines a general purpose lock object that is used for both the GIL and locking primitives in the threading modules. On Unix systems using pthreads, the implementation of the lock can be found in the file Python/thread_pthread.h. In that file, there are two functions that we are going to modify: PyThread_acquire_lock() and PyThread_release_lock().

Here's the general idea : The lock/unlock functions are instrumented to record a large in-memory trace of lock-related events. These include lock entry (when a thread first tries to acquire a lock), busy (when the lock is busy), retry (a repeated failed attempt to acquire a lock), acquire (lock successfully acquired), and release (lock released). In addition to events, the trace records current values of the _Py_Ticker and _Py_Ticker_Count variables as well as the pointer to the currently executing thread.

All trace data is stored entirely in memory as programs execute. The size of the history can be controlled with a macro in the code. To dump the trace, a function print_history() is registered to execute on interpreter exit using the atexit() call. It is important to emphasize that no I/O occurs as programs are executing--traces are only dumped on program exit.

Here a copy of the modified code. Be aware that thread_pthread.h is a bit of a mess and that there are a few different implementations of locks. This code is meant to go in the non-semaphore implemention of locks. Further discussion appears afterwards

/* thread_pthread.h */
...
/* Thread lock monitoring modifications (beazley) */

#include <sys/resource.h>
#include <sched.h>

#define MAXHISTORY 5000000
static int           thread_history[MAXHISTORY];
static unsigned char tick_history[MAXHISTORY];
static int           tick_count_history[MAXHISTORY];
static unsigned char tick_acquire[MAXHISTORY];
static double        time_history[MAXHISTORY];
static unsigned int  history_count = 0;

#define EVENT_ENTRY   0
#define EVENT_BUSY    1
#define EVENT_RETRY   2
#define EVENT_ACQUIRE 3
#define EVENT_RELEASE 4

static char *_codes[] = {"ENTRY","BUSY","RETRY","ACQUIRE","RELEASE" };

static void print_history(void) {
 int i;
 FILE *f;

 f = fopen("tickhistory.txt","w");
 for (i = 0; i < history_count; i++) {
   fprintf(f,"%x %d %d %s %0.6f\n",thread_history[i],tick_history[i],tick_count_history[i],_codes[tick_acquire[i]],time_history[i]);
 }
 fclose(f);
}

/* External variables recorded in the history */
extern volatile int _Py_Ticker;
extern volatile int _Py_Ticker_Count;


int
PyThread_acquire_lock(PyThread_type_lock lock, int waitflag)
{
 int success;
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;
 int start_thread = 0;

 if (history_count == 0) {
   atexit(print_history);
 }

 dprintf(("PyThread_acquire_lock(%p, %d) called\n", lock, waitflag));

 status = pthread_mutex_lock( &thelock->mut );

 /* Record information in the log */
 start_thread = (int) pthread_self(); 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = start_thread;
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   time_history[history_count] = 0.0;
   tick_acquire[history_count++] = EVENT_ENTRY;
 }

 CHECK_STATUS("pthread_mutex_lock[1]");
 success = thelock->locked == 0;

 if ( !success && waitflag ) {

   int ntries = 0;
  /* continue trying until we get the lock */

  /* mut must be locked by me -- part of the condition
   * protocol */

  while ( thelock->locked ) {
    if (ntries == 0) {
      if (history_count < MAXHISTORY) {
        thread_history[history_count] = start_thread;
        tick_history[history_count] = _Py_Ticker;
        tick_count_history[history_count] = _Py_Ticker_Count;
        time_history[history_count] = 0.0;
        tick_acquire[history_count++] = EVENT_BUSY;
      }
    }

   status = pthread_cond_wait(&thelock->lock_released,
         &thelock->mut);
   CHECK_STATUS("pthread_cond_wait");
   if (thelock->locked) {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       time_history[history_count] = 0.0;
       tick_acquire[history_count++] = EVENT_RETRY;
       ntries += 1;
     }
   } else {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       {
         struct timeval t;
#ifdef GETTIMEOFDAY_NO_TZ
         if (gettimeofday(&t) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#else /* !GETTIMEOFDAY_NO_TZ */
         if (gettimeofday(&t, (struct timezone *)NULL) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#endif /* !GETTIMEOFDAY_NO_TZ */
       }
       tick_acquire[history_count++] = EVENT_ACQUIRE;
     }
   }

  }
  success = 1;
 } else {
   if (history_count < MAXHISTORY) {
     thread_history[history_count] = start_thread;
     tick_history[history_count] = _Py_Ticker;
     tick_count_history[history_count] = _Py_Ticker_Count;
     time_history[history_count] = 0.0;
     tick_acquire[history_count++] = EVENT_ACQUIRE;
   }
 }
 if (success) thelock->locked = 1;
 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[1]");

 if (error) success = 0;
 dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success));
 return success;
}

void
PyThread_release_lock(PyThread_type_lock lock)
{
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;

 dprintf(("PyThread_release_lock(%p) called\n", lock));

 status = pthread_mutex_lock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_lock[3]");
 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = (int) pthread_self();
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   tick_acquire[history_count++] = EVENT_RELEASE;
 }

 thelock->locked = 0;

 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[3]");

 /* wake up someone (anyone, if any) waiting on the lock */
 status = pthread_cond_signal( &thelock->lock_released );
 CHECK_STATUS("pthread_cond_signal");
}

Step 3 : Rebuilding and Running Python

Once you have made the above changes, rebuild the Python interpreter and run it on some sample code. The code should run the same as before, but on program exit, you will get get a huge data file tickhistory.txt dumped into the current working directory. The contents of this file are going to look something like this:

a0811720 8 1299 RELEASE 0.000000
a0811720 15 1302 ENTRY 0.000000
a0811720 15 1302 ACQUIRE 0.000000
a0811720 10 1302 ENTRY 0.000000
a0811720 10 1302 ACQUIRE 0.000000
a0811720 10 1302 RELEASE 0.000000
a0811720 7 1302 ENTRY 0.000000
a0811720 7 1302 ACQUIRE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 BUSY 0.000000
a0811720 1 1302 RELEASE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 100 1303 RELEASE 0.000000
a0811720 100 1303 ENTRY 0.000000
a0811720 100 1303 ACQUIRE 0.000000
a0811720 92 1303 RELEASE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
...

Be forewarned--the size of this file can be substantial. Running a threaded program for even 10-20 seconds might generate a trace file that contains 3-4 million events. To do any kind of analysis on it, you'll probably want to do what everyone normally does and write a Python script.

Discussion

Interpreting the contents of the trace file are left as an exercise for the reader. However, here are few tips. First, the normal sequence of lock acquisition and release on the GIL with a CPU-bound thread looks something like this (notice that the _Py_Ticker value in the 2nd column is always 100 and that the lock goes through a repeated ENTRY->ACQUIRE->RELEASE cycle):

a000d000 100 3570 ENTRY 0.000000
a000d000 100 3570 ACQUIRE 0.000000
a000d000 100 3571 RELEASE 0.000000
a000d000 100 3571 ENTRY 0.000000
a000d000 100 3571 ACQUIRE 0.000000
a000d000 100 3572 RELEASE 0.000000
a000d000 100 3572 ENTRY 0.000000
a000d000 100 3572 ACQUIRE 0.000000
a000d000 100 3573 RELEASE 0.000000
...

If you're looking at thread contention, you're going to see a trace that has an event series of ENTRY->BUSY->RETRY->...->RETRY->ACQUIRE->RELEASE like this:

a000d000 48 4794 ENTRY 0.000000
a000d000 48 4794 BUSY 0.000000
7091800 32 4794 RELEASE 0.000000
7069a00 32 4794 ACQUIRE 1251397338.473370
7091800 32 4794 ENTRY 0.000000
7091800 32 4794 BUSY 0.000000
a000d000 32 4794 RETRY 0.000000
7069a00 100 4795 RELEASE 0.000000
7069a00 100 4795 ENTRY 0.000000
7069a00 100 4795 ACQUIRE 0.000000
a000d000 66 4795 RETRY 0.000000
7069a00 100 4796 RELEASE 0.000000
7069a00 100 4796 ENTRY 0.000000
7069a00 100 4796 ACQUIRE 0.000000
a000d000 95 4796 RETRY 0.000000
7069a00 100 4797 RELEASE 0.000000
7069a00 100 4797 ENTRY 0.000000
7069a00 100 4797 ACQUIRE 0.000000
...
a000d000 100 5083 ACQUIRE 1251397338.478188
...

Here are some other notes concerning its analysis:

The first column is the hex memory address of a lock object. If you run the program on a threaded program that is using many different locks, you will be tracing not only the GIL, but every lock in the program. You might be able to use this to investigate lock contention.
The GIL is not specifically identified in the trace file. However, it will be one of the first locks used.
The last column of the trace file is a system timer that is only recorded when locks are acquired after repeated failed acquisition attempts. At some point, I was using this to investigate some issues related to response times, but to be honest, I didn't spend much time exploring that angle. It might be useful if you want to get an idea for how long each thread runs before giving up control. Of course, you may just want to comment that code out.

Other Comments

Since giving the presentation, I've received a few comments through email offering suggestions for a GIL fix. I stand by my earlier assertion that there is no easy fix for the problem described in the presentation. Here are some specific suggestions followed by my response:

"Perhaps the GIL could be fixed by adding some kind of scheduling queue." If you were to add a scheduling queue to the GIL, you would effectively turn it into a kind of poorly implemented mutex lock. Mutex locks are already implemented (by pthreads and the OS) using queues into order to avoid thread starvation. More details can be found in an operating system textbook. You might also look at the Bakery Algorithm
"Perhaps the GIL could be fixed by simply using a mutex lock." As just mentioned, mutex locks are generally implemented using a queuing mechanism. If you do this, runnable threads will always context switch every 100 interpreter ticks (you'll see the threads cycling in a round-robin manner). This will definitely eliminate the multicore contention problem, but now your programs will perform a tremendous amount of context switching. Also, you might lose the high scheduling priority of I/O bound threads. Needless to say, there are some downsides that need to be considered (just for the record, I think the use of a condition variable in the current implementation is probably the best overall solution for running on a single CPU).
"Could you fix the problem by telling the operating system to schedule all threads on the same core?" Short answer: No. C extensions to Python (and even significant parts of Python itself) often release the GIL by design so that they can run concurrently while carrying out work that doesn't directly involve the Python interpreter. If you force everything to one core, you will most likely make these programs run worse, not better.

Final Words

As mentioned in the presentation, deep exploration of the Python GIL is not a project I'm actively working on. In fact, all of this was really just an exploration to find out how the GIL works and to see if I could track down pathological performance for a certain test case on my Mac. Feel free to take this code and hack it in any way that you wish. If it proves to be useful, just give me an acknowledgment when you give your PyCon presentation. Have fun!

Labels: gil, python, threads

posted by Dave # 5:39 AM 10 Comments

Sunday, August 09, 2009

Essential Misconceptions

A few days ago, Mike Riley posted a great review of the new "Python Essential Reference, 4th Edition" on Dr. Dobb's CodeTalk. In that review, he writes:

"While the author could have taken the easy path of regurgitating the online documentation, he has instead reworked the explanation for each class and function call in the Python core library with commendable clarity, frequently accompanying these detailed examinations with extremely useful and meaningful code examples. The book is also very well designed and organized, making it a snap to find information within a matter of seconds."

This is a reviewer who really gets what this book is about. However, for every great review like this, I also encounter comments that simply dismiss the book out-of-hand saying it "offers nothing" over Python's online documentation. With all due respect to Python's fine documentation, I beg to differ.

First and foremost, I've always viewed the Python Essential Reference as a serious programming reference for myself (yes, I always have a copy next to my desk and I use it regularly). Although, I will admit that Python certainly has a lot of online documentation, it's also missing a lot of essential details. For example, I can't count the number of times I've looked at the online documentation for something only to have to go out and do some kind of extended Google search to fill in a missing detail (or worse, having to load the source code for some module and look through it).

Let's look at an example. Suppose you're writing some networking code with the socket module and you want to use the recv(bufsize [, flags]) method of a socket. If you head off to the online documentation you will certainly find some information.

"Receive data from the socket. The return value is a string representing the data received. The maximum amount of data to be received at once is specified by bufsize. See the Unix manual page recv(2) for the meaning of the optional argument flags; it defaults to zero."

Yes, this is all very useful. Especially that part about having to refer to a Unix man page. I'm sure the Windows programmers find that especially useful. If you turn to the Essential Reference p. 483, you'll not only find a description, but you will also get a complete table showing you exactly what can be given for flags along with a brief description of each option. This approach is found throughout the book--with few exceptions are readers simply referred to other documentation. As another example, I would challenge anyone to effectively use something like the setsockopt() or getsockopt() methods of a socket using nothing by Python's online docs.

The other thing that I've tried to do in the book is answer all sorts of questions about tricky interactions between different parts of Python. Take, for example, this question: Can a separate execution thread safely close a generator/coroutine function by invoking the generator's close() method? Sure, that's not the kind of question that comes up every day, but if you know a thing or two about generators and coroutines, you'll know that they are often used in the context of concurrent programming, just like threads. Not only that, threads and generators might be used together (for example, using threads to carry out blocking operations). Thus, it is reasonable to assume that programmers working with both threads and generators in the same program might start to wonder about their possible interaction. I know I did.

If you try to find an answer to this question using the online documentation, you will be searching for some time and probably come up with nothing. Although there is plenty of discussion about generators, the yield statement, and other matters, you really don't find much about generators and threads mixed together. Even PEP 342, the official specification that introduced the generator close() method says nothing on this matter.

Now, let's look at the Essential Reference. First, if you turn to the index and look up "Threads", you will find about a half-page of subentries. In fact, there is even an entry labeled "Threads: close() method of generators, p. 104." If you turn to p. 104, you will find a sentence "if a program is currently iterating on a generator, you should not call close() asynchronously on that generator from a separate thread of execution or from a signal handler."

This is certainly not the only example, but there are a wide variety of similar questions that I try to address. For example, can you use a decorator with a recursive function? (p. 113). Or what is the interaction between the __slots__ feature of a class and inheritance? (p. 133). Or, does the name mangling of private attributes (e.g., __foo) in a class introduce a runtime performance penalty? (p. 128). All of these questions fall into a general category of issues related to the "side-effects" of using various Python features. Although you can find some of this in the online docs, it is often scattered and incomplete. I've tried to fix that.

Finally, I've really tried to make the Essential Reference a kind of programming "cookbook" of sorts. Although its primary goal is to be a reference, I have also incorporated a wide variety of practical examples from the Python training courses that I run. For instance, if you know about the Generators or Coroutines tutorials I presented at PyCON, you'll find similar information. I also include examples that explore tricky interactions and customization features of certain library modules. For example, how do I customize an XML-RPC server to only accept connections from known IP addresses? (p. 494). Or how do I use the ssl module to implement a secure server? (p. 489). Many of these examples are related to things that I've had to figure out once before, but can never quite remember on a day-to-day basis. By putting them in the book, it helps me remember how to do a variety of tricky things.

So, that's about it. I hope people find the book to be useful. If so, tell your friends. If not, feel free to use it for propping up some uneven furniture. Just don't say that it's the same as the online docs.

Labels: essential reference, python

posted by Dave # 8:26 AM 8 Comments

Dabeaz

Thursday, August 27, 2009

Inside the "Inside the Python GIL" Presentation

Sunday, August 09, 2009

Essential Misconceptions

Archives