Well, PyCon is over and it's time to get back to work. First, I'd just like to thank everyone who came to my GIL Talk and participated in all of the discussion that followed. It was almost as if part of PyCon had turned into a mini operating systems conference!
This post is a followup to the GIL open space at PyCon where we looked at the new GIL and explored the possibility of introducing thread priorities. For those of you not at PyCon, the open space was attended by about 30-40 people and included Guido, Antoine Pitrou, and a large number of systems hackers, some of which had previously worked on thread library implementations and operating system kernels.
First, a little background. As might know, Antoine Pitrou implemented a new Python GIL that is currently only available in the Python 3.2 development branch (you can obtain it via subversion). This new GIL is described in his original mailing list post as well as the slides for my PyCon talk. You should read those first if you haven't already.
Right before PyCON, I discovered an I/O performance problem with the new GIL that is related to CPU-bound threads stalling the progress of I/O bound threads which it turn leads to a severe performance degradation of I/O bandwidth and response time. This is described in Issue 7946 : Convoy effect with I/O bound threads and New GIL.
In the bug report, I submitted a very simple test case that illustrated the problem. However, here is a more refined experiment that you can try. The following program, iotest.py contains both CPU-bound threads and an I/O server thread that echos UDP packets. It is meant to study the case in which CPU-processing and I/O processing are overlapped.
# iotest.py import time import threading from socket import * import itertools def task_pidigits(): """Pi calculation (Python)""" _map = map _count = itertools.count _islice = itertools.islice def calc_ndigits(n): # From http://shootout.alioth.debian.org/ def gen_x(): return _map(lambda k: (k, 4*k + 2, 0, 2*k + 1), _count(1)) def compose(a, b): aq, ar, as_, at = a bq, br, bs, bt = b return (aq * bq, aq * br + ar * bt, as_ * bq + at * bs, as_ * br + at * bt) def extract(z, j): q, r, s, t = z return (q*j + r) // (s*j + t) def pi_digits(): z = (1, 0, 0, 1) x = gen_x() while 1: y = extract(z, 3) while y != extract(z, 4): z = compose(z, next(x)) y = extract(z, 3) z = compose((10, -10*y, 0, 1), z) yield y return list(_islice(pi_digits(), n)) return calc_ndigits, (50, ) def spin(): task,args = task_pidigits() while True: r= task(*args) def echo_server(): s = socket(AF_INET, SOCK_DGRAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1) s.bind(("",16000)) while True: msg, addr = s.recvfrom(16384) s.sendto(msg,addr) # Launch threads (adjust the number to see different results) NUMTHREADS = 1 for n in range(NUMTHREADS): t = threading.Thread(target=spin) t.daemon = True t.start() # Launch a background echo server echo_server()
Next, here is a client program ioclient.py that simply measures the time it takes to echo 10MB of data to the server in the iotest.py program.
# echoclient.py from socket import * import time CHUNKSIZE = 8192 NUMMESSAGES = 1280 # Total of 10MB # Dummy message msg = b"x"*CHUNKSIZE # Connect and send messages s = socket(AF_INET,SOCK_DGRAM) start = time.time() for n in range(NUMMESSAGES): s.sendto(msg,("",16000)) msg, addr = s.recvfrom(65536) end = time.time() print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start)))
If you run iotest.py on a dual-core Macbook with only 1 spin() thread. You get the following result if you run ioclient.py:
It works, but it's hardly impressive (just barely over 1MB/sec transfer rate between two processes?). However, if you make the server have two spin() threads, the performance gets much worse:
Now to further complicate matters, if you disable all but one of the CPU cores, you get this inexplicable result:
Needless to say, there are many bizarre things going on here. The most major effect is that on multiple cores, it is very easy for CPU-bound threads to reacquire the GIL whenever an I/O bound thread performs I/O. This means that CPU-threads have a greater tendency to hog the GIL.
At PyCON, I did some experiments with thread priorities and a modified GIL that adjusted priorities in a manner similar to what you find with multilevel feedback queues in operating systems. Namely:
The results of this approach were impressive. If you run the same tests with priorities on 2 CPU cores, you get this result:
The prioritized GIL also gives good performance for Antoine's own ccbench.py benchmark.
|New GIL||New GIL with priorities|
== CPython 3.2a0.0 (py3k:78250) == == i386 Darwin on 'i386' == --- Throughput --- Pi calculation (Python) threads=1: 873 iterations/s. threads=2: 845 ( 96 %) threads=3: 837 ( 95 %) threads=4: 820 ( 93 %) regular expression (C) threads=1: 348 iterations/s. threads=2: 339 ( 97 %) threads=3: 328 ( 94 %) threads=4: 317 ( 91 %) bz2 compression (C) threads=1: 367 iterations/s. threads=2: 655 ( 178 %) threads=3: 642 ( 174 %) threads=4: 646 ( 175 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 5 ms. (std dev: 0 ms.) CPU threads=2: 2 ms. (std dev: 2 ms.) CPU threads=3: 138 ms. (std dev: 100 ms.) CPU threads=4: 132 ms. (std dev: 99 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 6 ms. (std dev: 1 ms.) CPU threads=2: 6 ms. (std dev: 6 ms.) CPU threads=3: 6 ms. (std dev: 4 ms.) CPU threads=4: 10 ms. (std dev: 8 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)
== CPython 3.2a0.0 (py3k:78215M) == == i386 Darwin on 'i386' == --- Throughput --- Pi calculation (Python) threads=1: 885 iterations/s. threads=2: 860 ( 97 %) threads=3: 869 ( 98 %) threads=4: 859 ( 97 %) regular expression (C) threads=1: 362 iterations/s. threads=2: 358 ( 98 %) threads=3: 349 ( 96 %) threads=4: 354 ( 97 %) bz2 compression (C) threads=1: 373 iterations/s. threads=2: 654 ( 175 %) threads=3: 649 ( 173 %) threads=4: 638 ( 170 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 2 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 2 ms. (std dev: 1 ms.) CPU threads=2: 3 ms. (std dev: 3 ms.) CPU threads=3: 2 ms. (std dev: 1 ms.) CPU threads=4: 2 ms. (std dev: 2 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 1 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.)
The overall outcome of the GIL open space was that having a priority mechanism was probably a good idea. However, a lot of people wanted to study the problem in more detail and to think about different possible implementations. I am posting the following tar file that has my own modifications to the GIL used for the above benchmarks:
Note: This tar file has all of the modified files in the Python 3.2 source (pystate.h, pystate.c, and ceval_gil.h) along with the io testing benchmark. Be advised that this patch is only intended for further study by others---it's kind of hacked together and really only a proof of concept implementation of one possible priority scheme. A real implementation would still need to address some issues not covered in my patch (e.g., starvation effects).
Due to other time commitments, I'm not going to be able to do much followup with this patch at this moment. However, I do want to encourage others to at least consider the benefit of introducing thread priorities and to explore different possible implementations. Initial results seem to indicate that this can fix the GIL for both CPU-bound threads and for
Subscribe to Posts [Atom]