Python Generator Hacking

Copyright (C) 2009
David M. Beazley
http://www.dabeaz.com

Presented at USENIX Tech, June 15, 2009, San Diego, California.

Introduction

This tutorial discusses various techniques for using generator functions, generator expressions, and coroutines for a variety of practical problems--mostly related to systems programming.

Support Data Files

The following file contains some supporting data files that are used by the various code samples. Download this to your machine to work the examples that follow. This download also includes all of the code samples that follow below.

Code Samples

Here are various code samples that are used in the course. You can cut and paste these to your own machine to try them out. The order in which these are listed follow the course outline. These examples are written to run inside the "generators" directory that gets created when you unzip the above file containing the support data.

Part 2 : Processing Data Files

  • nongenlog.py. Calculate the number of bytes transferred in an Apache server log using a simple for-loop. Does not use generators.

  • genlog.py. Calculate the number of bytes transferred in an Apache server log using a series of generator expressions.

  • makebig.py. Make a large access-log file for performance testing. This will create a file "big-access-log". For the numbers used in the presentation, I used python makebig.py 2000.
Part 3 : Fun with Files and Directories

  • genfind.py. A generator function that yields filenames matching a given filename pattern.

  • genopen.py. A generator function that yields filenames matching a given filename pattern.

  • gencat.py. A generator function that concatenates a sequence of generators into a single sequence.

  • gengrep.py. A generator that greps a series of lines for those that match a regex pattern.

  • bytesgen.py. Example that finds out how many bytes were transferred for a specific file in a whole directory of log files.
Part 4 : Parsing and Processing Data

  • retuple.py. Parse a sequence of lines into a sequence of tuples using regular expressions.

  • redict.py. Parse a sequence of lines into a sequence of dictionaries with named fields.

  • fieldmap.py. Remap fields in a sequence of dictionaries.

  • linesdir.py. Generate lines from files in a directory.

  • apachelog.py. Parse an Apache log file.

  • query404.py. Find the set of all documents that are broken (404).

  • largefiles.py. Find all requests that transferred over a megabyte.

  • largest.py. Find the largest document.

  • hosts.py. Find unique host IP addresses.

  • downloads.py. Find number of downloads of a specific file.

  • robots.py. Find out who has been hitting robots.txt.
Part 5 : Processing Infinite Data

  • follow.py. Follow a log-file in real-time like tail -f in Unix. To run this program, you need to have a log-file to work with. Run the program runserver.py to start a simulated web-server. This will write a series of log lines for you to follow in the run/access-log file.

  • realtime404.py. Print all 404 requests as they happen in real-time on a log file.

Part 7 : Flipping Everything Around

  • cogrep.py. A first example of a coroutine function. This function receives lines and prints out those that contain a substring.

  • coroutine.py. A decorator function that eliminates the need to call .next() when starting a coroutine.

  • grepclose.py. An example of a coroutine that catches the close() operation.

Part 8 : Coroutines, Pipelines, and Dataflow

  • cofollow.py. A simple example of feeding data from a data source into a coroutine. This mirrors the 'tail -f' example from earlier.

  • copipe.py. An example of setting up a processing pipeline with coroutines.

  • cobroadcast.py. An example of a coroutine broadcaster. This fans a data stream out to multiple targets.

  • cobroadcast2.py. An example of broadcasting with a slightly different data handling pattern.

Part 9 : Coroutines and Event Dispatching

  • basicsax.py. A very basic example of the SAX API for parsing XML documents (does not involve coroutines).

  • cosax.py. An example that pushes SAX events into a coroutine.

  • buses.py. An example of parsing and filtering XML data with a series of connected coroutines.

  • coexpat.py. An XML parser that turns events generated by the expat XML library into coroutines. Compare with cosax.py above.

Part 10 : From Data Processing to Concurrent Programming

  • genpipe.py. A pair of functions for bridging a generator across a pipe.

  • netprod.py. An example of producer process. Launches netcons.py as a subprocess and sends data to it.