Purely Functional Programming for Python

For the past couple of years I have been playing around with Functional Programming (FP) but I still do a lot of my "real work" in imperative OOP in Python. There are pros and cons to each approach to solving problems.

Since some people might not be familiar with FP so I will try to explain it briefly. Basically, in a purely functional language, functions are not allowed to produce side-effects. Printing, modifying global state, modifying variables that have already been assigned a value and many, many techniques that define imperative programming are expressly banned in FP. The end result of this is that functions are referentially transparent, which is to say that you will get the same result from the function every time that you pass it the same parameters. Also, the parameters themselves can never be modified. This is the exact opposite of imperative programming. If you pass a variable by reference or if you pass a function or method an object it can be modified therein. For a more thorough introduction to FP see defmacro's excellent article Functional Programming for the Rest of Us.

Imperative OOP code is easy to understand and maintain. It jives with our understanding of reality: you say Printer.print(document) and it does exactly that. It can also run faster and use less memory than FP, since you can change values in-place. When you want to make a modification in a pure function you have to copy everything and change only the piece of data that needs to be updated. That can be slow. Many FP compiler implementers have been working on optimizations to work around this so things have gotten a lot better.

Testing in OOP can sometimes be a pain since you may have to potentially set up a massive amount of state. Objects can rely on the state of other objects, etc. Since pure functions are referentially transparent testing them just requires that you pass in some parameters and check that the return value is what you expect. There are other benefits to be had from pure functions as well.

The following was basically plagiarized from Wikipedia's article about FP:

If the result of a pure function isn't used it doesn't need to be computed.
If a pure function is called with the same parameters the compiler can implement caching optimizations such as memoization.
If there is no data dependency between two pure expressions (the result of one isn't required as an argument to the other), their order can be reversed or they can be run in parallel.
If there are no side-effects, any evaluation strategy can be used.

You might've noticed that CPUs have basically stopped getting faster recently. When we buy a computer now we just get more CPUs: dual-core, quad-core and so on according to Moore's law. This trend is going to continue to produce more and more cores.

This is bad news for the standard imperative approach to parallelization, threading, which relies on shared state. In OOP, objects are widely used. Calling a method on an object most likely modifies its state. Other threads might access the same object so it they have to check if something else is modifying the object that they want to update. If it is being updated the thread requesting the update has to block. This doesn't scale well. Pure functions share nothing and therefore don't block. Because of this more and more massively parallel systems are being built with functional languages or principles: Google uses map/reduce for searching its massive database. Facebook uses Erlang for its chat system. Amazon uses Erlang for their distributed database SimpleDB.

The other day I was explaining some of the benefits of functional programming to a coworker and it dawned on me that I might be able to explain to Python 1) how to check if a function is pure and 2) how to optimize the function according to the aforementioned rules about pure functions by manipulating the function's Abstract Syntax Tree using Python's ast module.

The result of a couple night's work is "Purely Functional Programming for Python" (pfpp), a very nascent library that checks if a function is pure and then tries optimization 3 from the above list.

I tried to make functional a decorator, so it could appear above functions, but this made it hard to get the AST, since the hackish way that I am getting the function source code isn't available when the decorator is called. I hope to solve this soon, but for now, if you want to declare something a pure function you have do something like this:

def a():
    return 10
a = functional(a)

This will check to make sure that the function is pure and then perform parallelization optimizations. Currently it checks to see that no no printing is done, variables are not assigned to more than once and no methods of objects are called. If any of these are violated the program will stop at the parse step. This is really cool. It's like being able to make your own language with Python as a starting point. And Python is a great starting point IMHO. However, for this project a lot of work remains here. There are probably many ways to get around these checks and introduce impure code.

Once it gets past the purity check, function calls within the function are parallelized. Here is an example:

from time import sleep

def x():
    sleep(2) # to simulate a long process
    return 10

def y():
    sleep(2) # to simulate a long process
    return 20

def z():
    x_result = x()
    y_result = y()
    return x_result + y_result

print z()

Given the above program, before parallelization:

$ time python demo.py
30

real    0m4.094s
user    0m0.020s
sys 0m0.052s

After adding one line, the z function's AST is transformed so that x and y are run in parallel:

z = functional(z)

$ time python demo.py 
results manager invoked
30

real    0m2.169s
user    0m0.044s
sys 0m0.088s

What's really cool about this is that, much like Haskell, if Python does end up getting a reliable way of identifying pure functions we can implement most of our programs in pure functions, automatically gaining benefits from them, and still use a few impure functions to communicate with the "real" world. Since Python lacks the insane type system of Haskell it might be a lot harder to enforce this separation but it's worth a shot.

Under the hood, pfpp requires Python 2.6 for the latest and greatest ast helpers for metaprogramming and the multiprocessing library for parallelization. Please feel free to contribute via github or shoot me an email if you are interested in this project. Memoization could be the next low hanging fruit. A lot of work remains.

UPDATE

Memoization has been added. Things like computing the Fibonacci series recursively can be made a lot faster now:

def fib(x):
    if x == 0 or x == 1:
        return 1
    return fib(x - 1) + fib(x - 2)

print fib(30)

As it is written above:

$ time python demo.py 
1346269

real    0m0.943s
user    0m0.512s
sys 0m0.092s

After setting fib = functional(fib):

$ time python demo.py 
1346269

real    0m0.188s
user    0m0.060s
sys 0m0.104s

5 times faster through memoization! While performing these benchmarks I also removed a bottleneck concerning the ResultManager, which stores the results of parallelized function calls. It is basically just a wrapper around multiprocessing.pool that turns results into thunks so they can be evaluated in parallel. Anyway, a ResultsManager was being created with each function call which was a lot of overhead. It is now only created once and placed in the func_globals of the function. Should speed things up a lot.