Friday, December 7, 2012

PyPy related internship at NCAR

Hello everyone

I would like to advertise a PyPy-related summer internship at the National Center for Atmospheric Research, which is located in lovely Boulder, Colorado. As for the last year, the mentor will be Davide del Vento, with my possible support on the PyPy side.

The full details of the application are to be found on the internship description and make sure you read the requirements first. Important requirements:

  • Must currently be enrolled in a United States university.
  • Only students authorized to work for any employer in the United States will be considered for the SIParCS program.
  • Must be a graduate or under graduate who has completed their sophomore year.

If you happen to fulfill the requirements, to me this sounds like a great opportunity to spend a summer at NCAR in Boulder hacking on atmospheric models using PyPy.


Tuesday, December 4, 2012

Py3k status update #8

This is the eight status update about our work on the py3k branch, which
we can work on thanks to all of the people who donated to the py3k

Just a short update on November's work: we're now passing about 194 of
approximately 355 modules of CPython's regression test suite, up from passing
160 last month. Many test modules only fail a small number of individual tests

We'd like to thank Amaury Forgeot d'Arc for his contributions, in particular he
has made significant progress on updating CPyExt for Python 3 this month.

Some other highlights:

  • test_marshal now passes, and there's been significant progress on
    pickling (thanks Kenny Levinsen and Amaury for implementing
  • We now have a _posixsubprocess module
  • More encoding related fixes, which affects many failing tests
  • _sre was updated and now test_re almost passes
  • Exception behavior is almost complete per the Python 3 specs, what's mostly
    missing now are the new __context__ and __traceback__ attributes (PEP
  • Fixed some crashes and deadlocks occurring during the regression tests
  • We merged the unicode-strategies branch both to default and to py3k: now we
    have versions of lists, dictionaries and sets specialized for unicode
    elements, as we already had for strings.
  • However, for string-specialized containers are still faster in some cases
    because there are shortcuts which have not been implemented for unicode yet
    (e.g., constructing a set of strings from a list of strings). The plan is to
    completely kill the shortcuts and improve the JIT to produce the fast
    version automatically for both the string and unicode versions, to have a
    more maintainable codebase without sacrificing the speed. The autoreds
    branch (already merged) was a first step in this direction.


Tuesday, November 27, 2012

PyPy San Francisco Sprint Dec 1st - Dec 2nd 2012

The next PyPy sprint will be in San Francisco, California. It is a
public sprint, suitable for newcomers. It will run on Saturday December 1st and
Sunday December 2nd. The goals for the sprint are continued work towards the
2.0 release as well as code cleanup, we of course welcome any topic which
contributors are interested in working on.

Some other possible topics are:

  • running your software on PyPy
  • work on PyPy's numpy (status)
  • work on STM (status)
  • JIT improvements
  • any exciting stuff you can think of

If there are newcomers, we'll run the usual introduction to hacking on


The sprint will be held at the Rackspace Office:

620 Folsom St, Ste 100
San Francisco

The doors will open at 10AM both days, and run until 6PM both days.

Thanks to David Reid for helping get everything set up!

Thursday, November 22, 2012

PyPy 2.0 beta 1

We're pleased to announce the 2.0 beta 1 release of PyPy. This release is not a typical beta, in a sense the stability is the same or better than 1.9 and can be used in production. It does however include a few performance regressions documented below that don't allow us to label is as 2.0 final. (It also contains many performance improvements.)

The main features of this release are support for ARM processor and compatibility with CFFI. It also includes numerous improvements to the numpy in pypy effort, cpyext and performance.

You can download the PyPy 2.0 beta 1 release here:

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.3. It's fast (pypy 2.0 beta 1 and cpython 2.7.3 performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. It also supports ARM machines running Linux. Windows 64 work is still stalling, we would welcome a volunteer to handle that.

How to use PyPy?

We suggest using PyPy from a virtualenv. Once you have a virtualenv installed, you can follow instructions from pypy documentation on how to proceed. This document also covers other installation schemes.


Reasons why this is not PyPy 2.0:

  • the ctypes fast path is now slower than it used to be. In PyPy 1.9 ctypes was either incredibly faster or slower than CPython depending whether you hit the fast path or not. Right now it's usually simply slower. We're probably going to rewrite ctypes using cffi, which will make it universally faster.
  • cffi (an alternative to interfacing with C code) is very fast, but it is missing one optimization that will make it as fast as a native call from C.
  • numpypy lazy computation was disabled for the sake of simplicity. We should reenable this for the final 2.0 release.


  • cffi is officially supported by PyPy. You can install it normally by using pip install cffi once you have installed PyPy and pip. The corresponding 0.4 version of cffi has been released.
  • ARM is now an officially supported processor architecture. PyPy now work on soft-float ARM/Linux builds. Currently ARM processors supporting the ARMv7 and later ISA that include a floating-point unit are supported.
  • This release contains the latest Python standard library 2.7.3 and is fully compatible with Python 2.7.3.
  • It does not however contain hash randomization, since the solution present in CPython is not solving the problem anyway. The reason can be found on the CPython issue tracker.
  • gc.get_referrers() is now faster.
  • Various numpy improvements. The list includes:
    • axis argument support in many places
    • full support for fancy indexing
    • complex128 and complex64 dtypes
  • JIT hooks are now a powerful tool to introspect the JITting process that PyPy performs.
  • **kwds usage is much faster in the typical scenario
  • operations on long objects are now as fast as in CPython (from roughly 2x slower)
  • We now have special strategies for dict/set/list which contain unicode strings, which means that now such collections will be both faster and more compact.

Things we're working on

There are a few things that did not make it to the 2.0 beta 1, which are being actively worked on. Greenlets support in the JIT is one that we would like to have before 2.0 final. Two important items that will not make it to 2.0, but are being actively worked on, are:

  • Faster JIT warmup time.
  • Software Transactional Memory.

Maciej Fijalkowski, Armin Rigo and the PyPy team

Friday, November 2, 2012

Py3k status update #7

This is the seventh status update about our work on the py3k branch, which
we can work on thanks to all of the people who donated to the py3k

The biggest news is that this month Philip started to work on py3k in parallel
to Antonio. As such, there was an increased amount of activity.

The py3k buildbots now fully translate the branch every night and run the
Python standard library tests.

We currently pass 160 out of approximately 355 modules of CPython's standard
test suite, fail 144 and skip approximately 51.

Some highlights:

  • dictviews (the objects returned by dict.keys/values/items) has been greatly
    improved, and now they full support set operators
  • a lot of tests has been fixed wrt complex numbers (and in particular the
    __complex__ method)
  • _csv has been fixed and now it correctly handles unicode instead of bytes
  • more parser fixes, py3k list comprehension semantics; now you can no longer
    access the list comprehension variable after it finishes
  • 2to3'd most of the lib_pypy modules (pypy's custom standard lib
  • py3-enabled pyrepl: this means that finally readline works at the command
    prompt, as well as builtins.input(). pdb seems to work, as well as
    fancycompleter to get colorful TAB completions :-)
  • py3 round
  • further tightening/cleanup of the unicode handling (more usage of
    surrogateescape, surrogatepass among other things)
  • as well as keeping up with some big changes happening on the default branch
    and of course various other fixes.

Finally, we would like to thank Amaury Forgeot d'Arc for his significant


Thursday, November 1, 2012

NumPy status update #5


I'm quite excited to inform that work on NumPy in PyPy has been restarted and there has been quite a bit of progress on the NumPy front in PyPy in the past two months. Things that happened:

  • complex dtype support - thanks to matti picus, NumPy on PyPy now supports complex dtype (only complex128 so far, there is work on the other part)
  • big refactoring - probably the biggest issue we did was finishing a big refactoring that disabled some speedups (notably lazy computation of arrays), but lowered the barrier of implementing cool new features.
  • fancy indexing support - all fancy indexing tricks should now work, including a[b] where b is an array of integers.
  • newaxis support - now you can use newaxis features
  • improvements to ``intp``, ``uintp``, ``void``, ``string`` and record dtypes

Features that have active branches, but hasn't been merged:

  • float16 dtype support
  • missing ndarray attributes - this is a branch to finish all attributes on ndarray, hence ending one chapter.
  • pickling support for numarray - hasn't started yet, but next on the list

More importantly, we're getting very close to able to import the python part of the original numpy with only import modifications and running it's tests. Most tests will fail at this point, however it'll be a good start for another chapter :-)


Tuesday, October 23, 2012

Cape Town 2012 sprint report


We're about to finish a PyPy sprint in Cape Town, South Africa that was one of the smallest done so far, only having Armin Rigo and Maciej Fijalkowski with Alex Gaynor joining briefly at the beginning, however also one of the longest, lasting almost 3 weeks. The sprint theme seems to be predominantly "no new features" and "spring cleaning". We overall removed about 20k lines of code in the PyPy source tree. The breakdown of things done and worked on:

  • We killed SomeObject support in annotation and rtyper. This is a modest code saving, however, it reduces the complexity of RPython and also, hopefully, improves compile errors from RPython. We're far from done on the path to have comprehensible compile-time errors, but the first step is always the hardest :)

  • We killed some magic in specifying the interface between builtin functions and Python code. It used to be possible to write builtin functions like this:

    def f(space, w_x='xyz'):

    which will magically wrap 'xyz' into a W_StringObject. Right now, instead, you have to write:

    def f(space, w_x):

    which is more verbose, but less magical.

  • We killed the CExtModuleBuilder which is the last remaining part of infamous extension compiler that could in theory build C extensions for CPython in RPython. This was never working very well and the main part was killed long ago.

  • We killed various code duplications in the C backend.

  • We killed microbench and a bunch of other small-to-medium unused directories.

  • We killed llgraph JIT backend and rewrote it from scratch. Now the llgraph backend is not translatable, but this feature was rarely used and caused a great deal of complexity.

  • We progressed on continulet-jit-3 branch, up to the point of merging it into result-in-resops branch, which also has seen a bit of progress.

    Purpose of those two branches:

    • continulet-jit-3: enable stackless to interact with the JIT by killing global state while resuming from the JIT into the interpreter. This has multiple benefits. For example it's one of the stones on the path to enable STM for PyPy. It also opens new possibilities for other optimizations including Python-Python calls and generators.
    • result-in-resops: the main goal is to speed up the tracing time of PyPy. We found out the majority of time is spent in the optimizer chain, which faces an almost complete rewrite. It also simplifies the storage of the operations as well as the number of implicit invariants that have to be kept in mind while developing.
  • We finished and merged the excellent work by Ronan Lamy which makes the flow object space (used for abstract interpretation during RPython compilation) independent from the Python interpreter. This means we've achieved an important milestone on the path of separating the RPython translation toolchain from the PyPy Python interpreter.

fijal & armin

Wednesday, September 26, 2012

Py3k status update #6

This is the sixth status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

The coolest news is not about what we did in the past weeks, but what we will
do in the next: I am pleased to announce that Philip Jenvey has been
selected by the PyPy communitiy to be funded for his upcoming work on py3k,
thanks to your generous donations. He will start to work on it shortly, and he
will surely help the branch to make faster progress. I am also particularly
happy of this because Philip is the first non-core developer who is getting
paid with donations: he demonstrated over the past months to be able to work
effectively on PyPy, and so we were happy to approve his application for the
job. This means that anyone can potentially be selected in the future, the
only strict requirement is to have a deep interest in working on PyPy and to
prove to be able to do so by contributing to the project.

Back to the status of the branch. Most of the work since the last status
update has been done in the area of, guess what? Unicode strings. As usual,
this is one of the most important changes between Python 2 and Python 3, so
it's not surprising. The biggest news is that now PyPy internally supports
unicode identifiers (such as names of variables, functions, attributes, etc.),
whereas earlier it supported only ASCII bytes strings. The changes is still
barely visible from the outside, because the parser still rejects non-ASCII
identifiers, however you can see it with a bit of creativity:

>>>> def foo(x): pass
>>>> foo(**{'àèìòù': 42})
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: foo() got an unexpected keyword argument 'àèìòù'

Before the latest changes, you used to get question marks instead of the
proper name for the keyword argument. Although this might seem like a small
detail, it is a big step towards a proper working Python 3 interpreter and it
required a couple of days of headaches. A spin-off of this work is that now
RPython has better built-in support for unicode (also in the default branch):
for example, it now supports unicode string formatting (using the percent
operator) and the methods .encode/.decode('utf-8').

Other than that there is the usual list of smaller issues and bugs that got
fixed, including (but not limited to):

  • teach the compiler when to emit the new opcode DELETE_DEREF (and
    implement it!)
  • detect when we use spaces and TABs inconsistently in the source code, as
    CPython does
  • fix yet another bug related to the new lexically scoped exceptions (this
    is the last one, hopefully)
  • port some of the changes that we did to the standard CPython 2.7 tests to
    3.2, to mark those which are implementation details and should not be run on

Finally, I would like to thank Amaury Forgeot d'Arc and Ariel Ben-Yehuda for
their work on the branch; among other things, Amaury recently worked on
cpyext and on the PyPy _cffi_backend, while Ariel submitted a patch to
implement PEP 3138.

Wednesday, September 5, 2012

PyPy Cape Town Sprint Oct 7th - Oct 21st 2012

Hello everyone!

The next PyPy sprint will be in Cape Town, South Africa. It is a public sprint, suitable for newcomers. It starts a couple of days after PyCon South Africa, which is on the 4th and 5th of October. This is a relatively unusual sprint in that it is hosted halfway across the world from where most contributors live, so we plan to spend some time during those two weeks doing sprinting and some time doing touristy stuff. The goals for the sprint are general progress and whatever people are interested in.

Possible topics:

  • PyPy release 2.0
  • running your software on PyPy
  • work on PyPy's numpy (status)
  • work on STM (status)
  • JIT improvements
  • any exciting stuff you can think of

If there are newcomers, we'll run the usual introduction to hacking on PyPy.


The sprint will be held either in the apartment of fijal, which is in Tamboerskloof, Cape Town, or in the offices of the Praekelt Foundation, located in Woodstock, Cape Town. The Praekelt Foundation has offered to host us, if needed.

Cape Town, as a very touristy place, has tons of accomodation ranging in quality from good to amazing. Depending on the sprint location you might need a car.

Good to Know

You probably don't need visa for South Africa -- consult Wikipedia. South Africa is a lovely place with lots of stuff to do. You can see penguins, elephants, lions and sharks all on one day (or better yet, on multiple days).

There is a wide selection of good restaurants within a reasonable distance of the sprint venue (depending on the venue, either walking or driving).

The power plug is some weird derivative of an old-english standard, but adapters are easily acquired.

Who's Coming?

If you'd like to come, please let us know when you will be arriving and leaving, as well as what your interests are. We'll keep a list of people which we'll update (or you can do so yourself if you have bitbucket pypy commit rights).


Tuesday, September 4, 2012

NumPy on PyPy status update

Hello everyone.

It's been a while since we posted a numpy work update, but I'm pleased to inform you that work on it has been restarted. A lot of the work has been done by Matti Picus, who is one of the newest contributors to the PyPy project. None of the work below has been merged so far, it's work in progress:

  • Complex dtype support.
  • Fixing incompatibilities between numpy and pypy's version.
  • Refactoring numpypy to simplify the code and make it easier for new contributors.
  • Reuse most of the numpy's pure python code without modifications.

Finishing this is also the plan for the next month.


Monday, August 13, 2012

CFFI release 0.3

Hi everybody,

We released CFFI 0.3. This is the first release that supports more than CPython 2.x :-)

  • CPython 2.6, 2.7, and 3.x are supported (3.3 definitely, but maybe 3.2 or earlier too)
  • PyPy trunk is supported.

In more details, the main news are:

  • support for PyPy. You need to get a trunk version of PyPy, which comes with the built-in module _cffi_backend to use with the CFFI release. For testing, you can download the Linux 32/64 versions of PyPy trunk. The OS/X and Windows versions of _cffi_backend are not tested at all so far, so probably don't work yet.
  • support for Python 3. It is unknown which exact version is required; probably 3.2 or even earlier, but we need 3.3 to run the tests. The 3.x version is not a separate source; it runs out of the same sources. Thanks Amaury for starting this port.
  • the main change in the API is that you need to use ffi.string(cdata) instead of str(cdata) or unicode(cdata). The motivation for this change was the Python 3 compatibility. If your Python 2 code used to contain str(<cdata 'char *'>), it would interpret the memory content as a null-terminated string; but on Python 3 it would just return a different string, namely "<cdata 'char *'>", and proceed without even a crash, which is bad. So ffi.string() solves it by always returning the memory content as an 8-bit string (which is a str in Python 2 and a bytes in Python 3).
  • other minor API changes are documented at (grep for version 0.3).

Upcoming work, to be done before release 1.0:

  • expose to the user the module cffi.model in a possibly refactored way, for people that don't like (or for some reason can't easily use) strings containing snippets of C declarations. We are thinking about refactoring it in such a way that it has a ctypes-compatible interface, to ease porting existing code from ctypes to cffi. Note that this would concern only the C type and function declarations, not all the rest of ctypes.
  • CFFI 1.0 will also have a corresponding PyPy release. We are thinking about calling it PyPy 2.0 and including the whole of CFFI (instead of just the _cffi_backend module like now). In other words it will support CFFI out of the box --- we want to push forward usage of CFFI in PyPy :-)


Armin Rigo and Maciej Fijałkowski

C++ objects in cppyy, part 1: Data Members

The cppyy module makes it possible to call into C++ from PyPy through the Reflex package. Documentation and setup instructions are available here. Recent work has focused on STL, low-level buffers, and code quality, but also a lot on pythonizations for the CINT backend, which is mostly for High Energy Physics (HEP) use only. A previous posting walked through the high-level structure and organization of the module, where it was argued why it is necessary to write cppyy in RPython and generate bindings at run-time for the best performance. This posting details how access to C++ data structures is provided and is part of a series of 3 postings on C++ object representation in Python: the second posting will be about method dispatching, the third will tie up several odds and ends by showing how the choices presented here and in part 2 work together to make features such as auto-casting possible.

Wrapping Choices

Say we have a plain old data type (POD), which is the simplest possible data structure in C++. Like for example:

    struct A {
        int    m_i;
        double m_d;

What should such a POD look like when represented in Python? Let's start by looking at a Python data structure that is functionally similar, in that it also carries two public data members of the desired types. Something like this:

    class A(object):
        def __init__(self):
            self.m_i = 0
            self.m_d = 0.

Alright, now how to go about connecting this Python class with the former C++ POD? Or rather, how to connect instances of either. The exact memory layout of a Python A instance is up to Python, and likewise the layout of a C++ A instance is up to C++. Both layouts are implementation details of the underlying language, language implementation, language version, and the platform used. It should be no surprise then, that for example an int in C++ looks nothing like a PyIntObject, even though it is perfectly possible, in both cases, to point out in memory where the integer value is. The two representations can thus not make use of the same block of memory internally. However, the requirement is that the access to C++ from Python looks and feels natural in its use, not that the mapping is exact. Another requirement is that we want access to the actual object from both Python and C++. In practice, it is easier to provide natural access to C++ from Python than the other way around, because the choices of memory layout in C++ are far more restrictive: the memory layout defines the access, as the actual class definition is gone at run-time. The best choice then, is that the Python object will act as a proxy to the C++ object, with the actual data always being in C++.

From here it follows that if the m_i data member lives in C++, then Python needs some kind of helper to access it. Conveniently, since version 2.2, Python has a property construct that can take a getter and setter function that are called when the property is used in Python code, and present it to the programmer as if it were a data member. So we arrive at this (note how the property instance is a variable at the class level):

    class A(object):
        def __init__(self):
            self._cppthis = construct_new_A()
        m_i = property(get_m_i, set_m_i)
        m_d = property(get_m_d, set_m_d)

The construct_new_A helper is not very interesting (the reflection layer can provide for it directly), and methods are a subject for part 2 of this posting, so focus on get_m_i and set_m_i. In order for the getter to work, the method needs to have access to the C++ instance for which the Python object is a proxy. On access, Python will call the getter function with the proxy instance for which it is called. The proxy has a _cppthis data member from which the C++ instance can be accessed (think of it as a pointer) and all is good, at least for m_i. The second data member m_d, however, requires some more work: it is located at some offset into _cppthis. This offset can be obtained from the reflection information, which lets the C++ compiler calculate it, so details such as byte padding are fully accounted for. Since the setter also needs the offset, and since both share some more details such as the containing class and type information of the data member, it is natural to create a custom property class. The getter and setter methods then become bound methods of an instance of that custom property, CPPDataMember, and there is one such instance per data member. Think of something along these lines:

    def make_datamember(cppclass, name):
        cppdm = cppyy.CPPDataMember(cppclass, name)
        return property(cppdm.get, cppdm.set)
where the make_datamember function replaces the call to property in the class definition above.

Now hold on a minute! Before it was argued that Python and C++ can not share the same underlying memory structure, because of choices internal to the language. But if on the Python side choices are being made by the developer of the language bindings, that is no longer a limitation. In other words, why not go through e.g. the Python extension API, and do this:

    struct A_pyproxy {
        int    m_i;
        double m_d;

Doing so would save on malloc overhead and remove a pointer indirection. There are some technical issues specific to PyPy for such a choice: there is no such thing as PyPyObject_HEAD and the layout of objects is not a given as that is decided only at translation time. But assume that those issues can be solved, and also accept that there is no problem in creating structure definitions like this at run-time, since the reflection layer can provide both the required size and access to the placement new operator (compare e.g. CPython's struct module). There is then still a more fundamental problem: it must be possible to take over ownership in Python from instances created in C++ and vice-versa. With a proxy scheme, that is trivial: just pass the pointer and do the necessary bookkeeping. With an embedded object, however, not every use case can be implemented: e.g. if an object is created in Python, passed to C++, and deleted in C++, it must have been allocated independently. The proxy approach is therefore still the best choice, although embedding objects may provide for optimizations in some use cases.


The next step, is to take a more complicated C++ class, one with inheritance (I'm leaving out details such as constructors etc., for brevity):

    class A {
        virtual ~A() {}
        int    m_i;
        double m_d;

    class B : public A {
        virtual ~B() {}
        int    m_j;

From the previous discussion, it should already be clear what this will look like in Python:

    class A(object):
        def __init__(self):
            self._cppthis = construct_new_A()
        m_i = make_datamember('A', 'm_i')
        m_d = make_datamember('A', 'm_d')

    class B(A):
        def __init__(self):
            self._cppthis = construct_new_B()
        m_j = make_datamember('B', 'm_j')

There are some minor adjustments needed, however. For one, the offset of the m_i data member may be no longer zero: it is possible that a virtual function dispatch table (vtable) pointer is added at the beginning of A (an alternative is to have the vtable pointer at the end of the object). But if m_i is handled the same way as m_d, with the offset provided by the compiler, then the compiler will add the bits, if any, for the vtable pointer and all is still fine. A real problem could come in however, with a call of the m_i property on an instance of B: in that case, the _cppthis points to a B instance, whereas the getter/setter pair expect an A instance. In practice, this is usually not a problem: compilers will align A and B and calculate an offset for m_j from the start of A. Still, that is an implementation detail (even though it is one that can be determined at run-time and thus taken advantage of by the JIT), so it can not be relied upon. The m_i getter thus needs to take into account that it can be called with a derived type, and so it needs to add an additional offset. With that modification, the code looks something like this (as you would have guessed, this is getting more and more into pseudo-code territory, although it is conceptually close to the actual implementation in cppyy):

    def get_m_i(self):
        return int(self._cppthis + offset(A, m_i) + offset(self.__class__, A))

Which is a shame, really, because the offset between B and A is going to be zero most of the time in practice, and the JIT can not completely elide the offset calculation (as we will see later; it is easy enough to elide if self.__class__ is A, though). One possible solution is to repeat the properties for each derived class, i.e. to have a get_B_m_i etc., but that looks ugly on the Python side and anyway does not work in all cases: e.g. with multiple inheritance where there are data members with the same name in both bases, or if B itself has a public data member called m_i that shadows the one from A. The optimization then, is achieved by making B in charge of the offset calculations, by making offset a method of B, like so:

    def get_m_i(self):
        return int(self._cppthis + offset(A, m_i) + self.offset(A))

The insight is that by scanning the inheritance hierarchy of a derived class like B, you can know statically whether it may sometimes need offsets, or whether the offsets are always going to be zero. Hence, if the offsets are always zero, the method offset on B will simply return the literal 0 as its implementation, with the JIT taking care of the rest through inlining and constant folding. If the offset could be non-zero, then the method will perform an actual calculation, and it will let the JIT elide the call only if possible.

Multiple Virtual Inheritance

Next up would be multiple inheritance, but that is not very interesting: we already have the offset calculation between the actual and base class, which is all that is needed to resolve any multiple inheritance hierarchy. So, skip that and move on to multiple virtual inheritance. That that is going to be a tad more complicated will be clear if you show the following code snippet to any old C++ hand and see how they respond. Most likely you will be told: "Don't ever do that." But if code can be written, it will be written, and so for the sake of the argument, what would this look like in Python:

    class A {
        virtual ~A() {}
        int m_a;

    class B : public virtual A {
        virtual ~B() {}
        int m_b;

    class C : public virtual A {
        virtual ~C() {}
        int m_c;

    class D : public virtual B, public virtual C {
        virtual ~D() {}
        int m_d;

Actually, nothing changes from what we have seen so far: the scheme as laid out above is fully sufficient. For example, D would simply look like:

    class D(B, C):
        def __init__(self):
            self._cppthis = construct_new_D()
        m_d = make_datamember('D', 'm_d')

Point being, the only complication added by the multiple virtual inheritance, is that navigation of the C++ instance happens with pointers internal to the instance rather than with offsets. However, it is still a fixed offset from any location to any other location within the instance as its parts are laid out consecutively in memory (this is not a requirement, but it is the most efficient, so it is what is used in practice). But what you can not do, is determine the offset statically: you need a live (i.e. constructed) object for any offset calculations. In Python, everything is always done dynamically, so that is of itself not a limitation. Furthermore, self is already passed to the offset calculation (remember that this was done to put the calculation in the derived class, to optimize the common case of zero offset), thus a live C++ instance is there precisely when it is needed. The call to the offset calculation is hard to elide, since the instance will be passed to a C++ helper and so the most the JIT can do is guard on the instance's memory address, which is likely to change between traces. Instead, explicit caching is needed on the base and derived types, allowing the JIT to elide the lookup in the explicit cache.

Static Data Members and Global Variables

That, so far, covers all access to instance data members. Next up are static data members and global variables. A complication here is that a Python property needs to live on the class in order to work its magic. Otherwise, if you get the property, it will simply return the getter function, and if you set it, it will dissappear. The logical conclusion then, is that a property representing a static or global variable, needs to live on the class of the class, or the metaclass. If done directly though, that would mean that every static data member is available from every class, since all Python classes have the same metaclass, which is class type (and which is its own metaclass). To prevent that from happening and because type is actually immutable, each proxy class needs to have its own custom metaclass. Furthermore, since static data can also be accessed on the instance, the class, too, gets a property object for each static data member. Expressed in code, for a basic C++ class, this looks as follows:

    class A {
        static int s_i;

Paired with some Python code such as this, needed to expose the static variable both on the class and the instance level:

    meta_A = type(CppClassMeta, 'meta_A', [CPPMetaBase], {})
    meta_A.s_i = make_datamember('A', 's_i')

    class A(object):
        __metaclass__ = meta_A
        s_i = make_datamember('A', 's_i')

Inheritance adds no complications for the access of static data per se, but there is the issue that the metaclasses must follow the same hierarchy as the proxy classes, for the Python method resolution order (MRO) to work. In other words, there are two complete, parallel class hierarchies that map one-to-one: a hierarchy for the proxy classes and one for their metaclasses.

A parallel class hierarchy is used also in other highly dynamic, object-oriented environments, such as for example Smalltalk. In Smalltalk as well, class-level constructs, such as class methods and data members, are defined for the class in the metaclass. A metaclass hierarchy has further uses, such as lazy loading of nested classes and member templates (this would be coded up in the base class of all metaclasses: CPPMetaBase), and makes it possible to distribute these over different reflection libraries. With this in place, you can write Python codes like so:

    >>>> from cppyy.gbl import A
    >>>> a = A()
    >>>> a.s_i = 42
    >>>> print A.s_i == a.s_i
    >>>> # etc.

The implementation of the getter for s_i is a lot easier than for instance data: the static data lives at a fixed, global, address, so no offset calculations are needed. The same is done for global data or global data living in namespaces: namespaces are represented as Python classes, and global data are implemented as properties on them. The need for a metaclass is one of the reasons why it is easier for namespaces to be classes: module objects are too restrictive. And even though namespaces are not modules, you still can, with some limitations, import from them anyway.

It is common that global objects themselves are pointers, and therefore it is allowed that the stored _cppthis is not a pointer to a C++ object, but rather a pointer to a pointer to a C++ object. A double pointer, as it were. This way, if the C++ code updates the global pointer, it will automatically reflect on the Python side in the proxy. Likewise, if on the Python side the pointer gets set to a different variable, it is the pointer that gets updated, and this will be visible on the C++ side. In general, however, the same caveat as for normal Python code applies: in order to set a global object, it needs to be set within the scope of that global object. As an example, consider the following code for a C++ namespace NS with global variable g_a, which behaves the same as Python code for what concerns the visibility of changes to the global variable:

    >>>> from cppyy.gbl import NS, A
    >>>> from NS import g_a
    >>>> g_a = A(42)                     # does NOT update C++ side
    >>>> print NS.g_a.m_i
    13                                   # the old value happens to be 13
    >>>> NS.g_a = A(42)                  # does update C++ side
    >>>> print NS.g_a.m_i
    >>>> # etc.


That covers all there is to know about data member access of C++ classes in Python through a reflection layer! A few final notes: RPython does not support metaclasses, and so the construction of proxy classes (code like make_datamember above) happens in Python code instead. There is an overhead penalty of about 2x over pure RPython code associated with that, due to extra guards that get inserted by the JIT. A factor of 2 sounds like a lot, but the overhead is tiny to begin with, and 2x of tiny is still tiny and it's not easy to measure. The class definition of the custom property, CPPDataMember, is in RPython code, to be transparent to the JIT. The actual offset calculations are in the reflection layer. Having the proxy class creation in Python, with structural code in RPython, complicates matters if proxy classes need to be constructed on-demand. For example, if an instance of an as-of-yet unseen type is returned by a method. Explaining how that is solved is a topic of part 2, method calls, so stay tuned.

This posting laid out the reasoning behind the object representation of C++ objects in Python by cppyy for the purpose of data member access. It explained how the chosen representation of offsets gives rise to a very pythonic representation, which allows Python introspection tools to work as expected. It also explained some of the optimizations done for the benefit of the JIT. Next up are method calls, which will be described in part 2.

Thursday, August 9, 2012

Multicore Programming in PyPy and CPython

Hi all,

This is a short "position paper" kind of post about my view (Armin Rigo's) on the future of multicore programming in high-level languages. It is a summary of the keynote presentation at EuroPython. As I learned by talking with people afterwards, I am not a good enough speaker to manage to convey a deeper message in a 20-minutes talk. I will try instead to convey it in a 250-lines post...

This is about three points:

  1. We often hear about people wanting a version of Python running without the Global Interpreter Lock (GIL): a "GIL-less Python". But what we programmers really need is not just a GIL-less Python --- we need a higher-level way to write multithreaded programs than using directly threads and locks. One way is Automatic Mutual Exclusion (AME), which would give us an "AME Python".
  2. A good enough Software Transactional Memory (STM) system can be used as an internal tool to do that. This is what we are building into an "AME PyPy".
  3. The picture is darker for CPython, though there is a way too. The problem is that when we say STM, we think about either GCC 4.7's STM support, or Hardware Transactional Memory (HTM). However, both solutions are enough for a "GIL-less CPython", but not for "AME CPython", due to capacity limitations. For the latter, we need somehow to add some large-scale STM into the compiler.

Let me explain these points in more details.

GIL-less versus AME

The first point is in favor of the so-called Automatic Mutual Exclusion approach. The issue with using threads (in any language with or without a GIL) is that threads are fundamentally non-deterministic. In other words, the programs' behaviors are not reproductible at all, and worse, we cannot even reason about it --- it becomes quickly messy. We would have to consider all possible combinations of code paths and timings, and we cannot hope to write tests that cover all combinations. This fact is often documented as one of the main blockers towards writing successful multithreaded applications.

We need to solve this issue with a higher-level solution. Such solutions exist theoretically, and Automatic Mutual Exclusion (AME) is one of them. The idea of AME is that we divide the execution of each thread into a number of "atomic blocks". Each block is well-delimited and typically large. Each block runs atomically, as if it acquired a GIL for its whole duration. The trick is that internally we use Transactional Memory, which is a technique that lets the system run the atomic blocks from each thread in parallel, while giving the programmer the illusion that the blocks have been run in some global serialized order.

This doesn't magically solve all possible issues, but it helps a lot: it is far easier to reason in terms of a random ordering of large atomic blocks than in terms of a random ordering of lines of code --- not to mention the mess that multithreaded C is, where even a random ordering of instructions is not a sufficient model any more.

How do such atomic blocks look like? For example, a program might contain a loop over all keys of a dictionary, performing some "mostly-independent" work on each value. This is a typical example: each atomic block is one iteration through the loop. By using the technique described here, we can run the iterations in parallel (e.g. using a thread pool) but using AME to ensure that they appear to run serially.

In Python, we don't care about the order in which the loop iterations are done, because we are anyway iterating over the keys of a dictionary. So we get exactly the same effect as before: the iterations still run in some random order, but --- and that's the important point --- they appear to run in a global serialized order. In other words, we introduced parallelism, but only under the hood: from the programmer's point of view, his program still appears to run completely serially. Parallelisation as a theoretically invisible optimization... more about the "theoretically" in the next paragraph.

Note that randomness of order is not fundamental: they are techniques building on top of AME that can be used to force the order of the atomic blocks, if needed.

PyPy and STM/AME

Talking more precisely about PyPy: the current prototype pypy-stm is doing precisely this. In pypy-stm, the length of the atomic blocks is selected in one of two ways: either explicitly or automatically.

The automatic selection gives blocks corresponding to some small number of bytecodes, in which case we have merely a GIL-less Python: multiple threads will appear to run serially, with the execution randomly switching from one thread to another at bytecode boundaries, just like in CPython.

The explicit selection is closer to what was described in the previous section: someone --- the programmer or the author of some library that the programmer uses --- will explicitly put with thread.atomic: in the source, which delimitates an atomic block. For example, we can use it to build a library that can be used to iterate over the keys of a dictionary: instead of iterating over the dictionary directly, we would use some custom utility which gives the elements "in parallel". It would give them by using internally a pool of threads, but enclosing every handling of an element into such a with thread.atomic block.

This gives the nice illusion of a global serialized order, and thus gives us a well-behaving model of the program's behavior.

Restating this differently, the only semantical difference between pypy-stm and a regular PyPy or CPython is that it has thread.atomic, which is a context manager that gives the illusion of forcing the GIL to not be released during the execution of the corresponding block of code. Apart from this addition, they are apparently identical.

Of course they are only semantically identical if we ignore performance: pypy-stm uses multiple threads and can potentially benefit from that on multicore machines. The drawback is: when does it benefit, and how much? The answer to this question is not immediate. The programmer will usually have to detect and locate places that cause too many "conflicts" in the Transactional Memory sense. A conflict occurs when two atomic blocks write to the same location, or when A reads it, B writes it, but B finishes first and commits. A conflict causes the execution of one atomic block to be aborted and restarted, due to another block committing. Although the process is transparent, if it occurs more than occasionally, then it has a negative impact on performance.

There is no out-of-the-box perfect solution for solving all conflicts. What we will need is more tools to detect them and deal with them, data structures that are made aware of the risks of "internal" conflicts when externally there shouldn't be one, and so on. There is some work ahead.

The point here is that from the point of view of the final programmer, we gets conflicts that we should resolve --- but at any point, our program is correct, even if it may not be yet as efficient as it could be. This is the opposite of regular multithreading, where programs are efficient but not as correct as they could be. In other words, as we all know, we only have resources to do the easy 80% of the work and not the remaining hard 20%. So in this model we get a program that has 80% of the theoretical maximum of performance and it's fine. In the regular multithreading model we would instead only manage to remove 80% of the bugs, and we are left with obscure rare crashes.

CPython and HTM

Couldn't we do the same for CPython? The problem here is that pypy-stm is implemented as a transformation step during translation, which is not directly possible in CPython. Here are our options:

  • We could review and change the C code everywhere in CPython.
  • We use GCC 4.7, which supports some form of STM.
  • We wait until Intel's next generation of CPUs comes out ("Haswell") and use HTM.
  • We write our own C code transformation within a compiler (e.g. LLVM).

I will personally file the first solution in the "thanks but no thanks" category. If anything, it will give us another fork of CPython that will painfully struggle to keep not more than 3-4 versions behind, and then eventually die. It is very unlikely to be ever merged into the CPython trunk, because it would need changes everywhere. Not to mention that these changes would be very experimental: tomorrow we might figure out that different changes would have been better, and have to start from scratch again.

Let us turn instead to the next two solutions. Both of these solutions are geared toward small-scale transactions, but not long-running ones. For example, I have no clue how to give GCC rules about performing I/O in a transaction --- this seems not supported at all; and moreover looking at the STM library that is available so far to be linked with the compiled program, it assumes short transactions only. By contrast, when I say "long transaction" I mean transactions that can run for 0.1 seconds or more. To give you an idea, in 0.1 seconds a PyPy program allocates and frees on the order of ~50MB of memory.

Intel's Hardware Transactional Memory solution is both more flexible and comes with a stricter limit. In one word, the transaction boundaries are given by a pair of special CPU instructions that make the CPU enter or leave "transactional" mode. If the transaction aborts, the CPU cancels any change, rolls back to the "enter" instruction and causes this instruction to return an error code instead of re-entering transactional mode (a bit like a fork()). The software then detects the error code. Typically, if transactions are rarely cancelled, it is fine to fall back to a GIL-like solution just to redo these cancelled transactions.

About the implementation: this is done by recording all the changes that a transaction wants to do to the main memory, and keeping them invisible to other CPUs. This is "easily" achieved by keeping them inside this CPU's local cache; rolling back is then just a matter of discarding a part of this cache without committing it to memory. From this point of view, there is a lot to bet that we are actually talking about the regular per-core Level 1 and Level 2 caches --- so any transaction that cannot fully store its read and written data in the 64+256KB of the L1+L2 caches will abort.

So what does it mean? A Python interpreter overflows the L1 cache of the CPU very quickly: just creating new Python function frames takes a lot of memory (on the order of magnitude of 1/100 of the whole L1 cache). Adding a 256KB L2 cache into the picture helps, particularly because it is highly associative and thus avoids a lot of fake conflicts. However, as long as the HTM support is limited to L1+L2 caches, it is not going to be enough to run an "AME Python" with any sort of medium-to-long transaction. It can run a "GIL-less Python", though: just running a few hundred or even thousand bytecodes at a time should fit in the L1+L2 caches, for most bytecodes.

I would vaguely guess that it will take on the order of 10 years until CPU cache sizes grow enough for a CPU in HTM mode to actually be able to run 0.1-second transactions. (Of course in 10 years' time a lot of other things may occur too, including the whole Transactional Memory model being displaced by something else.)

Write your own STM for C

Let's discuss now the last option: if neither GCC 4.7 nor HTM are sufficient for an "AME CPython", then we might want to write our own C compiler patch (as either extra work on GCC 4.7, or an extra pass to LLVM, for example).

We would have to deal with the fact that we get low-level information, and somehow need to preserve interesting high-level bits through the compiler up to the point at which our pass runs: for example, whether the field we read is immutable or not. (This is important because some common objects are immutable, e.g. PyIntObject. Immutable reads don't need to be recorded, whereas reads of mutable data must be protected against other threads modifying them.) We can also have custom code to handle the reference counters: e.g. not consider it a conflict if multiple transactions have changed the same reference counter, but just resolve it automatically at commit time. We are also free to handle I/O in the way we want.

More generally, the advantage of this approach over both the current GCC 4.7 and over HTM is that we control the whole process. While this still looks like a lot of work, it looks doable. It would be possible to come up with a minimal patch of CPython that can be accepted into core without too much troubles (e.g. to mark immutable fields and tweak the refcounting macros), and keep all the cleverness inside the compiler extension.


I would assume that a programming model specific to PyPy and not applicable to CPython has little chances to catch on, as long as PyPy is not the main Python interpreter (which looks unlikely to change anytime soon). Thus as long as only PyPy has AME, it looks like it will not become the main model of multicore usage in Python. However, I can conclude with a more positive note than during the EuroPython conference: it is a lot of work, but there is a more-or-less reasonable way forward to have an AME version of CPython too.

In the meantime, pypy-stm is around the corner, and together with tools developed on top of it, it might become really useful and used. I hope that in the next few years this work will trigger enough motivation for CPython to follow the ideas.

Tuesday, August 7, 2012

NumPyPy non-progress report

Hello everyone.

Not much has happened in the past few months with numpypy development. A part of the reason was doing other stuff for me, a part of the reason was various unexpected visa-related admin, a part of the reason was EuroPython and a part was long-awaited holiday.

The thing that's maybe worth mentioning is that it does not mean the donations disappeared in the mist. PyPy developers are being paid to work on NumPyPy on an hourly basis - that means if I decide to take holidays or work on something else, the money is simply staying in the account until later.

Thanks again for all the donations, I hope to get back to this topic soon!


Thursday, July 26, 2012

CFFI release 0.2.1

Hi everybody,

We released CFFI 0.2.1 (expected to be 1.0 soon). CFFI is a way to call C from Python.

EDIT: Win32 was broken in 0.2. Fixed.

This release is only for CPython 2.6 or 2.7. PyPy support is coming in
the ffi-backend branch, but not finished yet. CPython 3.x would be
easy but requires the help of someone.

The package is available on bitbucket as well as documented. You
can also install it straight from the python package index: pip install cffi

  • Contains numerous small changes and support for more C-isms.
  • The biggest news is the support for installing packages that use
    ffi.verify() on machines without a C compiler. Arguably, this
    lifts the last serious restriction for people to use CFFI.
  • Partial list of smaller changes:
    • mappings between 'wchar_t' and Python unicodes
    • the introduction of ffi.NULL
    • a possibly clearer API for e.g. to allocate a single int and obtain a pointer to it, use"int *") instead of the old"int")
    • and of course a plethora of smaller bug fixes
  • CFFI uses pkg-config to install itself if available. This helps
    locate libffi on modern Linuxes. Mac OS/X support is available too
    (see the detailed installation instructions). Win32 should work out
    of the box. Win64 has not been really tested yet.

Armin Rigo and Maciej Fijałkowski

Friday, July 13, 2012

Prototype PHP interpreter using the PyPy toolchain - Hippy VM

Hello everyone.

I'm proud to release the result of a Facebook-sponsored study on the feasibility of using the RPython toolchain to produce a PHP interpreter. The rules were simple: two months; one person; get as close to PHP as possible, implementing enough warts and corner cases to be reasonably sure that it answers hard problems in the PHP language. The outcome is called Hippy VM and implements most of the PHP 1.0 language (functions, arrays, ints, floats and strings). This should be considered an alpha release.

The resulting interpreter is obviously incomplete – it does not support all modern PHP constructs (classes are completely unimplemented), builtin functions, grammar productions, web server integration, builtin libraries etc., etc.. It's just complete enough for me to reasonably be able to say that – given some engineering effort – it's possible to provide a rock-solid and fast PHP VM using PyPy technologies.

The result is available in a Bitbucket repo and is released under the MIT license.


The table below shows a few benchmarks comparing Hippy VM to Zend (a standard PHP interpreter available in Linux distributions) and HipHop VM (a PHP-to-C++ optimizing compiler developed by Facebook). The versions used were Zend 5.3.2 (Zend Engine v2.3.0) and HipHop VM heads/vm-0-ga4fbb08028493df0f5e44f2bf7c042e859e245ab (note that you need to check out the vm branch to get the newest version).

The run was performed on 64-bit Linux running on a Xeon W3580 with 8M of L2 cache, which was otherwise unoccupied.

Unfortunately, I was not able to run it on the JITted version of HHVM, the new effort by Facebook, but people involved with the project told me it's usually slower or comparable with the compiled HipHop. Their JITted VM is still alpha software, so I'll update it as soon as I have the info.

benchmark Zend HipHop VM Hippy VM Hippy / Zend Hippy / HipHop
arr 2.771 0.508+-0% 0.274+-0% 10.1x 1.8x
fannkuch 21.239 7.248+-0% 1.377+-0% 15.4x 5.3x
heapsort 1.739 0.507+-0% 0.192+-0% 9.1x 2.6x
binary_trees 3.223 0.641+-0% 0.460+-0% 7.0x 1.4x
cache_get_scb 3.350 0.614+-0% 0.267+-2% 12.6x 2.3x
fib 2.357 0.497+-0% 0.021+-0% 111.6x 23.5x
fasta 1.499 0.233+-4% 0.177+-0% 8.5x 1.3x

The PyPy compiler toolchain provides a way to implement a dynamic language interpreter in a high-level language called RPython. This is a language which is lower-level than Python, but still higher-level than C or C++: for example, RPython is a garbage-collected language. The killer feature is that the toolchain will generate a JIT for your interpreter which will be able to leverage most of the work that has been done on speeding up Python in the PyPy project. The resulting JIT is generated for your interpreter, and is not Python-specific. This was one of the toolchain's original design decisions – in contrast to e.g. the JVM, which was initially only used to interpret Java and later adjusted to serve as a platform for dynamic languages.

Another important difference is that there is no common bytecode to which you compile both your language and Python, so you don't inherit problems presented when implementing language X on top of, say, Parrot VM or the JVM. The PyPy toolchain does not impose constraints on the semantics of your language, whereas the benefits of the JVM only apply to languages that map well onto Java concepts.

To read more about creating your own interpreters using the PyPy toolchain, read more blog posts or an excellent article by Laurence Tratt.

PHP deviations

The project's biggest deviation from the PHP specification is probably that GC is no longer reference counting. That means that the object finalizer, when implemented, will not be called directly at the moment of object death, but at some later point. There are possible future developments to alleviate that problem, by providing "refcounted" objects when leaving the current scope. Research has to be done in order to achieve that.


The RPython toolchain seems to be a cost-effective choice for writing dynamic language VMs. It both provides a fast JIT and gives you access to low-level primitives when you need them. A good example is in the directory hippy/rpython which contains the implementation of an ordered dictionary. An ordered dictionary is not a primitive that RPython provides – it's not necessary for the goal of implementing Python. Now, implementing it on top of a normal dictionary is possible, but inefficient. RPython provides a way to work directly at a lower level, if you desire to do so.

Things that require improvements in RPython:

  • Lack of mutable strings on the RPython level ended up being a problem. I ended up using lists of characters; which are efficient, but inconvenient, since they don't support any string methods.
  • Frame handling is too conservative and too Python-specific, especially around the calls. It's possible to implement less general, but simpler and faster frame handling implementation in RPython.

Status of the implementation

Don't use it! It's a research prototype intended to assess the feasibility of using RPython to create dynamic language VMs. The most notable feature that's missing is reasonable error reporting. That said, I'm confident it implements enough of the PHP language to prove that the full implementation will present the same performance characteristics.


The benchmarks are a selection of computer language shootout benchmarks, as well as cache_get_scb, which is a part of old Facebook code. All benchmarks other than this one (which is not open source, but definitely the most interesting :( ) are available in the bench directory. The Python program to run them is called and is in the same directory. It runs them 10 times, cutting off the first 3 runs (to ignore the JIT warm-up time) and averaging the rest. As you can see the standard deviation is fairly minimal for all interpreters and runs; if it's omitted it means it's below 0.5%.

The benchmarks were not selected for their ease of optimization – the optimizations in the interpreter were written specifically for this set of benchmarks. No special JIT optimizations were added, and barring what's mentioned below a vanilla PyPy 1.9 checkout was used for compilation.

So, how fast will my website run if this is completed?

The truth is that I lack the benchmarks to be able to answer that right now. The core of the PHP language is implemented up to the point where I'm confident that the performance will not change as we get more of the PHP going.

How do I run it?

Get a PyPy checkout, apply the diff if you want to squeeze out the last bits of performance and run pypy-checkout/pypy/bin/rpython to get an executable that resembles a PHP interpreter. You can also directly run python file.php, but this will be about 2000x slower.

RPython modifications

There was a modification that I did to the PyPy source code; the diff is available. It's trivial, and should simply be made optional in the RPython JIT generator, but it was easier just to do it, given the very constrained time frame.

  • gen_store_back_in_virtualizable was disabled. This feature is necessary for Python frames but not for PHP frames. PHP frames do not have to be kept alive after we exit a function.


Hippy is a cool prototype that presents a very interesting path towards a fast PHP VM. However, at the moment I have too many other open source commitments to take on the task of completing it in my spare time. I do think that this project has a lot of potential, but I will not commit to any further development at this time. If you send pull requests I'll try to review them. I'm also open to having further development on this project funded, so if you're interested in this project and the potential of a fast PHP interpreter, please get in touch.


EDIT: Fixed the path to the rpython binary

Tuesday, July 10, 2012

Py3k status update #5

This is the fifth status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

Apart from the usual "fix shallow py3k-related bugs" part, most of my work in
this iteration has been to fix the bootstrap logic of the interpreter, in
particular to setup the initial sys.path.

Until few weeks ago, the logic to determine sys.path was written entirely
at app-level in pypy/translator/goal/, which is automatically
included inside the executable during translation. The algorithm is more or
less like this:

  1. find the absolute path of the executable by looking at sys.argv[0]
    and cycling through all the directories in PATH
  2. starting from there, go up in the directory hierarchy until we find a
    directory which contains lib-python and lib_pypy

This works fine for Python 2 where the paths and filenames are represented as
8-bit strings, but it is a problem for Python 3 where we want to use unicode
instead. In particular, whenever we try to encode a 8-bit string into an
unicode, PyPy asks the _codecs built-in module to find the suitable
codec. Then, _codecs tries to import the encodings package, to list
all the available encodings. encodings is a package of the standard
library written in pure Python, so it is located inside
lib-python/3.2. But at this point in time we yet have to add
lib-python/3.2 to sys.path, so the import fails. Bootstrap problem!

The hard part was to find the problem: since it is an error which happens so
early, the interpreter is not even able to display a traceback, because it
cannot yet import The only way to debug it was through some
carefully placed print statement and the help of gdb. Once found the
problem, the solution was as easy as moving part of the logic to RPython,
where we don't have bootstrap problems.

Once the problem was fixed, I was able to finally run all the CPython test
against the compiled PyPy. As expected there are lots of failures, and fixing
them will be the topic of my next months.

Thursday, June 28, 2012

EuroPython sprint

Hi all,

EuroPython is next week. We will actually be giving a presentation on Monday, in one of the plenary talks: PyPy: current status and GIL-less future. This is the first international PyPy keynote we give, as far as I know, but not the first keynote about PyPy [David Beazley's video] :-)

The other talks are PyPy JIT under the hood and to some extent Performance analysis tools for JITted VMs. This year we are also trying out a help desk. Finally, we will have the usual sprint after EuroPython on Saturday and Sunday.

See you soon!


Monday, June 25, 2012

Architecture of Cppyy

The cppyy module makes it possible to call into C++ from PyPy through the Reflex package. Work started about two years ago, with a follow-up sprint a year later. The module has now reached an acceptable level of maturity and initial documentation with setup instructions, as well as a list of the currently supported language features, are now available here. There is a sizable (non-PyPy) set of unit and application tests that is still being worked through, not all of them of general applicability, so development continues its current somewhat random walk towards full language coverage. However, if you find that cppyy by and large works for you except for certain specific features, feel free to ask for them to be given higher priority.

Cppyy handles bindings differently than what is typically found in other tools with a similar objective, so this update walks through some of these differences, and explains why choices were made as they are.

The most visible difference, is from the viewpoint of the Python programmer interacting with the module. The two canonical ways of making Python part of a larger environment, are to either embed or extend it. The latter is done with so-called extension modules, which are explicitly constructed to be very similar in their presentation to the Python programmer as normal Python modules. In cppyy, however, the external C++ world is presented from a single entrance point, the global C++ namespace (in the form of the variable cppyy.gbl). Thus, instead of importing a package that contains your C++ classes, usage looks like this (assuming class MyClass in the global namespace):

>>>> import cppyy
>>>> m = cppyy.gbl.MyClass()
>>>> # etc.

This is more natural than it appears at first: C++ classes and functions are, once compiled, represented by unique linker symbols, so it makes sense to give them their own unique place on the Python side as well. This organization allows pythonizations of C++ classes to propagate from one code to another, ensures that all normal Python introspection (such as issubclass and isinstance) works as expected in all cases, and that it is possible to represent C++ constructs such as typedefs simply by Python references. Achieving this unified presentation would clearly require a lot of internal administration to track all C++ entities if they each lived in their own, pre-built extension modules. So instead, cppyy generates the C++ bindings at run-time, which brings us to the next difference.

Then again, that is not really a difference: when writing or generating a Python extension module, the result is some C code that consists of calls into Python, which then gets compiled. However, it is not the bindings themselves that are compiled; it is the code that creates the bindings that gets compiled. In other words, any generated or hand-written extension module does exactly what cppyy does, except that they are much more specific in that the bound code is hard-wired with e.g. fixed strings and external function calls. The upshot is that in Python, where all objects are first-class and run-time constructs, there is no difference whatsoever between bindings generated at run-time, and bindings generated at ... well, run-time really. There is a difference in organization, though, which goes back to the first point of structuring the C++ class proxies in Python: given that a class will settle in a unique place once bound, instead of inside a module that has no meaning in the C++ world, it follows that it can also be uniquely located in the first place. In other words, cppyy can, and does, make use of a class loader to auto-load classes on-demand.

If at this point, this all reminds you of a bit ctypes, just with some extra bells and whistles, you would be quite right. In fact, internally cppyy makes heavy use of the RPython modules that form the guts of ctypes. The difficult part of ctypes, however, is the requirement to annotate functions and structures. That is not very pleasant in C, but in C++ there is a whole other level of complexity in that the C++ standard specifies many low-level details, that are required for dispatching calls and understanding object layout, as "implementation defined." Of course, in the case of Open Source compilers, getting at those details is doable, but having to reverse engineer closed-source compilers gets old rather quickly in more ways than one. More generally, these implementation defined details prevent a clean interface, i.e. without a further dependency on the compiler, into C++ like the one that the CFFI module provides for C. Still, once internal pointers have been followed, offsets have been calculated, this objects have been provided, etc., etc., the final dispatch into binary C++ is no different than that into C, and cppyy will therefore be able to make use of CFFI internally, like it does with ctypes today. This is especially relevant in the CLang/LLVM world, where stub functions are done away with. To get the required low-level details then, cppyy relies on a back-end, rather than getting it from the programmer, and this is where Reflex (together with the relevant C++ compiler) comes in, largely automating this tedious process.

There is nothing special about Reflex per se, other than that it is relatively lightweight, available, and has proven to be able to handle huge code bases. It was a known quantity when work on cppyy started, and given the number of moving parts in learning PyPy, that was a welcome relief. Reflex is based on gccxml, and can therefore handle pretty much any C or C++ code that you care to throw at it. It is also technically speaking obsolete as it will not support C++11, since gccxml won't, but its expected replacement, based on CLang/LLVM, is not quite there yet (we are looking at Q3 of this year). In cppyy, access to Reflex, or any back-end for that matter, is through a thin C API (see the schematic below): cppyy asks high level questions to the back-end, and receives low-level results, some of which are in the form of opaque handles. This ensures that cppyy is not tied to any specific back-end. In fact, currently it already supports another, CINT, but that back-end is of little interest outside of High Energy Physics (HEP). The Python side is always the same, however, so any Python code based on cppyy does not have to change if the back-end changes. To use the system, a back-end specific tool (genreflex for Reflex) is first run on a set of header files with a selection file for choosing the required classes. This produces a C++ file that must be compiled into a shared library, and a corresponding map file for the class loader. These shared libraries, with their map files alongside, can be put anywhere as long as they can be located through the standard paths for the dynamic loader. With that in place, the setup is ready, and the C++ classes are available to be used from cppyy.

So far, nothing that has been described is specific to PyPy. In fact, most of the technologies described have been used for a long time on CPython already, so why the need for a new, PyPy-specific, module? To get to that, it is important to first understand how a call is mediated between Python and C++. In Python, there is the concept of a PyObject, which has a reference count, a pointer to a type object, and some payload. There are APIs to extract the low-level information from the payload for use in the C++ call, and to repackage any results from the call. This marshalling is where the bulk of the time is spent when dispatching. To be absolutely precise, most C++ extension module generators produce slow dispatches because they don't handle overloads efficiently, but even in there, they still spend most of their time in the marshalling code, albeit in calls that fail before trying the next overload. In PyPy, speed is gained by having the JIT unbox objects into the payload only, allowing it to become part of compiled traces. If the same marshalling APIs were used, the JIT is forced to rebox the payload, hand it over through the API, only to have it unboxed again by the binding. Doing so is dreadfully inefficient. The objective of cppyy, then, is to keep all code transparent to the JIT until the absolute last possible moment, i.e. the call into C++ itself, therefore allowing it to (more or less) directly pass the payload it already has, with an absolute minimal amount of extra work. In the extreme case when the binding is not to a call, but to a data member of an object (or to a global variable), the memory address is delivered to the JIT and this results in direct access with no overhead. Note the interplay: cppyy in PyPy does not work like a binding in the CPython sense that is a back-and-forth between the interpreter and the extension. Instead, it does its work by being transparent to the JIT, allowing the JIT to dissolve the binding. And with that, we have made a full circle: if to work well with the JIT, and in so doing achieve the best performance, you can not have marshalling or do any other API-based driving, then the concept of compiled extension modules is out, and the better solution is in run-time generated bindings.

That leaves one final point. What if you do want to present an extension module-like interface to programmers that use your code? But of course, this is Python: everything consists of first-class objects, whose behavior can be changed on the fly. In CPython, you might hesitate to make such changes, as every overlay or indirection results in quite a bit of overhead. With PyPy, however, these layers are all optimized out of existences, making that a non-issue.

This posting laid out the reasoning behind the organization of cppyy. A follow-up is planned, to explain how C++ objects are handled and represented internally.

Wim Lavrijsen

Monday, June 18, 2012

Release 0.1 of CFFI


We're pleased to announce the first public release, 0.1 of CFFI, a way to call C from Python.
(This release does not support PyPy yet --- but we announce it here as it is planned for the
next release :-)

The package is available on bitbucket as well as documented. You can also install it
straight from the python package index (pip).

The aim of this project is to provide a convenient and reliable way of calling C code from Python.
The interface is based on LuaJIT's FFI and follows a few principles:

  • The goal is to call C code from Python. You should be able to do so
    without learning a 3rd language: every alternative requires you to learn
    their own language (Cython, SWIG) or API (ctypes). So we tried to
    assume that you know Python and C and minimize the extra bits of API that
    you need to learn.
  • Keep all the Python-related logic in Python so that you don't need to
    write much C code (unlike CPython native C extensions).
  • Work either at the level of the ABI (Application Binary Interface)
    or the API (Application Programming Interface). Usually, C
    libraries have a specified C API but often not an ABI (e.g. they may
    document a "struct" as having at least these fields, but maybe more).
    (ctypes works at the ABI level, whereas Cython or native C extensions
    work at the API level.)
  • We try to be complete. For now some C99 constructs are not supported,
    but all C89 should be, including macros (and including macro "abuses",
    which you can manually wrap in saner-looking C functions).
  • We attempt to support both PyPy and CPython (although PyPy support is not
    complete yet) with a reasonable path for other Python implementations like
    IronPython and Jython.
  • Note that this project is not about embedding executable C code in
    Python, unlike Weave. This is about calling existing C libraries
    from Python.

Status of the project

Consider this as a beta release. Creating CPython extensions is fully supported and the API should
be relatively stable; however, minor adjustements of the API are possible.

PyPy support is not yet done and this is a goal for the next release. There are vague plans to make this the
preferred way to call C from Python that can reliably work between PyPy and CPython.

Right now CFFI's verify() requires a C compiler and header files to be available at run-time.
This limitation will be lifted in the near future and it'll contain a way to cache the resulting binary.


Armin Rigo and Maciej Fijałkowski