656x Faster JSON Parsing in Python with ijson

I profiled my team’s Python code and identified a performance bottleneck in JSON parsing. At two points, the code used the ijson package in a naive way that slowed down terribly for larger JSON files. It’s possible to achieve much faster JSON parsing without changing any code.

Data Scientist Workbench

My team builds Data Scientist Workbench, which is a free set of tools for doing data science.  It includes Jupyter and Zeppelin interactive notebooks as well as R Studio IDE all pre-configured to work with the Spark parallel data processing framework.

Behind the scenes, Data Scientist Workbench is composed of microservices. Some of them are built in Ruby, some in Node, and some in Python.

Faster JSON Parsing

Faster JSON parsing is possible in PythonJSON is a convenient format for serializing data. It originates from a subset of JavaScript Object Notation. Most languages have several libraries for reading and writing JSON.

ijson is a great library for working with JSON files in Python. Unfortunately, by default it uses a pure Python JSON parser as its backend. Much higher performance can be achieved by using a C backend.

These are the available backends:

  • yajl2_cffi: wrapper around YAJL 2.x using CFFI, this is the fastest.
  • yajl2: wrapper around YAJL 2.x using ctypes, for when you can’t use CFFI for some reason.
  • yajl: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
  • python: pure Python parser, good to use with PyPy

Assuming you have yajl2 installed, switching from the slow, pure Python parser to a faster JSON parser written in C is a matter of changing this line:

To this line:

All other code is the same.

Installation of yajl2

Before you can use yajl2 as a faster JSON parsing backend for ijson, you have to install it.

On Ubuntu, you can install it as follows:

On the Mac, you can install it as follows:

Performance Micro-benchmark

Other people have benchmarked ijson before me.

I did see a huge performance improvement with a specific 4MB JSON file (a Jupyter notebook), so it makes sense to measure that specifically.

Here’s the very simple code that I will use to measure the performance of parsing JSON with ijson:

The result of the first run:

After changing to yajl2 as the parser:

That’s 656x or 65600% faster!

I should mention that the JSON I’m parsing contains 4MB of escaped JSON represented as a string within actual JSON, so it may be an unusually bad case for the pure Python parser.