656x Faster JSON Parsing in Python with ijson

I profiled my team’s Python code and identified a performance bottleneck in JSON parsing. At two points, the code used the ijson package in a naive way that slowed down terribly for larger JSON files. It’s possible to achieve much faster JSON parsing without changing any code.

Data Scientist Workbench

My team builds Data Scientist Workbench, which is a free set of tools for doing data science. It includes Jupyter and Zeppelin interactive notebooks as well as R Studio IDE all pre-configured to work with the Spark parallel data processing framework.

Behind the scenes, Data Scientist Workbench is composed of microservices. Some of them are built in Ruby, some in Node, and some in Python.

Faster JSON Parsing

JSON is a convenient format for serializing data. It originates from a subset of JavaScript Object Notation. Most languages have several libraries for reading and writing JSON.

ijson is a great library for working with JSON files in Python. Unfortunately, by default it uses a pure Python JSON parser as its backend. Much higher performance can be achieved by using a C backend.

These are the available backends:

yajl2_cffi: wrapper around YAJL 2.x using CFFI, this is the fastest.
yajl2: wrapper around YAJL 2.x using ctypes, for when you can’t use CFFI for some reason.
yajl: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
python: pure Python parser, good to use with PyPy

Assuming you have yajl2 installed, switching from the slow, pure Python parser to a faster JSON parser written in C is a matter of changing this line:

import ijson

To this line:

import ijson.backends.yajl2 as ijson

All other code is the same.

Installation of yajl2

Before you can use yajl2 as a faster JSON parsing backend for ijson, you have to install it.

On Ubuntu, you can install it as follows:

apt-get -qq update
apt-get -y install libyajl2 libyajl-dev
pip install yajl-py==2.0.2

On the Mac, you can install it as follows:

brew install yajl
pip install yajl-py==2.0.2

Performance Micro-benchmark

Other people have benchmarked ijson before me.

I did see a huge performance improvement with a specific 4MB JSON file (a Jupyter notebook), so it makes sense to measure that specifically.

Here’s the very simple code that I will use to measure the performance of parsing JSON with ijson:

#!python

import ijson

# Do this 10 times
for i in range(0, 10):
    print "Starting parse #%i" % (i)
    json = ijson.parse(open('4MB.ipynb', 'r'))
    for prefix, event, value in json:
        pass

The result of the first run:

$ time python test.py
Starting parse #0
Starting parse #1
Starting parse #2
Starting parse #3
Starting parse #4
Starting parse #5
Starting parse #6
Starting parse #7
Starting parse #8
Starting parse #9

real    20m52.592s
user    14m37.860s
sys    6m6.768s

After changing to yajl2 as the parser:

$ time python test.py
Starting parse #0
Starting parse #1
Starting parse #2
Starting parse #3
Starting parse #4
Starting parse #5
Starting parse #6
Starting parse #7
Starting parse #8
Starting parse #9

real    0m1.910s
user    0m1.784s
sys    0m0.085s

That’s 656x or 65600% faster!

I should mention that the JSON I’m parsing contains 4MB of escaped JSON represented as a string within actual JSON, so it may be an unusually bad case for the pure Python parser.

Comments

One response to “656x Faster JSON Parsing in Python with ijson”

May 30, 2016

Leon Katsnelson

I can’t wait till we have this in production on http://DataScientistWorkbench.com. Bigger faster loading IPython notebooks has been a consistent ask from our users. Awesome job Leons!

656x Faster JSON Parsing in Python with ijson

Data Scientist Workbench

Faster JSON Parsing

Installation of yajl2

Performance Micro-benchmark

Comments

One response to “656x Faster JSON Parsing in Python with ijson”

Leave a Reply Cancel reply

More posts

Migrations

Joining Upgrade

Join or union ranges from multiple tabs in Google Sheets

@here considered harmful