I profiled my team’s Python code and identified a performance bottleneck in JSON parsing. At two points, the code used the ijson package in a naive way that slowed down terribly for larger JSON files. It’s possible to achieve much faster JSON parsing without changing any code.
Data Scientist Workbench
My team builds Data Scientist Workbench, which is a free set of tools for doing data science. It includes Jupyter and Zeppelin interactive notebooks as well as R Studio IDE all pre-configured to work with the Spark parallel data processing framework.
Behind the scenes, Data Scientist Workbench is composed of microservices. Some of them are built in Ruby, some in Node, and some in Python.
Faster JSON Parsing
ijson is a great library for working with JSON files in Python. Unfortunately, by default it uses a pure Python JSON parser as its backend. Much higher performance can be achieved by using a C backend.
These are the available backends:
- yajl2_cffi: wrapper around YAJL 2.x using CFFI, this is the fastest.
- yajl2: wrapper around YAJL 2.x using ctypes, for when you can’t use CFFI for some reason.
- yajl: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
- python: pure Python parser, good to use with PyPy
Assuming you have yajl2 installed, switching from the slow, pure Python parser to a faster JSON parser written in C is a matter of changing this line:
To this line:
import ijson.backends.yajl2 as ijson
All other code is the same.
Installation of yajl2
Before you can use yajl2 as a faster JSON parsing backend for ijson, you have to install it.
On Ubuntu, you can install it as follows:
apt-get -qq update apt-get -y install libyajl2 libyajl-dev pip install yajl-py==2.0.2
On the Mac, you can install it as follows:
brew install yajl pip install yajl-py==2.0.2
Other people have benchmarked ijson before me.
I did see a huge performance improvement with a specific 4MB JSON file (a Jupyter notebook), so it makes sense to measure that specifically.
Here’s the very simple code that I will use to measure the performance of parsing JSON with ijson:
#!python import ijson # Do this 10 times for i in range(0, 10): print "Starting parse #%i" % (i) json = ijson.parse(open('4MB.ipynb', 'r')) for prefix, event, value in json: pass
The result of the first run:
$ time python test.py Starting parse #0 Starting parse #1 Starting parse #2 Starting parse #3 Starting parse #4 Starting parse #5 Starting parse #6 Starting parse #7 Starting parse #8 Starting parse #9 real 20m52.592s user 14m37.860s sys 6m6.768s
After changing to yajl2 as the parser:
$ time python test.py Starting parse #0 Starting parse #1 Starting parse #2 Starting parse #3 Starting parse #4 Starting parse #5 Starting parse #6 Starting parse #7 Starting parse #8 Starting parse #9 real 0m1.910s user 0m1.784s sys 0m0.085s
That’s 656x or 65600% faster!
I should mention that the JSON I’m parsing contains 4MB of escaped JSON represented as a string within actual JSON, so it may be an unusually bad case for the pure Python parser.