Python Library of the Day: retrying

Python logoI’ve learned through extensive experience that Bash is the wrong choice for anything longer than a few lines. I needed to write a command line app, so I put one together in Python — Python 3 of course, as Python 2 is going away by 2020. In the process I discovered a new to me Python library called retrying.

If you want to learn Python, check out the Python for Data Science course on Cognitive Class.

retrying

I needed my Python code to repeat a bunch of operations until they succeeded. It’s easy to write a naive loop for that, but the logic gets convoluted and makes the actual operation ugly to look at. By the time you do something three times over, you should automate.

You can of course write an abstraction yourself, but for this sort of common problem it is best to use an existing library.

XKCD comic on automation

The benefit to using an existing library is not just that someone else maintains it, but also that you benefit from the collective wisdom and experience of everyone else using the library. Computing is full of strange edge cases and unexpected security holes. These are harder to avoid when rolling your own abstraction.

For my purpose, I found a Python library called retrying. It provides a simple decorator called @retry that you can apply to any function or method. The decorator also takes additional parameters so you can configure all the timeouts, intervals, exponential decay, and smart exception handling that you want.

Kudos to everyone working on the library. It’s a great little tool.

656x Faster JSON Parsing in Python with ijson

I profiled my team’s Python code and identified a performance bottleneck in JSON parsing. At two points, the code used the ijson package in a naive way that slowed down terribly for larger JSON files. It’s possible to achieve much faster JSON parsing without changing any code.

Data Scientist Workbench

My team builds Data Scientist Workbench, which is a free set of tools for doing data science.  It includes Jupyter and Zeppelin interactive notebooks as well as R Studio IDE all pre-configured to work with the Spark parallel data processing framework.

Behind the scenes, Data Scientist Workbench is composed of microservices. Some of them are built in Ruby, some in Node, and some in Python.

Faster JSON Parsing

Faster JSON parsing is possible in PythonJSON is a convenient format for serializing data. It originates from a subset of JavaScript Object Notation. Most languages have several libraries for reading and writing JSON.

ijson is a great library for working with JSON files in Python. Unfortunately, by default it uses a pure Python JSON parser as its backend. Much higher performance can be achieved by using a C backend.

These are the available backends:

  • yajl2_cffi: wrapper around YAJL 2.x using CFFI, this is the fastest.
  • yajl2: wrapper around YAJL 2.x using ctypes, for when you can’t use CFFI for some reason.
  • yajl: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
  • python: pure Python parser, good to use with PyPy

Assuming you have yajl2 installed, switching from the slow, pure Python parser to a faster JSON parser written in C is a matter of changing this line:

To this line:

All other code is the same.

Installation of yajl2

Before you can use yajl2 as a faster JSON parsing backend for ijson, you have to install it.

On Ubuntu, you can install it as follows:

On the Mac, you can install it as follows:

Performance Micro-benchmark

Other people have benchmarked ijson before me.

I did see a huge performance improvement with a specific 4MB JSON file (a Jupyter notebook), so it makes sense to measure that specifically.

Here’s the very simple code that I will use to measure the performance of parsing JSON with ijson:

The result of the first run:

After changing to yajl2 as the parser:

That’s 656x or 65600% faster!

I should mention that the JSON I’m parsing contains 4MB of escaped JSON represented as a string within actual JSON, so it may be an unusually bad case for the pure Python parser.