Fast queries against large JSON documents using pysimdjson

A very common task when working with large (and small) JSON files is to extract or "query" a small subset of the document. Imagine you're reading a 1GB catalog of products, but you only really care about the products that are in stock. Materializing JSON as Python objects is a very expensive process, so we want to avoid this as much as possible and push the bulk of the work to a lower layer written in another language such as C/C++.

pysimdjson is a (very young) library that wraps the amazing simdjson JSON parser capable of extremely fast parsing (2GB/s+) using the AVX2 instructions available on most modern x86-64 CPUs.

pysimdjson is normally comparable to ujson and other high-performance Python JSON parsers when reading an entire document, but pysimdjson also builds on top of simdjson to provide a simple API for filtering the document without materializing it into Python objects at all.

Caveats

There's a few things to keep in mind before trying pysimdjson:

simdjson does not yet have a streaming API so memory usage will be quite high compared to a streaming parser, dependant on the size of the JSON document being read.
simdjson is optimized for large documents that can fully take advantage of SIMD instructions. Compared to other parsers like ujson it may tie or even be slower on very small documents.
pysimdjson is not really compatible with the built-in JSON interface. For example you cannot specify an encoding as the underlying simdjson library always assumes UTF8. pysimdjson always works on bytes, not str.

Microbenchmark

As we mentioned in the caveats, simdjson just isn't optimized for very small documents. For an example document (166 bytes) such as:

{
    "hello": "world",
    "list": [
        1, 2, 3, 4, 5, 6, 7, 8, 9,
        1, 2, 3, 4, 5, 6, 7, 8, 9,
        1, 2, 3, 4, 5, 6, 7, 8, 9
    ],
    "xyz": 123
}

We get the times:

parser	avg time (s) over 100 runs
built-in json (py3.7.2)	0.000902
ujson	0.000194
pysimdjson (fully materialized)	0.000379
pysimdjson (no materialization)	0.000193

pysimdjson without materializing the parsed JSON into Python objects only ties with ujson which always fully materializes. However, it is still nearly 3 times faster than the built-in parser.

Lets try a much larger document (3.2mb) with many strings, gsoc-2018. This document is a subset of the Google Summer of Code candidates in 2018 and each record within it looks like:

{
    "1": {
        "@context": "http://schema.org",
        "@type": "SoftwareSourceCode",
        "name": "Instructor Interface for Plagiarism Detection",
        "description": "Plagiarism Detection is among the significant ...",
        "sponsor": {
            "@type": "Organization",
            "name": "Submitty",
            "disambiguatingDescription": "Programming assignment ...",
            "description": "Submitty ...",
            "url": "http://submitty.org",
            "logo": "//lh3.googleusercontent.com/..."
        },
        "author": {
            "@type": "Person",
            "name": "Tushar Gurjar"
        }
    },
    ...
}

Trying our benchmark against this document we get:

parser	avg time (s) over 100 runs
built-in json (py3.7.2)	1.452656
ujson	1.289953
pysimdjson (fully materialized)	0.871341
pysimdjson (no materialization)	0.176018

simdjson shines in this case, taking an average of only 0.176s to parse the entire document. Converting the internal simdjson representation of the parsed document into python objects took far longer (0.701s) than actually parsing the document! Still, in both cases we were significantly faster than the alternatives.

For certain problems we can find a middle ground that's still slower than the raw parsing speed of simdjson, but much faster than any other Python parser. Lets assume we were only parsing this document because we were trying to get a list of all the authors. We don't care about any other part of the document so materializing it all is a huge waste of time. Given the same gsoc-2018.json sample we used above we'll use the query .[].author.name to get a list of all the names that appear in the file. Our code looks like this:

import simdjson
with open('gsoc-2018.json', 'rb') as fin:
    pj = simdjson.ParsedJson(fin)
    authors = pj.items('.[].author.name')

This is significantly faster, taking just 0.315s to both parse the document, iterate over it, and materialize the results of the query as a native Python list.