Flexible JSON Parsing in Python

JSON is easily the most common data serialization format used on the web today. It's relatively (and deceptively) simple, human-readable, and parsable by most languages. Python has a built-in json module that, for most people, is Good Enoughâ„¢.

The Problems

The built-in json module is great for simple tasks, but it has a few issues that might make it the wrong choice.

  • It's really slow, even with a C extension.
  • It's fairly strict. It parses "JSON", not "JSON". It's very common to find JSON documents that aren't strictly valid JSON, but are still parsable by some parsers. VSCode, a popular IDE, has a lax JSON parser that allows its configuration file to contain comments. This is not valid JSON, but it's still useful to be able to parse these files.
  • To parse any JSON document, we have to load the entire document into Python, even if we only cared about a small part of it. The creation of Python objects is always the slowest part of any Python JSON library. Always.
  • It can read and write non-standard JSON because it handles numbers of arbitrary precision. This is great for some use cases, but it's not always what you want, and there's no way to control it.

But...

On the other hand, it's part of the standard library, so it's always available, and it's very easy to use, so it's probably the right choice for most projects that don't have very strict performance requirements or the need to parse non-standard JSON.

Possible Solutions

There are quite a few alternative Python JSON libraries that offer performance improvements over the built-in json module. Some of the more popular ones are:

  • orjson - A fast JSON parser written in Rust.
  • simdjson - A fast JSON parser written in C++ that uses SIMD instructions to achieve extremely high parsing speeds.
  • ujson - A fast JSON parser written in C.
  • rapidjson - A fast JSON parser written in C++.

You'll notice none of these are written in Python. They're all written in lower-level languages with a binding layer to use them from Python. This can make them more annoying to install, as you may require a compiler for the respective language if binary wheels are not available for your combination of platform and Python version. For example simdjson requires a modern C++ compiler, and specific CPU features to get the best performance.

Each of these has tradeoffs and benefits:

  • orjson can't parse comments, but it has very good all-around performance and has excellent performance for serializing certain Python objects, like UUIDs and Dataclasses.
  • simdjson is the fastest JSON parser in existence, but it only supports strict parsing and has no support for serialization.
  • rapidjson supports extensions like NaN/Infinity and parsing comments, but is can be even slower than the standard JSON module in certain cases.
  • both ujson and rapidjson fail several JSON minefield tests.

A contender

yyjson is an excellent JSON parser written in standard C89 that is strict by default, but can be configured to be more flexible. It's also very fast, competing directly with simdjson and even beating it in many cases without using SIMD instructions. It's compilation size is very small, it supports custom memory allocators out of the box, and it has a mutable API for modifying & creating documents. It passes the JSON minefield tests perfectly.

This checks off just about all the boxes I could hope for. Being standard C89, we can compile it for just about any architecture, any OS, and any set of CPU features made in the last 30 years. It's fast, it's small, it's flexible, and it's correct. Since we can use custom allocators, we can use CPython's memory allocator which will allow users to properly track memory usage using standard Python tools.

So, lets make a Python binding for it.

Introducing the poorly named py_yyjson

py_yyjson is a Python binding for yyjson that makes it easy to use and hides most of the C implementation details. It provides binary wheels for most platforms and has no dependencies, so installation is as simple as pip install yyjson. It's written in C, it's pretty fast, and it's licensed under the MIT license.

Internally yyjson has two different APIs for working with mutable and immutable documents. py_yyjson does its best to hide these details from you using a Document object.

Reading & Writing

We can create a Document in many different ways:

from pathlib import Path
from yyjson import Document

Document({'a': 1, 'b': 2}) # A Python object
Document(b'{"a": 1, "b": 2}') # A binary JSON string
Document('{"a": 1, "b": 2}') # A text JSON string
Document(Path('path/to/file.json')) # A file on disk

We can use ReaderFlags and WriterFlags to control optional behaviour, such as ignoring comments:

from yyjson import Document, ReaderFlags

Document('''{
    // Comments in JSON!?!?
    "a": 1
}''', flags=ReaderFlags.ALLOW_COMMENTS)

We can also allow serializing and parsing Infinity and NaN:

from yyjson import Document, ReaderFlags

Document('{"a": Infinity}', flags=ReaderFlags.ALLOW_INF_AND_NAN)
Document({'a': float('inf')})

We can ignore trailing commas:

from yyjson import Document, ReaderFlags

Document('{"a": 1,}', flags=ReaderFlags.ALLOW_TRAILING_COMMAS)

And of course we can combine any of these options:

from yyjson import Document, ReaderFlags

Document(
  '{"a": Infinity,}',
  flags=ReaderFlags.ALLOW_INF_AND_NAN | ReaderFlags.ALLOW_TRAILING_COMMAS
)

There are several other feature flags, so check them out in the documentation.

JSON Pointers

We can use JSON pointers to access just part of a document. When we do this, we avoid creating Python objects entirely, except for just the part that matches. This can be a huge performance win when you only care about a small part of a large document.

from pathlib import Path
from yyjson import Document

doc = Document(Path("canada.json"))
features = doc.get_pointer("/features")

We can also use JSON pointers to pluck out just part of a document for serialization:

from yyjson import Document

doc = Document({'results': {'count': 3, 'rows': [55, 66, 77]}})
print(doc.dumps(at_pointer='/results/rows'))

This would result in a serialized JSON string containing just [55, 66, 77].

JSON Patch (RFC 6902) and JSON Merge-Patch (RFC 7386)

We aren't limited to just reading part of a document - we can modify just a small part of it as well. For example, we can take our large hypothetical canada.json document and add a new GeoJSON entry to the features list:

from pathlib import Path
from yyjson import Document

doc = Document(Path("canada.json"))
patch = Document([
    {'op': 'add', 'path': '/features/-', 'value': {
        'type': 'Feature',
        'geometry': {
            'type': 'Point',
            'coordinates': [1, 2],
        },
        'properties': {
            'name': 'New Feature',
        },
    }},
])
modified = doc.patch(patch)
print(modified.dumps())

This mutated the document entirely in its lower-level representation without creating any Python objects. If we wanted to use a merge-patch (RFC 7386) we could do that too, for example to replace the entire features list with a single entry:

from pathlib import Path
from yyjson import Document

doc = Document(Path("canada.json"))
patch = Document({
    'features': [{
        'type': 'Feature',
        'geometry': {
            'type': 'Point',
            'coordinates': [1, 2],
        },
        'properties': {
            'name': 'New Feature',
        },
    }],
})
modified = doc.patch(patch, use_merge_patch=True)
print(modified.dumps())

Both the JSON Patch and JSON Merge-Patch implementations are the fastest of any Python library I could find.

Conclusion

py_yyjson isn't the fastest JSON parser in every situation, but it tends to either tie or beat the fastest parsers in most situations while supporting non-standard JSON, it's lightweight with no dependencies, and it's highly portable. Unlike most (all?) other Python JSON libraries, it also allows you to manipulate JSON documents without ever creating Python objects, which can be a huge performance win.

If you're looking for a fast, flexible, and correct JSON parser for Python, give py_yyjson a try. It's brand new and reasonably well tested, but if you manage to break it please open an issue.