Encoding & Decoding MUTF-8 in Python

CPython rejected adding a MUTF-8 parser to the standard library. This is unfortunate since there are many times you might want to use MUTF-8 in Python. You might be trying to:

  • Read NBT, the file format use by Minecraft
  • Read a JVM ClassFile
  • Read a DEX file, the binary format used by Android
  • Query SAP HANA (although this is really CESU-8)
  • Any object saved with Java Object Serialization

This has resulted in some significant replication of work, with many projects re-implementing MUTF-8 parsing:

So I've plucked the MUTF-8 decoder/encoder out of my Lawu project as the mutf8 package and added an optional C version. Hopefully, this version will not only be correct, but fast as well. The fallback Python version is a tad bit faster than others, and the C version is significantly faster than existing implementations:

Name Min (ns)
✨ cmutf8 100.9999
mutf8 1,199.9800
androguard-mutf8 1,799.9846

Using our C version in the popular tool androguard reduced the total time to parse the Facebook.apk by 20%. Due to the sheer number of strings in a DEX or JVM ClassFile, any improvement in decoding can have a large impact.

With precompiled binaries for most platforms, and a pure-python fallback, handling MUTF-8 is now a simple pip install mutf8.