Encoding & Decoding MUTF-8 in Python
CPython rejected adding a MUTF-8 parser to the standard library. This is unfortunate since there are many times you might want to use MUTF-8 in Python. You might be trying to:
- Read NBT, the file format use by Minecraft
- Read a JVM ClassFile
- Read a DEX file, the binary format used by Android
- Query SAP HANA (although this is really CESU-8)
- Any object saved with Java Object Serialization
This has resulted in some significant replication of work, with many projects re-implementing MUTF-8 parsing:
So I've plucked the MUTF-8 decoder/encoder out of my Lawu project as the mutf8 package and added an optional C version. Hopefully, this version will not only be correct, but fast as well. The fallback Python version is a tad bit faster than others, and the C version is significantly faster than existing implementations:
Name | Min (ns) |
---|---|
✨ cmutf8 | 100.9999 |
mutf8 | 1,199.9800 |
androguard-mutf8 | 1,799.9846 |
Using our C version in the popular tool androguard
reduced the total time to
parse the Facebook.apk by 20%. Due to the sheer number of strings in a DEX
or JVM ClassFile, any improvement in decoding can have a large impact.
With precompiled binaries for most platforms, and a pure-python fallback,
handling MUTF-8
is now a simple pip install mutf8
.