MUTF-8 for Python, or "Bah, Java!"

Jan 21, 2021
Return to index

Bah, Java!

Does this sound familiar?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
invalid continuation byte

You might have just tried to decode some random data as a unicode string, but more likely you’ve encountered a string encoded as Modified UTF-8, or MUTF-8. This seemingly esoteric encoding is actually very widely used, present in every Java (or JVM) and Android application. By using MUTF-8 you can guarantee that no NULL (0x00) bytes occur within a string, as any NULL is instead encoded as 0xC0 0x80. This means it’s “safe” to use traditional methods that operate on NULL-terminated strings such as strlen() or strcpy() on a string that with regular UTF-8 encoding might a NULL in it.

So, why does encountering 0xED when decoding a string mean you might have run into a MUTF-8 string? Lets take a look at how we would encode a supplemental character (that is, a unicode character in the range U+10000-U+10FFFF) in MUTF-8:

bytearray([
    0xED,
    0xA0 | ((c >> 0x10) & 0x0F),
    0x80 | ((c >> 0x0A) & 0x3f),
    0xED,
    0xb0 | ((c >> 0x06) & 0x0f),
    0x80 | (c & 0x3f)
])

So, whenever a MUTF-8 encoder encounters a codepoint in the range of U+10000 to U+10FFFF, it’ll encode it as a surrogate pair, with each surrogate starting with, you guessed it, 0xED. Lets see how we’d encode the common smiley emoji:

>>> import mutf8
>>> mutf8.encode_modified_utf8('😀')
b'\xed\xa1\xbd\xed\xb8\x80'

And what happens if we try to decode it with the UTF-8 decoder?

>>> b'\xed\xa1\xbd\xed\xb8\x80'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Shazam!

The State of MUTF-8/CESU-8 in Python

CPython rejected adding a MUTF-8 parser to the standard library. This is unfortunate since there are many times you might want to use MUTF-8 in Python. You might be trying to:

This has resulted in some signifcant replication of work, with many projects re-implementing MUTF-8/CESU-8 parsing:

So I’ve plucked the MUTF-8 decoder/encoder out of my Lawu project as the mutf8 package and added an optional C version. Hopefully, this version will not only be correct, but fast as well. Both the Python and the C version are faster than existing implementations:

Name Min (ns)
✨ cmutf8 100.9999
mutf8 1,199.9800
androguard-mutf8 1,799.9846

Using our C version in the popular tool androguard reduced the time to parse the Facebook.apk by 20%! Due to the sheer number of strings in a DEX or JVM ClassFile, any improvement in decoding can have a large impact.

With precompiled binaries for most platforms, and a pure-python fallback, handling MUTF-8 is now a simple pip install mutf8.