Bah, Java!
Does this sound familiar?
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
invalid continuation byte
You might have just tried to decode some random data as a unicode string, but
more likely you’ve encountered a string encoded as Modified UTF-8, or
MUTF-8
. This seemingly esoteric encoding is actually very widely
used, present in every Java (or JVM) and Android application. By using MUTF-8
you can guarantee that no NULL
(0x00
) bytes occur within a string, as
any NULL
is instead encoded as 0xC0 0x80
. This means it’s “safe” to use
traditional methods that operate on NULL
-terminated strings such as
strlen()
or strcpy()
on a string that with regular UTF-8 encoding might a
NULL
in it.
So, why does encountering 0xED
when decoding a string mean you might have run
into a MUTF-8
string? Lets take a look at how we would encode a supplemental
character (that is, a unicode character in the range U+10000-U+10FFFF) in
MUTF-8
:
bytearray([
0xED,
0xA0 | ((c >> 0x10) & 0x0F),
0x80 | ((c >> 0x0A) & 0x3f),
0xED,
0xb0 | ((c >> 0x06) & 0x0f),
0x80 | (c & 0x3f)
])
So, whenever a MUTF-8
encoder encounters a codepoint in the range of
U+10000 to U+10FFFF, it’ll encode it as a surrogate pair, with each surrogate
starting with, you guessed it, 0xED
. Lets see how we’d encode the common
smiley emoji:
>>> import mutf8
>>> mutf8.encode_modified_utf8('😀')
b'\xed\xa1\xbd\xed\xb8\x80'
And what happens if we try to decode it with the UTF-8 decoder?
>>> b'\xed\xa1\xbd\xed\xb8\x80'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Shazam!
The State of MUTF-8/CESU-8 in Python
CPython rejected adding a MUTF-8 parser to the standard library. This is unfortunate since there are many times you might want to use MUTF-8 in Python. You might be trying to:
- Read NBT, the file format use by Minecraft
- Read a JVM ClassFile
- Read a DEX file, the binary format used by Android
- Query SAP HANA (although this is really just CESU-8)
- Any object saved with Java Object Serialization
This has resulted in some signifcant replication of work, with many projects re-implementing MUTF-8/CESU-8 parsing:
So I’ve plucked the MUTF-8 decoder/encoder out of my Lawu project as the mutf8 package and added an optional C version. Hopefully, this version will not only be correct, but fast as well. Both the Python and the C version are faster than existing implementations:
Name | Min (ns) |
---|---|
✨ cmutf8 | 100.9999 |
mutf8 | 1,199.9800 |
androguard-mutf8 | 1,799.9846 |
Using our C version in the popular tool androguard
reduced the time to
parse the Facebook.apk by 20%! Due to the sheer number of strings in a DEX
or JVM ClassFile, any improvement in decoding can have a large impact.
With precompiled binaries for most platforms, and a pure-python fallback,
handling MUTF-8
is now a simple pip install mutf8
.