Does this sound familiar?
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
You might have just tried to decode some random data as a unicode string, but
more likely you’ve encountered a string encoded as Modified UTF-8, or
MUTF-8. This seemingly esoteric encoding is actually very widely
used, present in every Java (or JVM) and Android application. By using
you can guarantee that no
0x00) bytes occur within a string, as
NULL is instead encoded as
0xC0 0x80. This means it’s “safe” to use
traditional methods that operate on
NULL-terminated strings such as
strcpy() on a string that with regular UTF-8 encoding might a
NULL in it.
So, why does encountering
0xED when decoding a string mean you might have run
MUTF-8 string? Lets take a look at how we would encode a supplemental
character (that is, a unicode character in the range U+10000-U+10FFFF) in
bytearray([ 0xED, 0xA0 | ((c >> 0x10) & 0x0F), 0x80 | ((c >> 0x0A) & 0x3f), 0xED, 0xb0 | ((c >> 0x06) & 0x0f), 0x80 | (c & 0x3f) ])
So, whenever a
MUTF-8 encoder encounters a codepoint in the range of
U+10000 to U+10FFFF, it’ll encode it as a surrogate pair, with each surrogate
starting with, you guessed it,
0xED. Lets see how we’d encode the common
>>> import mutf8 >>> mutf8.encode_modified_utf8('😀') b'\xed\xa1\xbd\xed\xb8\x80'
And what happens if we try to decode it with the UTF-8 decoder?
>>> b'\xed\xa1\xbd\xed\xb8\x80'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
The State of MUTF-8/CESU-8 in Python
CPython rejected adding a MUTF-8 parser to the standard library. This is unfortunate since there are many times you might want to use MUTF-8 in Python. You might be trying to:
- Read NBT, the file format use by Minecraft
- Read a JVM ClassFile
- Read a DEX file, the binary format used by Android
- Query SAP HANA (although this is really just CESU-8)
- Any object saved with Java Object Serialization
This has resulted in some signifcant replication of work, with many projects re-implementing MUTF-8/CESU-8 parsing:
So I’ve plucked the MUTF-8 decoder/encoder out of my Lawu project as the mutf8 package and added an optional C version. Hopefully, this version will not only be correct, but fast as well. Both the Python and the C version are faster than existing implementations:
Using our C version in the popular tool
androguard reduced the time to
parse the Facebook.apk by 20%! Due to the sheer number of strings in a DEX
or JVM ClassFile, any improvement in decoding can have a large impact.
With precompiled binaries for most platforms, and a pure-python fallback,
MUTF-8 is now a simple
pip install mutf8.