Python info-stealing malware uses Unicode to evade detection

A malicious Python package on PyPI uses Unicode as an obfuscation technique to evade detection while stealing and exfiltrating developers’ account credentials and other sensitive data from compromised devices.

The malicious package, named “onyxproxy,” uses a combination of different Unicode fonts in the source code to help it bypass automated scans and defenses that identify potentially malicious functions based on string matching.

The discovery of onyxproxy comes from cybersecurity specialists at , who published a report explaining the technique.

The package is no longer available on PyPI, having been removed from the platform yesterday. However, since its publication on the platform on March 15, the malicious package has amassed .

Unicode abused in Python

Unicode is a comprehensive character encoding standard encompassing a wide range of scripts and languages, unifying various sets/schemes under a common standard covering over 100,000 characters.

It was created to help maintain interoperability and consistent text representation across different languages and platforms and eliminate encoding conflicts and data corruption issues.

The “onyxproxy” package contains a “” package with thousands of suspicious code strings that use a mix of Unicode characters.

Using a mix of Unicode characters
Using a mix of Unicode characters (Phylum)

While the text in these strings, besides the different fonts, looks almost normal in visual inspections, it makes a massive difference for Python interpreters that will parse and recognize these characters as fundamentally different.

For example, Phylum explains that Unicode has five variants for the letter “n” and 19 for the letter “s” for use in different languages, mathematics, etc. For example, the “self” identifier has 122,740 (19x19x20x17) ways to be represented in Unicode.

Python’s support for using Unicode characters for identifiers, i.e., code variables, functions, classes, modules, and other objects, allows coders to create identifiers that appear identical yet point to different functions.

In the case of onyxproxy, the authors used the identifiers “__import__,” “subprocess,” and “CryptUnprotectData,” which are larger and have a vast number of variants, easily beating string-matching-based defenses.

Variants count for the malicious identifiers
Variants count for the used identifiers

Python’s Unicode support can be easily abused to hide malicious string matches, making code appear innocuous while still performing malicious behavior. In this case, the stealing of sensitive data and authentication tokens from developers.

While this obfuscation method isn’t particularly sophisticated, it is worrying to see it employed in the wild and might be a sign of broader abuse of Unicode for Python obfuscation.

“But, whomever this author copied this obfuscated code from is clever enough to know how to use the internals of the Python interpreter to generate a novel kind of obfuscated code, a kind that is somewhat readable without divulging too much of exactly what the code is trying to steal,” concludes Phylum.

The risks of Unicode in Python have been  in the Python development community in the past.

Other researchers and developers have also  that Unicode support in Python will make the programming language vulnerable to a new class of security exploits, making submitted patches and code harder to inspect.

In November 2021, academic researchers presented a theoretical attack called “” that used Unicode control characters to inject vulnerabilities into source code while making it harder for human reviewers to detect those malicious injections.

In conclusion, these attacks are now confirmed, and defenders must implement more robust detection mechanisms against these emerging threats.