The Latest in IT Security

Spammers Take Advantage of Unicode Normalization to Hide URLs


by Francisco Pardo and Nick Johnston

Spammers are never idle when it comes to finding new ways to bypass mail filters—after all, this is crucial to a spammer's success. Recently, we've seen a low but steady number of spam messages in which spammers are replacing certain characters in URLs (which point to spam sites) with Unicode characters that look similar or identical. This is yet another way of obfuscating URLs in an attempt to make it more difficult to analyze them.

To understand how this technique works, a bit of knowledge of the Unicode standard is helpful. As well as specifying a large repertoire of characters, Unicode also provides normalization rules for converting similar and/or equivalent characters to a single form. For example, under various Unicode normalization forms, an encircled number is considered equivalent to the corresponding ordinary number. This latest spammer-led URL obfuscation technique relies on the HTML-rendering engine in mail clients (or Web browsers for Web-based email) to apply the appropriate Unicode normalization to URLs.

For example, a spam message could contain the following URL:


At first glance, the period or dot may look like a normal dot character, but it has actually been replaced with Unicode character U+2024, "ONE DOT LEADER". The "l" in the top-level domain also appears to be a normal Latin letter "l", but is actually Unicode character U+217C, "SMALL ROMAN NUMERAL FIFTY". When a Web browser or mail client HTML-rendering engine processes this URL, it typically applies Unicode normalization to it, replacing the "ONE DOT LEADER" character with a normal dot and the "SMALL ROMAN NUMERAL FIFTY" with a normal "l" character, allowing the user to visit the spam site. The process works as follows:

In a sense, this is similar to internationalized domain name (IDN) homograph attacks, in which similar-looking Unicode characters are used to lead users to fake sites, often for phishing purposes. However, this technique differs because it involves using similar Unicode characters to obfuscate a site rather than fake or spoof a site.

Leave a reply





Latest Comments

Social Networks