Unicode Gatekeepers S. Gilles 2017-06-01 The problem There has been a lot[0][1] of thought put into security implications of unicode. A popular concern is that harmless string A might be confused with evil string B by a human observer, who might then be tricked into giving their login credentials to an attacker or running code signed by an attacker. That's not the part I'm interested in right now, however. There's another concern: a system which accepts unicode allows spammers a far wider playing field. It might not be a security risk for a phishing attempt to include a link to paypa🄻.com, but it's annoying to receive an email offering “DISCOUNT GENERIC 🄰 🅂 🄿 🄸 🅁 🄸 🄽 ” because your spam filter was only trained on ASCII. In the language of UTS #39, this is “fooling mechanical gatekeepers”. UTS #39 is designed to handle the first concern. It includes (or references, or describes, whatever) a file[2] named confusables.txt, which defines a equivalences classes of unicode glyphs as the consortium feels relevant. There are some other bits like multi-glyph decomposition, but those aren't critical. The problem is that there is no equivalent of confusables.txt for gatekeepers. But suppose we really want one. Suppose we're building software for a mailing list, or a BBS, or something, and we want to give administrators the ability to detect unicode-obfuscated spam by providing their own blacklist of keywords. How hard would that be? Initial reaction: ”Don't do that” One obvious and completely foolproof method is to not support all of unicode. Messages must be in ASCII, or Latin 1, or some other small range. The problem with this is that people tend to expect unicode support these days. Even Slashcode was updated[3] to handle UTF-8 in 2014, after all. For the sake of argument, let's assume that our biggest customer has a habit of learning new languages and will be upset if the software has to be reconfigured for each new alphabet. Best Attempt A second approach might be to map everything via confusables before applying our checking. We might perform something along the lines of (untested) typedef struct { Rune from; Rune *to; } replacement_t; replacement_t reps[] = { ... }; #define NUM_REPLACEMENTS ((sizeof reps)/(sizeof reps[0])) int rcomp(void *a, void *b) { Rune *k = a; replacement_t *rep = b; if (*k < rep->from) { return -1; } else if (*k > rep->from) { return 1; } return 0; } int check_spam(const Rune *message, const Rune *badword) { Rune *normalized_message = 0; Rune *normalized_badword = 0; Rune *p = 0; replacement_t *v = { 0 }; int ret = 0; for (p = message; *p; p++) { if ((v = bsearch(p, reps, NUM_REPLACEMENTS, sizeof *reps, rcomp))) { append_to_string(&normalized_message, v->to); } } for (p = badword; *p; p++) { if ((v = bsearch(p, reps, NUM_REPLACEMENTS, sizeof *reps, rcomp))) { append_to_string(&normalized_badword, v->to); } } if (rune_strstr(normalized_message, normalized_badword)) { ret = -1; } free(normalized_message); free(normalized_badword); return ret; } Aside from knowing the ‘...’ of reps, this isn't impossible, and is even readable. If the algorithm flows this way, we can use sed or awk to turn confusables.txt into the contents of ‘...’. Issues A few problems show up immediately. o Extraneous characters. U+0224 LATIN CAPITAL LETTER Z WITH HOOK should probably be replaced with a simple Z, but it is listed as confusable with the sequence U+005A, U+0326 (a Z followed by COMBINING COMMA BELOW). o Overzealous replacing. “m” is confusable with “rn”, so it's impossible to distinguish “yarn” and “yam”, in case we need to censor vegetables or something. This could be a problem. o Missing replacements. Codepoints like U+1F130 SQUARED LATIN CAPITAL LETTER A should probably map to good old U+0041 LATIN CAPITAL LETTER A. However, since the two glyps are visually quite distinct, they are not considered confusable. The last part is the most concerning. As newer releases of Unicode offer more characters, confusables.txt may be updated, but since there does not exist a “gatekeeper-confusables.txt”. Finally The solution I'm using is the following (generally) algorithm for filling in the ‘...’ above. Depending on the real-world testing it gets, it may need tweaking. Parse confusables.txt into a map T: codepoint -> string. Parse UnicodeData.txt[4], and For every codepoint C If the general category is M, C, P, or Sk Make T delete that codepoint, by T(C) := '' End T(U+0020) := ' ' (it was mistakenly deleted above) If there's no mapping If UnicodeData.txt indicates a decomposition D1, D2, ... T(C) = D1, D2, ... End End End (Now T deletes the ‘extra’ characters.) Delete mappings for '0', '1', 'I', 'm', 'w', '|'. Add mappings T(U+0460) := 'w' T(U+FF29) := 'I' T(U+1F170) := 'A' (and so on through 'Z') T(U+00A9) := 'C' T(U+00AE) := 'R' T(U+00DF) := 'B' T(U+01AB) := 't' T(U+0272) := 'n' T(U+0274) := 'N' T(U+0291) := 'z' T(U+0298) := 'O' T(U+029F) := 'L' T(U+02B3) := 'r' T(U+0629) := 'o' T(U+0644) := 'J' T(U+1472) := 'b' T(U+1473) := 'b' T(U+1D07) := 'E' T(U+2117) := 'P' T(U+2365) := 'O' T(U+A793) := 'e' T(U+2776) := '(1)' (and so on through '(10)') T(U+2780) := '(1)' (and so on through '(10)') T(U+278A) := '(1)' (and so on through '(10)') Until T is fixed T := T ∘ T Output T in the form needed for rune transformation. It would be nice for the Unicode Consortium to address this more concretely. [0] http://www.unicode.org/reports/tr36 [1] http://www.unicode.org/reports/tr39 [2] http://www.unicode.org/Public/security/latest/confusables.txt [3] https://raw.githubusercontent.com/SoylentNews/rehash/fe00cc16d95c4b8bef921f9757a6af29712c7501/README.utf8 [4] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt