Maybe you’re building internationalized code and wondering how to build a whitelist filter that will support all the different character sets your planning to support. If you support more than ten, especially some of the larger east Asian sets, this might seem like an unwieldy or tricky process.
Well luckily it’s easier than most people would think. Building a good input validation filter can be simplified with .Net’s GetUnicodeCategory. But use the method from the System.Globalization namespace as the other one in System.Char looks like it may become the subordinate.
With GetUnicodeCategory you can simply build a whitelist supporting the character categories you want to allow. So get away from thinking you have to write a regEx filter and list out all the character ranges you want to allow in each character set, it’s much simpler than that!
The Unicode standard assigns ever character to one of about 31 categories. They make sense too, for example Other Control charactes (Cc) , Lowercase Letter (Ll), Uppercase Letter (Lu), Math Symbol (Sm). So for example you might want to only allow letters, numbers, and punctuation in your whitelist. This could be achieved with the following snippet:
char cUntrustedInput; // the untrusted user-input
UnicodeCategory cInputTest = CharUnicodeInfo.GetUnicodeCategory(cUntrustedInput);
if (cTestCategory == UnicodeCategory.LowercaseLetter ||
cTestCategory == UnicodeCategory.UppercaseLetter ||
cTestCategory == UnicodeCategory.DecimalDigitNumber ||
cTestCategory == UnicodeCategory.TitlecaseLetter ||
cTestCategory == UnicodeCategory.OtherLetter ||
cTestCategory == UnicodeCategory.NonSpacingMark ||
cTestCategory == UnicodeCategory.DashPunctuation ||
cTestCategory == UnicodeCategory.ConnectorPunctuation)
{
// character looks safe, continue
}
else
{
// character is not allowed, fail
}
Not too bad eh.
Brilliant, nice to see how easy this can be. Great work man you rule!
Comment by Brian — April 24, 2007 @ 9:53 am
[...] chrisweber Blogged about a good topic today on chrisweber.wordpress.comHere’s a summary…. [...]
Pingback by I18N input validation whitelist filter with System.Globalization and GetUnicodeCategory — October 12, 2007 @ 6:39 pm
I would like to see a continuation of the topic
Comment by Maximus — December 20, 2007 @ 1:47 am