Regular Expressions, Unicode, PHP and pain
You have all kinds of people saying we should do this and that to make our sites more accessible and secure. Whether it's high quality SSL certificates or stuff that makes our sites easier for people with disabilities but some of these are so tricky to do.
Take internationalisation (or i18n for short). Sounds good right? Make your web site translatable into other languages. One of the results of this? My input fields now need to validate all unicode characters, not just a-z.
I have learned this is a pain.
Javascript does not handle Unicode hardly at all, you can only match specific unicode characters. That is shocking in this day and age but hey what can you do? What I had to do was disable client validation for any text field regexes on my site.
PHP is slightly weird. You have to specify u at the end of the regex to make it Unicode. The reason, apparently, is that Unicode can make things slower so it is opt-in but that's not too bad since it is documented.
Now how do I something simple like match a set of Unicode characters?
You would think it would be easy but no. The dot character does not match unicode characters. This is not that surprising because a Unicode character or rather a grapheme - what is displayed on the screen, is not a byte, it is not even a few bytes, it might be a set of byte groups or code points (each of 1 or more bytes) if you like that produce a single grapheme. For instance, the two characters U+0061 (a) and U+0300 (grave accent) will combine to produce a single grapheme that looks like this: à. To add to the confusion, there is also a single character/code point from the standard ASCII character set that is also this same character (for historical reasons). The dot would match the single character but it would match the preceding example twice: once for the a and once for the accent - one of the many "combining marks" that get added to the preceding letter. This is unlikely to be what you want to do.
So what? In my case, I just want a sequence of Unicode characters - I won't be counting them - so why not just use the dot anyway? Because if the user typed in a load of characters that were not actually proper letters but were just combining marks, the regex should fail but with the dot it would pass.
So I could just use \X right? That works in PHP. Well, except that it doesn't seem to work at all even though it is specifically what I wanted to use. No idea why, it just doesn't work.
What I ended up having to do was use the long-hand version of \X (or close enough), which is (?>\P{M}\p{M}*) and which basically says find any Unicode character that is not a combining mark followed by zero or more combining marks.
As with any regex work, test it and test it some more. Think of the types of data you want to and don't want to accept. Also remember that most regexes are not perfect so if you need them to be, either be prepared to do a LOT of testing or have another system that can check the data after it has been passed by the regex to perform other automated tests.
Take internationalisation (or i18n for short). Sounds good right? Make your web site translatable into other languages. One of the results of this? My input fields now need to validate all unicode characters, not just a-z.
I have learned this is a pain.
Javascript does not handle Unicode hardly at all, you can only match specific unicode characters. That is shocking in this day and age but hey what can you do? What I had to do was disable client validation for any text field regexes on my site.
PHP is slightly weird. You have to specify u at the end of the regex to make it Unicode. The reason, apparently, is that Unicode can make things slower so it is opt-in but that's not too bad since it is documented.
Now how do I something simple like match a set of Unicode characters?
You would think it would be easy but no. The dot character does not match unicode characters. This is not that surprising because a Unicode character or rather a grapheme - what is displayed on the screen, is not a byte, it is not even a few bytes, it might be a set of byte groups or code points (each of 1 or more bytes) if you like that produce a single grapheme. For instance, the two characters U+0061 (a) and U+0300 (grave accent) will combine to produce a single grapheme that looks like this: à. To add to the confusion, there is also a single character/code point from the standard ASCII character set that is also this same character (for historical reasons). The dot would match the single character but it would match the preceding example twice: once for the a and once for the accent - one of the many "combining marks" that get added to the preceding letter. This is unlikely to be what you want to do.
So what? In my case, I just want a sequence of Unicode characters - I won't be counting them - so why not just use the dot anyway? Because if the user typed in a load of characters that were not actually proper letters but were just combining marks, the regex should fail but with the dot it would pass.
So I could just use \X right? That works in PHP. Well, except that it doesn't seem to work at all even though it is specifically what I wanted to use. No idea why, it just doesn't work.
What I ended up having to do was use the long-hand version of \X (or close enough), which is (?>\P{M}\p{M}*) and which basically says find any Unicode character that is not a combining mark followed by zero or more combining marks.
As with any regex work, test it and test it some more. Think of the types of data you want to and don't want to accept. Also remember that most regexes are not perfect so if you need them to be, either be prepared to do a LOT of testing or have another system that can check the data after it has been passed by the regex to perform other automated tests.