URL Encoding, Percent Encoding and what to do with spaces, %20 and +
We constantly stumble over weirdness in the world of IT, partly because of so many competing standards or ideas and partly because of legacy problems that often cannot be fixed because of chicken/egg problems between web servers and clients/browsers.
This one had got me confused and stumped for a while and it is related to the various flavours of URL encoding. The basic idea is that if you want to pass a URL in a URL, it is obvious that the multiple instances of things like http:// in the full string would/could confuse the web server and make the URL unparseable. The solution is that any "reserved" characters that have special meaning in a normal URL can be encoded using a number of the form %HH where HH is a hex number. For instance, you might have seen http:// replaced with http%3A%2F%2F and at the web server end, the reverse is carried out to work out the original text.
The same is actually true of any data that contains reserved characters and is sent via a URL, not just other URLs. For instance, if you generate or send some kind of random code to the web server that could include reserved characters, you would need to do the same thing to avoid the danger of web server confusion. In the case of non-URLs (such as random codes), you might instead choose to encode it with something like Base64 before passing it on the URL to make things much neater. In fact, you could do that when passing URLs but base64 has a size overhead of 33% which would make a long URL noticeably longer so URL encoding (or percent encoding as it might be referred to) is usually the weapon of choice.
So far, so good but there are some questions. What happens if I want to pass a URL that already contains something like %3A in it? Well, encoding it would, as you might expect, replace the percent symbol with %25 and the 3A, which is unreserved, stays as it is so you would end up with %253A. What this means, is that you must be really careful not to multiple-encode strings, since every time you encode, every percent symbol will be escaped again and you would need to match the number of decodings at the other end - although you should only do it once.
Another question, why are there two ways of encoding spaces? Well back in the bad old days, when the mime type application/x-www-form-urlencoded type was specified for passing URL-type data between browser and server or vice-versa, someone decided that they would use normal percent encoding EXCEPT that spaces wouldn't become percentage symbols but + symbols instead (and newlines are normalised). It works, but it is confusing and it smells of a shortcut that should never had been taken. The real problem with it doesn't lie in the fact that it doesn't work but that you can have all manner of compatibility problems. For instance, if you use an encoder that produces output in line with these HTML specs, it will encode spaces as pluses but if there are pluses, they become %2B. In other words, encoding "Hello There+" would produce "Hello+There%2B" which looks strange since + is supposed to be reserved. If you then tried to decode this with a decoder that wasn't designed for application/x-www-form-urlencoded, you would incorrectly get "Hello+There+".
Any encoder that is NOT specifically for application/x-www-form-urlencoded will replace spaces with %20, which is far more consistent. "Hello There+" => "Hello%20There%2B"
The moral here is to test exactly what your encoders and decoders are doing, especially when you are using data that might but might not contain spaces or pluses, in which case, you might find something works one day and another day it does not. The simplest way is to produce a short test with encoding something like "Hello There+" and see what it produces, if it replaces the space with a plus, test that your decoder replaces the + with a space. If your data contains pluses and is NOT encoded at all, make sure your web server/service/application is not automatically decoding it and replacing the + with a space, if it is, you might have to encode the data even though + is strictly safe to send in a URL.
This one had got me confused and stumped for a while and it is related to the various flavours of URL encoding. The basic idea is that if you want to pass a URL in a URL, it is obvious that the multiple instances of things like http:// in the full string would/could confuse the web server and make the URL unparseable. The solution is that any "reserved" characters that have special meaning in a normal URL can be encoded using a number of the form %HH where HH is a hex number. For instance, you might have seen http:// replaced with http%3A%2F%2F and at the web server end, the reverse is carried out to work out the original text.
The same is actually true of any data that contains reserved characters and is sent via a URL, not just other URLs. For instance, if you generate or send some kind of random code to the web server that could include reserved characters, you would need to do the same thing to avoid the danger of web server confusion. In the case of non-URLs (such as random codes), you might instead choose to encode it with something like Base64 before passing it on the URL to make things much neater. In fact, you could do that when passing URLs but base64 has a size overhead of 33% which would make a long URL noticeably longer so URL encoding (or percent encoding as it might be referred to) is usually the weapon of choice.
So far, so good but there are some questions. What happens if I want to pass a URL that already contains something like %3A in it? Well, encoding it would, as you might expect, replace the percent symbol with %25 and the 3A, which is unreserved, stays as it is so you would end up with %253A. What this means, is that you must be really careful not to multiple-encode strings, since every time you encode, every percent symbol will be escaped again and you would need to match the number of decodings at the other end - although you should only do it once.
Another question, why are there two ways of encoding spaces? Well back in the bad old days, when the mime type application/x-www-form-urlencoded type was specified for passing URL-type data between browser and server or vice-versa, someone decided that they would use normal percent encoding EXCEPT that spaces wouldn't become percentage symbols but + symbols instead (and newlines are normalised). It works, but it is confusing and it smells of a shortcut that should never had been taken. The real problem with it doesn't lie in the fact that it doesn't work but that you can have all manner of compatibility problems. For instance, if you use an encoder that produces output in line with these HTML specs, it will encode spaces as pluses but if there are pluses, they become %2B. In other words, encoding "Hello There+" would produce "Hello+There%2B" which looks strange since + is supposed to be reserved. If you then tried to decode this with a decoder that wasn't designed for application/x-www-form-urlencoded, you would incorrectly get "Hello+There+".
Any encoder that is NOT specifically for application/x-www-form-urlencoded will replace spaces with %20, which is far more consistent. "Hello There+" => "Hello%20There%2B"
The moral here is to test exactly what your encoders and decoders are doing, especially when you are using data that might but might not contain spaces or pluses, in which case, you might find something works one day and another day it does not. The simplest way is to produce a short test with encoding something like "Hello There+" and see what it produces, if it replaces the space with a plus, test that your decoder replaces the + with a space. If your data contains pluses and is NOT encoded at all, make sure your web server/service/application is not automatically decoding it and replacing the + with a space, if it is, you might have to encode the data even though + is strictly safe to send in a URL.