Random CORs errors on Chrome using Cloudfront and S3

Well this was a journey.

TLDR; It happens when you make a non-CORs request that is cached by the browser. A subsequent CORs request for the same object finds the cached item and tries to use it - it fails CORs because the first request had neither CORs headers or a VARY header. This affects Chrome but not currently other browsers for accidental reasons.

So this is the story, we have a web app making a cross-domain call to a CDN that contains static assets. With me so far?

That CDN is Cloudfront on AWS which uses S3 as an origin. Cheap and relatively easy to setup (I still find AWS a bit like learning a new programming language but I digress...).

I set up CORS, which is mandatory when making cross-origin calls and it worked. Well most of the time.

Sometimes, very occasionally, we would get CORs errors in the console and the resource wouldn't load. Kind of a deal-breaker when you are loading site.css from CDN! The error message? "Access to CSS stylesheet at blah from origin blah has been blocked by CORS policy: No "Access-Control-Allow-Origin" header is present on the requested resource.

This is another classic example of the error is completely meaningful but only after you have learned the meaning. As far as I understood, the file definitely had the correct headers because it worked most of the time.

Google showed me lots of people with similar random problems and no clue about what was causing it. Some people suggested setting up CloudFront correctly, which is not useful if the errors are random. Others came out with the usual snake oil and "this worked for me" but in this case, they were mostly wrong. The OP was making changes, it seemed to work because the error was random and then some time later, "The errors came back, what else could it be".

Fortunately there are some people who do know what is happening (Thank you Michael - Sqlbot).

First you need to understand a bit about CORs - one of the most misunderstood controls in HTTP. Like many mechanisms, it strived not to break existing systems which led to some of its complexity. It is also a security mechanism that doesn't actually provide any security except where clients opt-in to the functionality!

Presumably you know that a web page can include links from anywhere. Many of us include jQuery or Bootstrap etc. into pages and often we get these from CDNs that exist on a different origin to the page. This is a cross-origin request and is normal. However, in some cases, people expose resources on origins that they want to use on their website but they don't want all the Haters embedding the resources in their sites and getting the victim to pay for bandwidth and hosting costs or losing valuable assets to be embedded into someone else's site (e.g. a news feed or photo list). This is where CORs comes in.

When using CORs, the client sends the origin that is requesting the resource in the "Origin" header to the e.g. CDN. In other words, shared.cdn.com/something is being requested from a page at origin www.stealingotherpeoplesstuff.com. Now this is where it gets weird. CORs doesn't say that the shared resource should not be returned, it says to return headers confirming or denying that the client is allowed to proceed. Why? because they didn't want to break existing browsers that wouldn't be sending these new headers and were already using a setup that worked OK. I personally think the w3c should provide deadlines for new functionality to be implemented otherwise we live with backwards compatibility forever but maybe they are nicer than me. So the client (let's assume a browser) then looks at the returned headers and looks specifically for the header "Access-Control-Allow-Origin" to include either a * (all origins are welcome!) or otherwise a specific origin that matches the one that was sent in the request meaning "you are specifically allowed". Note that the protocol says that if there are multiple allowed origins, only the one in the request is sent back in the header unless all origins are allowed in which case * is sent. If the header is present, the browser carries on as usual, if not, the browser blocks the use of the content and puts an error in the console explaining why.

So this all makes sense. It is slightly more complicated if the request is not "simple" in other words, something that is likely to have an effect on the endpoint you are calling like a POST or DELETE. In the case of these unsafe requests, a "preflight" options request is made to basically ask the question in advance of making the call. The response contains the same CORs headers (or not) but this time the browser will not make the call to the unsafe method if it fails because if it did, something bad might happen.

All of this can be bypassed without CORs. If you type in the URL to a protected resource in the URL bar and press enter, it simply loads. Why? It is not a cross-origin request! The URL you typed is the same as the "page" you are in so no CORs headers are sent and the resource is returned. A bit weird but like I said, this was a deliberate decision and this is where the original "bug" comes from.

OK, what happens when we introduce cache to the browser? You have to decide when to use cache and when the cached item needs to be different and another call is made to the origin server. The most obvious part of a cache-key is the host and path of the resource. In general if a second call is made to the same item and it is in cache, you get the cached item. Now there are many caches and they don't all exist within the browser. What happens, for instance, if you cached something that used gzip encoding but it is then requested from a client that doesn't support (or doesn't want for some reason) gzip encoding? Just using the path would fail because the second request would get the cached gzip one and it wouldn't work. So another piece of data is added to the cache key: the Vary header contents. The vary header says, "you can use this item from cache as long as the values of these headers match the original request". For example, if the vary header contained "Content-Encoding" then the item would cache not only the URL but also that the content-encoding was gzip. If another request to the same item requiring, say, deflate was made, the cache would know that the cached item is not valid and will make another request. It is up to the origin to send this header based on anything in the request that could cause a different item to be returned. Often the web server will add the vary headers for content encoding.

All good so far? The bug is actually an edge case but one that is valid according to the specification and occurs when you first make a call to a CORs related item but without CORs headers e.g. you enter the URL into the URL bar and press enter. Since no CORs headers are sent, the CDN does not return any cors responses. Since it is not a CORs request, the Origin header is not sent. Since Origin is not sent, S3 doesn't reply with Vary: Origin. So what is cached? A basic item with the path and a vary header that will probably include only the content encoding and also no cors headers. This is all valid although arguably it causes the whole problem.

Next, you make a CORs request. Does it match the path of the cached item? Yes. Does it match the content-encoding in the vary header? Yes. Great, we can use this. The CORs check is then made on the object that has no CORs headers so of course, it fails.

There are some workarounds but no-one is necessarily doing it wrong although somebody could do something to make it work:

The browser probably can't do much, although it might say "we have an origin header in the request and the cached item doesn't contain origin in the vary header so this isn't the right object lets call the network again"

S3 doesn't add a vary: origin response header even though by definition here, it is running as a CDN origin and requires this header by definition. At least if it was a switch to enable it, the problem would go away.

Cloudfront is probably not doing anything wrong. It passes the headers to S3 that it should and returns what it gets back. Ironically, this is the place that is suggested as the workaround because it supports scripts and the script to add (in the StackExchange link) is to say if there is no vary in the response from S3, add it in so that Chrome will cache it correctly.

The other browsers haven't fixed it as such, it's just their cache works slightly differently so it accidentally works for them!

aarrrgggh