We have recently changed the Cache-Control and Pragma response headers on our origin service, WebEngine.
What follows is a detailed explanation of how caching has worked within our architecture, why this resulted in publishing bugs with our latest infrastructure and how we came to a solution.
Diagram of caching layers and their default caching times
Every origin HTTP(S) request flows into our system via this same path and then flows back out with a response, resulting in serving content. That content can take the form of a web page, JSON, etc.
When a cache is checked it determines whether there is a valid cached response or if it should continue requesting to the origin (Zesty.io)
The very first cache that makes a decision in this process is the user's browser. When a browser receives a "Cache-Control: max-age=300, public" header, the "public" directive instructs the browser it is allowed to cache this response. The "max-age" directive instructs the browser for how long it should use the cached response before returning to the origin for the latest. The value we have historically sent was 300 seconds (5 minutes). Meaning if a page had been loaded in someone's browser within the past 5 minutes and they published that same page they would continue to see the now stale content until the 5 minute cache was exceeded. This was one aspect that added confusion for a person as to why after publishing a web page they would not see their latest change. How much time has passed varies for each person based upon when they last loaded a web page. Making the observed effect different for each person.
Content Delivery Network (CDN)
If either a web page has not been cached or a cached copy has surpassed the max-age time then the browser sends the request for the URL, starting the web page fulfillment process. At which point Zesty.io Cloud customers would have that request serviced by our CDN. The CDN has a pre-configured cache time that we control which supersedes the Cache-Control header. By default Zesty.io requests are cached for 24 hours which means that unless a publish event occurs on that URL’s underlying content ⸺ that content will continue to be cached and served from the CDN, and therefore it will not return to the Zesty.io origin services to fetch the latest content. However, if the cache for a URL becomes invalidated through expiring (exceeding the 24 hours) or by the content being published, only then does the CDN continue on the request to Zesty.io origin services.
Google App Engine (GAE)
Before the request makes it to WebEngine (which we build and operate) it is routed through the GAE infrastructure which our services run on. Within this infrastructure there is a global load balancer which is used to balance all requests to GAE across the Google Cloud Platform (GCP) customer base. As such we do not get access to or control over this load balancer. We are now aware that this Google AppEngine load balancer will cache responses when they contain a "Cache-Control'' header. It’s the same with the browser: if the load balancer has a cached record for the requested URL it would return cached content instead of continuing the request to our WebEngine service. In this scenario the load balancer served a stale cache object which then returned to the CDN, that made the request, which then in turn would cache the stale response for 24 hours. Alternatively, if the load balancer did not contain a cached response for the requested URL or the cached response had expired it would then continue on the request to the WebEngine service.
Flowchart showing caching decision
WebEngine and Redis
When WebEngine services a request it goes through a process of routing and resolving the necessary content for that URL. The Zesty.io platform allows for multiple complex relationships between data which can result in expensive (timewise) operations when looking up related data. As a web origin there are typically many requests for the same data and due to this we cache "hot" data for 1 minute. Using a Redis cache, we hold onto the result of the initial lookup operation for quicker responses on future requests. This architecture requires the service to then manage that cache and invalidate records when specific events occur, e.g. publishing. The way users reference data in our templating language, Parsley, results in two separate types of references within our resolution service, dynamic and static. Specifically, "each loops" on models generate a static reference. Since these references were not changing between updates on items within a model the cache was not being invalidated thus causing a race condition at this level. If none of the previous caches returned a result for a request that made it to the origin, within the 1 minute time period, would result in a response with stale related data.
How did we get here
Our historical origin service (SiteEngine) runs within a managed Kubernetes cluster. In the second half of 2020 we began migrating customer traffic to our latest service WebEngine which runs on GAE infrastructure. This provides our engineering team with many benefits that ultimately lead towards stronger operation and more stability at scale.
With this change a caching layer was introduced we were unaware of.
GAE has a global load balancer which services all Google Cloud Platform customers. Load balancers do not generally encompass caching functionality. They typically act as a way to direct traffic to multiple backends. The GAE global load balancer does allow for caching when your backend services send caching directive headers. Which in our case we were, due to the nature of our product. This resulted in additional caching which changed the behaviour of our services in inconsistent and unpredictable ways.
I think it's important to note that while the GAE documentation on this behaviour is lacking and we would like to see this improved, as a whole the service GCP provides is excellent. Their products are performant, powerful, and a productivity amplifier for our team. If we were starting the Zesty.io platform today we would still choose GCP. Simply put the nature of our product meant this type of bug had an outsized impact on our customers.
We have observed issues with regards to content being served correctly after a publish towards the end of last year. In response to this we put large amounts of effort towards a solution within the first quarter of 2021. Shipping fixes along the way while we fenced in the problem and looked for solutions. Initially we thought it was an issue within our core business logic on resolving published version records. We explored many options, wrote tests to validate the outcomes and shipped updates within this time. While we thought we had achieved a solution we started to see reports of issues which seemed related.
Once we got to a place where we felt confident that our services were resolving records as we would expect we began investigating upstream. We then began to observe inconsistencies between our CDN vendor and origin. Working with our vendor we wrote up reproducible test cases and continued to push forward on determining a root cause. While there were issues we discovered none ultimately solved the issue of inconsistent serving of content we were observing.
A customer who previously reported the issue ran into it again starting an internal escalation process. By this point we were unhappy with the timeframe and results we had provided to our customers. The only option left was to start a "war room" and sit down until we understood the root cause.
We began by laying out all the prior issues discovered, points of caching and then ideated on causes. Starting from our WebEngine service we began the process of verifying the request responses at each layer in the request pipeline. There were two major insights along the way:
A conversation with our customer which revealed the fact this only seemed to affect related content within template loops.
Grasping the fact our origin service responses which include a "Cache-Control" header, (and have for years) were being cached by the new infrastructures load balancer. As such we had introduced a new cache layer that had not been accounted for.
Connecting the dots on why this issue had evaded us for so long.
Ultimately the multiple layers of caching with differing behaviours created compounding confusion on where and why request responses were not resulting in the data we expected.