Frontend Devs Should Care More About AWS Than They Think

For most of my career I thought of cloud infrastructure as somebody else's job. There was a platform team. They handled the AWS account. I built the React app, pushed to main, and a deploy happened. If something was on fire in the CDN layer, that was a Slack thread I read but did not contribute to.

I no longer think this is a good way to be a senior frontend engineer.

The "I just write the frontend" instinct is fine when you're three years in. It stops being fine somewhere around five or six, and the way you can tell is that the bugs you can't solve, and the architecture decisions you can't push back on, are increasingly the ones at the seam between your bundle output and whatever's serving it. You don't have to become a cloud engineer. You do have to learn enough about the layer below your build artifacts to argue intelligently about it. That layer, for a huge percentage of frontend work, happens to be AWS.

This post is a tour of the parts I've found genuinely worth knowing. None of it goes deep. I'm not going to teach you CloudFront. I'm going to tell you which CloudFront concepts have caused me actual pain so you can be the one who notices when they're configured wrong.

CloudFront, and the two-hour `index.html`

I'll start with the incident, because it's the kind of thing that converts you to caring.

A few years ago I was on a team that did standard CRA-style deploys to S3 with CloudFront in front. Our deploy was nice and clean: build, upload the new bundle to S3, invalidate /index.html in CloudFront, the next request would pick up the new HTML which referenced the new hashed JS bundle. Standard playbook. We'd done it a thousand times.

One afternoon we shipped a fix for a payment bug. Deploy went green. We tested production. Bug still there. Refreshed. Bug still there. Hard-refreshed. Gone. So we knew the build was right and the new code was in S3, but for some reason the world was getting served the old HTML.

For a humbling stretch of time we assumed it was a browser cache issue and asked customers to refresh. It was not a browser cache issue. It was a CloudFront cache issue. Specifically: somebody on the platform team had, weeks earlier, updated the CloudFront behavior for index.html to have a minimum TTL of two hours, with the reasonable-seeming logic that the TTL was a minimum and our invalidation would override it. But invalidations in CloudFront are eventually consistent — they're supposed to take 60 seconds, in practice they can take much longer if you're invalidating a path that has many edge locations holding it. We were watching prod serve stale HTML for nearly two hours after the deploy.

Three things from this incident that have stuck with me.

One: cache invalidation is not free, and it's not instant. It's a request to AWS to propagate a change across hundreds of edge locations. It usually feels instant. It doesn't have to be. If your deploy pipeline assumes the invalidation completes before users get the new bundle, your pipeline is wrong.

Two: never wildcard-invalidate. I've seen aws cloudfront create-invalidation --paths "/*" in deploy scripts and it's always a bad smell. Wildcard invalidations are slow, they're rate-limited, you get charged for them past the free tier, and they almost always indicate that someone gave up trying to be precise about which paths actually changed. The right pattern for an SPA is: hash all your JS/CSS assets, set them to immutable (Cache-Control: public, max-age=31536000, immutable), and only invalidate the unhashed HTML files that point to them.

Three: the cache behaviors panel is somebody else's UI, and you need to know what's in it. Every CloudFront distribution has a list of behaviors — patterns like *.js or /api/* — and each behavior has its own caching rules, headers it forwards, headers it strips, things it does to query strings. I have seen behaviors that strip the Authorization header (causing auth to silently break on cached responses), behaviors that forward all query strings (destroying your cache hit ratio), behaviors that whitelist a specific cookie nobody remembers adding (still there from a 2019 A/B test). You don't need write access. You do need to be able to read this panel and say "wait, why does the /* behavior have a 24-hour TTL?"

The lambda@edge nobody remembers writing

This is the second AWS thing that has cost me real time, and the reason I now ask, on day one of any new team, "are there any Lambda@Edge or CloudFront Functions in the request path?"

Lambda@Edge is a feature where you can attach a small piece of JavaScript (or Python) to a CloudFront request or response, running at the edge, modifying headers or rewriting URLs or doing auth checks. It's powerful. It's also a place where load-bearing logic goes to be invisible.

On one team I joined, our Cache-Control headers from S3 were being mysteriously rewritten. The bundle I uploaded with max-age=31536000 was being served with max-age=300. I spent half a day in the S3 object metadata thinking I had something misconfigured before someone mentioned, casually, that there was a Lambda@Edge function attached to the origin response that overrode Cache-Control for everything in a certain path. It had been written four years earlier, by a backend engineer who'd since left, to fix a problem from a different era that no longer existed.

That function had been sitting there, silently neutering caching, for years. The frontend team didn't know it existed. The infra team had inherited it from a predecessor and assumed it must be doing something important. So nobody touched it.

The lesson here is not really about Lambda@Edge in particular. It's that the request and response are touched by more things than the diagram on the wall shows, and if you're a senior frontend engineer working in a real organization, part of your job is to be the person who's curious about what those things are. Read the CloudFront config. Ask about edge functions. Ask whether there's a CDN-level WAF that might be stripping headers. The list of components between your npm run build output and the user's browser is longer than you think it is.

S3, briefly, but not as briefly as you'd think

The thing about S3 that I wish someone had told me earlier is that the bucket policy and the object ACL are different things and they interact in ways that will confuse you.

You can have a bucket that's "public" but objects that aren't, or objects that are public in a bucket that isn't, and the layered permission model means a bug at one layer can mask correctness at another. When you upload a file with aws s3 cp it inherits the bucket's default ACL — usually private — and you have to explicitly set --acl public-read or, better, rely on the bucket policy to grant access. I have personally shipped a deploy where the new bundle uploaded fine but was 403ing for users because I'd changed the upload command in a way that no longer set the ACL, and the bucket policy was missing the path I'd added. Embarrassing.

The modern recommendation, which I now use everywhere, is to make the S3 bucket completely private and put CloudFront in front of it with Origin Access Control (OAC). The bucket policy then trusts only the CloudFront distribution, and CloudFront is the only thing that can read the bucket. No object ACLs. No "public bucket" worries. No anxiety every time AWS sends you an email about public buckets in your account.

The other S3 thing worth knowing is signed URLs for private assets. If you have user-uploaded files — avatars, attachments, generated PDFs — you almost certainly do not want them in a public bucket. You want them in a private bucket and you want the backend to mint short-lived signed URLs that grant temporary access. The implementation lives on the backend. But as the frontend developer, you should know what to ask for. "Can I get a signed URL with a 5-minute TTL?" is the right question. "Can you make the bucket public?" is the wrong one and you need to be the one pushing back when someone proposes it.

IAM enough to not be dangerous

I do not love IAM. Nobody loves IAM. But you need a working mental model of it, because IAM is the language in which "why can't my Lambda upload to that bucket" arguments are conducted.

The two-sentence version: IAM has principals (users, roles, services) and resources (S3 buckets, Lambdas, CloudFront distributions), and policies attach to either, granting permissions in the form "principal P can do action A on resource R." Deployments and CI run as IAM roles. When something can't write to something, the issue is almost always a missing policy on the role doing the writing. When you see the dreaded AccessDenied, the question is "which role was acting, and what policies were attached to it?"

The thing that took me too long to learn is that IAM is deny by default and you need explicit allow statements for everything. There is no "obviously this role should be able to do this." There is only "is there a policy that allows this exact action on this exact resource." Get comfortable reading those policies. They look intimidating but they're really just JSON with a small vocabulary.

The other thing worth knowing is that you almost certainly do not want long-lived IAM access keys in your environment for anything. CI deploys should use OIDC federation (GitHub Actions has first-class support for this — you create a role that trusts GitHub's OIDC provider, and your workflow assumes that role with no static keys anywhere). If you see access keys in your CI config, that's a thing worth fixing.

Route 53 (a sentence)

I'm not going to write much about Route 53 because there isn't much to say. It's DNS. The thing worth knowing is that Route 53 supports routing policies more interesting than "round-robin between these IPs" — failover routing, weighted routing, latency-based routing — and that knowing these exist will at some point let you reach for the right tool when a backend service is being deployed across regions. That's basically it. If you understand DNS, you understand 90% of Route 53.

The actual point

Here's what I think the meta-point is, after typing all this out.

There is a category of frontend engineer who's very good at the framework layer — components, state, animations, the design system, the build tooling — and stops there. They are valuable people. There is also a category of frontend engineer who knows their build output well enough to argue about how it's being served. They understand the cache, the CDN, the origin, the headers, the IAM role doing the upload. When something is wrong in production, they don't wait for the platform team to triage. They show up with hypotheses.

The second category is where senior frontend work tends to live, in my experience. It's also harder to recruit for. And the only difference between the two, most of the time, is whether someone spent a couple of afternoons getting curious about the layer below their build output.

You don't need to become a cloud engineer. You do need to be able to read the CloudFront config and notice when it's wrong. That's the threshold. It's not far away, and crossing it changes what problems you're allowed to own.