Most of My Frontend Debugging Doesn't Happen in DevTools

I was thinking about this earlier today because I closed a bug ticket that took me almost a full day to chase down, and the actual DevTools session at the end of it was maybe four minutes. Four minutes! And I'd been on it since the morning. So I want to write about where the rest of that time goes, because I don't think I've ever seen a blog post that's honest about it.

The mythology of frontend debugging is that you open DevTools, you spot the red line in the console, you set a breakpoint, you find the bug. And sure, that happens. For local development, that's basically it. But the thing nobody tells you about debugging in production is that DevTools is the last 20% of the job, after you've already done the hard part somewhere else.

The hard part is figuring out which bug you're actually looking at.

The screenshot in Slack

So here is the thing that triggered this post. A PM posts a screenshot in a channel — just the screenshot, no description, maybe a "hey is this normal?" if you're lucky. You look at it. It's a dropdown that's clipped on the right side. Could be a hundred things. Could be a viewport you've never tested. Could be a CSS regression from the design system bump last sprint. Could be a translation that's longer than the original and overflowing. Could be Safari being Safari. Could be a Chrome extension on the PM's machine doing something weird. You don't know.

What I want at that moment is everything around the screenshot. What page were they on, what was the URL, what was the user doing, what was the network like, what's their viewport, what's their user agent, were they zoomed in. I want the breadcrumbs leading up to the screenshot more than I want the screenshot itself. The screenshot tells me there's a bug. The breadcrumbs tell me which bug.

And this is where I've come around on Sentry, and tools like it. (I've used Sentry, Datadog RUM, and Bugsnag at various points. They're more similar than they are different at this level.) The thing those tools get right is breadcrumbs and release tracking. You see a stack trace, you see what release it came from, you see the last twenty user actions before the error, you see the URL and the viewport. That's 90% of triage solved. I genuinely think the breadcrumb feature is more valuable than the actual error tracking, and I'm only slightly exaggerating.

What they get wrong, and I'll just say it, is treating frontend errors like server errors. A server error is a categorical thing — the code is broken or it isn't, on this version or that version, and once you fix it you fix it for everyone. Frontend errors are more like weather. The same code is running in fifty different browsers across thirty different OS versions on hundreds of viewport sizes with extensions and adblockers and translation tools all mucking around with the DOM, and you can fix the bug for 99% of users and still have a long tail of 1% of users where it manifests differently and you'll never reproduce it in a million years. Treating that the same as "endpoint X returned 500" gives you the wrong instincts about urgency. (Spending three days chasing a 0.01% error rate on a niche Android browser when there's a real bug affecting paying users on Chrome — I've done this. It's bad.)

The sourcemap incident

Okay so one specific war story because this stuff is too abstract otherwise.

We had a bug in production that was being reported to Sentry with stack traces like e.t.j is not a function at o.r (a.js:1:48292). Minified. Useless. We had sourcemaps configured to upload during the build, so my first instinct was that something was wrong on Sentry's side — maybe the file fingerprints weren't matching. They weren't. But the reason they weren't matching took me embarrassingly long to figure out: our CI had been updated to bump the build version after the sourcemap upload step ran, so the sourcemaps were being associated with the wrong release.

This kind of thing is not a fun bug to find. It doesn't look like a bug. The system is working — sourcemaps are uploading, errors are being captured, the dashboard looks fine. There's just this one little plumbing problem where the sourcemap-to-release association is off by one, and the symptom is that for two months your production stack traces had been silently useless and nobody noticed because nobody actually clicks into them unless there's a real incident.

The lesson I took from this — and it's the kind of lesson that's annoying because it's not really actionable in the moment — is that the value of observability isn't in the dashboard, it's in the moment three months from now when you need it. And by then it's too late to discover that it was misconfigured. So every time I set up a new project, I now make a point of deliberately throwing an error in production right after launch and then walking through the full trail: does the error show up in Sentry, does the stack trace deminify, does the release version match, are the source files linked correctly. It's the kind of thing that takes ten minutes and saves you a future Tuesday.

The Safari 16.4 thing

I have to mention this one because it broke me a little.

We had a report — one report — of a user who couldn't get past our login page. They'd enter credentials, hit submit, and the page would just sit there. No error toast, no console output (they'd been kind enough to screenshot the console for us), no network request even. Just nothing.

I couldn't repro it. Nobody on the team could repro it. We had a Safari user on the team, working fine. We had iOS testers, working fine. The user sent us a screen recording. From the recording, it was clear that nothing was happening when they hit submit. We thought maybe it was a content blocker. They turned content blockers off. Still nothing.

Eventually — and this took several back-and-forth emails — we figured out they were on iOS Safari 16.4 on cellular. Not on wifi. On wifi the bug didn't happen. On cellular it did. And it turned out, after enough digging, that one of our analytics scripts was loading from a CDN that was being weirdly throttled or blocked on certain carriers, and our login form's submit handler was waiting on a promise from that script that would never resolve. (We had a Promise.all somewhere we shouldn't have.)

The bug is not interesting. The repro condition is. iOS 16.4. Cellular. Specific carrier. I would never have found this from DevTools. I never would have found this from Sentry, because the script never errored — it just never resolved, so nothing ever got reported. The only way I found it was the user's patience and a lot of asking-around about their setup.

Which is sort of the point. The bugs that are actually hard are the ones that don't generate errors at all. And no amount of error tracking helps with those, because there's nothing to track. The only signal you have is "user said it doesn't work." Which is, depressingly, why "could not reproduce, closing" tickets are the most dangerous tickets in any frontend bug tracker. Every ticket I've ever marked "could not reproduce" and closed was a ticket where I gave up too early. Every single one. I've gone back through old closed tickets and roughly half of them, in retrospect, were real bugs that someone else hit later and we fixed.

What I actually do now

I don't know if any of this is generalizable advice but here's roughly what my workflow looks like now, after enough scars.

If something comes in as a screenshot or a "this looks weird" — first thing I ask for is the URL and ideally a Sentry/RUM session replay if we have it. The screenshot is the symptom; the URL plus session is the diagnosis. I try really hard not to start in DevTools before I have those. Just because I find that if I open DevTools first, I end up confirming whatever the most plausible-looking hypothesis is rather than the actual cause.

If it's an error with a stack trace — I look at the breadcrumbs before I look at the stack. The stack tells me where the error happened. The breadcrumbs tell me what the user did to get there. The second one is almost always more useful.

If I can't repro — I do not close the ticket. I move it to a "needs more info" state and I keep it open. The instinct to clean up the tracker by closing things you can't repro is the wrong instinct. Closed tickets are pretend-fixed; open tickets at least keep the problem visible.

And anyway it's late and I should stop typing. The summary, if I had to write one, is just: tools are great, DevTools is great, observability is great, none of it substitutes for asking better questions about which bug you're actually looking at. The answer's almost never in the stack trace. It's in the metadata around it.