The 3 AM Bug That Changed How We Debug AI-Generated Code
Back to Blog

The 3 AM Bug That Changed How We Debug AI-Generated Code

David Liu

David Liu

·8 min read

It was 2:47 AM when I finally found it.

A single line of code—generated by an AI agent, reviewed by two engineers, and deployed to production three weeks earlier—was silently corrupting user data. Not crashing. Not throwing errors. Just quietly writing the wrong values to our database, one request at a time.

By the time we noticed, 12% of our users had been affected.

That night changed how I think about AI-generated code. Not because AI is bad at coding—it’s often remarkably good. But because the way it fails is fundamentally different from how humans fail. And debugging it requires a different mental model.

The False Confidence Problem

Human-written bugs tend to cluster around complexity. We make mistakes when logic gets tangled, when there are too many edge cases to track, when we’re tired or rushed.

AI-generated bugs are different. They often appear in code that looks perfectly reasonable. The syntax is clean. The logic flows well. The variable names make sense. Everything about the code signals competence.

That’s the trap.

The bug I found that night was in a data transformation function. The AI had written something like this:

function transformUserData(input: RawUserData): ProcessedUser {
  return {
    id: input.id,
    email: input.email.toLowerCase(),
    createdAt: new Date(input.created_at),
    preferences: input.preferences || {},
    lastLogin: input.last_login ? new Date(input.last_login) : null
  };
}

See the bug? Neither did we. For three weeks.

The problem was input.preferences || {}. When preferences was an empty object {}, this worked fine. When it was null, this worked fine. But when it was undefined—which happened for users created before we added the preferences field—the fallback worked.

Except it didn’t. Not when preferences was 0 or false or an empty string, all of which were falsy and got replaced with {}. And in our system, preferences: 0 meant “user explicitly disabled all preferences.” The AI’s code was silently re-enabling preferences for users who had turned them off.

The fix was simple:

preferences: input.preferences ?? {}

Nullish coalescing instead of logical OR. A one-character difference.

But the lesson wasn’t about nullish coalescing. It was about how AI code passes visual inspection because it looks right. Human reviewers pattern-match against what they expect to see, and AI-generated code matches those patterns almost perfectly.

The Five Categories of AI Code Bugs

After that incident, we spent a month cataloging every bug we’d found in AI-generated code across our projects. Clear patterns emerged.

1. Plausible Defaults

AI loves filling in defaults. It sees a parameter that might be undefined and helpfully provides a fallback. The problem is that these defaults are based on what’s common in training data, not what’s correct for your specific context.

We found default timeout values that were too short for our infrastructure. Default retry counts that were too aggressive. Default error messages that were too generic. Each one plausible. Each one wrong.

How we catch it now: We explicitly search generated code for ||, ??, default parameter values, and fallback patterns. Every default gets a comment explaining why that specific value was chosen—if we can’t write the comment, the default is probably wrong.

2. Confident Hallucinations

Sometimes AI invents APIs that don’t exist. Not obviously fake APIs—subtle variations of real ones. A method called findOneOrCreate when the ORM actually uses findOrCreate. A config option called retryAttempts when the library uses maxRetries.

These bugs are insidious because they often almost work. JavaScript won’t complain about accessing a property that doesn’t exist—it’ll just return undefined. TypeScript catches some of these, but not if the AI also generated plausible type definitions.

How we catch it now: We never trust AI-generated import statements or external API calls without verification. If the code references a library method, someone manually confirms that method exists with that exact signature.

3. Context Window Amnesia

AI agents process code in chunks. When a function spans multiple context windows, or when the AI is working across several files, it can lose track of details established earlier.

We found cases where a variable was declared as a string in one file and used as a number in another. Cases where an async function was awaited in some call sites but not others. Cases where error handling was comprehensive at the top of a file and completely absent at the bottom.

How we catch it now: We run consistency checks across files. Same variable, same type, everywhere. Same function, same error handling, at every call site. We also break up large generation tasks into smaller, self-contained units.

4. Training Data Ghosts

AI models are trained on code from specific eras. They’ve seen more jQuery than modern React, more callbacks than async/await, more CommonJS than ES modules. When you ask them to write modern code, they usually do—but sometimes old patterns leak through.

We found code that used var instead of const. Code that manually bound this in React class components (when we use only function components). Code that used deprecated APIs that technically still work but are marked for removal.

How we catch it now: We maintain an explicit blocklist of patterns we don’t use. Our linter is configured to error on anything deprecated or outdated. If the AI generates it, CI catches it.

5. Optimization Blindness

AI-generated code often works correctly but performs terribly. It’ll build an array with repeated concat calls instead of a single spread. It’ll query a database inside a loop. It’ll re-render entire component trees when a single value changes.

The code is correct. The code is slow. And the AI has no way to know, because performance is an emergent property of how code runs in production, not something visible in the source.

How we catch it now: Every generated code path gets a performance review. We look specifically for loops containing async operations, repeated allocations, and patterns that scale linearly when they should be constant-time.

Building a Debugging Culture

None of this means we’ve stopped using AI-generated code. We use it constantly. It’s made us dramatically faster. But we’ve built specific practices around it.

The Generation Log

Every piece of AI-generated code gets logged with the prompt that created it, the model version that generated it, and the review status. When bugs surface weeks or months later, we can trace them back to their origin.

This has revealed patterns we wouldn’t have seen otherwise. Certain types of prompts consistently produce buggier code. Certain model versions have specific blind spots. The log turns debugging from archaeology into science.

The Skepticism Review

Regular code review asks: “Does this code do what it’s supposed to do?”

Our AI code review asks: “What would I expect this code to do if I’d never seen the requirements?”

The distinction matters. AI code often does exactly what you asked for, interpreted literally. The skepticism review catches cases where what you asked for isn’t what you actually needed.

The Boundary Tests

For every generated function, we write tests for the boundaries: null inputs, empty inputs, maximum-size inputs, malformed inputs, inputs from different sources. AI code tends to handle the happy path well. The edges are where it breaks.

We also write tests for the implicit contracts. If a function is supposed to be idempotent, we call it twice and verify the results match. If it’s supposed to be pure, we verify it has no side effects. If it’s supposed to be atomic, we interrupt it mid-execution and check for corruption.

The Uncomfortable Truth

Here’s what I’ve realized after a year of debugging AI-generated code: we’re not debugging code. We’re debugging communication.

When an AI agent misunderstands a requirement and produces wrong code, the bug isn’t in the code. The bug is in the prompt, or the context, or the implicit assumptions we failed to make explicit.

The 3 AM bug wasn’t really about || versus ??. It was about the fact that nobody told the AI that preferences: 0 was a meaningful value in our system. The AI did exactly what a reasonable developer would do with the information it had. We just hadn’t given it all the information.

This reframes debugging entirely. Instead of asking “What did the AI get wrong?” we ask “What did we fail to communicate?” Instead of blaming the tool, we improve our process.

That’s not to say AI code is perfect when properly prompted. It’s not. The five bug categories I described are real, and they require real vigilance. But the majority of serious bugs we’ve encountered trace back to gaps in our prompts, not flaws in the generation.

Looking Forward

The tooling around AI-generated code is improving fast. Better static analysis that understands generative patterns. Better test generation that targets likely failure modes. Better prompting frameworks that reduce ambiguity.

But the fundamental challenge will remain: AI code fails differently than human code, and debugging it requires different intuitions. The engineers who thrive in the AI-native era won’t be the ones who trust AI blindly or reject it entirely. They’ll be the ones who understand how it thinks—and where it’s likely to be wrong.

That 3 AM bug cost us three days of cleanup and some uncomfortable conversations with affected users. It also taught us more about working with AI than any success ever could.

Some lessons you have to learn the hard way.


ProductOS is built with AI agents that generate, review, and deploy code—and we’ve learned these debugging lessons firsthand. If you’re building with AI and want a platform designed for this new reality, try ProductOS free at productos.dev.

Photo by Daniil Komov on Unsplash