The Good, the Bad, and the Ugly: Lessons from an AI-Assisted Infrastructure Failure #
Artificial Intelligence is no longer an abstract concept from a conference keynote slide. It’s an open terminal. It’s a tool I reach for constantly. Claude Code, Codex, Copilot, ChatGPT—these haven’t just changed how I learn, troubleshoot, and build infrastructure; they have completely rewritten the timeline of what a single engineer can achieve. Projects that used to take weeks now take hours. But acceleration without control is just a crash waiting to happen. For anyone who noticed my page crawling or completely down recently—this is why. This isn’t a story about AI making a mistake. This is a story about a tired engineer who stopped paying attention. It’s a raw look at what happens when human fatigue meets machine-speed execution, and it is a conversation our industry desperately needs to have before more organizations learn this lesson the hard way.
The Good: From Summit Vision to Hands-On Execution #
Fresh off attending the 2026 Red Hat Summit, I came back with a fire to build. Listening to the keynotes and talking with peers made one thing crystal clear: the future of modern application platforms is increasingly centered around containerization and Kubernetes. I didn’t just want to understand this shift conceptually; I wanted a rigorous, hands-on approach to the technology defining the modern enterprise. Right then, I made the decision to overhaul my entire homelab infrastructure, migrating completely from virtual machines to containers—transitioning my core architecture away from Proxmox and straight into OKD (the community distribution of OpenShift). It was the ultimate playground to double down on my philosophy: #LongLiveOpensource. But moving from traditional hypervisors to bare-metal Kubernetes and distributed, software-defined storage is a massive paradigm shift. The learning curve can be brutal. That is where AI stepped in to flatline the curve. Instead of spending hours digging through documentation and forum threads, AI acted as an on-demand principal architect. I could validate complex topologies, explore edge cases, and receive explanations tailored directly to my specific environment. More importantly, it catalyzed the transition from inspiration to implementation. I was actively tearing down legacy VM workloads and rebuilding them. Applications were being redesigned, containerized, and deployed onto OKD. Databases were being shifted. Identity providers and authentication services were being integrated. Distributed storage systems were being provisioned from scratch, and enterprise monitoring stacks modernized. The velocity was incredible. Projects that traditionally would have stalled out over weeks of planning were wrapped up in days. Complex, multi-layered troubleshooting that used to require a day of research was solved in minutes. The AI wasn’t replacing my engineering judgment. It was amplifying it, allowing me to build out a cutting-edge open platform at unprecedented speed. I was winning, the platform was scaling, and the velocity was impressive. Which is exactly why I didn’t see the wall I was running toward.
The Bad: When Success Becomes Routine #
That brings us to 2:00 AM. It’s the classic engineering trap: you are running on caffeine and momentum, pushing past the point of exhaustion because the deployment is going incredibly well. One more manifest, one more config change, then sleep. I was pairing with Claude Code on some late-night adjustments to the OKD cluster. Because the tool had earned my trust over weeks of successful execution, I let my guard down. I did the most dangerous thing an infrastructure operator can do: I stopped actively auditing the code. The automation plans generated by the LLM grew increasingly complex and verbose. Instead of taking the time to parse through the terminal output, my compliance became routine. I fell into a pattern of blind approval—skimming the text and hitting Accept over and over. The critical mistake wasn’t made by the machine. It was made by a tired engineer who stopped treating automated actions with a zero-trust mindset.
The Ugly: When Execution Outpaced Review #
The impact was heaviest on the storage plane. I was reconfiguring my Rook-Ceph environment after a minor misstep during a storage expansion exercise. I had accidentally provisioned an older 250GB SSD instead of the intended 500GB drive. My mistake started with a classic systems engineering shortcut: I was target-identifying drives using volatile Linux device names like /dev/sdb and /dev/sdc instead of using persistent, immutable Device IDs (/dev/disk/by-id/*). During the expansion, the kernel mapped the drives differently than I expected, and the 250GB disk was swallowed into the cluster. The new objective was simple: remove the incorrect disk and scale up the cluster using three new 500GB SSDs. The fatal flaw wasn’t the objective; it was the sequence of operations. Because of how Ceph had initialized the storage, that 250GB disk had immediately become active, hosting critical Ceph metadata and Object Storage Daemon (OSD) mappings anchoring a massive portion of the cluster’s state. The correct, resilient engineering path required draining the OSD, rebalancing the placement groups, and preserving that data footprint before pulling the drive.
Instead, the late-night sequence I fed to Claude effectively read:
- Purge the 250GB disk immediately.
- Provision the three new 500GB drives.
Let’s be entirely clear:
The AI did not misunderstand the task. It took my flawed logic and executed it with machine-speed precision.
Within minutes, alerts began appearing across the environment. Persistent Volume Claims (PVCs) went dark. Mission-critical databases lost state. Centralized authentication services stopped functioning. Apps crashed. Anything tied to that distributed storage plane became unavailable — PostgreSQL, Percona, Authentik, and every stateful service beneath them. Yet, amid the wreckage, an interesting architectural lesson emerged. The outage drew a definitive line through my platform, instantly highlighting which parts of the architecture were truly resilient. My stateless workloads survived untouched. My decoupled Hugo site stayed online. Monitoring systems built outside the affected storage boundary continued reporting data without a hitch. This wasn’t an AI failure. It was an infrastructure incident initiated by a human operator who stopped auditing his own code. The AI simply moved down the execution path faster than any human typing manually ever could.
The Recovery: 48 Hours in the Trenches #
The moment the alerts breached and The scope of the data plane failure became clear, I killed the process. But the sequence had already finished; the damage was done. Recovery began immediately. Ironically, the very same AI tools that had executed the incident became my most valuable assets during the restoration. The paradigm shift was immediate: I went from treating the AI as an execution engine to treating it as an analytical peer. Claude helped map out the corrupted cluster states. Codex and Copilot validated alternative recovery approaches, and ChatGPT helped brainstorm architectural fallback positions. Instead of blindly running generated scripts, I used the LLM stack to double-check my logic, sanity-check raw configuration files, and bounce recovery paths off a virtual infrastructure team. Over the next two sleepless nights, it was pure, hands-on systems engineering. I pulled backups, rebuilt core services, recovered distributed databases, and re-established broken application connections across the OKD cluster. Some data was lost—specifically transient delta changes made since the last automated backup window. Because this occurred within my homelab, the blast radius was safely contained inside my own sandbox. Had this sequence been approved by an engineer targeting a production enterprise region, the operational and financial impact would have been severe. But the true value of an infrastructure failure isn’t just getting the lights back on. It’s the hard-won operational maturity forged during the rebuild.
The Lessons: Governance in the Age of Velocity #
1. Accountability Cannot Be Outsourced #
This is the foundational takeaway of the modern engineering landscape: the AI did exactly what I instructed it to do. The failure occurred because I approved a destructive sequence of operations without proper verification. The accountability belongs entirely to the person behind the keyboard. You cannot fire, discipline, or audit an LLM. In an enterprise setting, leadership must explicitly reinforce that utilizing AI assistants does not absolve an engineer from full code ownership.
2. Speed is a Risk Vector #
The core business value of AI is raw execution speed. It allows us to learn, build, and deploy at a staggering pace. But that velocity introduces a subtle, psychological risk: when repeated success becomes routine, rigorous peer review begins to feel like a bottleneck. When review becomes optional, systemic downtime becomes inevitable. The true danger isn’t that machines move too fast—it’s that humans become comfortable operating at machine speed without machine precision.
3. Elevate the “Audit-to-Execution” Ratio #
It sounds basic: read the plan. Yet, this is becoming one of the greatest operational hurdles in modern DevOps. AI-generated infrastructure manifests are dense, verbose, and incredibly detailed. Thoroughly auditing a 300-line declarative plan takes real cognitive effort. But that time is never wasted; it’s where risk is actively managed. If an automated tool saves your team twenty hours of development time, mandating that they spend one focused hour peer-reviewing the output is still an extraordinary return on investment.
4. Data Has Priorities #
This incident was a brutal reminder of an essential infrastructure principle: not all data is created equal. Your identity providers (like Authentik), core stateful databases, and distributed storage metadata require isolation and tier-one placement strategies. Where your critical data lives and the underlying dependencies beneath it must be driven strictly by architectural impact and resilience, never by convenience or automated defaults.
5. Backups Are Protection Against Ourselves #
Hardware, networks, and software will always fail. But human error remains the highest-probability catalyst for a catastrophic outage. Backups aren’t just an insurance policy against hardware degradation; they are a safety net for our own tired decisions. The only reason my OKD cluster recovered was because robust, decoupled backups existed outside the blast radius. Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs) are not compliance checkmarks—they are operational survival. And a backup that has never been successfully restored is nothing more than a comforting theory.
Final Thoughts: Human Judgment at Machine Speed #
Organizations everywhere are aggressively adopting AI. They are granting automated tools deep, programmatic access to source code repositories, cloud environments, CI/CD pipelines, Kubernetes clusters, and core infrastructure platforms. This trend is not going away. Nor should it. The productivity gains and engineering acceleration are simply too massive to ignore. But as we enthusiastically embrace these tools, our operational discipline and architectural choices must evolve at the exact same rate. The takeaway here is not to avoid AI, nor is it to fear innovation. The takeaway is to remember that absolute engineering responsibility does not evaporate simply because a machine is helping us type the commands. AI did not cause my outage. Fatigue did. Complacency did. Blind trust did. The AI simply executed a flawed human plan faster than I ever could have on my own. The real danger wasn’t artificial intelligence. It was artificial confidence. One of the advantages of operating an open platform is visibility. During recovery, I had access to logs, manifests, configuration, and the underlying systems that allowed me to understand what happened and rebuild quickly. When things break—and eventually they will—visibility becomes one of the most valuable tools an engineer can have. As engineers and technology leaders, we must learn to operate in a reality where execution is nearly instantaneous, but architectural judgment remains entirely, fundamentally human. True innovation isn’t just about how fast we can build with AI—it’s about building open, resilient systems that can survive our own human mistakes.
What makes this story uncomfortable is that nothing about it was unique to a homelab. The same AI tools are increasingly being granted access to Git repositories, CI/CD pipelines, cloud accounts, Kubernetes clusters, and production infrastructure. The only thing that saved me was the fact that this happened in a lab environment. Had this occurred against a production region, the outcome would have looked very different.
And sometimes, the sharpest engineering decision you can make is to close the terminal, get some sleep, and run the deployment in the morning light.