It started with a single message in our Slack alerts channel at 3:14 a.m. The checkout service was returning 502 errors. I woke up, bleary-eyed, and squinted at my phone. No other context. No stack traces. Just a generic HTTP error from the load balancer. I opened my laptop, logged into the server, and prepared to find the culprit. What followed was six hours of pure debugging hell, most of which I spent staring at log files that told me absolutely nothing. By the time I finally found the bug, a single misconfigured line in our logging setup had cost me half a night’s sleep and a significant piece of my sanity. That night taught me more about logging than years of reading documentation ever could. This is the story of that session, the mistakes that made it a nightmare, and the logging philosophy I have followed ever since.
The Setup: A Production Service That Should Have Been Fine
The checkout service was a Node.js application I had built six months earlier. It handled payment processing for our small e-commerce platform, communicating with Stripe’s API, recording orders in a PostgreSQL database, and publishing events to a RabbitMQ queue for order fulfillment. It ran on two AWS EC2 instances behind a load balancer, with a health check endpoint that returned 200 OK as long as the process was alive. We had logging: Winston, a popular Node.js logging library, was configured to write to files on each server and also send errors to a small Elasticsearch cluster we used for log aggregation. We had alerting: a CloudWatch alarm was set to fire if the number of 5XX errors exceeded a threshold. What we did not have was useful logs. I would learn this the hard way.
The First Symptom and the Empty Logs
I SSHed into the first server and tailed the application log file. It was full of information-level messages about successful payment intents, database queries, and session refreshes. Normal traffic. I scrolled back to the time of the alert and found nothing. No errors. No warnings. Just a gap in the logs, as if the application had simply stopped writing. Then the logs resumed, with an innocent-looking message about a new Stripe webhook being received. I checked the second server. Same pattern. The health check had not failed, so the instances stayed in the load balancer, but actual requests were failing with 502 errors during that gap. Something was killing the application’s ability to respond, but it was recovering quickly and leaving no trace.
I checked the Elasticsearch logs. The error index was empty for that time window. That was the first big clue: Winston was not shipping error logs either. I checked the Winston configuration, which I had not touched since writing it months ago. It looked fine. A file transport, an Elasticsearch transport, and a console transport for development. The transports were configured with a log level of “info” and above. Errors should have been caught. But they were not. I was flying blind.
The RabbitMQ Wild Goose Chase
My first theory was RabbitMQ. The checkout service published events after successful payments, and the queue had been a source of trouble before. I checked the RabbitMQ management console and saw a backlog of unacked messages on the order fulfillment queue. The consumers were alive but slow. Maybe the checkout service was hanging while trying to publish, waiting for a broker acknowledgment that never came. I spent an hour digging through RabbitMQ logs, network metrics, and connection timeouts. I added temporary debug logs to the publisher code and redeployed. The debug logs showed that publishing was fine. The backpressure was on the consumer side, not the publisher. The 502 errors had nothing to do with the queue. I had lost an hour and gained nothing.
The Database Connection Pool Suspicion
Next, I suspected the database. The checkout service used a connection pool to PostgreSQL. If the pool was exhausted, incoming requests would hang waiting for a connection, eventually timing out and returning 502 errors. I checked the PostgreSQL logs. No connection errors. I checked the pool metrics. The pool size was 20, and active connections rarely exceeded 5. The pool was not the bottleneck. I added logging to the pool acquire and release events anyway, and redeployed. The logs showed connections being acquired and released normally. The 502 errors occurred even when the pool had plenty of capacity. Another theory dead.
The Moment of Discovery: A Logging Backpressure Loop
By 6 a.m., I was exhausted and considering restarting the servers just to see if the problem went away. But I decided to look at the Winston configuration one more time, this time reading the library’s documentation for the Elasticsearch transport. That transport was a community plugin, and its README mentioned a “bufferLimit” option. By default, it buffered log messages in memory and flushed them to Elasticsearch in batches. If the buffer filled up, for example if Elasticsearch was slow or unreachable, the transport would stop accepting new log messages. And critically, the default behavior was to simply drop them silently. No error callback, no event emission. Just a buffer overflow and a quiet failure.
I checked the Elasticsearch cluster. It was up, but the disk on one of the two nodes was 92 percent full. Elasticsearch had entered read-only mode for indices on that node. The Winston transport was trying to write to an index that was now read-only, receiving 403 errors from Elasticsearch, and silently buffering the errors. As the buffer filled, the transport stopped flushing. The file transport was also configured, but Winston’s internal buffering applied to all transports because the Elasticsearch transport was blocking the event loop. I was not a Winston internals expert, but I could see the chain: Elasticsearch disk pressure, read-only index, transport buffer overflow, log silence. The application was still running, but the logging layer had become a black hole that swallowed everything, including the error that would have told me what was actually causing the 502 errors.
I freed disk space on Elasticsearch, and the indices recovered. Immediately, Winston began flushing its buffer, and a torrent of delayed logs appeared. Among them was the error I had been searching for: a Stripe webhook signature verification failure. A misconfiguration in our webhook secret rotation had caused Stripe to send events signed with a key the application did not yet trust. The signature verification middleware threw an exception, which my code caught and attempted to log as an error. But because the logger was broken, the error never surfaced. The middleware then returned a 500 status, which the load balancer translated into a 502. The error was intermittent because only some webhooks used the new key. When the signature matched, the application worked normally. When it did not, the app crashed silently, and the logging layer hid the crime.
The Fix and the Uncomfortable Aftermath
Once I saw the root cause, the fix was trivial. I updated the webhook secret to the correct value, and the errors stopped. But the aftermath was humbling. The entire ordeal, from alert to resolution, took six hours. The actual bug took five minutes to fix. The other five hours and fifty-five minutes were spent debugging my logging system, which had been designed to help me but had instead actively obstructed me. I had built a logging pipeline that was brittle, silent on failure, and coupled to an external service without any resilience. When Elasticsearch sneezed, my entire observability stack caught a cold and died.
I also realized I had no alert on the logging pipeline itself. My application could log or not log, and I would not know until I needed the logs and found emptiness. That was a meta-failure. Logging is infrastructure, and infrastructure needs monitoring. I had monitored my application, but not the thing that monitored my application. The irony was thick enough to spread on toast.
What I Learned About Logging That Night
That single debugging session reshaped how I think about logging. Before that night, I treated logging as a passive output stream. After, I treated it as a critical subsystem with its own failure modes, backpressure, and need for resilience. I learned that logging transports can block the event loop, that buffering can silently drop data, and that “write to file and forget” is not a strategy. I also learned that logs are worthless if they cannot be trusted to exist when you need them most.
The most important lesson was about log levels. I had set the global log level to “info” for all transports. That filled the logs with noise: successful queries, routine events, and debug-level details promoted to info. When I needed to find an error, it was buried in thousands of irrelevant lines. After the incident, I dropped the default level to “warn” and reserved “info” for truly useful operational events. Errors became impossible to miss because they were the loudest things in the file.
What I Changed Immediately
Within a day of the incident, I rewrote the logging configuration for every service we ran. I removed the community Elasticsearch transport and replaced it with a sidecar approach: the application writes structured JSON logs to stdout, and a separate agent (Filebeat) ships them to Elasticsearch. The application’s responsibility ends at stdout. It has no knowledge of Elasticsearch, no buffering, and no chance of being blocked by a downstream system. If Filebeat fails, the application keeps running and logging. That decoupling was the single most impactful change I made.
I also added a health check endpoint that verifies the logging pipeline. A small synthetic log message is written every minute, and a cron job checks that it appears in Elasticsearch. If it does not, an alert fires. That simple heartbeat has caught two incidents in the year since, both before they became night-long debugging sessions. Finally, I adopted structured logging everywhere. Instead of string interpolation, my logs are now JSON objects with consistent fields: timestamp, level, message, service, traceId, and error stack when applicable. This makes searching Elasticsearch fast and precise. No more grepping through multiline stack traces by hand.
What I’d Do Differently Now
If I could go back and build the checkout service again, I would make three changes to the logging setup from day one. I would log to stdout only, and let the infrastructure handle aggregation. I would set the default log level to “warn” and use environment variables to temporarily increase verbosity for debugging, rather than drowning production in info logs. And I would add monitoring for the logging pipeline itself, treating it as a first-class service with uptime requirements. Logging is not a feature. It is a dependency, and dependencies need to be reliable.
I would also avoid community logging transports that add buffering without clear failure modes. The convenience of piping logs directly to Elasticsearch was not worth the brittleness. Simplicity beats integration when the cost of failure is high, and for production debugging, the cost of failure is always high. A simple stdout pipeline may be boring, but it works, and that is exactly what I need at 3 a.m.
Why This Matters Beyond One Night
That night taught me that debugging skill is not just about finding bugs quickly. It is about building systems that make bugs findable. Logging is not something you add after the code works. It is part of the code’s functionality. A service with broken logging is a service with a blind spot, and blind spots in production eventually become incidents. The checkout service incident could have been a five-minute fix if I had seen the signature verification error immediately. Instead, it became a war story I tell junior developers when they ask why I am so particular about log format and transport resilience.
Now, whenever I review a pull request, one of the first things I look at is how errors are logged. Are they logged at the error level? Do they include enough context to debug without reproducing the issue? Is the logging path decoupled from external services? These questions are scars from that night, and they have made every system I have built since more debuggable. The nightmare was avoidable, but the lessons it burned into me are permanent. I am a better engineer because I once stayed up all night chasing a bug that was hiding in plain sight, silenced by my own logging code.
