Monitoring, how you know it broke

AI asks: How will you know when it broke?

There is a famous pattern in software, observed across decades and dozens of companies, that goes like this. A new team ships a new product. Things go well for a while. Then something quietly breaks at 3 a.m. on a Saturday. The team finds out about it on Monday morning, from an angry customer email. The customer is angry not because the thing broke (things break, that is fine) but because they found out about it before the team did.

Monitoring is the discipline of making sure you find out first.

Image slot

Suggested meme: the classic 'this is fine' dog sitting in a burning room. Room is labeled 'production'. Dog is labeled 'team without monitoring'. Caption: 'no alerts have fired so far'. Save as public/lessons/day-14-meme.png and add src='/lessons/day-14-meme.png'.

No news is rarely good news.

The three layers of monitoring

Monitoring as a topic has gotten elaborate over the last twenty years. For your purposes, it collapses into three layers, each answering a different question.

Logs. "What is happening, in detail, right now and historically?" Every request your server handles, every error it encounters, every action a user takes can be written to a log, which is just a stream of timestamped lines of text. You read logs when something has already gone wrong and you want to figure out why.

Metrics. "What is the health of the system, at a glance?" Numbers over time: request count per minute, average response time, error rate, CPU usage. Dashboards full of graphs. You look at metrics to spot trends, and you set alerts on them so you find out automatically when something is off.

Error tracking. "What specifically broke, and where, and how often?" When an error happens, it gets captured, grouped with other instances of the same error, and surfaced in a tool that ranks them by frequency and severity. You triage errors the way a clinic triages patients.

Each of these has a category-defining tool you can pick up in an afternoon.

Logs. Built into your hosting provider (Vercel logs, Cloudflare logs, AWS CloudWatch), plus standalone services like Better Stack, Datadog, or Axiom.
Metrics. Datadog, Grafana Cloud, your hosting provider's dashboard.
Error tracking. Sentry is the default. Rollbar and Bugsnag are alternatives.

For Week 3 builds: turn on your hosting provider's built-in logs, and add Sentry. That is enough for v1. Day 19 will come back to this in the guardrail checklist.

How to answer "how will you know when it broke"

The format that works is to name three things.

What is the worst silent failure? The thing that, if it stopped working, your users would feel before you did. (For most apps: the sign-up flow.)

What signal would tell me, automatically, that it is happening? ("If sign-up success rate drops below X for Y minutes, page me.")

Where does that page go? Your phone, your email, a Slack channel. Not nowhere. If the alert fires into a room with nobody in it, you do not have monitoring, you have decorative noise.

This is enough specification for any AI tool to wire up real monitoring for your build. Day 18 will return to this.

A small vocabulary sweep

Alert. A signal that some metric crossed a threshold. ("Error rate above 5% for ten minutes.")
SLO (service level objective). A specific target for system health ("99.9% of requests should succeed"). Useful for grown-up products, overkill for most v1 builds.
Tracing. A more advanced version of logs, where each request gets a unique ID and you can follow it through every system it touched. Useful at scale.
Observability. The umbrella term for "having enough signal to figure out what is happening in production." Logs, metrics, and traces add up to observability.
On-call. The arrangement where one engineer at a time is responsible for responding to alerts. Healthy teams rotate this. Unhealthy teams have one person silently doing it forever.

Why this is the most-skipped layer in beginner builds

Almost every beginner build ships without monitoring, and almost every beginner build pays for it. The reason is that monitoring feels like overhead. It does not add a feature, it does not speed anything up, it does not even make the product look better. It just sits there, quietly, until the day it saves your weekend.

The cheapest peace of mind in software is wiring up a free Sentry account and turning on your hosting provider's alerts before you go live. It takes an hour, and on the day something breaks at 3 a.m. on a Saturday, you will be grateful.

Forward references

Day 15 begins Week 3, where we shift from "the map" to "your build." Day 19 returns to monitoring as one of the five guardrails. Day 21 closes the course by asking you to commit, on paper, to setting these up before you ship.

Day 14 wrap

The thing you can now say plainly. Monitoring is how you find out something is broken before your users tell you. It has three layers: logs (detail), metrics (health), and error tracking (what specifically broke).

The thing you can now do. When AI asks "how will you know when it broke," name the worst silent failure, the signal that would tell you, and where the page goes.

The guardrail to remember. The day before you go live is the day to wire up monitoring. The day after, it is already too late for the first incident.

You finished Week 2. See you in Your Build.