Happy Path Is Not Production
Your automation works. That is the problem.
It ran today. It ran yesterday. It will run tomorrow, on the input you imagined while you were building it. Watching it work feels like proof. It is not proof. It is a demo that has not been contradicted yet.
The question almost everyone asks is “does it work?” You run it, the output is right, you ship it. That question is easy to answer and it answers the wrong thing. It tells you the automation can succeed. It tells you nothing about what happens the first time it cannot.
Here is what that looks like.
An order-to-fulfilment automation ran clean for nineteen days. On day nineteen a customer requested a refund four seconds before the fulfilment step read the order record. There was no branch for “the record changed underneath us.” The automation did exactly what it was built to do: it read an order, it shipped it. It shipped a cancelled order, told no one, and recorded a success. Nobody looked, because it had reported success nineteen times and there was no reason to think the twentieth was different. The cost was not the postage. The cost was the eleven days before anyone noticed the pattern.
Nothing about that failure was exotic. The data was valid. The tool did not break. The automation was not wrong. It was incomplete. It had a success path and no answer for the moment reality moved faster than it did.
So the question is not “does it work?” The question is “what does it do the first time it cannot?”
That reframing changes what you are building. A production automation is not a success path with some error handling bolted on. It is mostly failure handling, with a success path running through the middle of it. The happy path is the part that is easy. It is also the smallest part. The part that determines whether the automation survives a year instead of a week is everything around it: what it does when an input is malformed, when a service is down, when a credential expires, when the same event arrives twice, when the record changes mid-run, when the answer is uncertain and a human should decide.
Those are not one thing. “It failed” is not a state you can handle. A malformed input and an expired credential and a duplicate event are three different problems that need three different responses, and a workflow that routes them all into one “something went wrong” branch has not handled failure. It has hidden it. The work is not adding a catch. The work is deciding, for each way this can fail, what should happen (stop, retry, escalate, or roll back), and making the automation say so out loud when it does.
Designing that after the happy path works is the mistake. By then the happy path feels finished, the failure paths feel like cleanup, and cleanup is what gets cut when the week runs short. So invert it. Decide how this is allowed to fail before you build how it succeeds. The success path is the part you will get right anyway. The failure paths are the part you will only build if you build them first.
This is also why “it works in the demo” is not a milestone. The demo is the happy path by definition. You control the input. Production is the part you do not control. The gap between the two is not polish. It is the system.
An automation is not done when the happy path works. It is done when failure is detected, contained, and recoverable.
Does it work?
What does it do the first time it can't?