EXPERIMENTS
The threshold gate: how we decide what gets to scale.
Internally we have called what follows a "threshold gate" for a while, and over the last year we have realized the label is slightly wrong. It implies bureaucracy. It implies a checklist. What it actually is, when it works, is a single conversation that happens at a specific moment, and the quality of that conversation is the whole thing. The label makes it sound like governance. It is really closer to editing.
The moment
The conversation happens when an experiment has cleared statistical significance and the owner is ready to scale it. By "scale" I mean at least 5x the current spend or audience. The conversation happens before the budget moves. Afterwards is too late.
It has three questions in it. None of them are about significance. We assume significance is already there, because if it is not, there is nothing to talk about yet.
Question one: do you know why it worked
This is the question that kills the most scale-ups. The owner usually has a theory, but when asked to state it in one sentence, the theory collapses into "users responded well." That is not a theory, it is a restatement of the result. A theory is a mechanism. Why did this specific creative work. Which cohort responded, and what belief or behavior is the experiment exploiting. If the owner cannot articulate the mechanism, scaling is a bet on a confound nobody has identified yet. Scaling buys noise.
Question two: where does it saturate
Nobody knows exactly, and that is fine. The honest version of the question is: at what spend or audience do you expect the returns to stop being linear, and what will you do when that happens. "I will run it up until CAC starts to move and then hold" is a real answer, because it implies guardrails and an owner who is watching daily. Silence is not a real answer. Silence means the scale-up is being decided on optimism, and optimism about saturation is expensive to test.
Question three: who breaks
Scaling an experiment rarely stays inside the experiment. A paid creative scale-up means the creative team owes more variants, usually starting yesterday. A paid channel scale-up means finance needs to sign off on the new cap. An onboarding change means customer support sees more confusion. Every scale-up has an operational cost that lands on someone who was not in the experiment, and if that person has not been told, the scale-up will create a second, worse problem inside a month.
What this has actually killed and cleared
Hard to count precisely, because the conversation is informal most of the time and the record-keeping is not what it should be. My recollection of the last year is that roughly a third of experiments clearing significance did not pass the three-question version of the conversation, and were held at their existing size. Some of those were revived a few weeks later, once the owner had found the mechanism. The rest stayed held, which is almost always the right outcome, and almost always feels like a loss when it happens.
Why this works when it works
Not because it is disciplined, and not because it is a framework. It works because the three questions respect the experimenter. They do not ask "prove it to me." They ask "do you understand it yet." A good experimenter can usually feel the difference, and will find the answer within a week if they did not have it at the start. A weak experimenter will hate the conversation regardless of what you call it, which is also useful information.