Comments on Digital Marketing and Analytics by Anil Batra: Significance of Statistically Significant Results in A/B Testing

Hi Olga, Yes this is from GWO. Will you be able t...

2011-03-11T11:17:19.895-08:00

Hi Olga,

Yes this is from GWO. Will you be able to send me a copy of your chart?

Thanks for the post. Is it Google website optimize...

2011-03-11T10:30:50.332-08:00

Thanks for the post. Is it Google website optimizer test? We had several tests with this tool & every time got the same charts: during 1st two weeks of testing there's a significant difference between combinations, in 2 weeks the chart shows that there's no difference between them at all.

I think a problem with the null hypothesis formali...

2010-09-20T16:25:48.406-07:00

I think a problem with the null hypothesis formalism is that can ironically paralyze decision making. You can get stuck worring that you are picking the 'wrong' answer. However, it might be better to think about the decision problem in terms of revenue maximizing (or minimizing regret) rather than winners and losers or right ones and wrong ones.

A couple of things to think about before going the null hypotheses route:
1)Opportunity Costs - it costs you to learn, so you want to do it efficiently. Sig testing as discussed above does not account for the lost reward by playing suboptimal options.
2)The internet is real-time! There is often no need to arbitrarily pick a time (picking a confidence level is arbitrary) to force yourself to play a pure strategy from then on. Think in continuous terms rather than in discrete terms. There is no reason you can't keep all options available but decrease the frequency with which they are played as you move through time and learn more about them. So play a mixed (portfolio) strategy. Which brings me to the last point
3) The environment need not be stationary. More likely than not the environment that your application is making its decisions in is non-stationary, so the notion of a 'winner' may not make any sense.

Often, these types of decision problems can be better modeled as a bandit problem rather than as a hypothesis test problem.
Of course it depends on what you want to learn and why. If you need to make generalizations about the world, than you prob want to get some confidence around your statements and should run more formal tests, but if you just want to optimize an online process, than I'm not sure that sig testing is always the way to go. I can guarantee that Google, Yahoo! etc are not running null hypothesis tests on Ad placements - they are running bandits.

Thanks

Jason, I agree with you.

2010-02-08T21:29:22.245-08:00

Jason, I agree with you.

Suresh, I always advice to wait but there are case...

2010-02-08T21:29:03.564-08:00

Suresh,
I always advice to wait but there are cases when the test runs too long without giving any conclusive results in that case you need to test a different variation.
In this case it seems like picking yellow might not have been a wrong decision but it would not have been the right decision either. (Also see my answers above).

Giadascript, the test is still running. All the tr...

2010-02-08T21:26:14.431-08:00

Giadascript, the test is still running. All the treatments are getting equal traffic.

Barbara, Since the test has not completed yet I c...

2010-02-08T21:25:23.427-08:00

Barbara,

Since the test has not completed yet I can not say I picking yellow would be a mistake or not. The point I am trying to make is that you have to wait for statistically significant results before declaring a winner (or looser) as results can change (as the chart shows).

Suarbah, Once you reach statistical significance,...

2010-02-08T21:23:39.517-08:00

Suarbah, Once you reach statistical significance, the results should not change, assuming all other variables don't change.
This example shows that results can change early on and you should not make decisions in a hurry without getting statistically significant results.

Google has a calculator that might help, https://www.google.com/analytics/siteopt/siteopt/help/calculator.html

Great post! The variations are trending so close...

2010-02-07T17:12:17.872-08:00

Great post!

The variations are trending so closely - makes me think you need to go back to the drawing board and start over!
http://bit.ly/a2uVxU

Nice post. I think even more important than letti...

2010-02-07T08:16:49.552-08:00

Nice post. I think even more important than letting your AB tool pick a winner for you, you should know the statistics behind the calculations so that your results are tool independent.

Anil, nice post! But some would argue that picking...

2010-02-06T11:57:38.424-08:00

Anil, nice post! But some would argue that picking yellow early wouldn't have a big negative impact (based on the graphs) and indeed you would be applying testing correctly - test, fail/succeed quickly, and then retest. It would be interesting to see the analysis around when testers must make a decision and when it pays to wait.

This can also happen if your client has trend shif...

2010-02-04T16:58:43.110-08:00

This can also happen if your client has trend shifts in their traffic stream - of course you have to draw the line at some point or another or you'll be in perpetual testing mode forever.

Having said that, there are SOME times when the test page completely tanks or smokes the original page - but in most instances I agree it's better to wait it out.

I like the post. short and sweet.

Excellent post, Anil. It's so critically impor...

2010-02-03T13:56:11.770-08:00

Excellent post, Anil. It's so critically important we understand significance before just quickly making decisions on very early results. I wrote about this a bit in a post called "Wanna be better with metrics? Watch more poker and less baseball" (http://www.retailshakennotstirred.com/retail-shaken-not-stirred/2009/11/wanna-better-understand-metrics-watch-more-poker-and-less-baseball.html) and in that post included a free spreadsheet that can be used to determine statistical significance. I thought your readers might find it useful.

Thanks for an excellent post!

While i agree with you conclusion, and prefer to h...

2010-02-03T11:58:35.583-08:00

While i agree with you conclusion, and prefer to have a stat.sig results, here is my question: would we have made a huge error by selecting yellow early (if we had no time to test any longer) as it seems to work as good as any of the other options...

Not only do you not have a winner, you might incor...

2010-02-03T10:07:26.701-08:00

Not only do you not have a winner, you might incorrectly choose the "blue" as the loser. Possible scenario is that Blue is some new treatment that users need to get used to or some new treatments are "exciting" or novel to start with but that effect doesn't last.

I completely agree with your conclusions here.

Our paper http://exp-platform.com/hippo_long.aspx

talks in depth about Novelty and Primacy effects.

This has to be the best explaining article on stat...

2010-02-02T23:59:52.571-08:00

This has to be the best explaining article on statistically significant data for a/b testing.

One of those cases when an image is worth 1000 calculations :)

Thank you.

Between two sets in an A/B split test, if the popu...

2010-02-02T21:05:41.509-08:00

Between two sets in an A/B split test, if the population of each set is sufficiently large and the difference in response rates is statistically significant, how will you know that the results may change in the future and one should not take a call based on one test?

How many days should you be running the test to be sure that the results are statistically significant and will not change in the future?