BYOMSPM

Build-Your-Own Master’s Degree in Product Management

Find here my thoughts on a collection of podcasts, articles, and videos related to product management, organized like a semester of a Master’s degree.

Module 4 / Product / A|B Testing



This post covers a quick refresher on A/B testing, some key tips for doing it right, and a list of some of its limitations. Anna Marie Clifton was the guest on the two Product Podcast episodes cited in this post; know that if I cite the podcast here, the idea was probably hers.

Grade I gave myself for this assignment: 95/100

Quick Refresher on A|B Testing

In product management an A/B test involves giving a random half of a product’s user base a potential new version of a product experience (“the treatment”) while the other half continue to be given the legacy experience (“the control”) (Huryn, 2023). Analysis on the differences in user behavior and outcomes between the two groups can be compared against a hypothesis about the effect of the potential new experience to help inform whether the new feature should be implemented. In theory, this type of testing can be used to assess the impact of smaller, individual features or entire changes to a product experience.

How to Do A|B Testing Right

A/B testing promises to get us relatively closer to being able to prove causation rather than just mere correlation (Huryn, 2023), but there are some key things that need to be true.

First thing to know about A/B testing is that for the results to be anywhere near statistically significant there need to be at least (somewhere on the order of) thousands of users involved in the test (The Product Podcast, 2017).

Another critical consideration for A/B tests is that the two experiences (potential new experience & legacy experience) need to be available at the same time to minimize the potential of alternative explanations for any differences in user behavior or outcomes between the two groups (The Product Podcast, 2017). According to Clifton this is commonly misunderstood (The Product Podcast, 2017). Along with this, it’s critical that the two groups are truly random, or the experiment will exhibit selection bias and not truly isolate the effect of the change (Huryn, 2023).

In one of the podcasts, Clifton explains that a third critical thing to keep in mind when running an A/B test is determining from the beginning of the experiment whether it is a “test-to-ship” experiment or a “test-to-learn” experiment. She distinguishes that a “test-to-ship” experiment is used for a new feature that you expect to keep in production; the feature will be fully fleshed out and implemented completely prior to the A/B test. On the other hand, a “test-to-learn” experiment is usually an early iteration of a new potential feature, and the implementation will be done only as thoroughly as needed to prove the hypothesis; the feature is not expected to stay in production in its A/B test state. Making this determination early on is critical to effectively defining success and appropriately setting expectations for stakeholders (The Product Podcast, 2017).

Last thing that I’ll note is that, according to Clifton, if you’re running experiments correctly, around half of them should “fail” (a.k.a. suggest that the new potential feature should not be implemented).

Downfalls of A|B Testing

A/B testing can be useful, but there are also limitations. Here are some of them:

Invariably running a test on a potential new feature will take longer (aka cost more) than just simply shipping that feature. If the feature doesn’t significantly affect the product experience, or you are fairly confident that it will have the intended effect, then it might not make sense to run a full-blown experiment (The Product Podcast, 2017).

Another downside to A/B tests is that you can’t really change anything about the potential new feature once the experiment has started (even if there are bugs) because that hinders the integrity of the experiment results (The Product Podcast, 2017).

On top of that, A/B testing can also cause confusion for users, as key elements of their experience with the product might be changing, or it might be different than the experiences of others around them (The Product Podcast, 2017).

Another limitation of A/B tests is that, at least for software products, the code clean up is usually imperfect in some way, as reverting the changes of a new potential feature that was rejected might leave remnants, bugs, or tech debt after removal (The Product Podcast, 2017).

Lastly, it’s important to note that A/B tests in the “wild” are rarely true double blind tests that are perfectly statistically reliable. Clifton estimates that there can be around a 20% chance of uncertainty or inaccuracy in some A/B tests.

Happy testing.


Works Cited

“All the Ways That A/B Testing Sucks by Yammer PM.” The Product Podcast. 29 June 2017. Spotify.

Huryn, Pawel. “A/B Testing 101 + Examples.” The Product Compass. 29 July 2023. https://www.productcompass.pm/p/ab-testing-101-for-pms.

“The Thing about A/B test — To Learn or To Ship: That is the question.” The Product Podcast. 7 May 2017. Spotify.


Leave a comment