Split Testing

Split testing

Quick definition

Split testing is the practice of taking traffic into a website or a digital property, randomly assigning each visitor a different experience or a different piece of content and tracking their activity to see how a different experience affected the outcome.

Key takeaways:

The following information was provided during an interview with Kimen Warner, director of product management for Adobe Target.

What is split testing?
Why is split testing important?
What are the different types of split testing?
What are the limitations of split testing?
What tools do you need to perform split testing?
How do you implement split testing?
What are the methods for analyzing split testing results?
How can companies optimize split testing?
What problems do companies run into when running split tests?
Is split testing cost effective?
How often should you run split tests?
Does split testing affect SEO?
How will split testing evolve in the future?

What is split testing?

Split testing is when you take incoming traffic into a digital experience, like a website or mobile app, and randomly assign each person into a different group.

Those groups see different pieces of content or different experiences, and your goal is to see if delivering different experiences drives more of a particular outcome.

This includes actions like a click, or purchase, or newsletter sign up — or improves the conversion rate.

Why is split testing important?

It's the easiest way to get quantitative data back from your customers without demanding anything of them. There's no focus group.

To attract customers, you don't have to pull them out of their daily lives into some lab. A/B testing reaches people while they are interacting with the brand normally. And they don't even know it's happening.

It's also usually more cost effective than making a wholesale change and then hoping it works. And often if you just hope it works, it’s difficult to track the metrics after the fast to see the impact of your change.

What are the different types of split testing?

The most common types are A/B testing and multivariate testing. A/B testing is when you manually specify each of the different full experiences that someone might see.

You might have multiple pieces of content that all get shown together as one total experience or one total view. This kind of testing isn’t limited to two experience options, so it can be more accurately defined as A/B/C/D testing.

Multivariate testing is when you have a bunch of different components that automatically get combined to create a final experience.

Let's say that you have a headline text, an image, and a call-to-action button. In a multivariate test, you might be testing different versions of the headline text and different images and a different call-to-action text in different colors for that button.

And you create each of those options individually, and then the system automatically generates each of the combinations.

An A/B approach is simpler. You can see all the different experiences a visitor would have. But it takes a lot of time to set up. Depending on how many features you’re testing, you may create more than 50 different versions to get all the combinations.

Multivariate testing automates the process. But the challenge with multivariate testing is that you must make sure that each of those combinations will actually look good.

If you had a purple background for an image, and a red button, you might not want both of those to show up at the same time because it would be visually jarring, so you must make sure that each of your combinations actually works.

What are the limitations of split testing?

Split testing can be a bit myopic. It's only looking at one moment in time. It’s hard to test a full customer journey because there's just too much data and too many differences between each visitor to really track that accurately.

All the data that comes out of split testing is quantitative, not qualitative. You don't know why someone likes the blue button over the red button or whatever you're testing, you just know that they did.

To overcome this limitation, companies couple split testing with some qualitative usability testing based on the different segments that they're most interested in through avenues like recorded sessions, in-person focus groups, or interviews.

What tools do you need to perform split testing?

First, you need something to keep track of the best ideas at your company. It's helpful to gather those ideas from various people in the company and figure out which ones you're going to implement.

Next, you need a project planner, which can be as simple as a whiteboard or an Excel spreadsheet. You need to keep track of what you're doing, and that can help you figure out how long you're going to run the test and when you can look at the results.

The most technical testing tool you need is something to actually split the traffic on your site. And then you need a way to measure the performance of that change.

If you don't use a package service, it can sometimes be challenging to get results back tied to each individual visitor and make sure that you're looking at the results accurately.

You need to know per session what someone did and make sure it's tied back to the content that you delivered to them.

How do you implement split testing?

Any time a visitor engages with a component a company wants to test, like a landing page, that request comes in, and instead of just returning the general content on the web page, the call needs to route through some sort of testing service.

The company might test internally or use a third-party split testing tool.

That testing tool needs to look at who the person is and see if they're already in a test. If they are, it will continue to show them the content that they should see based on their already existing test membership.

If they aren’t already in a test, the service will make a random assignment and put the visitor into a new experience before returning the content.

The testing service then needs to track the success. Whenever you set up a test, it's important to specify what your goal is so you’re not changing content without measuring the impact on user engagement.

You need to set up a hypothesis in the beginning — by changing the content, more people will click on it, or add a product to the cart, or make a purchase.

So, you need your testing service to track that metric and see whether someone performed the action you wanted them to, and then assign that action back to the experience that they saw.

What are the methods for analyzing split testing results?

There are two main statistical analysis methods. The first is called a Student’s T-Test, and the second is known as a multi-armed bandit.

The big benefit of using the multi-armed bandit approach is that you have a lot more flexibility in looking at the results, while a classic Student’s T-Test is structured in its analysis.

With a Student’s T-Test, you need to figure out ahead of time how much traffic you need for the test, how long you want to run the test for, and how much statistical power you're expecting.

You can look at all that data, use a sample-size calculator, and then figure out how long you need to run the test. And then you should only look at the results at the end of that period.

Only checking the results at the end can be pretty challenging to do in business, because you have to teach everyone in your company enough about the statistical analysis to stay true to that requirement.

Often what happens is people will want to see how it's going one hour in, two hours in, two days in. And after one or two hours, it can look like your test is really doing terribly and you're losing money.

But there's a lot of variation at the beginning and you have to wait for the whole sample size to be reached at the end before you look at the results. So we see people cheat and not wait the whole time, which invalidates the test.

With the multi-armed bandit approach, you can look at the results whenever you want. You'll always see an accurate result and you can look at the results early without impacting the results of the test.

Another difference between the two tests is that with Student’s T-Tests you can understand a lot about each of the variations that you tested. So you can see which one was the winner, and you can also see which one was the worst.

With a multi-armed bandit, you can see which one's the winner, but you can't compare the other ones. It's difficult to tell which one was the second best, or third best, or fourth best.

If you just want to know the winner, a multi-armed bandit is great. If you want to take a more scientific approach and understand each of your options, then the Student's T-Test is a better statistical method to use.

How can companies optimize split testing?

In terms of content, the bigger change you make, the better. If you're changing something from teal blue to royal blue, you're not going to get much of a result. You should also try to affect an action as close to the content change as possible.

What I mean by that is if you change an image on the homepage hoping that it affects someone purchasing something 10 pages down the funnel, it’s pretty hard to track that back.

You introduce a lot of noise into the analysis. It’s better to test a change on a page where they actually have the option to convert.

What problems do companies run into when running split tests?

The biggest problem people run into is an executive wanting to look at the results before the test is over. They look at the results two hours later and think the new experience is doing terribly.

But if you look at any results, even just normal web traffic results for a very short period of time, it fluctuates a lot as well. So looking at the results before a test ends can trigger people to turn things off early and make bad decisions.

Few people in marketing ever learned statistics in an academic setting, so it's easy to fall into traps because you don't have that background to understand the importance of sticking with it.

A challenge of classic A/B testing is that you are finding out the best thing for the average person on the site. But a company may not have an “average” visitor.

For example, a shoe store may have a group of high-end customers who spend $300 on shoes, and a group of customers who only spend a few dollars on socks. The average order value of the store would be $100, but no one is actually spending $100.

So, your results might show the average best for an average group of people that don't really exist. There might be multiple peaks and valleys in your data.

In that case, you want an analysis that shows multiple winners, one for the experience that resonated with sock buyers and one for shoe buyers. Then you can continue to show different versions of the site to visitors depending on their purchase habits.

Is split testing cost effective?

There are so many variables that affect the return on investment (ROI) of a split test, but it is cheaper than completely making a change and trying to find out why you aren’t doing as well as you thought you would.

At that point, you won’t have the data to identify the problem. Also, without split testing, you may not have the confidence to make a potentially risky change that would result in earning you more money.

How often should you run split tests?

Ideally, you test everything automatically as it goes out. You test every change you ever make. But, in reality, that can be just impossible to manage. It takes time to build each of the different versions and to keep track of it all. It can also be expensive to do.

Does split testing affect SEO?

Google has never given a definitive answer, but the general idea is that if the A/B testing is done to make the customer experience better and more relevant, then it's fine.

If A/B testing is used to dole one version to search engines, and a different version to a customer intentionally in a way to obfuscate your true intention with the customer, then that’s bad.

Companies can run into search engine optimization (SEO) issues if they aren’t updating their raw site. You could have an out-of-date website that the customers never see, but search engine spiders can see the default content.

The website being shown to the search engine won’t be the same as the site being shown to customers, and the company can get in trouble for hiding information, even if it’s not intentional.

How will split testing evolve in the future?

The biggest trend we're seeing right now is a move to multichannel experiences. Split testing used to only occur on websites because most companies only had a website.

But now you have mobile apps, mobile web, desktop web, and newer channels like voice assistants. Having a more centralized optimization capability that can interact with each of those different clients is a tremendous change.

And as your visitors interact with your company or your brand across each of those different devices, you want to make sure that you're doling a consistent story across each of them.

The other piece is a continued move toward automation. Doing split testing manually is hard and takes too much time. You have to come up with all the segments. So, companies are moving to machine learning to do it for them.

But the main goal is to determine the best thing to show to a person in a specific moment. And that usually requires a combination of more traditional split testing techniques and machine learning outcomes at the same time.

People also view