Whether you have been testing for years or you are just getting started, building a successful website optimization program depends on careful planning, implementation, and measurement.
This is the second in a three-part series of articles that look at the steps involved in creating a successful optimization program. Part 1 discussed how to plan your testing and implementation program. In this article, we'll outline four key steps you'll need to take to implement your optimization program:
- Clearly define success and failure.
- Ensure good test design.
- Clarify your testing timeline.
- Test different audience segments.
Clearly define success and failure
A common disappointment among companies deploying testing and optimization technology stems from tests that fail to produce the type of gains expected. Seemingly without rhyme or reason, even the most dramatic design changes yield "no significant differences" in simple measures such as click-through and even less for more involved downstream metrics such as conversion rate.
Though that is the reality of testing, much of the disappointment stems from a lack of attention to the definition of "success" and "failure" as the design or changes are implemented.
Success in testing can be measured many different ways:
- For some, "success" is a dramatic increase in a revenue-based metric, knowing that most senior stakeholders will respond to incremental revenue.
- For others, "success" is a small increase in key visitor engagement metrics, knowing that a series of small gains eventually adds up.
- For still others, "success" is a reduction in the number of problems present throughout the site, knowing that reducing barriers improves usability.
- For some, especially those with an increasingly dated site, "success" is simply being able to deploy a new look without a negative impact on key performance indicators.
A lack of success in testing is often viewed as a failure on someone's part, but that is rarely the case. In reality, testing powers a continual learning process about your visitors and customers. If a particular image fails to increase conversion rates, you have learned that your audience does not respond to that particular image. If subsequent testing reveals that variations of the same image yield similar results, then you learn something about your audience's reaction to the image's content. In that context, there is no such thing as "failure" in testing—only a failure to achieve the specific defined objective.
Keep in mind that not every test can yield incremental millions in revenue for your business. Some tests will fail to produce the change desired; others will yield results but not across the key performance indicators; and still others will simply fail to produce statistically relevant differences.
But there are no "failures" in testing—other than a failure to carefully design your tests and a failure to carefully consider what you've learned.
Ensure good test design
Success with testing depends heavily on the quality of your test design. One of the reasons we recommend requiring a formal test plan is so that the testing team has as much information as possible to determine how the test should be run.
Especially when you start to aggressively test, good test design helps ensure that any effects from participation in multiple tests can be taken into account, either by identification and isolation or outright removal from the result set.
Accordingly, consulting with someone experienced in experimental design in the online world—either from your vendor or a third-party—makes sense.
Good test design consists of several elements, and paying attention to them is important. For example, you should...
- Know whether you need an A/B or multivariate test.
- Pick the test array that works best for your needs, either a full or fractional factorial array.
- Make sure you are running the test long enough based on traffic and conversions to get a statistically valid sample size.
- Make sure you are properly testing variations of factors. Improper factoring is caused by poor (or no) isolation of individual changes—for example, changing a headline's text, font, color, and size all at the same time.
Another mistake new testers often make is to run tests against anyone and everyone; however, a good test design means you are targeting your tests to a relevant audience, and then performing additional segmentation on the results.
Clarify your testing timeline
One of the most unfortunate mistakes that companies make when getting started with testing is to test only for statistical significance. A great deal has been written about test design and full factorial versus fractional factorial versus A/B testing. While important considerations, none are nearly as important as having a test sample that takes day-part and day-of-week variation into account.
Consider that even on the highest-volume sites, there are typical peaks and valleys in traffic caused by target audience geography, marketing efforts, and the particular interaction model promoted by the site. Within each of those peaks and valleys, your site is attracting a particular type of visitor—late night visitors, early risers, lunch-timers, and those across different time zones, etc.
Assuming you're not trying to target a specific audience segment, a truly random sample of visitors will account for this variation and sample across these visitor variants. To reduce test bias as much as possible, a general rule-of-thumb for test planning is the "7+1" testing model.
In this model, you will test over an entire week (seven days) and build in a little extra time to make sure that you have a clean break in the data for analysis. Thus, "7+1" means running your test for a full week with an extra day on the front end. By giving the test a day before you start actively tracking results, you allow for slippage and the need for last-minute changes, plus doing so gives the analysis team the ability to gather data starting at midnight at the end of the "+1" day.
And by running the test over an entire week, you will account for all of the potential day-part and day-of-week variation, at least as much as is possible. If you have the luxury of time, you may want to consider extending the test to a "14+1" model, doubling the amount of time you run the test. With two weeks, you will be better able to account for additional variation in the data arising from tactical marketing efforts, a sudden increase in referral from social media, holidays, and current events, etc.
One of the advantages of the "7+1" model is that you can adjust your sample size to still only gather as much data as you need; you'll just gather that data more slowly. Rather than taking a 20% sample over four days to get statistical significance, the "7+1" model may guide you to take a 5% sample over seven days. The smaller sample lessens your risk associated with testing since if the tests fair poorly, fewer visitors will be exposed to them and you're still able to get to statistical significance in a relatively short period of time. Furthermore, it allows nontest traffic to be eligible for assignment to other tests that you may be running concurrently.
The major complaint about the "7+1" model is that it takes time, so if you just open the spigot on the test, the thinking goes, you can achieve statistical significance in a matter of hours in some cases. Though that sounds good, opening the spigot on testing is exactly how not to achieve success via testing.
Unless you have a very sophisticated understanding of your audience and the sampling technique employed, the "fire hose" model will likely leave you with more questions than helpful insights. Anyone who doesn't like the results can simply argue that your sample does not represent the diversity of user types coming to the site, and so refuse to accept your analysis.
Whatever kind of test you are running (A/B or multivariate), you want to make sure you've run your test long enough to obtain a statistically valid sample size—the number of participants assigned to the test.
Your sample size will be determined by a combination of traffic volume, your baseline (control) conversion rate, and the conversion rate observed by test participants. You'll want to make sure you obtain an appropriate sample without bias in time of day, day or week, holiday/event, etc.
For example, you might run a test with a huge sample size and obtain statistically significant results in one day, but that would reflect only how visitors behaved on that particular day. So take care that the test is run across a longer period of time (at least 7+1 or 14+1, and perhaps longer depending on the situation), to insure against bias.
Remember: the best testers work thoughtfully and carefully, and they are willing to spend a little extra time on process or testing to make sure they deliver accurate, reliable, and believable results to socialize through the rest of the organization.
Test different audience segments
Advanced testers are testing against key visitor and customer segments. The logic is clear: why optimize your site for everyone when you can focus your optimization efforts on those visitors who have already demonstrated value to your business?
You have two ways to conduct a segmented test: ad hoc and post hoc.
The former method requires that you be able to identify segment members in real time so that the testing engine can assign people appropriately. For example, you may be targeting "first-time visitors" or "visitors referred from Google organic search results," which, depending on the testing platform you use, can be easily done.
The latter method for segmenting is post hoc—after the fact—which is more an analysis technique than a testing strategy. In this case, you will mine test results for segment members and compare those results across control and test groups. This strategy also involves some work between testing and analytics vendors but is often more forgiving, especially if your testing vendor supports full data export and is able to provide the analytics vendor's ID.
Regardless of how you produce the data, focus on your key segments when communicating your test results. If you have the data and the time, it is definitely better to be able to tell management, "The test produced an X% lift in click-through rate across all visitors and a Y% increase in click-through rate across our most valuable customer segment."
That message should resonate loud and clear, especially if your measurement team has done a good job at applying visitor segmentation.
(Those seeking more information on testing and optimization can download the whitepaper titled "Successful Web Site Testing Practices: Ten Best Practices for Building a World-Class Testing and Optimization Program.")
Articles in the series: