Leveraging Matched Control Groups to Accurately Measure Business Experiments

It can be challenging for managers to measure the financial effects of various changes they’ve made to their enterprise. Interventions such as marketing campaigns, human and physical capital expenditures, changes to business processes, new product introductions, and outlet openings or closures all require rigorous assessment of the effects to the bottom line as justification for investment.

Diagram of matched control groups

Learning from successes and failures can both drive future business decisions and test a manager’s new idea as a hypothesis within the lens of validating or refuting their proposition. The gold standard when it comes to experimental testing is a randomized block design, in which subjects (customers, employees, stores, branches, markets) for the experiment are randomly selected, and those receiving differentiated treatment are randomly selected. Often, subjects are sub-grouped or blocked by common features in order to measure the effects of treatment for those specific attributes.

Grouping by features of interest ensures that sufficient observations are available to measure possible significant effects. Within this framework, the treated and non-treated controls are statistically indistinguishable by any observable business metric as well as unobserved features. Due to the randomization of treatment, any effects will be the same between test and control.

We were approached by a pricing software company to do a deep dive into their current processes for selecting test and control stores; evaluate the statistical methodology used to calculate lift; and assess client deliverable charts, reports and claims about lift and significance. We were able to independently validate their measured outcomes, provide more granular insights and suggest improvements on a more robust, standardized, and efficient process.

Diagram of case study

Randomized block experiments are commonplace in direct marketing, where the managers have access to many observed customer features and have full control over who is treated with marketing and who is not. Managers also have control overwhenandhowthey’re marketing in order to best observe all responses and non-responses. The effectiveness of the marketing is then measured from any difference in response between the treated and control groups. Randomizing the subjects for the experiment and treatment however, is not always cost effective or even possible. For example,  capital improvements are usually too costly to justify a purely random experiment and experiments in human resource management, for example, could be unethical or even illegal. Furthermore, managers may have limited or no control over who is subject to treatment, such as those online or any scenario where customers can opt in to a program, such as a loyalty card.

Simply measuring the post-intervention differences between treated and non-treated without an experimental design will likely lead to false conclusions due to selection bias. For instance, only underperforming stores may be selected for new training, a new product launch may be limited to stores with available shelf space, or only qualifying customers are approved for a store-branded credit card. If the treated and untreated subjects differ prior to the intervention, it is likely that the response to the intervention would differ as well. Measuring differences between pre- and post-intervention exclusively for treated subjects requires controlling for seasonality and other varying trends that have additional modeling requirements.

One solution that allows for accurately measuring treatment effects without a randomized test is matching treated subjects to untreated subjects on similar pre-intervention features. Matching subjects based on observable attributes reduces or eliminates the pre-intervention differences between the treated and untreated. A subset of untreated subjects then acts as the comparison control group. The assumption is that without any intervention, both the treated and the matched untreated would continue to look similar by these features post-intervention. Any observed differences in actual outcome can then be attributed to the effect from treatment. In many cases, the manager is then able to measure the effects from intervention with nearly the same precision as a randomized experiment.

Various methods are available to match treated to untreated subjects as well as balance the pre-intervention features. Propensity score, exact, coarsened exact, nearest neighbor, and optimal are among the commonly used algorithms. Each has its own benefits and tradeoffs in terms of data requirements and precision. The choice of which features to use as the basis for matching must also be made. Propensity score is the most commonly used and simplest to implement. However, since it relies on a generalized linear model, it is sensitive to the usual model assumptions and not typically subject to a thorough model building process. It is also possible that pre-intervention features remain statistically unbalanced. Combining propensity score with additional constraints can improve results but requires additional technical sophistication on the part of the analyst.

The selected features to match on should be assumed to be correlated with outcome metrics used to assess the effectiveness of the intervention, but also should include features related to how subjects were selected for treatment. For example, if a marketing campaign to acquire new customers is launched, then prior new customer acquisition rates would be a reasonable baseline to match stores. If that same campaign was directed toward a certain demographic, then each store’s known demographic footprint could be an additional attribute to match upon.

Diagram of case study

A Fulcrum client was looking to understand local mass media impact on new customer acquisition. We were able to build matched control groups at the store level and aggregate up to measure the campaign’s local and total performance. Additionally, the campaign could be broken out by other important store dimensions such as store format, population density, and proximity to competitors.

Ultimately, validation of the manager’s hypothesis rests on finding a valid control group for the experiment using a complex set of criteria. In many ways, each set of matched control groups found through one of the matching algorithms or matched using different criteria is comparable to a sample in the statistical sense. It is a single point estimate of the true effect from the intervention. Likewise, with repeated sampling and repeated matching with different criteria or algorithms, additional point estimates are generated to give a manager a range of values that point to the true result of the intervention. The intervention itself is a single event, and repeated interventions with like criteria provide additional data points to the measured effects. Structuring interventions and measurement of effects to allow for meta-analysis leads to a more robust estimate of  the true intervention effects. 

Another often overlooked aspect is that no manager’s intervention is done in isolation. Other interventions are likely to have occurred prior, which affect the features on which treated and untreated are matched, or concurrent, which bias the measurement of the intervention. Capturing and cataloging all interventions enterprise-wide on customers, employees, and/or outlets is a continuing challenge—but key to proper assessment of any intervention and experiment.

However, any method of matching test subjects to control will remain a second-best solution compared to a randomized test design. Although balance between test and control may be achieved for observed features, there remains a possibility that unobserved features, such as attitudes toward a brand, are not balanced and could impact the treatment effects. Additionally, balancing features may not be possible if, for example, all subjects with a given feature are treated and none remain to match against. Knowing these limits and being able to measure the effectiveness of the best possible non-randomized test is itself of value.

While not every intervention directed by a manager needs to be thought of as an experiment to be rigorously measured with a randomized design or matched controls, those in which future decisions depend on the intervention’s outcome do require a sophisticated analytical methodology. Enterprises committed to data-driven business decision-making are the ones investing in resources necessary to capture interventions and accurately measure their effects.

To learn more about our software platform that automates the analysis of experimental interventions, check out Fulcrum Impact or contact us.