Customer Analytics: Institute for Advanced Analytics

Customer Analytics

Introduction
Google Analytics
A–B Testing
Propensity Scoring

Introduction

This module will discuss customer analytics. Customer analytics analyzes a company's customer data and behaviours to try to identify, attract, and retain the most profitable types of customers. Since customers have access to significantly more information about products and companies in the digital age, organizations must use updated strategies to attract and retain these customers. This is the goal of customer analytics: to create a unified, accurate view of the company's customer base, and decide which strategies can best retain and grow this base. More detailed questions can fall underneath this basic approach, for example, who are a company's highest value customers, and should those customers be prioritized over other needs?

Customer analytics is normally an interdisciplinary problem that involves marketing, sales, IT, customer service, and business analytics. The skills and knowledge of each group is shared to identify the business metrics to capture, the analysis to perform, and the overall goals to address. Customer analytics begins with the capture of raw data and ends with business decisions. One definition of the stages of customer analytics includes the following three steps.

Collection. Obtain raw data from marketing tools, CRM systems, or external sources: This could include demographics, purchase history, social media presence, engagement with advertisements, and so on.
Organization. Convert the data into a format that will facilitate the types of analysis needed to provide insights to support the desired business decisions.
Analysis. Perform appropriate analysis on the organized data to produce the insights needed to direct the business in its decisions. Examples include modelling different types of customers, sales prediction, effects of price changes, and so on.

The types of business decisions or goals associated with customer analytics are numerous and varied, but they normally have a relationship to the overall purpose of identifying, attracting, and retaining profitable customers. Below are some examples of goals for customer analytics.

Analyzing how to distribute a product across customer channels,
determining customer satisfaction,
identifying when and where to engage with a customer,
predicting churn and actions to reduce it,
highlight trends in the raw data that could be leveraged to increase sales, and
optimize a customer's "journey" through a web site to better encourage sales.

Customer Analytics Tools

Not surprisingly, numerous commercial tools exist to help with customer analytics. Many focus on customer relationship management (CRM), which encompasses many of the ideas discussed above. CRM tools can collect data, aggregate data from different sources, support raw data organization and analysis, and visualize results to better highlight any relevant insights that are found. The tools can also integrate with sales and marketing applications, web content management systems, email, social media sites, customer loyalty programs, and other tools designed to help attract and retain customers. Some common CRM tools from well-known vendors include Salesforce, Oracle Netsuite, Zoho, HubSpot, Pipedrive, Insightly, and Google Analytics 360, among others.

CRM tools are sometimes divided into collaborative, operational, and analytical. A collaborative CRM is designed to remove silos between different teams to ensure they are all sharing a common set of data. An operational CRM streamlines understanding the customer's journey through a company's web site, even in situations where the journey is represented with many highly detailed touchpoints. This is usually done by automating repetitive tasks to free employees to focus on more subjective or creative issues. An analytical CRM is optimized to analyze massive amounts of customer data and return actionable insights.

For our instruction, we will complete a basic introduction of Google Analytics. Google Analytics is free, is easy to setup within a web site, provides various ways to filter data, offers high-quality dashboards, and has some basic analytics built into its system. For more sophisticated organization and analytics, data from Google Analytics can be exported as CSV for use by external programs.

Google Analytics

Google Analytics is a digital analytics platform provided by Google. One of its main advantages is that Google provides the service for free: because Google Analytics tends to drive business to Google Ads, Google benefits indirectly from users of the Analytics platform. A basic description of Google Analytics is that it is a tool within Google's Marketing Platform that collects data on users visiting your web site, then allows you to compile that data into reports to develop business strategies and improve your web site's performance. By installing web site-specific tracking code, you can see who visits you site and what they do there, as well as collect a wide variety of demographic data on your visitors. A small sample of the data you can collect using Google Analytics includes the following.

How visitors enter your web site.
The path or "journey" visitors take through your web site.
When visitors purchase a product on your web site.
Which pages a visitor loads from your web site.

Digital Analytics Funnel

Google Analytics discusses the digital analytics funnel, the idea that individuals explore a web site or purchase items in stages. Marketing uses the concept of a funnel to enumerate these stages.

Acquisition. Building awareness and acquiring user interest.
Behaviour. User engagement with your web site or business.
Conversion. A user becoming a customer through a transaction with your business.

These stages may be different for web site that are not designed to sell products, however, the basic ideas are often analogous. For example, conversion on a non-product web site may mean that the visitor returns to the site repeatedly because they find it useful.

Although we will discuss Google Analytics in the context of web sites and web traffic, it can also be used to collect data from mobile applications, point-of-sales systems, video game consoles, CRMs, and other Internet-connected platforms.

UA versus GA4

Google Analytics was originally built using UA (Universal Analytics). This allowed you to create an account, then obtain a snippet of Javascript code that you added to each web site you wanted to track. The Javascript contained a unique ID that sent information back to your Analytics account. The Google Analytics web site allowed that information to be viewed and filtered in a variety of ways.

More recently, Google decided to move to a new platform called GA4 (Google Analytics 4). This is designed to do what UA did, but in a very different way. Originally, UA used a hierarchy of account → property → view, where an account represented one or more web sites, a property was a section of a web site (e.g., a part of the site of a subset of its visitors), and a view was a combination of a filter and a visualization of some or all of the data associated with a property. Properties were themselves given unique sub-IDs to allow you to treat them as independent data sources.

In GA4 the hierarchy is now account → property → data stream. The basic ideas of account and property are similar to UA, but view has been removed. The new data stream has been added, which represents a source of raw data to be fed into a property. As before, that data can be filtered and visualized in a number of different ways. However, it is non-trivial to save a particular "view" into a property's data. This has caused a number of issues for individuals and businesses who were using the old UA system, since views are clearly valuable and difficult to replace. On the other hand, GA4 makes it much easier to separate events on a web site. Previously, UA bundled everything under a single tag ID. GA4 is also capable of capturing a much wider set of events that UA was.

All of our discussion will revolve around GA4, since UA is being depricated in July 2024. If you search online for information on using Google Analytics, be careful to make sure you're looking at instructions for GA4 and not UA. Google has not removed any information about UA, and since this is the system with the longest history, searches that do not explicitly specify GA4 (and even some that do) will point you to information relevant to UA and not GA4.

Google Analytics versus Google Tag Manager

Google Tag Manager is now the recommended method for collecting data to be analyzed in Google Analytics. Although this is often not clear, although Google Analytics and Google Tag Manager are closely integrated, they are two entirely separate online systems with two different purposes.

Google Analytics is an online software suite used for analytics, conversion tracking, and reporting. It provides a wide range of visualization, filtering, and reporting tools to build dashboards that provide insights into visitors to a web site, what they see on the site, and how they interact on the site.

Google Tag Manager, on the other hand, is a tag management system that can detect and store events within a web site, for example, page views, clicks, scrolling, entering or exiting the site, and so on. These events can be sent to a separate analytics package for further analysis and presentation. An obvious candidate for this task is Google Analytics.

Basic Account Setup

To use Google Analytics, you first create a Google Analytics account tied to your web site. This includes creating an initial property attached to your account and a data stream to provide data to that property. At that point, you have the option of adding Javascript code to each page you want to track, or using the Google Tag Manager to automatically track different types of events (e.g., page views, clicks, etc.) Currently, Google is recommending using the Google Tag Manager, since any changes you make in the Tag Manager web site should automatically apply to all pages your are managing, without the need for additional edits to the Google Javascript that is involved in manual tagging.

Complete the following steps to create your initial Google Analytics account.

Navigate to https://analytics.google.com.
Choose the Google account you want to use to sign up with.
Click "Start measuring."
Choose an "Account name" for your Analytics account. You can accept the default Account Data Sharing Settings.
Choose a name for your initial, default property in "Property name (Required)." Select the proper timezone for your web site.
Choose the "Industry Category (Required)" from the drop-down menu and "Business size (Required)" of the organization your web site belongs to.
Choose the purpose of your Analytics site. The default is "Get baseline reports" if none of the other options apply.
Once you click "Create" and accept the Terms of Service you will be asked to "Choose a Platform." For our purposes, we are only looking at web-based analytics, so choose "Web."
Enter the domain of the website you want to analyze, for example, healey.wordpress.ncsu.edu. Choose a Stream name for the datastream attached to this property, and ensure "Enhanced measurement" is turned on.
At this point a set of Installation instructions will appear. If you are using a web hosting service that Google Analytics recognizes (you can see the list of services by clicking "Select your platform" under "Install with a website builder or CMS") choose the proper service for instructions on how to register your Google Analytics site ID. Otherwise, choose "Install manually" and the Javascript needed for every web page you want to track will be shown. You can copy and past this code immediately after the <head> tag on the web pages you plan to analyze.
Close the Instruction instructions panel and you will see a Stream details panel summarizing your Analytics options. You can return to this panel later if you need to update any information or (specifically for Google Tag Manager) if you need your Measurement ID.
Press Esc to exit the Stream details panel, then click "Next." You will be told "Data collection is pending." Click "Continue to Home" to jump to the Analytics homepage. Choose any email communications you want, click "Save", and your analytics homepage will be shown. At this point the page is empty since no data has been collected yet.

This web page gives additional instructions on how to setup an account and create an initial property and stream. After some time has passed, you can login to Google Analytics and use the site's interface to explore data about users who have visited your web site.

Google Analytics groups activity into a session that begins when a user loads a page with tracking code, and ends after 30 minutes of inactivity. Data is uploaded to Google and made available through your analytics account. By default, Google Analytics will aggregate and present information based on a predefined set of criteria like geographic location, origin type, and page, but these can be modified as desired using filters and other interactive controls.

Properties & Streams

A Google Analytics account is made up of one or more properties. Within each property one or more data streams can be created. Properties are mean to collect data independent of one another by using a unique tracking ID in the tracking code. For example, a business may want to use different properties to collect data from different web sites, from a web site versus a mobile application, or from different geographic regions. Data streams are sources of data from a web site being sent to a property. Data streams come in three types: web, iOS app, or Android app. The intent is to allow us to aggregate users across different data sources using a single data stream.

When a property is created, Google Analytics automatically creates a default view called "All Web Site Data," containing all raw data collected for the property.

Navigate to https://tagmanager.google.com.
Choose the Google account you want to use to sign up with.
Click the "Create Account" button.
Enter an Account Name and choose a Country. Enter a Container name. The container controls the Javascript code you will use to invoke Google Tag Manager. Typically there is one container per web site. Choose Web as the Target platform and click the "Create" button.
Click the "I also accept the Data Processing Terms as required by GDPR" checkbox, then click the "Yes" button to agree to the Terms of Service.
Two Javascript code snippets are shown. Note the instructions: the first snippet should be placed on your web page right after the <HEAD> tag. The second should be placed right after the <BODY> tag.
You can test your website by typing in its URL and clicking "Test." This should place a checkmark beside the URL if Google Tag Manager sees valid Tag Manager code on the given web page.

Next, you will want to add one or more tags and associated triggers to collect data from your web site and send it to Google Analytics.

Click "Add a new tag" to create a new tag within the Google Tag Manager you just created (this should be shown in the dropdown at the top of the page.)
Replace Untitled Tag at the top of the panel with a more descriptive name of the type of data you plan to capture, for example, Page Views.
Click on the "Tag Configuration" region, then on "Google Analytics", then on "Google Tag." You will now need to enter your Tag ID from the Google Analytics account you previously created. To find this, sign in to your Google Analytics account, click the Admin gear at the bottom-left of the page, click "Data collection and modification" under the "Property settings", click "Data streams", then choose the data stream you previously created. Your Google Tag should be the "MEASUREMENT ID" at the top of the page. Click the icon to the right to copy it, then paste it into the "Google Tag" field in Google Tag Manager.
Click on the "Triggering" region and choose "All Pages" to trigger this tag on every page view. Click the "Save" button to save your new tag.
Click the "Preview" button to ensure your tag is working. Enter your web site URL in the "Your website's URL" field and click the "Connect" button. Your website should appear in a new tab or window with a Tag Assistant pop-up. If you go back to the Google Tag Manager window, it should say "Connected!"
Click "Continue" and you should see a "Page Views" tag under the "Tags Fired" field. If you refresh the window or tab containing your web page, the "Tags Fired" field should increase by 1.
Close the debug version of your website and the Tag Assistant tab. Click "Submit" and choose a Version Name and Description (if desired) and click the "Publish" button to finalize registration of your new tag.
If you go back to Google Analytics and choose "Realtime" in the Reports menu, you should be able to track visits to your site as they happen.

Note. I could only get the Preview to work when I used the Chrome browser, installed the Tag Assistant Companion extension, and disabled all other extensions since there seemed to be a conflict in my extension set. I suspect this only affects previewing whether the tag is working or not. Rather than running the Preview step, you could jump directly to Submit then Publish, then check with Realtime as you load your page to see if page views are appearing or not.

A–B Testing

In its most basic form, A–B testing is a way to determine whether changing a property (often called a key performance indicator or KPI) of an environment makes it better or worse, based on a specific evaluation metric. In other words, given two versions of an environment, which performs better? Although the term "A–B testing" was coined in the 1990s, it is a form of a basic randomized controlled experimentation, first documented by Ronald Fisher of the famous Fisher's Iris dataset in the 1920s. A–B testing in its current form is often characterized by being run online, in real time, and on a much larger scale in terms of participants and experiments versus traditional randomized controlled trials (RCTs). A high-level overview of designing, conducting, and evaluating A–B testing might go something like this.

Decide what you want to test, and construct two versions of the test environment: Version A and Version B.
Determine how you will evaluate performance.
Randomly assign two sets of users to Version A and Version B of the environment
Run the experiment, asking the users to operate in their version of the environment.
Statistically evaluate the performance of the two sets of users to determine if there was a significant difference between the two environments.

For example, you might wonder whether one version of a button on a web site will encourage users to click more often than another version. Two versions of the web site with the two button candidates is constructed, and users are randomly assigned to view one of the two versions. The performance metric is number of button clicks. Once all users have explored the web site, the number of web clicks is statistically compared to see whether one is significantly higher than the other. If it is, the button with more clicks is chosen for the final web design.

As with all experiments, randomization is critical. This ensures that users are not grouped based on some criteria that might influence their preferences, for example, for one colour of button over another.

Sample Size Estimation

During design of the experiment, deciding how many users are needed to ensure statistical significant is important. Since A–B testing is a form of a randomized controlled experiment, we can use literature from either area to study this problem. For example, the medical community often conducts A–B-type tests and has many good sources and examples of how to calculate sample sizes for a desired level of improvement (i.e., how much "better" does the outcome need to be to be considered relevant?) Two types of experiments are considered: dichotomous, where the outcome of interest is one of two possibilities: yes/no, success/failure, and so on, or continuous, where the outcome of interest is the mean difference of an outcome variable between the two groups: for example, the difference in the average number of clicks between group A and group B.

Given the overall goal of determining whether changing the test environment leads to a significant change in KPI, experiments are often described in terms of the null hypothesis ($H_0$), that no significant change was found, or the alternative hypothesis ($H_a$), a significant change did occur. This is often modelled using false positive (Type I error), false negative (Type II error), true positive, and true negative proportions, which are based on the null hypothesis $H_0$ (no difference) and the alternative hypothesis $H_a$ (significant difference).

	$H_0$	$H_a$
Predict $H_0$	Probability True Negative $1 - \alpha$	Probability False Negative $\beta$
Predict $H_a$	Probability False Positive $\alpha$	Probability True Positive $1-\beta$

Dichotomous. For a proportional metric, we need to define a significance level $\alpha$, a power level $P$, and the two proportions $\mu_1$ and $\mu_2$ from groups A and B that constitute the desired level of improvement $\mu_2 - \mu_1$. Recall that

$\alpha$ is the probability of rejecting the null hypothesis $H_0$ when it is actually true (Type I error),
$\beta$ is the probability of rejecting the alternative hypothesis $H_a$ when it is actually true (Type II error), and
power $P = 1 - \beta$ is the probability of accepting the alternative hypothesis when it is actually true; a minimum $P$ is normally $0.8$ or higher.

Notice that reducing the probability of committing a Type II error increases the probability of committing a Type I error and vice versa. Because of this, careful balance must be maintained between $\alpha$ and $\beta$.

Given this, the size of each group $n_A = n_B$ is \[ n_A = n_B = c \cdot \frac{\mu_1 (1-\mu_1) + \mu_2 (1-\mu_2)}{(\mu_1 - \mu_2)^{2}} \] where $c=7.9$ or $c=10.5$ for the standard power levels of $P=80$% or $P=90$% and $\alpha = 0.05$. $c$ is based on the cumulative distribution function (CDF), $c = f(\frac{\alpha}{2}, \beta) = (\Phi(\frac{\alpha}{2}) + \Phi \beta)^{2}$ where $\Phi$ is the CDF of a standard normal distribution. $\Phi$ is based on the Z-score. \[ \begin{gather} \Phi(x) = p(Z \leq x) = \frac{1}{\sqrt{2 \pi}} \int_{-\inf}^{x} \exp \left( -\frac{u^{2}}{2} \right) du \\ Z \sim N(\mu = 0, \sigma^{2} = 1) \end{gather} \] For example, if we want to go from 40% of participants answering Yes in Group A (control) to 70% answering Yes in Group B (test), $n_A = n_B = 7.9 \cdot \frac{(0.4 \cdot 0.6) + (0.7 \cdot 0.3)}{0.3^{2}} \approx 40$ for an 80% power level or $n_A = n_B \approx 53$ for a 90% power level at $\alpha = 0.05$.

Continuous. We need to define a significance level $\alpha$, a power level $P$, a desired response difference $\mu_2 - \mu_1$, and a common (combined group) standard deviation $\sigma$. Given this, the size of each group $n_A = n_B$ is \[ n_A = n_B = \frac{2c}{\delta^{2}} + 1 \] where \[ \delta = \frac{|\mu_2 - \mu_1|}{\sigma} \] where as before $c=7.9$ for $p=80$% and $c=10.5$ for $p=90$%. For example, if we wanted to go from 20% clicks in group A to 30% clicks in group B with a standard deviation $\sigma=0.5$ then $\delta = \frac{0.1}{0.5} = 0.2$ and $n_A = n_B = \frac{15.8}{0.04} + 1 = 396$ for $p=80$% or $n_A = n_B = \frac{21}{0.04} + 1 = 526$ for $p=90$%.

Alternatively, you can use either Python or R to calculate minimum sample sizes using the statsmodels or power libraries. Both model the problem using false positive (Type I error), false negative (Type II error), true positive, and true negative proportions, which are based on the null hypothesis $H_0$ (no difference) and the alternative hypothesis $H_a$ (significant difference). Both Python and R's tests use the $\alpha$ significance level (normally 1%, 5%, or 10%), the false negative rate $\beta$ (the probability of incorrectly rejecting $H_0$), the power level ($1 - \beta$ or the true positive rate, the probability of correctly rejecting $H_0$), and the effect size divided by the minimum detectable lift (MDL, the minimum change needed to reject $H_0$).

For a dichotomous A–B test, power's prop.test() is used to determine the minimum $n$ needed for significance.

# Historical data p0 <- 0.12 # Group A probability # Model parameters alpha <- 0.05 # False positive probability beta <- 0.20 # False negative probability power <- 1 - beta # True positive probability mdl <- 0.02 # Minimum detectable lift dir <- 'two.sided' # Type of t-test min_n <- power.prop.test( n=NULL, p1=p0, p2=(p0*(1+mdl)), sig.level=alpha, power=power, alternative=c(dir) ) min_n$n

For a continuous A–B test, power's t.test() is used to determine the minimum $n$ needed for significance.

# Historical data mu <- 30 # Average lift theta <- mu / 5 # Standard deviation of lift # Model parameters alpha <- 0.05 # False positive probability beta <- 0.20 # False negative probability power <- 1 - beta # True positive probability mdl <- 0.02 # Minimum detectable lift dir <- 'two.sided' # Type of t-test min_n <- power.t.test( n=NULL, delta=(mu*mdl), sd=theta, sig.level=alpha, power=power, type=c('two.sample'), alternative=c(dir) ) min_n$n

A–B Analysis

Once results are obtained from an A–B test, they are analyzed to search for significant differences. The null hypothesis $H_0$ that there is no significant difference in the performance metric between the two groups A and B is $p_B - p_A = 0$ for dichotomous (proportional) metrics and $\bar{p_A} = \bar{p_B}$ for continuous metrics. Proportional significance can be measured in Python with statsmodels.stats.proportion.proportions_ztest() and in R with prop.test(). For continuous metrics use scipy.stats.ttest_ind() or t.test() in Python or R, respectively.

Effects Size

One final value you may want to calculate is effects size (ES). Intuitively, effects size states how strongly the independent variables affect the dependent variable. For t-test studies, Cohen's d is often used to measure effect size. Cohen's d calculates the ratio of mean difference between groups to pooled standard deviation, where $d=0.2$ is consider small, $d=0.5$ is considered medium, and $d=0.8$ is considered large.

\[ \begin{gather} d = \frac{|\mu_1 - \mu_2|}{\sigma_p} \\ \sigma_p = \sqrt{\frac{(n_1 - 1) \sigma_1^{2} + (n_2 - 1 ) \sigma_2^{2}}{n_1 + n_2 - 2}} \end{gather} \]

It is often useful to complete the analysis by including both significance and effect size. For example, changing this property results in a significant change between groups, with a small/medium/large effect on the measured result or KPI.

Multivariate Testing

Multivariate testing (MVT) is performed using many variations of a design, usually called factors, tested simultaneously. For example, you might design two possible headlines and two possible images for a website, then test them simultaneous as $2 \times 2 = 4$ possibilities using a headline factor of size 2 and an image factor of size 2. MVT is more complicated than A–B testing, but it can be more efficient, since it allows multiple factors to be tested in parallel rather than sequentially. It also provides information about how combinations of factors perform: it may be that Image 1 works well with Headline 1 but not with Headline 2. Testing each factor independently would not reveal this insight.

If we define a factor being present as $+1$ and a factor being absent as $-1$, we can present MVT designs as a table of experiments or treatments and the associated factors and factor interactions being tested.

	A (Factor 1)	B (Factor 2)	AB (Interaction)
Treatment 1	+1	+1	+1
Treatment 2	+1	-1	-1
Treatment 3	-1	+1	-1
Treatment 4	-1	-1	+1

Recall that when two vectors' dot product is 0 they are orthogonal. If we consider the factor columns, a balanced factor of designs occurs when their dot products are 0 or orthogonal. In this case, Factor A $\cdot$ Factor B = $(1,1,-1,-1) \cdot (1,-1,1,-1) = 1 - 1 - 1 + 1 = 0$, producing a full factorial design or balanced design (all possible combinations of factors are tested).

The effect of any factor, for example A, is calculated as the difference between the mean in response between rows of A at +1 and -1, $\bar{x_A} = \bar{x_{A+1}} - \bar{x_{A-1}}$. The effect of the interaction between factors is calculated similarly, $\bar{x_{AB}} = \bar{x_{AB+1}} - \bar{x_{AB-1}}$. The key advantage of a balanced design is that you can add more (two-level) factors without increasing the required sample size. An $n$-factor design has $2^{n}$ rows, $1$ mean, $n$ main effect(s), $2^{n-1}$ interactions, and $2^{n}$ treatments. For example, an $n=3$-factor design has $2^3=8$ rows, $1$ overall mean, $n=3$ main effects, $3$ two-way interactions, $1$ three-way interaction, and $2^{3}=8$ total treatments.

As you can see, as we increase the number of factors, the number of treatments increases rapidly and the number of interactions increases exponentially. If we choose not to test all interaction terms, we can instead focus on designs that include only a subset of the treatments. The question then becomes: Which subset to include? Consider a 3-factor design where we want to run four treatments.

	A (Factor 1)	B (Factor 2)	C (Factor 2)	AB (Interaction)	AC (Interaction)	BC (Interaction)	ABC (Interaction)
Treatment 1	+1	+1	-1	+1	-1	-1	-1
Treatment 2	+1	-1	+1	-1	+1	-1	-1
Treatment 3	+1	+1	+1	+1	+1	+1	+1
Treatment 4	+1	-1	-1	-1	-1	+1	-1
Treatment 5	-1	+1	-1	-1	+1	-1	-1
Treatment 6	-1	-1	+1	+1	-1	-1	-1
Treatment 7	-1	+1	+1	-1	-1	+1	-1
Treatment 8	-1	-1	-1	+1	+1	+1	+1

If we chose Treatments 1-4 then we could not investigate the main effect of A, since A is $+1$ in all cases so there is no variance available. So this would be a poor subset to choose. This is where the idea of fractional factorial design comes into play. Fractional factorial design focuses on a reduced set of treatments that optimize independently estimating main effects and the lower level interactions. Some effects are dependent or confounded on one another, so they cannot be estimated.

	A (Factor 1)	B (Factor 2)	C (Factor 2)	AB (Interaction)	AC (Interaction)	BC (Interaction)	ABC (Interaction)
Treatment 3	+1	+1	+1	+1	+1	+1	+1
Treatment 4	+1	-1	-1	-1	-1	+1	-1
Treatment 5	-1	+1	-1	-1	+1	-1	-1
Treatment 6	-1	-1	+1	+1	-1	-1	-1

For example, in Treatments 3-6 C=AB so C cannot be estimated independent of A and B. Additionally, A=BC, B=AC, and C=AB. This is known as a Resolution III design: the main effects are confounded with the 2-factor interactions but not with each other.

Resolution II: The main effects can be confounded with each other.
Resolution III: The main effects can be confounded with 2-factor interactions.
Resolution IV: The main effects can be estimated independent of each other and the 2-way interactions, but the main effects can be confounded with 3-way interactions and 2-way interactions can be confounded with each other. Three-way interactions are assumed to be negligible.
Resolution V. The main effects can be estimated independent of each other and the 2-way and 3-way interactions, but the main effects can be confounded with higher order interactions. Two-way interactions are not confounded with each other. Higher order interactions are assumed to be negligible.

Obviously the higher the design resolution the better, but this also causes more treatments. Typically Resolution IV or Resolution V designs are most practical. There are some general "rules of thumb" when trying to choose a good factorial design.

Balance. Each level occurs an equal number of times.
Orthogonality. There is no correlation between pairs of factors.

Note that depending on the actual factors, certain conditions may not be possible. For example, if you are testing the presence or absence of a banner on a web page, and whether the banner should be red or blue, the colour is not testable in the condition where the banner is not present. In this case, balance is not possible. These types of situations should be taken into account when designing an MVT experiment. Since finding optional designs for complex treatment arrangements is difficult, most statistical software provides functionality to do this for you.

A final question is: should you use A–B testing or MVT testing? A–B testing focuses on the effect of independent components in a new environment. MVT experiments focus on the holistic effect of the overall experience in a new environment. The experience that is most relevant or important to you or your users should dictate which type of experiments you choose to run.

Propensity Score Matching

Although RCTs are the "gold standard" for assessing changes in a test environment, in certain situations they are not possible. Ethical, cost, lack of known participant properties, the number of specialized participants needed for the experiment, or other factors may preclude conducting controlled experiments. In these situations, we can use propensity score matching (PSM) to "match" pairs of participants that are similar, placing one in group A and one in group B.

The most common use of propensity scoring is when participants are defined by multiple attributres. During the definition of groups A and B, rather than choosing by random selection, we would like to choose two participants that are "similar" to one another and place one in group A and one in group B. This addresses the issue of balancing potentitally confounding effects across the two groups. We fall back to random selection in situations where (the correct) attributes are not available. For example, if we are testing button clicks for two types of buttons on a website, we are unlikely to know anything (useful) to define our split between users. Here we use randomization to assign users to group A or group B to best address bias.

Propensity scoring is calculated from observational data about participants. It attempts to split participants based on two participants having common observable characteristics. Rather than using the participant attributes directly, which can be difficult or expensive, we compute a propensity score for each participant. A propensity score is a value where the "score" of a participant is determined as a function of the covariates (observable attributes) of a participant.

Propensity scoring simplifies the task of identifying (or matching) similar participants in group A and group B. Matching by covariate values, especially when there are numerous covariates, is complicated. Reducing the covariates to a single score makes it much easier to identify similar participants. The standard method to do this is to fit a logistic regression model to the covariates of interest, then use the model to convert a participant's covariates to a single logit value on the range $0 \ldots 1$. Recall that a (binary) logistic regression model defines log-odds of an event as a linear combination of one or more independent variables. Here, those variables are the covariates selected to compute the propensity score.

PSM does not eliminate A–B testing. Instead, it adjusts an initial A–B randomized split to improve it by removing possible bias in the two groups. More specifically, the following steps are used to determine whether there is a significant difference between group A (control) and group B (test).

Randomly divide participants into group A and group B, exactly like A–B testing.
Choose which covariate attributes you will use to calculate a participant's propensity score.
Fit a logistic regression model $l$ using the selected covariates.
Use $l(p_i)$ to compute a propensity score for each participant $p_i$.
Order participants in both groups by their propensity scores.
For each participant $p_i$ find their nearest neighbour $n_i$ in the opposite group. If $n_i$ is farther than a threshold value $\tau$, do not include $p_i$ in the follow-on analysis.
Store the pair $(p_i, n_i)$ as a matched pair.
Once all participants are paired or removed, search for significance over the pairs' proportional differences $|\mu_{p_i} - \mu_{n_i}|$.

import random import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, roc_auc_score from sklearn.datasets import make_classification import seaborn as sns from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import auc from sklearn.neighbors import NearestNeighbors from IPython.display import display import numpy as np import pandas as pd # Generate noisy data X, y = make_classification( n_samples=1000, n_features=4, n_redundant=0, n_classes=2, n_clusters_per_class=1, class_sep=2, flip_y=0.2, weights=[0.5, 0.5], ) # Create minmaxscaler, normalize data to usage ratios scaler = MinMaxScaler() normalized_data = scaler.fit_transform(X) data = pd.DataFrame(normalized_data) data.columns = ["A", "B", "C", "D"] data["RenewalStatus"] = y data["Treatment"] = (data["A"] >= 0.40) * 1 # Select covariates for PSM covariates = ["B", "C", "D"] X = data[covariates] y = data["Treatment"] # Fit logit to get coefficients for covariates logit = LogisticRegression() logit.fit(X, y) PS = logit.predict_proba(X)[:, 1] false_positive_rate, true_positive_rate, th = roc_curve(y, PS) # Match treated and control individuals based on propensity score treated_indices = data[data["Treatment"] == 1].index control_indices = data[data["Treatment"] == 0].index nbrs = NearestNeighbors(n_neighbors=1, algorithm="ball_tree").fit( np.reshape(PS[control_indices], (-1, 1)) ) distances, indices = nbrs.kneighbors( np.reshape(PS[treated_indices], (-1, 1)) ) matched_control_indices = control_indices[indices.flatten()] # Duplicate entries can lead to biased estimates new_control_indices = list(set(matched_control_indices)) control_data = data.iloc[new_control_indices].RenewalStatus control_mean = data.iloc[new_control_indices].RenewalStatus.mean() treatment_data = data.iloc[treated_indices].RenewalStatus treatment_mean = data.iloc[treated_indices].RenewalStatus.mean() effect = treatment_mean - control_mean print("Direct PSM:") print("Effect on renewal rates w/Product A > 40%: ", round(effect * 100), "%") print()

This code snippet implements propensity score matching to test whether a four-product dataset will generate higher renewal rates when the usage of the first product A is greater than 40%.

Create random data with four products A, B, C, and D; a binary column indicating whether a customer renewed their subscription RenewalStatus, and a binary column identifying customers with product A usage over 40%.
Select covariates B, C, and D and fit a logit to produce an outcome treatment variable to predict whether a customer will have a usage for product A above or below 40%.
Given the logit probabilities, we pair customers in the control and treatment groups with similar probabilities using k-nearest neighbours.
We compare the mean renewal rate for control and treatment groups to determine the predicted renewal rate when product A usage is above 40%.

Python provides the package psmpy to perform propensity score matching directly. The following code shows the follow-on codee that uses psmpy on the same random data as the original example.

# psmpy from psmpy import PsmPy from psmpy.functions import cohenD from psmpy.plotting import * psm_data = data.copy() psm_data["idx"] = psm_data.index # Create propensity score matching (psm) data structure psm = PsmPy( psm_data, treatment="Treatment", indx="idx", exclude=["A", "RenewalStatus"] ) # Apply logit for propensity probabilities psm.logistic_ps(balance=True) # Match control and treatment pairs based on propensity scores, 1-many psm.knn_matched( matcher="propensity_logit", replacement=False, caliper=None, drop_unmatched=False, ) effect_tbl = ( psm_data[["RenewalStatus", "Treatment"]] .groupby(by="Treatment") .aggregate(["mean", "var", "std"]) ) effect_tbl.columns = ["Mean", "Var", "Std"] effect = effect_tbl.iloc[1]["Mean"] - effect_tbl.iloc[0]["Mean"] print("psmpy:") print("Effect on renewal rates w/Product A > 40%: ", round(effect * 100), "%")

Although the effect scores are not identical, they are close, suggesting psmpy is performing certain steps slightly different than the direct code.

Direct PSM: Effect on renewal rates w/Product A > 40%: -66 % psmpy: Effect on renewal rates w/Product A > 40%: -67 %

	\(H_0\)	\(H_a\)
Predict \(H_0\)	Probability True Negative \(1 - \alpha\)	Probability False Negative \(\beta\)
Predict \(H_a\)	Probability False Positive \(\alpha\)	Probability True Positive \(1-\beta\)