This module will discuss customer analytics. Customer analytics analyzes a company's customer data and behaviours to try to identify, attract, and retain the most profitable types of customers. Since customers have access to significantly more information about products and companies in the digital age, organizations must use updated strategies to attract and retain these customers. This is the goal of customer analytics: to create a unified, accurate view of the company's customer base, and decide which strategies can best retain and grow this base. More detailed questions can fall underneath this basic approach, for example, who are a company's highest value customers, and should those customers be prioritized over other needs?
Customer analytics is normally an interdisciplinary problem that involves marketing, sales, IT, customer service, and business analytics. The skills and knowledge of each group is shared to identify the business metrics to capture, the analysis to perform, and the overall goals to address. Customer analytics begins with the capture of raw data and ends with business decisions. One definition of the stages of customer analytics includes the following three steps.
The types of business decisions or goals associated with customer analytics are numerous and varied, but they normally have a relationship to the overall purpose of identifying, attracting, and retaining profitable customers. Below are some examples of goals for customer analytics.
Not surprisingly, numerous commercial tools exist to help with customer analytics. Many focus on customer relationship management (CRM), which encompasses many of the ideas discussed above. CRM tools can collect data, aggregate data from different sources, support raw data organization and analysis, and visualize results to better highlight any relevant insights that are found. The tools can also integrate with sales and marketing applications, web content management systems, email, social media sites, customer loyalty programs, and other tools designed to help attract and retain customers. Some common CRM tools from well-known vendors include Salesforce, Oracle Netsuite, Zoho, HubSpot, Pipedrive, Insightly, and Google Analytics 360, among others.
CRM tools are sometimes divided into collaborative, operational, and analytical. A collaborative CRM is designed to remove silos between different teams to ensure they are all sharing a common set of data. An operational CRM streamlines understanding the customer's journey through a company's web site, even in situations where the journey is represented with many highly detailed touchpoints. This is usually done by automating repetitive tasks to free employees to focus on more subjective or creative issues. An analytical CRM is optimized to analyze massive amounts of customer data and return actionable insights.
For our instruction, we will complete a basic introduction of Google Analytics. Google Analytics is free, is easy to setup within a web site, provides various ways to filter data, offers high-quality dashboards, and has some basic analytics built into its system. For more sophisticated organization and analytics, data from Google Analytics can be exported as CSV for use by external programs.
Google Analytics is a digital analytics platform provided by Google. One of its main advantages is that Google provides the service for free: because Google Analytics tends to drive business to Google Ads, Google benefits indirectly from users of the Analytics platform. A basic description of Google Analytics is that it is a tool within Google's Marketing Platform that collects data on users visiting your web site, then allows you to compile that data into reports to develop business strategies and improve your web site's performance. By installing web site-specific tracking code, you can see who visits you site and what they do there, as well as collect a wide variety of demographic data on your visitors. A small sample of the data you can collect using Google Analytics includes the following.
Google Analytics discusses the digital analytics funnel, the idea that individuals explore a web site or purchase items in stages. Marketing uses the concept of a funnel to enumerate these stages.
These stages may be different for web site that are not designed to sell products, however, the basic ideas are often analogous. For example, conversion on a non-product web site may mean that the visitor returns to the site repeatedly because they find it useful.
Although we will discuss Google Analytics in the context of web sites and web traffic, it can also be used to collect data from mobile applications, point-of-sales systems, video game consoles, CRMs, and other Internet-connected platforms.
Google Analytics was originally built using UA (Universal Analytics). This allowed you to create an account, then obtain a snippet of Javascript code that you added to each web site you wanted to track. The Javascript contained a unique ID that sent information back to your Analytics account. The Google Analytics web site allowed that information to be viewed and filtered in a variety of ways.
More recently, Google decided to move to a new platform called GA4 (Google Analytics 4). This is designed to do what UA did, but in a very different way. Originally, UA used a hierarchy of account → property → view, where an account represented one or more web sites, a property was a section of a web site (e.g., a part of the site of a subset of its visitors), and a view was a combination of a filter and a visualization of some or all of the data associated with a property. Properties were themselves given unique sub-IDs to allow you to treat them as independent data sources.
In GA4 the hierarchy is now account → property → data stream. The basic ideas of account and property are similar to UA, but view has been removed. The new data stream has been added, which represents a source of raw data to be fed into a property. As before, that data can be filtered and visualized in a number of different ways. However, it is non-trivial to save a particular "view" into a property's data. This has caused a number of issues for individuals and businesses who were using the old UA system, since views are clearly valuable and difficult to replace. On the other hand, GA4 makes it much easier to separate events on a web site. Previously, UA bundled everything under a single tag ID. GA4 is also capable of capturing a much wider set of events that UA was.
All of our discussion will revolve around GA4, since UA is being depricated in July 2024. If you search online for information on using Google Analytics, be careful to make sure you're looking at instructions for GA4 and not UA. Google has not removed any information about UA, and since this is the system with the longest history, searches that do not explicitly specify GA4 (and even some that do) will point you to information relevant to UA and not GA4.
Google Tag Manager is now the recommended method for collecting data to be analyzed in Google Analytics. Although this is often not clear, although Google Analytics and Google Tag Manager are closely integrated, they are two entirely separate online systems with two different purposes.
Google Analytics is an online software suite used for analytics, conversion tracking, and reporting. It provides a wide range of visualization, filtering, and reporting tools to build dashboards that provide insights into visitors to a web site, what they see on the site, and how they interact on the site.
Google Tag Manager, on the other hand, is a tag management system that can detect and store events within a web site, for example, page views, clicks, scrolling, entering or exiting the site, and so on. These events can be sent to a separate analytics package for further analysis and presentation. An obvious candidate for this task is Google Analytics.
To use Google Analytics, you first create a Google Analytics account tied to your web site. This includes creating an initial property attached to your account and a data stream to provide data to that property. At that point, you have the option of adding Javascript code to each page you want to track, or using the Google Tag Manager to automatically track different types of events (e.g., page views, clicks, etc.) Currently, Google is recommending using the Google Tag Manager, since any changes you make in the Tag Manager web site should automatically apply to all pages your are managing, without the need for additional edits to the Google Javascript that is involved in manual tagging.
Complete the following steps to create your initial Google Analytics account.
https://analytics.google.com
.healey.wordpress.ncsu.edu
. Choose a Stream name for
the datastream attached to this property, and ensure "Enhanced
measurement" is turned on.<head>
tag on the web pages you plan to
analyze.This web page gives additional instructions on how to setup an account and create an initial property and stream. After some time has passed, you can login to Google Analytics and use the site's interface to explore data about users who have visited your web site.
Google Analytics groups activity into a session that begins when a user loads a page with tracking code, and ends after 30 minutes of inactivity. Data is uploaded to Google and made available through your analytics account. By default, Google Analytics will aggregate and present information based on a predefined set of criteria like geographic location, origin type, and page, but these can be modified as desired using filters and other interactive controls.
A Google Analytics account is made up of one or more properties. Within each property one or more data streams can be created. Properties are mean to collect data independent of one another by using a unique tracking ID in the tracking code. For example, a business may want to use different properties to collect data from different web sites, from a web site versus a mobile application, or from different geographic regions. Data streams are sources of data from a web site being sent to a property. Data streams come in three types: web, iOS app, or Android app. The intent is to allow us to aggregate users across different data sources using a single data stream.
When a property is created, Google Analytics automatically creates a default view called "All Web Site Data," containing all raw data collected for the property.
https://tagmanager.google.com
.<HEAD>
tag. The second should be placed
right after the <BODY>
tag.Next, you will want to add one or more tags and associated triggers to collect data from your web site and send it to Google Analytics.
Untitled Tag
at the top of the panel with a
more descriptive name of the type of data you plan to capture, for
example, Page Views
.Note. I could only get the Preview to work when I used the Chrome browser, installed the Tag Assistant Companion extension, and disabled all other extensions since there seemed to be a conflict in my extension set. I suspect this only affects previewing whether the tag is working or not. Rather than running the Preview step, you could jump directly to Submit then Publish, then check with Realtime as you load your page to see if page views are appearing or not.
In its most basic form, A–B testing is a way to determine whether changing a property (often called a key performance indicator or KPI) of an environment makes it better or worse, based on a specific evaluation metric. In other words, given two versions of an environment, which performs better? Although the term "A–B testing" was coined in the 1990s, it is a form of a basic randomized controlled experimentation, first documented by Ronald Fisher of the famous Fisher's Iris dataset in the 1920s. A–B testing in its current form is often characterized by being run online, in real time, and on a much larger scale in terms of participants and experiments versus traditional randomized controlled trials (RCTs). A high-level overview of designing, conducting, and evaluating A–B testing might go something like this.
For example, you might wonder whether one version of a button on a web site will encourage users to click more often than another version. Two versions of the web site with the two button candidates is constructed, and users are randomly assigned to view one of the two versions. The performance metric is number of button clicks. Once all users have explored the web site, the number of web clicks is statistically compared to see whether one is significantly higher than the other. If it is, the button with more clicks is chosen for the final web design.
As with all experiments, randomization is critical. This ensures that users are not grouped based on some criteria that might influence their preferences, for example, for one colour of button over another.
During design of the experiment, deciding how many users are needed to ensure statistical significant is important. Since A–B testing is a form of a randomized controlled experiment, we can use literature from either area to study this problem. For example, the medical community often conducts A–B-type tests and has many good sources and examples of how to calculate sample sizes for a desired level of improvement (i.e., how much "better" does the outcome need to be to be considered relevant?) Two types of experiments are considered: dichotomous, where the outcome of interest is one of two possibilities: yes/no, success/failure, and so on, or continuous, where the outcome of interest is the mean difference of an outcome variable between the two groups: for example, the difference in the average number of clicks between group A and group B.
Given the overall goal of determining whether changing the test environment leads to a significant change in KPI, experiments are often described in terms of the null hypothesis (\(H_0\)), that no significant change was found, or the alternative hypothesis (\(H_a\)), a significant change did occur. This is often modelled using false positive (Type I error), false negative (Type II error), true positive, and true negative proportions, which are based on the null hypothesis \(H_0\) (no difference) and the alternative hypothesis \(H_a\) (significant difference).
\(H_0\) | \(H_a\) | |
Predict \(H_0\) |
Probability True Negative \(1 - \alpha\) |
Probability False Negative \(\beta\) |
Predict \(H_a\) |
Probability False Positive \(\alpha\) |
Probability True Positive \(1-\beta\) |
Dichotomous. For a proportional metric, we need to define a significance level \(\alpha\), a power level \(P\), and the two proportions \(\mu_1\) and \(\mu_2\) from groups A and B that constitute the desired level of improvement \(\mu_2 - \mu_1\). Recall that
Notice that reducing the probability of committing a Type II error increases the probability of committing a Type I error and vice versa. Because of this, careful balance must be maintained between \(\alpha\) and \(\beta\).
Given this, the size of each group \(n_A = n_B\) is \[ n_A = n_B = c \cdot \frac{\mu_1 (1-\mu_1) + \mu_2 (1-\mu_2)}{(\mu_1 - \mu_2)^{2}} \] where \(c=7.9\) or \(c=10.5\) for the standard power levels of \(P=80\)% or \(P=90\)% and \(\alpha = 0.05\). \(c\) is based on the cumulative distribution function (CDF), \(c = f(\frac{\alpha}{2}, \beta) = (\Phi(\frac{\alpha}{2}) + \Phi \beta)^{2}\) where \(\Phi\) is the CDF of a standard normal distribution. \(\Phi\) is based on the Z-score. \[ \begin{gather} \Phi(x) = p(Z \leq x) = \frac{1}{\sqrt{2 \pi}} \int_{-\inf}^{x} \exp \left( -\frac{u^{2}}{2} \right) du \\ Z \sim N(\mu = 0, \sigma^{2} = 1) \end{gather} \] For example, if we want to go from 40% of participants answering Yes in Group A (control) to 70% answering Yes in Group B (test), \(n_A = n_B = 7.9 \cdot \frac{(0.4 \cdot 0.6) + (0.7 \cdot 0.3)}{0.3^{2}} \approx 40\) for an 80% power level or \(n_A = n_B \approx 53\) for a 90% power level at \(\alpha = 0.05\).
Continuous. We need to define a significance level \(\alpha\), a power level \(P\), a desired response difference \(\mu_2 - \mu_1\), and a common (combined group) standard deviation \(\sigma\). Given this, the size of each group \(n_A = n_B\) is \[ n_A = n_B = \frac{2c}{\delta^{2}} + 1 \] where \[ \delta = \frac{|\mu_2 - \mu_1|}{\sigma} \] where as before \(c=7.9\) for \(p=80\)% and \(c=10.5\) for \(p=90\)%. For example, if we wanted to go from 20% clicks in group A to 30% clicks in group B with a standard deviation \(\sigma=0.5\) then \(\delta = \frac{0.1}{0.5} = 0.2\) and \(n_A = n_B = \frac{15.8}{0.04} + 1 = 396\) for \(p=80\)% or \(n_A = n_B = \frac{21}{0.04} + 1 = 526\) for \(p=90\)%.
Alternatively, you can use either Python or R to calculate minimum
sample sizes using the statsmodels
or power
libraries. Both model the problem using false
positive (Type I error), false negative (Type II error), true
positive, and true negative proportions, which are based on the null
hypothesis \(H_0\) (no difference) and the alternative hypothesis
\(H_a\) (significant difference). Both Python and R's tests use the
\(\alpha\) significance level (normally 1%, 5%, or 10%), the false
negative rate \(\beta\) (the probability of incorrectly rejecting
\(H_0\)), the power level (\(1 - \beta\) or the true positive rate,
the probability of correctly rejecting \(H_0\)), and the effect size
divided by the minimum detectable lift (MDL, the minimum change
needed to reject \(H_0\)).
For a dichotomous A–B
test, power
's prop.test()
is used to
determine the minimum \(n\) needed for significance.
For a continuous A–B
test, power
's t.test()
is used to
determine the minimum \(n\) needed for significance.
Once results are obtained from an A–B test, they are analyzed to
search for significant differences. The null hypothesis \(H_0\) that
there is no significant difference in the performance metric between
the two groups A and B is $p_B - p_A = 0$ for dichotomous
(proportional) metrics and \(\bar{p_A} = \bar{p_B}\) for continuous
metrics. Proportional significance can be measured in Python
with statsmodels.stats.proportion.proportions_ztest()
and in R with prop.test()
. For continuous metrics
use scipy.stats.ttest_ind()
or t.test()
in
Python or R, respectively.
One final value you may want to calculate is effects size (ES). Intuitively, effects size states how strongly the independent variables affect the dependent variable. For t-test studies, Cohen's d is often used to measure effect size. Cohen's d calculates the ratio of mean difference between groups to pooled standard deviation, where \(d=0.2\) is consider small, \(d=0.5\) is considered medium, and \(d=0.8\) is considered large.
\[ \begin{gather} d = \frac{|\mu_1 - \mu_2|}{\sigma_p} \\ \sigma_p = \sqrt{\frac{(n_1 - 1) \sigma_1^{2} + (n_2 - 1 ) \sigma_2^{2}}{n_1 + n_2 - 2}} \end{gather} \]It is often useful to complete the analysis by including both significance and effect size. For example, changing this property results in a significant change between groups, with a small/medium/large effect on the measured result or KPI.
Multivariate testing (MVT) is performed using many variations of a design, usually called factors, tested simultaneously. For example, you might design two possible headlines and two possible images for a website, then test them simultaneous as $2 \times 2 = 4$ possibilities using a headline factor of size 2 and an image factor of size 2. MVT is more complicated than A–B testing, but it can be more efficient, since it allows multiple factors to be tested in parallel rather than sequentially. It also provides information about how combinations of factors perform: it may be that Image 1 works well with Headline 1 but not with Headline 2. Testing each factor independently would not reveal this insight.
If we define a factor being present as \(+1\) and a factor being absent as \(-1\), we can present MVT designs as a table of experiments or treatments and the associated factors and factor interactions being tested.
A (Factor 1) | B (Factor 2) | AB (Interaction) | |
Treatment 1 | +1 | +1 | +1 |
Treatment 2 | +1 | -1 | -1 |
Treatment 3 | -1 | +1 | -1 |
Treatment 4 | -1 | -1 | +1 |
Recall that when two vectors' dot product is 0 they are orthogonal. If we consider the factor columns, a balanced factor of designs occurs when their dot products are 0 or orthogonal. In this case, Factor A \(\cdot\) Factor B = \((1,1,-1,-1) \cdot (1,-1,1,-1) = 1 - 1 - 1 + 1 = 0\), producing a full factorial design or balanced design (all possible combinations of factors are tested).
The effect of any factor, for example A, is calculated as the difference between the mean in response between rows of A at +1 and -1, \(\bar{x_A} = \bar{x_{A+1}} - \bar{x_{A-1}}\). The effect of the interaction between factors is calculated similarly, \(\bar{x_{AB}} = \bar{x_{AB+1}} - \bar{x_{AB-1}}\). The key advantage of a balanced design is that you can add more (two-level) factors without increasing the required sample size. An \(n\)-factor design has \(2^{n}\) rows, \(1\) mean, \(n\) main effect(s), \(2^{n-1}\) interactions, and \(2^{n}\) treatments. For example, an \(n=3\)-factor design has \(2^3=8\) rows, \(1\) overall mean, \(n=3\) main effects, \(3\) two-way interactions, \(1\) three-way interaction, and \(2^{3}=8\) total treatments.
As you can see, as we increase the number of factors, the number of treatments increases rapidly and the number of interactions increases exponentially. If we choose not to test all interaction terms, we can instead focus on designs that include only a subset of the treatments. The question then becomes: Which subset to include? Consider a 3-factor design where we want to run four treatments.
A (Factor 1) | B (Factor 2) | C (Factor 2) | AB (Interaction) | AC (Interaction) | BC (Interaction) | ABC (Interaction) | |
Treatment 1 | +1 | +1 | -1 | +1 | -1 | -1 | -1 |
Treatment 2 | +1 | -1 | +1 | -1 | +1 | -1 | -1 |
Treatment 3 | +1 | +1 | +1 | +1 | +1 | +1 | +1 |
Treatment 4 | +1 | -1 | -1 | -1 | -1 | +1 | -1 |
Treatment 5 | -1 | +1 | -1 | -1 | +1 | -1 | -1 |
Treatment 6 | -1 | -1 | +1 | +1 | -1 | -1 | -1 |
Treatment 7 | -1 | +1 | +1 | -1 | -1 | +1 | -1 |
Treatment 8 | -1 | -1 | -1 | +1 | +1 | +1 | +1 |
If we chose Treatments 1-4 then we could not investigate the main effect of A, since A is \(+1\) in all cases so there is no variance available. So this would be a poor subset to choose. This is where the idea of fractional factorial design comes into play. Fractional factorial design focuses on a reduced set of treatments that optimize independently estimating main effects and the lower level interactions. Some effects are dependent or confounded on one another, so they cannot be estimated.
A (Factor 1) | B (Factor 2) | C (Factor 2) | AB (Interaction) | AC (Interaction) | BC (Interaction) | ABC (Interaction) | |
Treatment 3 | +1 | +1 | +1 | +1 | +1 | +1 | +1 |
Treatment 4 | +1 | -1 | -1 | -1 | -1 | +1 | -1 |
Treatment 5 | -1 | +1 | -1 | -1 | +1 | -1 | -1 |
Treatment 6 | -1 | -1 | +1 | +1 | -1 | -1 | -1 |
For example, in Treatments 3-6 C=AB so C cannot be estimated independent of A and B. Additionally, A=BC, B=AC, and C=AB. This is known as a Resolution III design: the main effects are confounded with the 2-factor interactions but not with each other.
Obviously the higher the design resolution the better, but this also causes more treatments. Typically Resolution IV or Resolution V designs are most practical. There are some general "rules of thumb" when trying to choose a good factorial design.
Note that depending on the actual factors, certain conditions may not be possible. For example, if you are testing the presence or absence of a banner on a web page, and whether the banner should be red or blue, the colour is not testable in the condition where the banner is not present. In this case, balance is not possible. These types of situations should be taken into account when designing an MVT experiment. Since finding optional designs for complex treatment arrangements is difficult, most statistical software provides functionality to do this for you.
A final question is: should you use A–B testing or MVT testing? A–B testing focuses on the effect of independent components in a new environment. MVT experiments focus on the holistic effect of the overall experience in a new environment. The experience that is most relevant or important to you or your users should dictate which type of experiments you choose to run.
Although RCTs are the "gold standard" for assessing changes in a test environment, in certain situations they are not possible. Ethical, cost, lack of known participant properties, the number of specialized participants needed for the experiment, or other factors may preclude conducting controlled experiments. In these situations, we can use propensity score matching (PSM) to "match" pairs of participants that are similar, placing one in group A and one in group B.
The most common use of propensity scoring is when participants are defined by multiple attributres. During the definition of groups A and B, rather than choosing by random selection, we would like to choose two participants that are "similar" to one another and place one in group A and one in group B. This addresses the issue of balancing potentitally confounding effects across the two groups. We fall back to random selection in situations where (the correct) attributes are not available. For example, if we are testing button clicks for two types of buttons on a website, we are unlikely to know anything (useful) to define our split between users. Here we use randomization to assign users to group A or group B to best address bias.
Propensity scoring is calculated from observational data about participants. It attempts to split participants based on two participants having common observable characteristics. Rather than using the participant attributes directly, which can be difficult or expensive, we compute a propensity score for each participant. A propensity score is a value where the "score" of a participant is determined as a function of the covariates (observable attributes) of a participant.
Propensity scoring simplifies the task of identifying (or matching) similar participants in group A and group B. Matching by covariate values, especially when there are numerous covariates, is complicated. Reducing the covariates to a single score makes it much easier to identify similar participants. The standard method to do this is to fit a logistic regression model to the covariates of interest, then use the model to convert a participant's covariates to a single logit value on the range \(0 \ldots 1\). Recall that a (binary) logistic regression model defines log-odds of an event as a linear combination of one or more independent variables. Here, those variables are the covariates selected to compute the propensity score.
PSM does not eliminate A–B testing. Instead, it adjusts an initial A–B randomized split to improve it by removing possible bias in the two groups. More specifically, the following steps are used to determine whether there is a significant difference between group A (control) and group B (test).
This code snippet implements propensity score matching to test
whether a four-product dataset will generate higher renewal rates
when the usage of the first product A
is greater than
40%.
A
, B
, C
,
and D
; a binary column indicating whether a customer
renewed their subscription
RenewalStatus
, and a binary column identifying
customers with product A
usage over 40%.B
, C
,
and D
and fit a logit to produce an outcome treatment
variable to predict whether a customer will have a usage for
product A
above or below 40%.A
usage is above 40%.
Python provides the package psmpy
to perform propensity
score matching directly. The following code shows the follow-on codee
that uses psmpy
on the same random data as the original
example.
Although the effect scores are not identical, they are close, suggesting
psmpy
is performing certain steps slightly different than
the direct code.