Inclusiveness is important. Not only in our society, but also in the products that we create. This means that products should be equally usable for everyone, regardless of someone’s gender, age, ethnicity, etc.
Sadly, this is not always true. For instance, Google’s speech recognition works better for males than for females, and facial recognition systems tend to classify males more accurately than females.
A/B testing is often used to design or improve software. Generally speaking, in an A/B test users are randomly assigned to a control (A) or treatment (B) group. Those in the control group work with the existing version of the software, while those in the treatment group get to see the software with some new or changed feature. Any differences in metrics between the two groups are likely to be caused by the feature.
At least in theory, that is.
Windows out of box experience (OOBE) is a feature that helps users set up their Windows accounts and services on their new device.
Historically, OOBE has always used a blue design, but in 2018 an alternative light design was considered. The goal of this change was to increase the proportion of users who would link or sign up for Windows services, like OneDrive and Office 365.
A small A/B test was conducted with about 150 participants for each of the two groups. The results appeared to show no statistically significant differences between the two designs.
However, a closer examination of the collected data painted a very different picture…
It turned out that there were large differences, if you knew were to look:
For OneDrive, the light theme worked 18% better for those who self-identified as female, but lowered opt-in from those who self-identified as male by 21%.
Results for Office 365 were similar: the light theme led to an improvement of 39% among those who self-identified as female, and an equally large reduction among self-identified males.
These insights were possible, because this particular A/B test unintentionally recruited a balanced 50/50 split in gender identity.
Because A/B tests generally use random sampling, this does not happen very often. In fact, female gender identities tend to be under-represented in computing datasets, so the result of the test might as well have been that the blue design is better!
The A/B test used for OOBE also didn’t have enough participants from all relevant dimensions (e.g. level of education or ethnicity). This makes it hard to gain insights into any possible differences between sub-populations.
The inclusiveness of the Windows Experimentation Program (which A/B testing is part of) was improved in three ways:
Pre-launch sample size computations
A/B tests are conducted on a minimal sample rather than the entire population. This minimises possible downsides of the A/B test and .
Sampling is still a good idea, but the calculation of the right sample size needs to be done differently. First, the minimal sample size needs to be computed for each metric and for each population of interest. The correct sample size is the largest value that you get from these computations.
The computations make it possible to detect differences between sub-populations.
Analyses and dashboards
All collected data are labelled with sub-population information and visible in analyses and dashboards. Notifications show when statistically significant differences occur between sub-populations and drill-downs are provided to examine differences between sub-populations.
These changes make it easy to detect differences between sub-populations.
Education and guidance
Finally, education and training documentation were added to promote cultural change, and different ways of thinking and decision making in engineers.
This creates more awareness about differences between sub-populations.
Nevertheless, inclusive A/B testing remains far from trivial:
Target sampling for specific sub-populations is impractical. Target sampling one dimension can lead to imbalances in other dimensions. Also, some dimensions are simply not known yet when the experiment is conducted.
Anonymity is often used to protect vulnerable sub-populations, but also makes it hard to make sure that software is engineered to treat users from those sub-populations fairly.
It’s still hard to know if the data you have is representative or biased, as
it is those that cannot be observed that may need the most help to be included.
Larger sample sizes are needed for A/B tests to detect differences between sub-populations (e.g. gender, ethnicity, age)
Analyses and dashboards should make it easy to examine differences between sub-populations
Engineers need to be made aware that differences may exist between sub-populations