The Toilet Paper

When machine learning meets software engineering

Software engineering and machine learning are like oil and water. They don’t mix – unless you add an emulsifier of course.

Borat Sagdiyev gives his thumbs up
Machine Learnings of America for Make Benefit Glorious Nation of Engineeringstan

Search, advertising, machine translation, voice recognition, and even design advice for presentations and word processing documents: it almost seems as if every application nowadays “needs” AI-powered features. This certainly appears to be the case at Microsoft, where an ever-increasing number of software engineering teams incorporates machine learning in their products or processes.

ML workflows at Microsoft


In theory, machine learning workflows at Microsoft consist of nine consecutive stages:

  1. In the model requirements stage designers decide which features can be implemented using machine learning and what types of models are most appropriate.

  2. During data collection teams look for existing datasets or create their own. Early versions of models may already be trained during this stage.

  3. Data cleaning is done to remove inaccurate and noisy records from the dataset.

  4. Data labelling is done to assign to each record.

  5. Feature engineering involves extraction and selection of features for machine learning models.

  6. During model training the models are trained and tuned on the labelled dataset.

  7. The performance of the model is measured during model evaluation. For critical domains this stage might also involve extensive human evaluation.

  8. The inference code of the model is deployed to production.

  9. The model is continuously monitored for errors during real-world execution.

In practice, machine learning workflows are highly non-linear and include several feedback loops. Non-linearity and feedback loops are also a major part of agile processes. However, machine learning requires a much greater amount of experimentation than software engineering.

Ongoing research suggests that software engineers find it hard to integrate machine learning into their existing processes, possibly due to issues arising from the inherent differences between machine learning and software engineering. This paper therefore aims to provide some insights into ML-specific best practices at Microsoft.

ML best practices


The researchers collected data using interviews with 14 senior engineers and surveys with 551 engineers at Microsoft. The data show that AI is used throughout the company; not only to implement features in end-user applications, but also for internal analyses and incident reporting.

Respondents mentioned a number of different challenges, along with lessons learnt which the researchers have converted into best practices.

End-to-end pipeline support


The first major challenge is tooling. Ideally, one would want a seamless development experience with a lot of automation that covers all nine stages in the ML workflow. Sadly, integrating machine learning components into larger software systems can be very hard due to the aforementioned differences between machine learning and software engineering.

Respondents report making use of internal infrastructure, specialised pipelines, rich dashboards which show the value that’s provided to users, and development tools that make machine learning easier to apply by engineers with varying levels of experience: Visual tools help beginning data scientists when getting started, but once they know the ropes and branch out, such tools may get in their way and they may need something else.

Data availability, collection, cleaning, and management


The second major challenge is data availability, collection, cleaning, and management. Collecting data is expensive: any existing data should be reused as much as possible to reduce duplicated efforts, while automation can be used to lower the costs of collecting new data.

Moreover, rapid evolution of data sources requires the use of data management tools to avoid fragmentation of data and model management activities. The article mentions an example where models are versioned with provenance tags that explain which data they were trained on and which version of the model (code) was used.

Education and training


As machine learning components make their way into more customer-facing products and engineers with traditional software engineering backgrounds need to learn how to work alongside ML specialists, education and training play an increasingly important role.

Microsoft facilitates education and training in various ways:

  • twice-yearly internal conferences on machine learning and data science, with at least one day devoted to ML basics and best practices;

  • employee talks about internal tools, engineering details behind novel projects and product features, and cutting-edge advances in AI research;

  • weekly open forums on machine learning and deep learning, where practitioners can get together and learn more about AI; and

  • mailing lists and online forums.

Model and debugging


Debugging activities for components that learn from data not only focus on programming bugs, but also on other issues that arise from model errors and uncertainty. This is still an active research area, but several possible solutions include the use of more interpretable models, visualisation techniques that make black-box models more interpretable, and modularisation in a conventional, layered, and tiered software architecture to simplify error analysis and debuggability.

Model evolution, evaluation, and deployment


Development of machine learning models is a highly iterative process, which naturally requires more frequent deployments. Updates may have a significant impact on system performance. Some teams therefore use agile techniques to evaluate the performance of new models. Automated tests that capture what models should do are as helpful for machine learning as they are for software engineering. However, it’s important that a human remains in the loop to understand why models don’t always work as desired.

Ideally, not only the training and deployment pipeline should be automated: model building should be integrated with the rest of the software, using common versioning repositories and tightly coupled development processes.

The effect of seniority


The ability of teams to effectively deliver products with ML-based features is largely determined by their amount of prior experience with machine learning and data science. When the researchers grouped responses by respondents’ experience levels, they found two things that are worth noticing:

  • Data availability, collection, cleaning, and management is ranked as the top challenge by many respondents, regardless of their experience level. End-to-end pipeline support is also often mentioned as a top challenge across the board.

  • Some challenges grow or shrink in importance as engineers gain more experience. For example, education and training is more important to novices, while tooling, evolution, and deployment are more likely to be major concerns for those with a lot of experience.


  1. Compared to software engineering, machine learning workflows are highly non-linear and include more feedback loops

  2. Data availability, collection, cleaning and management are often seen as the hardest part of machine learning

  3. Best practices ensure that engineers with varying levels of experience can use machine learning, reliably