After spending two days at the AI Summit fair in London and having several conversations with people from different business backgrounds, I wanted to clarify why machine learning infrastructure is one of the biggest things to concentrate on when building production level machine learning models.
If you have a business background and want to understand the requirements of machine learning development, continue reading. I bet this is something that your colleagues from the machine learning/data science department have tried to tell you, but you haven’t quite understood just yet.
Before I start explaining the unsexy topic – machine learning infrastructure – I want to share our CEO’s, Eero Laaksonen, words about why he wanted to start building Valohai.
“I started to think what would have the biggest impact to the future of humankind. If we would have started working on one algorithm to solve one specific problem, it would not have been enough. I wanted to empower developers all across industries to take us to the future faster. And the one thing that is slowing AI development in general is lack of proper processes and tools. That’s what Valohai is about.”
And if you are wondering why I’m writing about machine learning and not artificial intelligence, the reason is that machine learning is an application of AI and AI is just a broader term for the same topic. You can read definitions and differences of these two terms for example here in the Forbes article.
Why should you understand the concept?
Currently data scientists, who should concentrate on valuable AI development, need to do lots of DevOps work before they are ready to do the thing they do best: playing with the data and algorithms. Larger companies like Uber and Facebook have built their competitive advantage around machine learning and have proper tools and processes in place, but most likely they are not going to open them to the public.
Majority of companies are left out as they don’t have the knowledge nor the resources to build an efficient and scalable machine learning workflow and pipeline. The gap between big and small players grows.
This is the reason why companies (Yes, yours too.) need to have a mutual understanding between the business development and data scientist teams on what it really takes to build production level machine learning. Building and managing the machine learning infrastructure is one big part of the development work and it is not going to directly bring in any revenue for the company, but the good news is that it can be automated.
What is this machine learning infrastructure then?
The image above, from Google research team’s study, illustrates the scale and different parts that needs to be taken into consideration in machine learning development. When I started working at Valohai, my understanding of AI development pipeline was something like this: have the data, have an algorithm, feed the data to the algorithm and voilà.
In reality, the required infrastructure is vast and very complex. Valohai helps tackle all of this extraneous, but inevitable, infrastructure around the actual revenue generating machine learning code.
Hopefully this image helps you understand the scale, and if you are interested in reading more about the research behind this image, you can find the research paper here.
When do we need infrastructure management
Proper tools and processes to manage infrastructure aren’t only for saving data scientists’ time, but it becomes particularly useful when finalized models don’t perform as planned. You probably have heard about the recent unfortunate incident when an Uber self-driving car drove over a pedestrian. Another example of an unwanted end result is a face recognition model that recognized only people with white skin tones as people.
In both of these examples, an explanation for the malfunctioning model is required in order to fix the flaws. There might be some problem with the data or with preprocessing it, or maybe some parameters worked better than others. How do you know which model is the exact one that is in production and how you can spot the possible errors in it? It is not a foregone conclusion that machine learning teams actually have proper history available. To drive the point home, here are couple examples of data scientist teams’ uneffective version control that I have heard of.
One team of hundred data scientists kept track of their models by posting the binary file of the model into a Slack channel. Compare this to a situation where sales people would not have any CRM and they would just jot their client history down to a Slack channel and try to find notes from there later on. Sounds insane, right?
One member of a 50 person machine learning team keeps track of his executions in an Excel sheet that is located on his own computer. This can be compared to a situation where a salesman would have a spreadsheet on his own laptop and no other sales team member would know what companies that salesman has contacted and what has been the end result of the meetings. And an even more accurate comparison would be if a salesman would write all different hypotheses of the end results of his every single customer to the spreadsheet. Data scientists can have multiple scenarios regarding one data set and all of these should be tracked somehow.
Valohai keeps records of trained models
Keeping record of customers in an Excel sheet would not work for a sales team, and similarly, keeping record of machine learning experiments in an Excel sheet does not work for a machine learning team either.
Ask your machine learning team how they manage their version history and if their answer is something similar to the examples above, help them out!
Original text can be found in Valohai blog.