I’m (Berian, that is, rather than Luke) attending the SF Data Science Summit today and tomorrow. I’m taking some rough notes as I go and want to publish them in digestible bits. One of the speakers I most enjoyed today was Carlos Guestrin (@guestrin), who gave a keynote and then a little 25-minute appendix later in the day. Here’s what I wrote down.
In the last two years or so, technology companies and academic work are pivoting from a focus on optimizing model performance (‘my curve is better than your curve’) to focus on the success factors for getting a model in production and keep it successful once there. In short, the value of machine learning is only captured when a model goes into production.
The macro trend is one of technology companies building ‘intelligent applications powered by machine learning’. [Demo of similarity image search; a deep learning model extracts visual elements and extracts images with similar features.] Carlos’ claim: ‘Within 5 years, every innovative application with be intelligent (i.e. driven by machine learning)’. Given these trends, this talk focuses on four key lessons for the success of modeling at scale.
Maximize the use of resources and reusing features
Even huge companies can’t afford to duplicate resources on shared tasks like data munging and feature engineering. While there are legitimately different modeling use cases (customer segmentation, sentiment analysis, ads, fraud, recommendation, churn) that may require separate teams, having each team duplicate the whole pipeline is unaffordable.
One clever way to working out what is shared is to segment the use cases by the kind of customer activity data (events, etc.) generated. Call these areas ‘data verticals’ and focus on building for these with the re-use for all pipelines. The key use of this is that within a data vertical, feature engineering should be shared. Manual feature engineering is still common enough, but it is increasingly interesting if this can be done in an automated manner. Here’s a method for doing that: first, manually define simple counts of events as features; then, train a boosted tree model on this data; finally, ignore the model output but use the decision tree branches to define new non-linear features that optimally explain the data set.
Never stop learning
This is about deploying an ML model in an Agile way. The best approach seem to be to deploy models as microservices that can be queried for predictions by other services. This allows for the prediction service to be monitored live for performance quality, which can be used to continually optimize the quality of the model.
The next question that arises is: how often to update models? Ideally, one will want to use real-time feedback to trigger re-training (Berian: this is an anomaly detection problem where the output is used to trigger a training and deployment event). In practice this sort of online learning turns out to be very difficult to debug. The best middle ground approach found so far is ‘Online Re-ranking’: have a rack set of models (or model outputs) available and re-rank them in real-time based on some triggering.
When does scaling really matter?
From a developer’s viewpoint, good scaling needs to minimize end-to-end time to create an intelligent application. For the end user, good scaling needs to mean good quality output. Sometimes (fraud, ads) distributed training is a must because the $ value from high-quality prediction is so high. XGBoost (Boosted Trees) is the most effective method for doing this presently. Other times, deep learning is the better scaling approach where TensorFlow and mxnet are the frameworks currently being developed.
Explain yourself
It’s important for models to be transparent, so that the models can be used to gain deeper understanding of the processes at work. Indeed, we instinctively distrust blind predictions because it’s hard to understand why they’re correct. [An example of an untrustworthy model is the Wolf vs Husky demo: one can train an apparently accurate model to distinguish wolves from huskies, but it turns out to simply be detecting the presence of snow in the background and using that to designate which is the wolf.]
So: explaining models is important for successful production implemenation too. Examples of this are Netflix (we recommend X because you like Y) and medical advice (we predict this output because of these studies etc.). Within the context of deep learning on images, looking at which parts of the image are informative for prediction can be a good gut-check on model trustworthiness. (Berian: This is the equivalent of looking at feature importances for traditional learning.)
What can we do to gain trust in an ML model? One option is to use interpretable models, e.g. decisions trees. However these are often not accurate enough. So the two have to be balanced. Most ‘highly accurate’ predictions are due to misleading features—i.e. overfitting. A third option is A/B testing, can be a gold standard but is slow, expensive and tricky to implement well. So, none of these are ideal!
Attaching explanations of which input feature(s) are informative. (Berian: i.e. like weighting the nodes of a decision tree and taking the top N). A model that is interpretable should also be faithful, the feedback should accurately describe how the model behaves. Lastly, it should be model-agnostic.
Carlos calls this process LIME (Locally-interpretable… something something acronym, sorry Carlos). The process aims to fit a simple model (a straight line!) as a decision boundary in the region of the prediction that we are trying to explain. The important point here (the thing that makes it ‘local’) is that we are explaining the result of a single prediction, not the model as a whole. It’s an interesting idea!
How is the explanation generated? Sample points around the target point that is being predicted, weighting close-by points more highly; then fit a simple model to the locally-weighted data. This can reduce (for the target point in question) the explanation to a simple yes/no factor.
Random forests are a good example of models that can be susceptible to poor interpretability. If one sees non-sensical features being ‘important’ in the context of random forest prediction, it is indicative of the model being a poor candidate for production.
[…] « Notes from Data Science Summit: ML in Production […]