By Joel Prince Varghese, MEng ’16 (IEOR)
TL;DR
Tip 1: Build random forest models using h2o.ai instead of scikit-learn. Tip 2: Provide local model explanations using LIME. I’ll tell you how, why and an easy way to do this below.Who should be reading this?
This article is intended for two audiences:- You’re looking for your first/next data science job and are interested in going one-step ahead in your take-home challenges to stand out from the crowd.
- You’re working as a data scientist and are looking to improve the performance of your random forest models or are tired of explaining the accuracy vs interpretability trade off of machine learning models to your business counterparts.
Tip 1: Build random forest models using h2o.ai instead of scikit-learn
You’ve spent hours polishing that messy data, carrying out featuring engineering and are finally ready to attack your training dataset with your arsenal of machine learning models. Random Forests are usually my first weapon of choice since they are easy to implement and tend to give good out-of-box performance without requiring intensive hyperparameter tuning. If you have categorical features in your dataset, you need to one-hot encode them to make it scikit-learn friendly. Did you know that this step, in most cases, actually prevents your model from reaching its full performance potential? Say you’ve two features — Age (continuous) and Education (categorical with four levels — High School, Bachelors, Masters, and Ph.D). One-hot encoding transforms these two features into five features. When the decision tree is choosing the best feature for a split at a node, unless one of the categorical levels individually has more predictive power than age, the continuous feature tends to get picked first. Hence, if you ever looked at scikit-learn’s feature importance plots, you’ll notice continuous variables are usually on top. However, h20.ai’s or R’s implementation of random forest circumvents this problem by considering categorical features as a whole. With one-hot encoding, only four conditions for Education are checked at a node for a split. If the training data points have the one-hot encoded value of High School as one then move the data points to the left of the node else move them to the right and the same is checked for Bachelors, Masters, and Ph.D. On the other hand, h2o.ai checks the above conditions but also checks compound conditions. Compound conditions are when the different categorical levels are grouped into one. For example, if data points have the value of High School or Bachelors then move data points to the left of the node else right. All other combinations of levels are checked for the split as well. For example, High School or Masters, Bachelors or Ph.D, High School or Bachelors or Masters (this is similar to the one-hot encoded value of Ph.D) etc. These extra compound conditions result in a greater probability of a categorical variable being chosen for a split. Keep in mind though that since these implementations check additional compound conditions, their training times are usually longer than the scikit-learn implementation. Scratching your head after the above paragraph? This article provides a detailed explanation of this phenomenon with examples. If you’ve never built a model using h2o.ai, worry not, I got your back. There’s a link at the end of this article to a Jupyter Notebook with a simple example. It’s pretty straightforward! Read the full story on Joel’s Medium account: Two Tips to Stand Out in your Data Science Interview Take-Home Challenge or JobTwo tips to stand out in your data science interview, take-home challenge, or job was originally published in Berkeley Master of Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.