A Blog by Jonathan Low


Mar 23, 2020

How To Spot A Fake Data Scientist

Definitely easier than actually becoming a data scientist. JL

Conor Lazarou reports in Towards Data Science:

These days it seems like everyone and their dog are marketing themselves as data scientists—and you can hardly blame them, with “data scientist” being declared the Sexiest Job of the Century and carrying the salary to boot. Laypeople think machine learning is all about black boxes that magically churn out results from raw data; please don’t contribute to this misconception. I’ve compiled this list of tells so if you’re a hiring manager and don’t know what you’re looking for, you can filter out the slag, and if you’re an aspiring data scientist, you can fix them before you turn into a poser yourself.

1. You Don’t Bother with Data Exploration

Data exploration is the first step in any machine learning project. If you don’t take the time to familiarize yourself with your data and get well-acquainted with its features and quirks, you’re going to waste a significant amount of time barking up the wrong decision tree before arriving at a usable product — if you even make it there.

A) You don’t visualize your data

Exploratory data visualization is the best way to start any data-related project. If you’re applying machine learning, it’s likely that you’re working with a high volume of high-dimensional data; perusing a .csv in Excel or running a df.describe() is not a suitable alternative to proper data visualization. Francis Anscombe illustrated the importance of data visualization with his famous quartet:
Anscombe’s Quartet (Source: Wikimedia Commons)

B) You don’t clean your data

Data is messy: values get entered wrong; conversions run awry; sensors go spastic. It’s important that you resolve these issues before you waste months and months on a dead-end project, and it’s mission critical that you resolve them before pushing your models to production. Remember: garbage in ⇒ garbage out.
A histogram of adult human heights

C) You don’t bother with feature selection and engineering

One of the cool things about neural networks is that you can often throw all your raw data at it and it will learn some approximation of your target function. Sorry, typo, I meant one of the worst things. It’s convenient, sure, but inefficient and brittle. Worst of all, it makes beginner data scientists reliant on deep learning when it’s often the case that a more traditional machine learning approach would be appropriate, sending them on a slow descent to poserdom. There’s no “right” way to do feature selection and engineering, but there are a few key outcomes to strive for:
  • Data Formatting: Computers are dumb. You need to convert your data into a format that your model will easily understand: neural networks like numbers between -1 and 1; categorical data should be one-hot encoded; ordinal data (probably) shouldn’t be represented as a single floating point field; it may be beneficial to log transform your exponentially-distributed data. Suffice it to say, there’s a lot of model-dependent nuance in data formatting.
  • Creating Domain-Specific Features: It’s often productive to create your own features from data. If you have count data, you may want to convert it into a relevant binary threshold field, such as “≥100” vs “<100 0="" continuous="" data="" have="" if="" is="" not="" or="" span="" vs="" you=""> 
x and z, you may want to include fields , xz, and  alongside x and z in your feature set. This is a highly problem-dependent practice, but if done right can drastically improve model performance for some types of models.

2: You Fail to Choose an Appropriate Model Type

Machine learning is a broad field with a rich history, and for much of that history it went by the name “statistical learning”. With the advent of easy-to-use open source machine learning tools like Scikit-Learn and TensorFlow, combined with the deluge of data we now collect and a ubiquity of fast computers, it’s never been easier to experiment with different ML model types. However, it’s not a coincidence that removing the requirement that ML practitioners actually understand how different model types work has led to many ML practitioners not understanding how different model types work.

A) You just try everything

The github repos of aspiring data scientists are littered with Kaggle projects and online course assignments-come-portfolios that look like this:
from sklearn import *
for m in [SGDClassifier, LogisticRegression, KNeighborsClassifier,  
             KMeans, KNeighborsClassifier, RandomForestClassifier]:
    m.overfit(X_train, y_train)

B) You don’t actually understand how different model types work

Why might a KNN classifier not work so well if your inputs are “car age in years” and “kilometres traveled”? What’s the problem with applying linear regression to predict global population growth? Why isn’t my random forest classifier working on my dataset with a 1000-category one-hot-encoded variable? If you can’t answer those questions, that’s okay! There are lots of great resources to learn how each of these techniques work; just be sure to read and understand them before you apply for a job in the field.

C) You don’t know if you want accuracy or interpretability, or why you have to pick

All model types have their pros and cons. An important trade-off in machine learning is that between accuracy and interpretability. You can have a model that does a poor job of making predictions but is easy to understand and effectively explains the process, you can have a black box which is very accurate but whose inner workings are an enigma, or you can land somewhere in the middle.

3: You Don’t Use Effective Metrics and Controls

Despite making up 50% of the words and 64% of the letters, the “science” component of data science is often ignored. It’s not uncommon for poser data scientists to blindly apply a single metric in a vacuum as their model evaluation. Unwitting stakeholders are easily wowed by bold claims like “90% accuracy” which are technically correct but wildly inappropriate for the task at hand.

A) You don’t establish a baseline model

I have a test for pancreatic cancer which is over 99% accurate. Incredible, right? Well, it’s true, and you can try it for yourself by clicking this link.

B) You use the wrong metric

Continuing the diagnostic example above, it’s important to make sure you’re using the right metric. For cancer diagnosis, accuracy is actually a bad metric; it’s often preferable to decrease your accuracy if it means an increase in sensitivity. What’s the cost associated with a false positive? Patient stress, as well as wasted time and resources. What’s the cost of a false negative? Death. An understanding of the real-world implications of your model and an appreciation of how those implications govern metric selection clearly delineate real data scientists from their script-kiddie lookalikes.

C) You bungle the train/test split

This is a big one, and it’s far too common. Properly testing a model is absolutely essential to the data science process. There are many ways this can go awry: not understanding the difference between validation and test data, performing data augmentation before splitting, not plugging data leaks, ignoring data splitting altogether… There’s not much to say about this other than that if you don’t know or care how to create a proper holdout set, all your work has been a waste of time.

…to import tensorflow as tf

These are only a handful of tells that give up the game. With enough experience, they’re easy to spot, but if you’re just starting out in the field it can be hard to separate the Siraj Ravals of the world from the Andrew Ngs. Now, I don’t mean to gatekeep the field to aspiring data scientists; if you feel attacked by any of the above examples, I’m glad to hear it because it means you care about getting things right. Keep studying, keep climbing so that you too can be endlessly irked by the sea of posers.


Post a Comment