A Blog by Jonathan Low

 

Jan 4, 2019

What Amount of Time Do Data Scientists Spend Cleaning vs Analyzing Data?

The 80-20 rule appears to apply: assuming you trust the data...JL

Kirk Borne comments in Quora:

The real distinction is  between (a) all of those things (data prep, data wrangling, and analyzing data) and (b) data science, i.e., inferring hypotheses from the data, designing an experiment and/or model to test a hypothesis, build the model, collect output data, evaluate and/or validate the model, and (if necessary) refine the model and start again. "Data Scientists typically spend roughly 80% of their time preparing and cleaning their data. They spend the other 20% of their time complaining about preparing and cleaning their data."

I would say that data wrangling is a subset of analyzing data (EDA: Exploratory Data Analysis), since all of that requires getting up close and personal with your data — learning about and exploring its multiple facets through data profiling (data types, formats, ranges, missing values, characteristic values, outliers, trends, clusters, etc.), plus data transformations, data normalization, feature pruning, feature selection, and feature engineering.
The real distinction that you appear to asking about is the distinction between (a) all of those things (data prep, data wrangling, and analyzing data) and (b) data science — i.e., inferring hypotheses from the data, designing an experiment and/or model to test a hypothesis, build the model, collect output data, evaluate and/or validate the model, and (if necessary) refine the model and start again. In other words, data science should probably not be called “science” unless it follows the scientific method, since only then are you doing data science, not data analysis.
If your question is really about the ratio of time spent on data wrangling vs data analysis (not data science), then that’s a question about scientists in general, not specific to data scientists. In that case, the answer to your question would then be very dependent on the discipline, on the scientist, on the problem, on the data, etc.
The standard quote about data cleaning, data prep, and data wrangling among data scientists is this:
"Data Scientists typically spend roughly 80% of their time preparing and cleaning their data. They spend the other 20% of their time complaining about preparing and cleaning their data."

0 comments:

Post a Comment