A Blog by Jonathan Low

 

May 2, 2019

How Data Management Is Essential to AI Deployment

AI and machine learning are dependent on good data. There are no short cuts. But 'cleaning' data is boring, time intensive grunt which is not what high priced - and highly sought - data scientists want to be doing.

Because the job is crucial, companies are increasingly turning to software to fix the problem. JL


John Murawski reports in the Wall Street Journal:

AI can automate processes, diagnose problems and enhance customer service, but it can’t perform those tasks without high-quality data. (But) data is often stored in a variety of formats, in multiple data centers and in duplicate copies. “Everybody’s data is messy. “Each (company) curates their own data and they don’t have the same standards. There’s layers and layers of how detailed you can get.” When fed sloppy data that isn’t current, complete, consistent and accurate, AI is vulnerable to decisions that are erroneous.Data-management tools allow businesses to keep up the escalating volume, variety and pace of data they are handling.
Managing data properly to avoid errors and bias is an increasingly critical concern for information-services companies looking to deploy artificial intelligence.
AI can automate processes, diagnose production problems and enhance customer service, but it can’t perform any of those tasks without high-quality data. Unfortunately, data is often stored in a variety of formats, in multiple data centers and in duplicate copies. When fed sloppy data that isn’t current, complete, consistent and accurate, AI is vulnerable to making decisions that are erroneous or biased.
Unilog Content Solutions LLC is an e-commerce business that operates an enormous digital library listing descriptions and specifications of 5.2 million industrial products. These entries are downloaded by 166 retail distributors that sell tools, parts and other equipment to contractors, electricians and others through the distributors’ websites.
Unilog’s ever-expanding library and search engine would be unusable without the use of data preparation software, machine learning and natural language processing that allow the contractors to quickly find the products they need.
To assemble its library, Unilog, based in Wayne, Pa., had been manually cleaning up and standardizing the data it receives from thousands of manufacturers, said Noah Kays, vice president of content delivery. But it was a heavy lift.
Mr. Kays said nothing is standardized in his world. The amount of horsepower for an electrical motor, for example, can be listed as ⅓, 0.3 or .33, and that’s not even getting into the variety of ways the word “horsepower” could appear. This sort of chaos reigns for all product attributes, such as weight, size, shape, color, material and much more, he said.
“Everybody’s data is messy,” Mr. Kays said. “Each manufacturer curates their own data and they don’t have the same standards. There’s layers and layers of how detailed you can get.”
When Unilog’s digital library hit 1 million items 2½ years ago, the company turned to Paxata Inc., a Redwood City, Calif.-based company that helps businesses gather, standardize and prepare data. Paxata’s tool helped Unilog streamline descriptors in its digital library to correspond to terms most often used in online searches.
The users of Unilog’s library are parts distributors that sell to contractors, electricians, plumbers and others. Distributors such as APR Supply Co., Cooney Brothers Inc., Turtle & Hughes Inc. and Independent Electric Supply Inc. download Unilog’s curated descriptions of tools and parts, displaying them on websites that their customers, the contractors, see. But the distributors have a hard time finding exactly what they need in Unilog’s library because they don’t always use precise search terms, such as UPCs or manufacturer model numbers. So Unilog deploys its own machine learning and natural language algorithms to match the vague search terms with the contents of its master library.
The Paxata tool, called Self-Service Data Preparation, can even spot relationships between the misspelled and abbreviated company names that might sound similar if pronounced by a human. And because the automated technology doesn’t require coding, it doesn’t need coders to run it.
The data-management industry is expanding as more businesses are adopting AI and there is more data to deal with, said Ted Friedman, distinguished vice president at research and advisory firm Gartner Inc. Global revenue from data quality and integration software broke the $5 billion barrier last year and is expected to crack the $6 billion mark in 2020, according to Gartner. The revenue comes from licensing, subscriptions, maintenance, updates and technical support.
Mr. Friedman said cleaning and prepping data is “grunt work” and a Gartner report says those tasks can occupy up to 80% of a data scientist’s time. Data-management tools allow businesses to keep up the escalating volume, variety and pace of data they are handling.
Anil Chakravarthy, CEO of enterprise cloud data-management firm Informatica LLC, said the technology solves several core problems for businesses. One is documenting the source of the data by comparing data sets, so businesses can verify accuracy and have confidence the data is reliable. Such a verification might be needed to justify AI decisions to regulators or to the public, he said. Data management also automates the integration of multiple data sets into a master file and migrating data to the cloud.
PrecisionProfile Inc., a three-year-old Boulder, Colo.-based startup, is using Paxata technology to help oncologists match genetic mutations in their patients’ tumors to medical histories of patients with similar attributes from records in databases. The databases, downloaded from five disparate sources, contain details that doctors can use to find treatments that were successful with other patients whose genomic profile matches that of their patients.
Once the data is searchable, PrecisionProfile uses an AI algorithm to match patients by similar tumor characteristics, analyze treatments those patients received and recommend potential treatments. PrecisionProfile uses natural language processing to analyze the mutation characteristics of the patient’s tumor, and uses machine learning and predictive analytics to collect patient histories and infer treatment outcomes from insurance claims data.
Data prep that used to take up to two weeks to do manually now takes several hours when automated, said CEO David Parkhill. He said the medical files, measuring 18 terabytes for 400 bladder-cancer patients, are simply too enormous to be swallowed whole by spreadsheet programs.
“It would be very cumbersome,” Mr. Parkhill said. “It would take us weeks to actually deal with a new file structure and ingest the data.”