A Blog by Jonathan Low

 

May 13, 2019

We have A Lot Of Data, We Just Don't Know Where

Sound familiar? JL

Favio Vasquez reports in Medium:

The ultimate goal of using data is making decisions from it. Data science does that: we have data and after the data science workflow we should be able to make decisions from the analysis and models we created.Most data centered projects nowadays (in reality) are complex, expensive, require organizational and cultural change, have a long time-to-value period, and are not so easy to scale. What’s the problem?
If you are reading this it’s probable that you are doing something with your data or want to do something with it. But, it’s not that easy right?
Most data centered projects nowadays (in reality) are complex, expensive, require organizational and cultural change, have a long time-to-value period, and are not so easy to scale. What’s the problem?
In this article I’ll discuss what in my opinion is the biggest problem companies trying to use their data face, and it has to do with decades of doing things that used to work but not anymore.

The Data Conundrum

Welcome to the data era. Where everything produces data and everyone wants to use it. The big question is HOW?
In a recent article by Cambridge Semantics, called “The Story of the Data Fabric”, they point out three specific issues we have in organizations and our data:
  • There is more data than ever from a multitude of both structured and unstructured sources.
  • Data in its raw form is highly variable in data quality. Sometimes it is well formed and clean. Other times it is sparse and uneven.
  • Data comes in many different (and incompatible) formats.
https://blog.adverity.com/data-warehouse-marketing-better-faster-insights
All of this came from the “Data Lake Era”. Some years ago, when data started to grow exponentially we changed our Data Warehouses that in a few words are systems that pulls together data from many different sources within an organization for reporting and analysis to Data Lakes. The data warehouses were present in almost every organization, and they were created to be used in “Business Intelligence (BI)”, the predecessor of data science (together with data mining). In BI we did reports, analysis and studies from organized and structured data from relational databases (mostly), and raw data, although it was there, was not that used.
https://www.inovaprime.com/business-intelligence-transform-data-into-successful-decisions/
When data grew and became more and more weird and unstructured a new paradigm appear. The mighty data lake.

In almost all companies doing big data and data science it’s the standard. The premise of data lakes is: store all of your structured and unstructured data at any scale. So we started doing that. Then a lot of new technologies were created to manage them, like Hadoop and Spark.
This is a brief story of Apache Spark that I created a while ago but shows how big data transformed over the years too (it’s updated btw):

What really happened? We had the possibility in our hands to analyze huge amounts of data, different types of data and in real time too, which was awesome. But the problem was that even though we tried our best to govern our data and have everything tight, it wasn’t easy.
Most of our data lakes transformed into data swamps:

This is not uncommon. Even though we have ways to improve the way we use our data lake and really govern it, it’s not easy to get the data we want, when we want it.
That’s why when I’m working with companies, the thing I hear the most is:
We have a lot of data, I just don’t know where. It should be here somewhere…
This is not what we want :(
Data it’s normally in silos, under the control of one department and is isolated from the rest of the organization, much like grain in a farm silo is closed off from outside elements. It’s time to stop that. Remember:
To extract value from data, it must be easy to explore, analyze and understand.

Towards the Data Fabric

If you’ve been following my research you may remember my definition of the data fabric:
The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.
And there are two important points I want to make here, the data fabric is formed by the enterprise knowledge-graph and it should be as automated as possible.
To create a knowledge-graph you need semantics and ontologies to find an useful way of linking your data that uniquely identifies and connects data with common business terms.
The key here is that instead of looking for possible answers, under this new model we’re seeking an answer. We want the facts — where those facts come from is less important.
The concept of the data lake it’s important too because we need a place to store our data, govern it and run our jobs. But we need a smart data lake, a place that understand what we have and how to use it. We have to make an effort to be able to organize all the data in the organization in one place and really manage and govern it.
In order to go to the data fabric universe we need to start thinking about ontologies, semantics, graph databases, linked-data and more to build a knowledge-graph and then find a way of automating the process of ingesting, preparing and analyzing data.
You can read more about how to start building a data fabric here:

Conclusion: Data science in the Data Fabric


The ultimate goal of using data is making decisions from it. Data science does that: we have data and after the data science workflow we should be able to make decisions from the analysis and models we created.
So far I’ve written two pieces on how to start doing machine learning (ML) and deep learning (DL) in the data fabric:
Before doing that we need to break our “data silos” and harmonizing organizational data is necessary to find new insights and unlock our data’s full potential.
What we actually need it’s a graph-based system that allows data analysis. Usually that’s called a Graph- Online Analytical Processing (OLAP).
A Graph OLAP (like Anzo) can deliver the high level of performance enterprises need for big data analytics at scale, and in combination with Graph Online Transaction Processing (OLTP) database (like Neo4j, Amazon Neptune, ArangoDB, etc.) you have a great way to start building your knowledge graph.
After you successfully created a data fabric, you will be able to do one of the most important parts of data science workflows: machine learning, as ML in this context is:
The automatic process of discovering insights in the data fabric by using algorithms that are able to find those insights without being specifically programmed for that, using the data stored in the it.
Remember also that insights generated with the fabric are themselves new data that becomes explicit/manifest as part of the fabric. i.e. Insights can grow the graph, potentially yielding further insights.
So the process of doing data science inside the data fabric it’s much easier, as we have a whole system that stores, and automates data ingestion, processing and analysis, that also enable us to find and explore all the data available in the organization in an faster and clearer way. No more weird data and huge queries to get a simple value it’s one of the goals too.
There are some examples of data fabrics around us that we don’t even know. Most successful companies in the world are implementing and migrating their systems to build a data fabric and of course all the things inside of it. I think it’s time for all of us to start building ours.

0 comments:

Post a Comment