Data Lake Essentials



A Data Lake is a central data repository that can hold a vast amount of structured and unstructured data in its native format, until the value has been discovered through the use of other data sources, experimentation, and exploration through a collaborated approach in the organization. It is a compliment and mature attribute to existing Business Intelligence implementations, providing proof-of-value concepts to our data. In our design efforts, the Data Warehouse should almost always receive architecture priority over the Data Lake.


alt


"It is a compliment and mature attribute to existing Business Intelligence implementations, providing proof-of-value concepts to our data..."


The Three Data Lake Pillars

Organization

Operations

Discovery


Organization

A common organization pattern in the data lake is one that enables Data Exploration and a proof-of-value concept to the data itself. It is crucial that thorough planning is considered when implementing a Data Lake in your organization so that the Data Lake does not transform into a Data Swamp.

At the end of the day, the Data Lake should enable power users, analyst, and Data Scientist to quickly discover, provide value, and promote that value to a Production system. The production of the Exploratory pipelines to Operational Pipelines should be seamless and easily deployable.

alt

You can begin to understand how a Data Lake can be structured from the above image. The data is landed in the RAW area where the data is organized by 5 sub categories. An example follows:

Finance (Subject)
 CRM      (Data Source)   
   Invoices (Data Set)
     2017      (Date)
        05       (Date)
           27      (Date)
             Invoices_20170527.txt  (File)



Lake Area Attributes As the data moves from Ingestion to Curation, the data becomes more structured, governed, and more available to the users.

Phase A: Exploratory Pipeline - RAW > EXPLORATION
Phase B: Operational Pipeline - RAW > STAGE > CURATED



alt

Organizing by Date It has become a best practice to partition and organize Data Lake data by date. This can become particular useful as data is collected from aggregated streams, constant large batch ingestions, but also could provide an organization to rebuild a destination database or system from a certain point-in-time.




Operations

A common operations pattern involving Exploratory Analytics
alt

As companies embark on a Data Science initiative, and more and more value becomes available from our Exploration, we must take into consideration AnalyticOps or the promotion of valuable datasets to Production.

Example: A Data Scientist retrieves new data that was just landed in the RAW area. This data has not yet been moved to the Operational pipeline because there is currently no accessible value. The Scientist then decides to run some experiments through Azure Machine Learning, maybe include some R and Python scripting. Highly valuable predictions are then discovered in which the company wants to immediately surface in one of their Customer Service applications. We don't want to have to rebuild this entire model to fit IT's standards. Instead, we take an approach that enables multiple tools (ML, R, Python, C#, etc.) to be used without having to change the pipeline that orchestrates this data movement.

In a growing world of new open-source tools, it is imperative that a solution is in place that can handle the data movement in an organization without limiting the usage of new technology.





Discovery

A key attribute to a Data Lake implementation is enabling the capture of MetaData. This not only includes the capture of metadata when it comes to automating ETL and SQL development, but as well as looking at what type of data we are bringing in and how it relates to other data.

Tagging Phase 1 usually occurs in a MetaData type Framework that is used for development. Where 2, 3, and 4 could be stored in a different type of system such as Azure Data Catalog or the open-source equivalents. Here users will Tag data, based on internal acronyms and understanding. Then when someone wants to discover new value from existing data, they can search the Catalog for tags such as Customer A, or Product B, etc. This occurs specifically in Tag 4 (reference table below). Other tagging can be automated such as pre-aggregating certain data field types, identifying counts on data sets where there is missing data (to determine value aspect), etc.

There are many options when it comes to designing your Discovery methods. A more mature feature would be using graph databases and creating visual maps on how data relates to each other.

alt