How to get going with Modern Analytics Sandpit Environment using Azure in no time…(Part 2)

The previous post in this series of 3 was regarding the background / purpose of “how to get started with Modern Analytics” and in this post we will try to cover the actual approach adopted and why?

As the name suggests “Modern Analytics” the approach has to be unconventional regarding Information Management to support the idea of bringing all the different types of analytics together (Descriptive, Diagnostic, Predictive & Prescriptive). Especially, the shift that has drastically challenged traditional analytics in the form of “Big Data“, and for that we have to draw some parallels between the old and new worlds (ETL vs ELT) as following:

Information Management Approach

There are two approaches to doing information management for analytics and deriving actionable Insights:

  1. Top-down (deductive approach)

This is where analytics is done, starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen?

2. Bottom-up (inductive approach)

This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen?

The following image very nicely  visualises the differences between the two and the types of analytics they cover.

analytics

In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash,’” they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach and exactly this forms the basis of my approach towards accumulating / simplifying different types of Analytics within “Modern Analytics”.

This leads onto a basic idea of how to start with bottom-up approach as we are already familiar with Top-down in form of traditional BI / Relational paradigm? Yes, Using Data Lake, as it helps to land all the different data sources in one single place to derive actionable insights, irrespective of data volume, velocity, variety or Veracity challenges.

Data Lake Framework

However, before jumping straight into Data Acquisition and Ingestion steps there has to be some careful thinking to avoid Data Lakes from turning into Data Swamps, and especially thinking around What about Data Management structure considering data democratisation within an organisation?”, “How to apply Data Governance?”, “How to cater for Data Quality”, “How to apply Data Security based on certain Roles within organisation”, “How to store data for Real-time and batch Processing?” without making this too onerous having some basic framework around these fundamental questions does assist in delivering profound, relevant and quality analytics even in the Sandpit environment, above all, also helps in formulating certain aspects of Enterprise Data Strategy.

Additionally, consideration on these basic questions would form the solid basis of Best Practices and Patterns for delivering on Data Science & Advance Analytics projects in an effective manner.

We can go into a lot of details but already there has been some quality blogs written by the likes of my colleague Tony Smith and especially Adatis which is referenced below and the one used as an adopted approach for answering the questions regarding Data Governance & Data Management mentioned above.

Though, just to highlight one of the core areas which is quite significant and fundamental to all the above questions, and that’s regarding the Big Data Architecture applicable for managing the Real-Time and Batch Processing within Data Lake.

Lambda Architecture

Briefly, this architecture approach divides data processing into  “speed” (near real time) and “batch” (Raw, Base, Curated) layers.  This design is well established and is a relatively common implementation pattern on Azure.

lambda

Framework

The rest of the details for Data Lake Framework are articulated very nicely in the following blogs especially around the carving of Data Lake for better Data Governance and Management based on the above Lambda Architecture.

Here is a screenshot of the carved Azure Data Lake discussed in detail within the following blogs:

datalakeframework

Must reads…

Shaping The Lake: Data Lake Framework

Azure Data Lake Store–Storage and Best Practices

Architectural Components or Azure Building blocks for the Environment

The main architectural components or building blocks referenced for this environment are as following:

  1. Azure Data Lake Store
  2. Data Science Virtual Machines
  3. App Service
  4. Event Hubs
  5. Stream Analytics
  6. Azure Blob Storage
  7. Power BI

Modern Analytics Project Delivery Process (mainly Data Science and Advance Analytics) & Key Personas

The Project delivery process has been narrowed down specifically for Data Science and Advance Analytics  in the form of TDSP (Team Data Science Process).

The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data-science projects. The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.

The TDSP lifecycle is composed of five major stages that are executed iteratively. These stages include:

  1. Business understanding
  2. Data acquisition and understanding
  3. Modeling
  4. Deployment
  5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

tdsp-lifecycle2

The key personas involved in the delivery of a Data Science & Advance Analytics project could slightly differ depending on the availability / organisational structure based on size/maturity but in majority of the cases the team structure would look like:

  1. Solution Architect
  2. Data Scientist
  3. Data Engineer
  4. Project Lead / Manager
  5. Developer (Integrating the Insights into downstream Apps)

For further details click here regarding standardised project structure, Infrastructure Resources, Tools & Utilities for delivering Data Science / Adavcne Analytics projects in an effective manner.

This is by no means the definitive / exhaustive details for DS & AA project delivery process as already mentioned could vary / tailored depending on numerous factors but as the topic for blog series suggests to “Get started” will surely be quite helpful.

The final post in this series of 3 will demonstrate the entire above approach in form of a practical Example Project (Cisco Meraki)

 

Some relevant blogs

Making Sense of the Swamp – Azure Data Catalog for your Data Lake

Granting Permissions In Azure Data Lake
Assigning Resource Management Permissions For Azure Data Lake Store (Part 2)
Assigning Data Permissions For Azure Data Lake Store (Part 3)

Buck Woody’s DevOps for Data Science Series

Advertisements