The objective behind referring to this environment as Modern Analytics Environment is due to the fact that it covers all types of Analytical projects whether Big Data, Modern BI/DW, Data Science and Advance Analytics.
It will be 3 blog post series, starting with the purpose then details of the approach and lastly a working example / Solution (using ARM) covering all the aspects discussed in this blog post series to get going on the Modern Analytics journey.
By no means this approach is recommended for production environments as it’s merely to get organisations started with a sandpit environment for doing mainly experimentation around Data Science and Advance Analytics related use cases.
I have been thinking of writing this blog for a while reason being in all my recent customer interactions irrespective of organisation size or sector there has been a single biggest challenge stopping them to get going mainly on their Data Science and Advance Analytical journey is “How do we get started”?
“How do we get started” question can very quickly unfold into an extensive debate seeking all eventualities, hence, in a lot of cases leading to inconclusive outcomes. Just to give a flavour of those, discussion could be in the form of queries like “How do we do different analytical projects within same environment?”, “How do we ingest data?”,”How do we store data keeping our organisation and source of data in mind?”, “How to catalog data?”, “How could Data Scientist access the relevant data and do exploratory analysis with ease to find key insights?” “How to ensure secure access to the data and manage azure data services spend?”, “How could the environment follow DevOps process?” and the list could go on and precisely for this very reason, to just “get going”, I have adopted a rather simplistic still solid enough approach in building an effective Modern Analytics sandpit environment which allows organisations to start transforming raw data into intelligent action and reinvent their business processes in a very quick efficient manner. And, once the organisation maturity level starts improving the environment and processes can be transformed for much coherent Modern Analytics projects delivery / management practices because the foundations of the environment will still still be intact.
The key and sole purpose of this blog is to provide a Modern Analytics sandpit environment boilerplate keeping industry best practices and basic enterprise readiness requirements in perceptive i.e. Data Governance, Security, Scalability, High Availability, Monitoring, Lower TCO and most importantly Agility.
Here is the background to rather simplistic still solid approach for the proposed Modern Analytics Sandpit environment:
- Information Management Approach
- Data Lake Framework
- Architectural Components or Azure Building blocks for the Environment
- Example Project (Cisco Meraki)
- Key Personas (mainly Data Science and Advance Analytics)
- Modern Analytics Project Delivery Process (mainly Data Science and Advance Analytics)
The environment code will bet available via a Github repository later so watch this space.
Now, before we go into details of each one of the core areas of the above mentioned approach, here is the high level view of the proposed Sandpit environment using Azure Data & Analytics offering to keep things in perspective for next blog post:
Also, just to highlight here that the above high level architecture diagram represents the logical grouping of resources using Azure Resource Groups as an example implementation, the above implementations can be different depending on Organisation/Teams structure, Security/compliance etc.
Azure Resource groups are logical containers that allow to group individual resources such as virtual machines, storage accounts, websites and databases so they can be managed together.
The key benefits of the Azure Resource Groups are in the form of Cost Management, Security, Agility, Repeatability etc.
For complete benefit details please click on the following link:
Majority of the Services used in the proposed high level design are either PaaS or Managed Services to provide lower TCO and greater agility in deriving key actionable insights.
All PaaS services include cloud features such as scalability, high-availability, multi-tenant capability and Resiliency
|Resource Group||Details||Technologies to be used|
|1 – rgCommonDev||Shared resources to be used by all types of projects mainly for Information Management perspective and most important as well||
|2 – rgAnalyticsTeamDev||Development resources required for the Big Data, Data Science and Analytics Projects||
|3 – rgMerakiDev||Resources required for delivering the Meraki project which is to demonstrate a type of Advance Analytics example project||
In this resource group, the only resource will be Azure Data lake Store as that’s the most important and critical part of the implementation. Any data ingested from multiple data sources will be stored in here in an organised manner based on best practices / patterns.
More details will be added in the next series of this blog post Data Lake Framework process regarding the structure of the Data Lake.
rgAnalyticsTeamDev (Resource Group) contains 1 Data Science Virtual Machine but this is bare minimum setup and could be altered based on number of Data Scientists and Data Engineers usage. Regarding, Blob storage it will be used for storing ad-hoc data files, artefacts related to development environment.
Mainly, DSVMs will be used by the Data Scientists to perform exploratory data analysis by accessing data stored within Azure Data Lake Store, finding patterns and creating predictive models.
Talking explicitly around Azure Machine Learning & AI Platform portfolio the following image shows landscape and can be employed depending on organisation needs but to keep things simple have gone with DSVMs (details in next blogpost).
For further options regarding machine learning offerings see here…
The sample project to be deployed in this environment is Meraki. I have blogged earlier regarding this project here. This project helps demonstrating:
a) How different analytical projects can be deployed in Modern Analytics Environment side by side with completely separate governance framework
b) It’s also got the Lambda Architecture implementation which helps in showcasing the underlying Data architecture and framework whilst using Azure Data Lake store.
The key components of the project are: Web App, Event Hub, Stream Analytics, Azure SQL DB and Power BI. All these resources are already wrapped in ARM template within a Visual Studio Solution to automate the deployment process and ease of repeatability reasons which can be accessed separately by going to Github repository.
This project will also help in developing better understanding regarding the environment and processes involved in managing it going forward.
Details of the Modern Analytics approach mentioned above to follow in next blog post…
Just to give a quick glance of the Modern Analytics Pipeline in Azure with multiple offerings in each stage.
Some relevant blogposts: