Demystifying Azure Machine Learning Service (Public Preview)

 

Abstract

Since last month’s Ignite announcements – consolidated in Book of News very nicely, I have been thinking of writing this post due to couple of key reasons, despite plethora of other really exciting and great announcements:

a) I have been part of AML Service Private Preview exercise, so have seen it develop very closely and can provide personal POV

b) I do appreciate that it’s version 2 of the AML Service in a short span of time

The entire focus of the IT industry is at the moment on the Machine Learning discipline. Hence, it could easily raise a few questions in everyone’s mind (e.g. What’s happening with Azure Workbench? – Please click here for details), regarding the maturity, real effectiveness of the service to simplify the entire Data Science life-cycle and moreover confidence in it’s future roadmap.

This post aims to provide some clarity around the 2 fundamental aspects:

  1. Types of Azure Machine Learning Services & Differences between them
  2. Demystifying New Azure Machine Learning Service (Pubic Preview)

Through this exercise, I will try to promote a better understanding of the different Azure Machine Learning offerings available today.

Most importantly, I hope this will assist in deciding whether it’s got the right balance between challenges of AI & Machine Learning-based solutions faced by different organisations nowadays in form of DS/ML Skills disparity.

Additionally, I try to focus on the Enterprise readiness currently required for End-to-End Data Science Life cycle, delivering key actionable insights in a repeatable and quicker time to market fashion.

Types of Azure Machine Learning Services

At present the two following services are available in the Azure Machine Learning space, and both are strategic offerings:

Azure Machine Learning Studio

Azure Machine Learning Studio is a collaborative, drag-and-drop visual workspace where you can build, test, and deploy machine learning solutions without needing to write code. It uses pre-built and pre-configured machine learning algorithms and data-handling modules.

You should use Machine Learning Studio when you want to experiment with machine learning models quickly and easily, and the built-in machine learning algorithms are sufficient for your solutions.

It’s also aimed more towards Citizen Data Scientists, Business Analysts, Data Analyst where the deep understating of Data Science/Machine Learning concepts aren’t necessary.

Models created in Azure Machine Learning Studio cannot be deployed or managed by Azure Machine Learning service, which is an independent service.

Azure Machine Learning Service (Public Preview)

Azure Machine Learning service (Preview) is a cloud service that you can use to develop and deploy machine learning models. Using Azure Machine Learning service, you can track your models as you build, train, deploy, and manage them, all at the broad scale that the cloud provides.

aml

Use Machine Learning service if you work in a Python environment, you want more control over your machine learning algorithms, or you want to use open-source machine learning libraries.

Most of the recent machine learning innovations are happening in the Python language space, which is why all the core features of the service are exposed through a Python SDK.  

Azure Machine Learning service fully supports open-source technologies, so you can use hundreds of open-source Python packages with machine learning components such as TensorFlow, Keras, PyTorch, Onnx, scikit-learn, CNTK, etc. Rich tools, such as Jupyter notebooks or the Visual Studio Code Tools for AI, make it easy to interactively explore data, transform it, and then develop and test models. Azure Machine Learning service also includes features that automate model generation and tuning to help you create models with ease, efficiency, and accuracy.

Azure Machine Learning service lets you start training on your local machine, and then scale out to the cloud. With native support for Azure Batch AI and with advanced hyperparameter tuning services, you can build better models faster, using the power and scale of the cloud.

When you have the right model, you can easily deploy it in a container such as Docker. This means that it’s simple to deploy to Azure Container Instances or Azure Kubernetes Service, or you can use the container in your own deployments, either on-premises or in the cloud. You can manage the deployed models, and track multiple runs as you experiment to find the best solution.

Mainly, targeted at skilled Data Scientists where they don’t have to worry too much about the infrastructure / engineering aspects of Models training, Management,  Operationalisation & Monitoring.

Having said that, I strongly believe based on closer interactions with quite advance customers in the DS space, a siloed / disconnected team approach can’t deliver on the entire end-to-end data science life-cycle and that’s where a change in organisational culture is required to embrace DevOps for Data Science approach / methodology, to make the best use of available tooling & services for increased productivity and value add for businesses.

Interestingly, one of my colleagues also wrote an excellent blogpost recently around Collaborative Approach & Blockers in Data Science, so may be worth a read

The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data-science projects. The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.

tdsp-lifecycle2

The AML Service is very much geared towards this methodology, and we will now dive the main aspects that we clearly need to understand to appreciate the value offer.

To create the new Machine Learning Service, go to the Azure Portal and click on ‘+’ sign on the top left and type “Machine Learning service workspace”

AML Service

Workspace

The workspace is the top-level resource for the Azure Machine Learning service. It provides a centralized place to work with all the artifacts you create when using Azure Machine Learning Service.

Logical Artifacts

logical

Compute target

A compute target is the compute resource used to run your training script or host your web service deployment running your model. The supported compute targets are:

  • Your local computer
  • A Linux VM in Azure (such as the Data Science Virtual Machine)
  • Azure Batch AI Cluster
  • Apache Spark for HDInsight
  • Azure Container Instance
  • Azure Kubernetes Service

Compute targets are attached to a workspace. Compute targets other than the local machine are shared by users of the workspace.

Experiment

An experiment is a grouping of many runs of a given script, and it always belongs to a workspace. When you submit a run, you provide an experiment name. Information for the run is stored under that experiment. If you submit a run and specify an experiment name that doesn’t exist, a new experiment with that name is automatically created.

Datastore

A datastore is a storage abstraction over an Azure Storage Account. The datastore can use either an Azure blob container or an Azure file share as the backend storage. Each workspace has a default datastore, and you may register additional datastores.

Use the Python SDK API or Azure Machine Learning CLI to store and retrieve files from the datastore.

Model

At its simplest, a model is a piece of code that takes an input and produces output. Creating a machine learning model involves selecting an algorithm, providing it with data, and tuning hyperparameters. Training is an iterative process that produces a trained model, which encapsulates what the model learned during the training process.

Images

Images provide a way to reliably deploy a model, along with all components needed to use the model. An image contains the following items:

  • A model.
  • A scoring script or application. This script is used to pass input to the model and return the output of the model.
  • Dependencies needed by the model or scoring script/application. For example, you might include a Conda environment file that lists Python package dependencies.

Deployment

A deployment is an instantiation of your image into either a Web Service that may be hosted in the cloud or an IoT Module for integrated edge device deployments.

Portal View for the logical artifacts looks like below

portalview

Physical Artifacts

physical

Once you create a Workspace, all the above artifacts get automatically created, which makes this process very neat and with less overhead in terms of worrying or figuring out additional aspects.

Azure Machine Learning Service Workflow

workflow

The workflow generally follows these steps:

  1. Develop machine learning training scripts in Python.
  2. Create and configure a compute target.
  3. Submit the scripts to the configured compute target to run in that environment. During training, the compute target stores run records to a datastore. There the records are saved to an experiment.
  4. Query the experiment for logged metrics from the current and past runs. If the metrics do not indicate a desired outcome, loop back to step 1 and iterate on your scripts.
  5. Once a satisfactory run is found, register the persisted model in the model registry.
  6. Develop a scoring script.
  7. Create an Image and register it in the image registry.
  8. Deploy the image as a web service in Azure.

For more details on architecture and concepts, please click here.

Conclusion

In the end, I believe the latest AML Service (currently in Public Preview) has been developed with engineering first in mind, to reflect various aspects that need to be addressed from data collection to metrics for a successful AI model to be used in a production environment within an organisation.

This will certainly boost productivity for data scientists and machine learning practitioners in building and deploying machine learning solutions at cloud scale.

The service is all set to go GA / Live very soon with great set of new features, so I would say that it definitely looks very promising and encourage you to watch this space.

How to Get Started with Azure Machine Learning Service

References:

https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml

https://azure.microsoft.com/en-gb/blog/what-s-new-in-azure-machine-learning-service/

https://github.com/Azure/MachineLearningNotebooks

 

Advertisements

How to get going with Modern Analytics Sandpit Environment using Azure in no time…(Part 2)

The previous post in this series of 3 was regarding the background / purpose of “how to get started with Modern Analytics” and in this post we will try to cover the actual approach adopted and why?

As the name suggests “Modern Analytics” the approach has to be unconventional regarding Information Management to support the idea of bringing all the different types of analytics together (Descriptive, Diagnostic, Predictive & Prescriptive). Especially, the shift that has drastically challenged traditional analytics in the form of “Big Data“, and for that we have to draw some parallels between the old and new worlds (ETL vs ELT) as following:

Information Management Approach

There are two approaches to doing information management for analytics and deriving actionable Insights:

  1. Top-down (deductive approach)

This is where analytics is done, starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen?

2. Bottom-up (inductive approach)

This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen?

The following image very nicely  visualises the differences between the two and the types of analytics they cover.

analytics

In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash,’” they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach and exactly this forms the basis of my approach towards accumulating / simplifying different types of Analytics within “Modern Analytics”.

This leads onto a basic idea of how to start with bottom-up approach as we are already familiar with Top-down in form of traditional BI / Relational paradigm? Yes, Using Data Lake, as it helps to land all the different data sources in one single place to derive actionable insights, irrespective of data volume, velocity, variety or Veracity challenges.

Data Lake Framework

However, before jumping straight into Data Acquisition and Ingestion steps there has to be some careful thinking to avoid Data Lakes from turning into Data Swamps, and especially thinking around What about Data Management structure considering data democratisation within an organisation?”, “How to apply Data Governance?”, “How to cater for Data Quality”, “How to apply Data Security based on certain Roles within organisation”, “How to store data for Real-time and batch Processing?” without making this too onerous having some basic framework around these fundamental questions does assist in delivering profound, relevant and quality analytics even in the Sandpit environment, above all, also helps in formulating certain aspects of Enterprise Data Strategy.

Additionally, consideration on these basic questions would form the solid basis of Best Practices and Patterns for delivering on Data Science & Advance Analytics projects in an effective manner.

We can go into a lot of details but already there has been some quality blogs written by the likes of my colleague Tony Smith and especially Adatis which is referenced below and the one used as an adopted approach for answering the questions regarding Data Governance & Data Management mentioned above.

Though, just to highlight one of the core areas which is quite significant and fundamental to all the above questions, and that’s regarding the Big Data Architecture applicable for managing the Real-Time and Batch Processing within Data Lake.

Lambda Architecture

Briefly, this architecture approach divides data processing into  “speed” (near real time) and “batch” (Raw, Base, Curated) layers.  This design is well established and is a relatively common implementation pattern on Azure.

lambda

Framework

The rest of the details for Data Lake Framework are articulated very nicely in the following blogs especially around the carving of Data Lake for better Data Governance and Management based on the above Lambda Architecture.

Here is a screenshot of the carved Azure Data Lake discussed in detail within the following blogs:

datalakeframework

Must reads…

Shaping The Lake: Data Lake Framework

Azure Data Lake Store–Storage and Best Practices

Architectural Components or Azure Building blocks for the Environment

The main architectural components or building blocks referenced for this environment are as following:

  1. Azure Data Lake Store
  2. Data Science Virtual Machines
  3. App Service
  4. Event Hubs
  5. Stream Analytics
  6. Azure Blob Storage
  7. Power BI

Modern Analytics Project Delivery Process (mainly Data Science and Advance Analytics) & Key Personas

The Project delivery process has been narrowed down specifically for Data Science and Advance Analytics  in the form of TDSP (Team Data Science Process).

The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data-science projects. The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.

The TDSP lifecycle is composed of five major stages that are executed iteratively. These stages include:

  1. Business understanding
  2. Data acquisition and understanding
  3. Modeling
  4. Deployment
  5. Customer acceptance

Here is a visual representation of the TDSP lifecycle:

tdsp-lifecycle2

The key personas involved in the delivery of a Data Science & Advance Analytics project could slightly differ depending on the availability / organisational structure based on size/maturity but in majority of the cases the team structure would look like:

  1. Solution Architect
  2. Data Scientist
  3. Data Engineer
  4. Project Lead / Manager
  5. Developer (Integrating the Insights into downstream Apps)

For further details click here regarding standardised project structure, Infrastructure Resources, Tools & Utilities for delivering Data Science / Adavcne Analytics projects in an effective manner.

This is by no means the definitive / exhaustive details for DS & AA project delivery process as already mentioned could vary / tailored depending on numerous factors but as the topic for blog series suggests to “Get started” will surely be quite helpful.

The final post in this series of 3 will demonstrate the entire above approach in form of a practical Example Project (Cisco Meraki)

 

Some relevant blogs

Making Sense of the Swamp – Azure Data Catalog for your Data Lake

Granting Permissions In Azure Data Lake
Assigning Resource Management Permissions For Azure Data Lake Store (Part 2)
Assigning Data Permissions For Azure Data Lake Store (Part 3)

Buck Woody’s DevOps for Data Science Series

How to get going with Modern Analytics Sandpit Environment using Azure in no time…

Purpose

The objective behind referring to this environment as Modern Analytics Environment is due to the fact that it covers all types of Analytical projects whether Big Data, Modern BI/DW, Data Science and Advance Analytics.

It will be 3 blog post series, starting with the purpose then details of the approach and lastly a working example / Solution (using ARM) covering all the aspects discussed in this blog post series to get going on the Modern Analytics journey.

By no means this approach is recommended for production environments as it’s merely to get organisations started with a sandpit environment for doing mainly experimentation around Data Science and Advance Analytics related use cases.

I have been thinking of writing this blog for a while reason being in all my recent customer interactions irrespective of organisation size or sector there has been a single biggest challenge stopping them to get going mainly on their Data Science and Advance Analytical journey is “How do we get started”?

“How do we get started” question can very quickly unfold into an extensive debate seeking all eventualities, hence, in a lot of cases leading to inconclusive outcomes. Just to give a flavour of those, discussion could be in the form of queries like “How do we do different analytical projects within same environment?”, “How do we ingest data?”,”How do we store data keeping our organisation and source of data in mind?”, “How to catalog data?”, “How could Data Scientist access the relevant data and do exploratory analysis with ease to find key insights?” “How to ensure secure access to the data and manage azure data services spend?”, “How could the environment follow DevOps process?” and the list could go on and precisely for this very reason, to just “get going”, I have adopted a rather simplistic still solid enough approach in building an effective Modern Analytics sandpit environment which allows organisations to start transforming raw data into intelligent action and reinvent their business processes in a very quick efficient manner. And, once the organisation maturity level starts improving the environment and processes can be transformed for much coherent Modern Analytics projects delivery / management practices because the foundations of the environment will still still be intact.

The key and sole purpose of this blog is to provide a Modern Analytics sandpit environment boilerplate keeping industry best practices and basic enterprise readiness requirements in perceptive i.e. Data Governance, Security, Scalability, High Availability, Monitoring, Lower TCO and most importantly Agility.

Approach

Here is the background to rather simplistic still solid approach for the proposed Modern Analytics Sandpit environment:

  1. Information Management Approach
  2. Data Lake Framework
  3. Architectural Components or Azure Building blocks for the Environment
  4. Example Project (Cisco Meraki)
  5. Key Personas (mainly Data Science and Advance Analytics)
  6. Modern Analytics Project Delivery Process (mainly Data Science and Advance Analytics)

The environment code will bet available via a Github repository later so watch this space.

Solution

Now, before we go into details of each one of the core areas of the above mentioned approach, here is the high level view of the proposed Sandpit environment using Azure Data & Analytics offering to keep things in perspective for next blog post:

blomodernanalytics

Also, just to highlight here that the above high level architecture diagram represents the logical grouping of resources using Azure Resource Groups as an example implementation, the above implementations can be different depending on Organisation/Teams structure, Security/compliance etc.

Azure Resource groups are logical containers that allow to group individual resources such as virtual machines, storage accounts, websites and databases so they can be managed together.

The key benefits of the Azure Resource Groups are in the form of Cost Management, Security, Agility, Repeatability etc.

For complete benefit details please click on the following link:

https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-overview#the-benefits-of-using-resource-manager

Majority of the Services used in the proposed high level design are either PaaS or Managed Services to provide lower TCO and greater agility in deriving key actionable insights.

All PaaS services include cloud features such as scalability, high-availability, multi-tenant capability and Resiliency

Resource Group Details Technologies to be used
1 – rgCommonDev Shared resources to be used by all types of projects mainly for Information Management perspective and most important as well
  • Azure Data Lake Store
2 – rgAnalyticsTeamDev Development resources required for the Big Data, Data Science and Analytics Projects
  • Data Science Virtual Machines times no. of Data Scientists/Engineers
  • Azure Blob Storage
3 – rgMerakiDev Resources required for delivering the Meraki project which is to demonstrate a type of Advance Analytics example project
  • Web App
  • Event Hub
  • Stream Analytics
  • Azure SQL
  • Power BI
  1. rgCommonDev

In this resource group, the only resource will be Azure Data lake Store as that’s the most important and critical part of the implementation. Any data ingested from multiple data sources will be stored in here in an organised manner based on best practices / patterns.

More details will be added in the next series of this blog post Data Lake Framework process regarding the structure of the Data Lake.

2. rgAnalyticsTeamDev

rgAnalyticsTeamDev (Resource Group) contains 1 Data Science Virtual Machine but this is bare minimum setup and could be altered based on number of Data Scientists and Data Engineers usage. Regarding, Blob storage it will be used for storing ad-hoc data files, artefacts related to development environment.

Mainly, DSVMs will be used by the Data Scientists to perform exploratory data analysis by accessing data stored within Azure Data Lake Store, finding patterns and creating predictive models.

Talking explicitly around Azure Machine Learning & AI Platform portfolio the following image shows landscape and can be employed depending on organisation needs but to keep things simple have gone with DSVMs (details in next blogpost).

AI Platform Stack

For further options regarding machine learning offerings see here

3. rgMerakiDev

 The sample project to be deployed in this environment is Meraki. I have blogged earlier regarding this project here. This project helps demonstrating:

a) How different analytical projects can be deployed in Modern Analytics Environment side by side with completely separate governance framework

b) It’s also got the Lambda Architecture implementation which helps in showcasing the underlying Data architecture and framework whilst using Azure Data Lake store.

The key components of the project are: Web App, Event Hub, Stream Analytics, Azure SQL DB and Power BI. All these resources are already wrapped in ARM template within a Visual Studio Solution to automate the deployment process and ease of repeatability reasons which can be accessed separately by going to Github repository.

This project will also help in developing better understanding regarding the environment and processes involved in managing it going forward.

Details of the Modern Analytics approach mentioned above to follow in next blog post…

Just to give a quick glance of the Modern Analytics Pipeline in Azure with multiple offerings in each stage.

Azure Modern Analytics

Some relevant blogposts:

Buck Woody’s DevOps for Data Science series

Making sense of the swamp