Building Automated Data and Machine Learning Pipelines with Azure - Automated Feature Engineering

Introduction


Artificial Intelligence (AI) has been dubbed the fourth industrial revolution, where advancements in the field represent a fundamental change in the way we live, work and relate to one another.

Like with any new technology breakthrough, it has its challenges. It takes time and education for it to be understood, become more accessible, and applied to its fullest potential. For example, think about how long it took before the personal computer and the Internet were seamlessly integrated into our daily lives.


So how do businesses evolve AI from ideation to productisation? What do organisations do if they do not possess the luxury of virtually limitless compute and talent might, akin to the Googles, Microsofts, and Amazons of the world?


We'd only need to look into history and the successes of the first industrial revolution to find the answer - Automation, standardisation, speed and scale (using cloud services).


Purpose


This article is one in a series of articles discussing automation for end to end data and machine learning pipelines.

AI is a two step process - Data Preparation and Machine Learning model build.

The objective of the article is to provide guidelines for the automation of the Data Preparation process.

Although Azure cloud services were used to detail the proposed implementation, the blueprint can also be used on other cloud service providers.

The topics discussed:

  • Feature Engineering Concepts

  • Azure Data Factory

  • DataOps

  • Data Ingestion Architecture

  • Feature Engineering Architecture



Feature Engineering Concepts


The proposed blueprint is a two step process - Data Ingestion and Feature Engineering.




Data Ingestion


Data ingestion is the process of obtaining and importing data from disparate sources for immediate use or storage in a data store/database.

The proposed architecture provides the capability to ingest data from manually uploaded files, SQL database tables, and Data Lakes.

Feature Engineering


Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. It involves transforming raw data into features that better represent the underlying problem for i.e. predictive models.

The blueprint proposes the concepts of a Base Population and Feature Catalogue. The idea is that features are engineered as per defined in a Feature Catalogue, for the standing data defined in a Base Population. For example in a fashion retail store context, engineering a feature for the spending total in the last 24 hours (Feature Catalogue) for customers who are active members (Base Population)


Base Population


A Base Population is the reference or standing data associated with a Feature Catalogue. An example of Base Population data is customer information like name, age, address and customer account id. A Base Population in the context of the proposed architecture is defined by the following:



  • Source datastore: Location of the data store (i.e. Blob storage, database)

  • Source table/file: Source table/file name (i.e. credit card accounts table)

  • Reference ID: The unique reference id that identifies the customer in the base population (i.e. Customer Account ID)

  • Base Population Criterion: definition of criterion to include/exclude customers (i.e. All active Credit Card customer accounts)

Feature Catalogue


A Feature Catalogue consists of definitions related to deriving/transforming the data in a raw format to a feature. This involves taking multiple rows per reference id and create a single row per reference id. An engineered feature in the context of the proposed architecture is defined by the following:



  • Source datastore: Location of the data store (i.e. Blob storage, database)

  • Source table/file: Source table/file name (i.e. credit card transactions table)

  • Reference ID: The unique reference id that identifies the customer in the base population (i.e. Customer Account ID)

  • Reference Sequence: The group by column name for the aggregated column (i.e. a date or date time column) Reference Point: The snapshot point in time for the aggregation (i.e. group by from 14/01/2019 looking back 24 hours of customer spend)

  • Raw Column name: The raw column to carry out the transformation (i.e. Spend)

  • Transformation: The type of transformation (i.e. sum, average, min, max)

  • Look back unit: The look back unit (i.e. hours)

  • Look back value: The look back value (i.e. 24 hours)


Azure Data Factory


The proposed architecture uses Azure Data Factory (ADF). ADF is an easy and seamless way to deploy and automate technology solutions for the purpose of data ingestion and feature engineering in the cloud. More information on ADF can be found here.


DataOps


The blueprint applies the principles of DevOps for Data (DataOps) to standardise and streamline the feature engineering process. More information on DataOps on Azure can be found here.




Data Ingestion Architecture


Manual File Upload Ingestion Pattern

  1. File is uploaded and stored in a Azure Blob Storage location


  1. ADF Data Pipeline detect new files in a predetermined Blob Storage location using an event based trigger.

  2. Data Flow retrieves, maps, transforms and saves the file to the destination ADLS Raw Data Catalogue in a Parquet format.

SQL Database Ingestion Pattern


Bulk Multiple Tables Copy

  1. ADF Data Pipeline executes the data pipeline using manual execution or scheduled trigger.

  2. Data Flow retrieves, maps, transforms and saves the file from the SQL server source to the destination ADLS Raw Data Catalogue in a Parquet format.

Incremental Multiple Tables Copy

  1. The ADF Data Pipeline's Lookup activity tracks the source database tables for incremental changes in data.

  2. Incremental data is mapped, transformed and saved from the SQL server to the destination Raw Data Catalogue ADLS Raw Data Catalogue in a Parquet format.

Datalake Ingestion Pattern

  1. ADF Data Pipeline detect new files at a Data Lake location using an event based trigger.

  2. Data Flow retrieves, maps, transforms and saves the file from the Data Lake to the destination ADLS Raw Data Catalogue in a Parquet format.

Feature Engineering Architecture


Feature Engineering Package


The blueprint proposes using Feature Engineering Packages to create features:


  • The package contains Spark based boiler plate code.

  • It defines and creates features derived from a Base Population and Feature Catalogue.

  • The package's open source libraries and APIs encourages cross platform compatibility. For example, the package allows users to develop, test and experiment on their local laptops, as well as deploy and run in other cloud providers' ecosystem.

The package standardises and automates the feature engineering process. It eliminates the complexity and efforts associated with software engineering. It allows Data Engineers/Analysts to focus solely on feature engineering. It also promotes good software engineering practices, reproducibility, and version control.


Adhering to Data Ops principles, the package is developed, versioned and deployed to Feature Engineering workflows via GIT repositories.


Feature Engineering Workflows


The proposed architecture contains two Feature Engineering workflows - Development and Production


Development


The Development workflow allows users to engineer features for the purpose of exploration, experimentation, data analytics and machine learning. The workflow consists of Spark activities embedded in a ADF Data Pipeline that execute the Feature Engineering Packages. They are typically triggered manually or on a schedule:

  1. The Feature Engineering package retrieves the Raw and Base Population data from the ADLS Raw Data Catalogue.

  2. The package includes only Raw Data associated with the Base Population Criteria based Reference ID.

  3. It creates engineered features based on the Raw Column name, applying Transformations, starting from the Reference Sequence - Look back value to the Reference Sequence.

  4. The features are saved to the ADLS Development Feature Catalogue.

Production


Once the features have been explored, tested, and validated in Development, the Feature Engineering package is promoted to the Production workflow via a GIT repository.

The Production workflow engineer features as soon as data is made available, for the purpose of ML model inference and data analytics. The workflow consists of Spark activities embedded in a ADF Data Pipeline that execute the Feature Engineering Packages. They are normally event triggered (i.e. when new data is made available), though there are use cases where schedule triggers are required.

  1. New Raw data is made available on the ADLS Raw Data Catalogue.

  2. The package includes only Raw Data associated with the Base Population Criteria based Reference ID.

  3. It creates engineered features based on the Raw Column name, applying Transformations, starting from the Reference Sequence - Look back value to the Reference Sequence.

  4. The features are saved to the ADLS Production Feature Catalogue.


Conclusion


Market leaders have recognised that:

  • Machine learning and AI have demonstrated benefits and value in an organisation

  • Standardisation & automation of ML drive speed, scale and efficiencies

  • Teams become productive in their day jobs

  • Giving them the capacity to innovate 

  • Allowing for breakthroughs and discoveries

  • Which improves and matures Machine Learning in an organisation

  • And increases return of investment

CorticAi has helped organisations with initiatives to evolve their Data and ML capabilities. They are typically short incremental engagements that contribute to a company's long term technology strategy, allowing them to demonstrate business benefits and value in the process. The fundamentals that underpin the success of these projects are Agile methodologies, cross functional teams, and automation in the cloud.



6 views0 comments