Eugene Tan
Building Automated Data and Machine Learning Pipelines with Azure - Automated Feature Engineering
Introduction
Artificial Intelligence (AI) has been dubbed the fourth industrial revolution, where advancements in the field represent a fundamental change in the way we live, work and relate to one another.
Like with any new technology breakthrough, it has its challenges. It takes time and education for it to be understood, become more accessible, and applied to its fullest potential. For example, think about how long it took before the personal computer and the Internet were seamlessly integrated into our daily lives.
So how do businesses evolve AI from ideation to productisation? What do organisations do if they do not possess the luxury of virtually limitless compute and talent might, akin to the Googles, Microsofts, and Amazons of the world?
We'd only need to look into history and the successes of the first industrial revolution to find the answer - Automation, standardisation, speed and scale (using cloud services).
Purpose
This article is one in a series of articles discussing automation for end to end data and machine learning pipelines.
AI is a two step process - Data Preparation and Machine Learning model build.
The objective of the article is to provide guidelines for the automation of the Data Preparation process.
Although Azure cloud services were used to detail the proposed implementation, the blueprint can also be used on other cloud service providers.
The topics discussed:
Feature Engineering Concepts
Azure Data Factory
DataOps
Data Ingestion Architecture
Feature Engineering Architecture

Feature Engineering Concepts
The proposed blueprint is a two step process - Data Ingestion and Feature Engineering.

Data Ingestion
Data ingestion is the process of obtaining and importing data from disparate sources for immediate use or storage in a data store/database.
The proposed architecture provides the capability to ingest data from manually uploaded files, SQL database tables, and Data Lakes.
Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. It involves transforming raw data into features that better represent the underlying problem for i.e. predictive models.
The blueprint proposes the concepts of a Base Population and Feature Catalogue. The idea is that features are engineered as per defined in a Feature Catalogue, for the standing data defined in a Base Population. For example in a fashion retail store context, engineering a feature for the spending total in the last 24 hours (Feature Catalogue) for customers who are active members (Base Population)
Base Population
A Base Population is the reference or standing data associated with a Feature Catalogue. An example of Base Population data is customer information like name, age, address and customer account id. A Base Population in the context of the proposed architecture is defined by the following:

Source datastore: Location of the data store (i.e. Blob storage, database)
Source table/file: Source table/file name (i.e. credit card accounts table)
Reference ID: The unique reference id that identifies the customer in the base population (i.e. Customer Account ID)
Base Population Criterion: definition of criterion to include/exclude customers (i.e. All active Credit Card customer accounts)
Feature Catalogue
A Feature Catalogue consists of definitions related to deriving/transforming the data in a raw format to a feature. This involves taking multiple rows per reference id and create a single row per reference id. An engineered feature in the context of the proposed architecture is defined by the following:

Source datastore: Location of the data store (i.e. Blob storage, database)
Source table/file: Source table/file name (i.e. credit card transactions table)
Reference ID: The unique reference id that identifies the customer in the base population (i.e. Customer Account ID)
Reference Sequence: The group by column name for the aggregated column (i.e. a date or date time column) Reference Point: The snapshot point in time for the aggregation (i.e. group by from 14/01/2019 looking back 24 hours of customer spend)
Raw Column name: The raw column to carry out the transformation (i.e. Spend)
Transformation: The type of transformation (i.e. sum, average, min, max)
Look back unit: The look back unit (i.e. hours)
Look back value: The look back value (i.e. 24 hours)
Azure Data Factory
The proposed architecture uses Azure Data Factory (ADF). ADF is an easy and seamless way to deploy and automate technology solutions for the purpose of data ingestion and feature engineering in the cloud. More information on ADF can be found here.
DataOps
The blueprint applies the principles of DevOps for Data (DataOps) to standardise and streamline the feature engineering process. More information on DataOps on Azure can be found here.

Data Ingestion Architecture
Manual File Upload Ingestion Pattern
File is uploaded and stored in a Azure Blob Storage location

ADF Data Pipeline detect new files in a predetermined Blob Storage location using an event based trigger.
Data Flow retrieves, maps, transforms and saves the file to the destination ADLS Raw Data Catalogue in a Parquet format.
SQL Database Ingestion Pattern
Bulk Multiple Tables Copy
ADF Data Pipeline executes the data pipeline using manual execution or scheduled trigger.
Data Flow retrieves, maps, transforms and saves the file from the SQL server source to the destination ADLS Raw Data Catalogue in a Parquet format.
Incremental Multiple Tables Copy
The ADF Data Pipeline's Lookup activity tracks the source database tables for incremental changes in data.
Incremental data is mapped, transformed and saved from the SQL server to the destination Raw Data Catalogue ADLS Raw Data Catalogue in a Parquet format.
Datalake Ingestion Pattern
ADF Data Pipeline detect new files at a Data Lake location using an event based trigger.
Data Flow retrieves, maps, transforms and saves the file from the Data Lake to the destination ADLS Raw Data Catalogue in a Parquet format.
Feature Engineering Architecture
Feature Engineering Package
The blueprint proposes using Feature Engineering Packages to create features:

The package contains Spark based boiler plate code.
It defines and creates features derived from a Base Population and Feature Catalogue.
The package's open source libraries and APIs encourages cross platform compatibility. For example, the package allows users to develop, test and experiment on their local laptops, as well as deploy and run in other cloud providers' ecosystem.
The package standardises and automates the feature engineering process. It eliminates the complexity and efforts associated with software engineering. It allows Data Engineers/Analysts to focus solely on feature engineering. It also promotes good software engineering practices, reproducibility, and version control.
Adhering to Data Ops principles, the package is developed, versioned and deployed to Feature Engineering workflows via GIT repositories.
Feature Engineering Workflows
The proposed architecture contains two Feature Engineering workflows - Development and Production
Development
The Development workflow allows users to engineer features for the purpose of exploration, experimentation, data analytics and machine learning. The workflow consists of Spark activities embedded in a ADF Data Pipeline that execute the Feature Engineering Packages. They are typically triggered manually or on a schedule:
The Feature Engineering package retrieves the Raw and Base Population data from the ADLS Raw Data Catalogue.
The package includes only Raw Data associated with the Base Population Criteria based Reference ID.
It creates engineered features based on the Raw Column name, applying Transformations, starting from the Reference Sequence - Look back value to the Reference Sequence.
The features are saved to the ADLS Development Feature Catalogue.
Production
Once the features have been explored, tested, and validated in Development, the Feature Engineering package is promoted to the Production workflow via a GIT repository.
The Production workflow engineer features as soon as data is made available, for the purpose of ML model inference and data analytics. The workflow consists of Spark activities embedded in a ADF Data Pipeline that execute the Feature Engineering Packages. They are normally event triggered (i.e. when new data is made available), though there are use cases where schedule triggers are required.
New Raw data is made available on the ADLS Raw Data Catalogue.
The package includes only Raw Data associated with the Base Population Criteria based Reference ID.
It creates engineered features based on the Raw Column name, applying Transformations, starting from the Reference Sequence - Look back value to the Reference Sequence.
The features are saved to the ADLS Production Feature Catalogue.
Conclusion
Market leaders have recognised that:
Machine learning and AI have demonstrated benefits and value in an organisation
Standardisation & automation of ML drive speed, scale and efficiencies
Teams become productive in their day jobs
Giving them the capacity to innovate
Allowing for breakthroughs and discoveries
Which improves and matures Machine Learning in an organisation
And increases return of investment
CorticAi has helped organisations with initiatives to evolve their Data and ML capabilities. They are typically short incremental engagements that contribute to a company's long term technology strategy, allowing them to demonstrate business benefits and value in the process. The fundamentals that underpin the success of these projects are Agile methodologies, cross functional teams, and automation in the cloud.