Data Preparation for Business Intelligence Versus Machine Learning

Automated machine learning (autoML) is an ideal innovation for citizen data scientists - business intelligence professionals, data analysts, and data engineers to advance their careers. Today we see simple autoML tools in the market that citizen data scientists can immediately enjoy. However, there remains a significant knowledge gap for using this technology responsibly. The machine learning technical barriers were reduced with autoML but not educational ones.

While you don’t need to be a data scientist to get immediate predictive insights and some business value, you do need to understand when it is appropriate to use autoML, how to approach these projects properly, and the required steps in correctly preparing data. In this article, I’ll share a few tips on the key differences between data preparation for business intelligence versus machine learning.

 

Dimensional Models for Business Intelligence

Dimensional modeling design patterns that were born in the 90s still provide data models for reporting accurately and efficiently over time. Most SQL Server and Power BI reporting projects today do use dimensional models. These models are not designed for machine learning. Dimensional model design purpose is to aggregate and summarize data for reporting query performance.

A dimensional model, also commonly referred to as a star or snowflake schema, contains facts surrounded by descriptive data called dimensions. Facts contain numerical values of what you measure such as sales or user counts that are additive, or semi-additive in nature. Fact tables also contain the keys/links to associated dimension tables.

 

Machine Learning Datasets

Unlike other analytical techniques, machine learning algorithms rely on carefully curated data sources. You’ll need to organize your data within one wide analytical row of input variables and outcome metrics that describe an entire lifetime of events. You’ll need to organize your data within one wide analytical row of input variables and outcome metrics that describe an entire lifetime of events. Let's skim through how to amalgamate data in a machine learning-friendly format that accurately reflects business processes and outcomes.

Machine learning datasets summarize a lifetime of values in one single row with many columns in relation to an outcome to predict. Where columns in business intelligence might be called dimensions, columns in machine learning are referred to as features. While there might be many rows in a business intelligence dimensional model for a given outcome, source data for machine learning needs to be pivoted into a one row per prediction.

Machine learning datasets look like a “flattened” table. If you do have data stored in a dimensional data warehouse or transactional database format, you will need to use record identifiers or primary keys to join fields from multiple tables to create a single unified, flattened “view”. Your view will contain an outcome metric along with input predictor variables collected at a level of analytical granularity that you can make actionable decisions upon.

For many outcome metrics, data is captured at various business process steps in multiple data sources. A sales process might have data in a CRM, email marketing program, Excel spreadsheet and accounting system. If that is the case, you will want to identify the fields in those systems that can relate, join, or blend the different data sources together.

If you trying to capture changes in data over a time period, check if your data source is only keeping the current state values of a record. Most data warehouses are designed to save different values of a record over time and do not overwrite historical data values with current data values. Transactional application data sources, like Salesforce for instance, only contain current state value for a record. If you want to get a prior value, you would need to have a snapshot of the historical data stored or keep the prior value data in custom fields on the current record.

After identifying outcome metric business process data sources and fields, you will want to select any fields (predictor variables) that may directly affect the outcome. In doing so, ensure that the variable data is clean and consistent. The order and meaning of input predictor variables should remain the same from record to record. Inconsistent data formats, “dirty data” and outliers can undermine the quality of analytical findings.

 

Don’t Use the Wrong Data Models

In the traditional BI to self-service BI evolution, I saw numerous reporting data messes generated by non-technical users with good intentions. Usually I got the call from a line of business sponsor that realized their team needed professional help to get them back on track since reports were no longer accurate or got difficult to maintain.

Over time, bad data models can become expensive, big problems to manage. Even though you might not immediately need a dimensional design for Power BI reports, consider the bigger picture. These novice users with good intentions issues become far more expensive problem when dealing with machine learning projects.  

A real-world example of simply tuning machine learning models with correctly prepared data for a 9.5% lift in accuracy resulted in a $400M year revenue improvement. How much would incorrectly prepared data cost you? Don’t find out the hard way. Do it right the first time.