Challenge 03 - Data Pipeline Migration

< Previous Challenge - Home - Next Challenge >

Introduction

WW Importers keep missing the SLAs for their nightly data load process. The loads take six hours to complete and start each evening at 1:00AM. They must complete by 8:00AM but frequenly these jobs are taking longer than planned. In addition a few critical stakeholders are asking to provide the data even more frequently. Since these business units are the key stakeholders, they have the funding to help replatform the data pipelines.

WW Importers realizes they need to leverage their new Data Lake to scale and load this into their Data Warehouse for Stage 3. These data pipelines must be ELT (Extract, Load & Transform) so they can quickly write the data to the cloud and scale out the compute to transform the data.

Description

The objective of this lab is to modernize the ETL pipeline that was originally built in SSIS. A detailed diagram of the current workflow is included below. We need to rebuild this pipeline in Azure leveraging scale-out architecture to transform this data. The data flow will include steps to extract the data from the OLTP platform, store it in the Azure Data Lake and bulk ingest it into Azure Synapase Analytics. This will be run on a nightly basis, and will need to leverage Azure Data Factory or Synapse Pipelines as a job orchestration and scheduling tool.

Current SSIS Workflow

Below is a summary of each of the tasks in the existing SSIS package. Note that we will be able to re-use the existing scripts for all of these tasks except for step 6.

  1. The first step of the pipeline is to retrieve the “ETL Cutoff Date”. This date can be found in the [Integration].[Load_Control] in Azure Synapse SQL Pool that should have been created as part of challenge 1 (This step should have already been recreated in Challenge 2)
  2. The next step ensures that the [Dimension].[Date] table is current by executing the [Integration].[PopulateDateDimensionForYear] in Synapse SQL Pool
  3. Next the [Integration].[GetLineageKey] procedure is executed to create a record for each activity in the [Integration].[Lineage Key] table
  4. This step Truncates the [Integration].[[Table]_Staging] tables to prep them for new data
  5. This step retrieves the cutoff date for the last successful load of each table from the [Integration].[ETL Cutoffs] Table
  6. New data is now read from the OLTP source (using [Integration].[Get[Table]Updates] procedures) and copied into the [Integration].[[Table]_Staging] tables in the target DW
  7. Finally the staged data is merged into the [Dimension] and [Fact] tables in the target DW by executing the [Integration].[MigrateStaged[Table]] stored procedures in the target DW
    • NOTE: As part of this step, surrogate keys are generated for new attributes in Dimension tables (tables in the [Dimension] schema), so Dimenion tables must be loaded before FACT tables to maintain data integrity

NOTE: This challenge is intended to build upon the previous 2 challenges, and you should try to reuse content wherever possible

Create a data pipeline for the [Dimension].[City] table considering logic above. See Tips for Hints

The Solution diagram is described in the text following this diagram.

Success Criteria

Learning Resources

Tips

Advanced Challenges (Optional)

Too comfortable? Eager to do more? Try these additional challenges!