Data Pipeline with Apache Airflow

Project overview

This repository is aimed to load JSON files from AWS S3 bucket to AWS Redshift cluster using Apache Airflow with the main emphasis on creating custom Airflow operators.

Assignment

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

It is expected that high grade data pipelines are created, and they are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. Sparkify team have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Architecture of the project

Overview of the files in the repository

create_tables.sql file contains SQL queries to create table in Redshift cluster
'dags' folder contains dag.py file
'plugins' folder contains 'helpers' and 'operators' folders that has additional SQL queries and custom operators respectively
'img' folder contains images for the current file

Running the project

Pre-requisites

In your AWS account:
- prepare S3 bucket with the dataset (you can use this dataset or it's subset)
- prepare Redshift cluster for the output (note that it should be in the same region as S3 bucket)
- prepare an IAM User in AWS that can communicate with both S3 bucket (read permissions) and Redshift cluster (full permissions)
- install Apache Airflow and add your User and Cluster details as Connections

How to run the project

Run dag.py file in the terminal. You can manually monitor the progress in Airflow UI.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dags		dags
img		img
plugins		plugins
.DS_Store		.DS_Store
README.md		README.md
create_tables.sql		create_tables.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline with Apache Airflow

Project overview

Assignment

Architecture of the project

Overview of the files in the repository

Running the project

Pre-requisites

How to run the project

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline with Apache Airflow

Project overview

Assignment

Architecture of the project

Overview of the files in the repository

Running the project

Pre-requisites

How to run the project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages