Skip to content

Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda and Glue converts to Parquet format and stores it in a cleansed S3 bucket. AWS QuickSight then visualizes the materialised data.

Notifications You must be signed in to change notification settings

waqarg2001/Youtube-Data-Pipeline-AWS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda converts to Parquet format and stores it in a clean S3 bucket. AWS QuickSight then visualizes the materialised data, providing insights into YouTube video performance.

built-with-love powered-by-coffee cc-nc-sa

OverviewToolsArchitectureDashboardScreenshotsSupportLicense

Overview

This project utilizes AWS Cloud Services to build an efficient ETL pipeline for processing YouTube video statistics data. The data, available here, is downloaded from Kaggle and uploaded to an S3 bucket. AWS Glue catalogs the data, enabling seamless querying using Amazon Athena. The pipeline processes both JSON and CSV data, converting them into Parquet format. JSON data is transformed using AWS Lambda functions with AWS Data Wrangler layers, while CSV data is processed through visual ETL jobs in AWS Glue.

Data is first stored in a raw S3 bucket, then cleaned and organized in a cleansed bucket, and finally joined and stored in an analytics or materialized bucket. Automated ETL jobs run daily using AWS Glue workflows, ensuring up-to-date data processing. A simple QuickSight dashboard visualizes the cleansed data, providing valuable insights into YouTube video performance across different regions. This setup ensures a scalable and efficient data processing workflow, facilitating detailed analysis and reporting.

The repository directory structure is as follows:

├── assets/                        <- Includes assets for the repo.
│   └── (Contains images, architecture and quicksight dashboard)
│
├── data/                          <- Contains data used and processed by the project.
│   ├── raw/                      <- Raw data files (not included here due to large files size).
│   ├── cleansed/                 <- Cleansed data files.
│   └── analytics/                <- Materialized view for analytics and reporting.
│
├── docs/                          <- Documentation for the project.
│   └── solution methodology.pdf   <- Detailed project documentation.
│
├── scripts/                                       <- Python scripts for the ETL pipeline.
│   ├── etl_pipeline_csv_to_parquet.py             <- csv to parquet pipeline glue script.
│   ├── lambda_function.py                         <- Lambda function code.
│   └── etl_pipeline_materialised_view.py          <- materialised view pipeline glue script
│
├── README.md                      <- The top-level README for developers using this project.

Tools

To build this project, the following tools were used:

  • AWS S3
  • AWS Glue
  • AWS Lambda/Layers
  • Amazon Athena
  • AWS QuickSight
  • AWS Data Wrangler
  • AWS Cloudwatch
  • AWS IAM
  • Python
  • Pandas
  • Spark
  • Git

Architecture

Following is the architecture of the project.

Dashboard

Access simplified dashboard from here.

Screenshots

Following are project execution screenshots from AWS portal.


Support

If you have any doubts, queries, or suggestions then, please connect with me on any of the following platforms:

Linkedin Badge Gmail Badge

License

by-nc-sa

This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

About

Leveraging AWS Cloud Services, an ETL pipeline transforms YouTube video statistics data. Data is downloaded from Kaggle, uploaded to an S3 bucket, and cataloged using AWS Glue for querying with Athena. AWS Lambda and Glue converts to Parquet format and stores it in a cleansed S3 bucket. AWS QuickSight then visualizes the materialised data.

Topics

Resources

Stars

Watchers

Forks

Languages