Skip to content

Building an ETL pipelineπŸ”€ that extracts data from AWS S3 πŸ—‘ stages them in Redshift πŸ—„πŸ—„ and transforms data into a set of dimensional tables for the analytics team.

Notifications You must be signed in to change notification settings

pratikwatwani/Cloud-Data-Warehousing-with-IaC-DevOps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cloud Data Warehousing with IaC DevOps

Description

Building an ETL pipeline that extracts data from Amazon Web Services S3, stages them in Redshift Cluster, and transforms data into a set of dimensional tables for the analytics teams. The Redshift cluster is created and maintained using the Infrastructure as Code paradigm and policies.

Directory Structure

β”œβ”€β”€ README.md
β”œβ”€β”€ dwh.cfg
β”œβ”€β”€ run.sh
β”œβ”€β”€ analysis
    └── Exploratory Data Analysis.ipynb
    └── Performance Measure.ipynb
β”œβ”€β”€ modules
    └── create_database.py
    └── create_tables.py
    └── sql_queries.py
    └── etl.py
    └── __init__.py

Running instructions

There is a shell script in the root directory that will call upon the required python scripts. Just use the shell file to run the program, logging will aid you in every step of execution.

Run: sh run.sh

Exploratory Data Analysis

Data points

table count
time 16112
staging events 8056
songplays 320
staging songs 14896
users 7770
artists 14896
songs 14896

Queries

1. Streaming by Location (Top 10)

location count
San Francisco-Oakland-Hayward, CA 41
Portland-South Portland, ME 31
Lansing-East Lansing, MI 29
Waterloo-Cedar Falls, IA 20
Tampa-St. Petersburg-Clearwater, FL 18
Sacramento--Roseville--Arden-Arcade, CA 17
Atlanta-Sandy Springs-Roswell, GA 17
Chicago-Naperville-Elgin, IL-IN-WI 15
Lake Havasu City-Kingman, AZ 13
Janesville-Beloit, WI 11

2. Streaming by Platform

platforms count
Macintosh 139
Windows NT 5.1 53
X11 46
Windows NT 6.1 40
Windows NT 6.3 32
iPhone 9
compatible 1

3. Total album length for each artist (Top 10)

artist total_album_minutes
Jean Grae 3006
Freddie Hubbard 2882
Aphex Twin 2657
Enigma 2644
Madonna 2513
Opeth 2479
Mortiis 2445
Herbie Hancock 2421
Simple Minds 2343
Stephan Micus 2298

4. Streaming by week

week count
44 972
45 2982
46 4564
47 4074
48 3520

5. Service type weight

level count
paid 262
free 58

6. Type(Level) of user by Gender

gender level sum
F paid 10972270
F free 318801
M free 386760
M paid 993345

7. Top artists by each location (Top 10)

location artist_name
Atlanta-Sandy Springs-Roswell, GA Dr. Alban
Atlanta-Sandy Springs-Roswell, GA The Smiths
Atlanta-Sandy Springs-Roswell, GA Binary Star
Atlanta-Sandy Springs-Roswell, GA The Human League
Atlanta-Sandy Springs-Roswell, GA The Presets
Atlanta-Sandy Springs-Roswell, GA Limi-T 21
Atlanta-Sandy Springs-Roswell, GA Goldfrapp
Atlanta-Sandy Springs-Roswell, GA Arctic Monkeys
Atlanta-Sandy Springs-Roswell, GA Fergie
Atlanta-Sandy Springs-Roswell, GA Jack Johnson

8. Average album length over years

year average_album_length
0 251.89
1927 185.00
1944 142.00
1952 133.00
1954 234.67

Query Performance

Loading Time Comparison (in seconds)

Table         |    W/ Distribution Style     |     W/O Distribution Style
------------------------------------------------------------------------------
staging_events            15.15                            1.92 
staging_songs            169.99                           181.38 
songs                     0.73                             0.79 
users                     0.65                             0.67 
artists                   0.67                             0.68 
time                      0.51                             0.78 
songplays                 6.08                             1.21 

Execution Time Comparison (in m seconds)

Table         |    W/ Distribution Style     |     W/O Distribution Style
------------------------------------------------------------------------------
QUERY 1                    261                             437
QUERY 2                    269                             415
QUERY 3                    276                             556 
QUERY 4                    256                             4680
QUERY 5                    263                             360
QUERY 6                    264                             508 
QUERY 7                    285                             564
QUERY 8                    251                             413

About

Building an ETL pipelineπŸ”€ that extracts data from AWS S3 πŸ—‘ stages them in Redshift πŸ—„πŸ—„ and transforms data into a set of dimensional tables for the analytics team.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages