-
Establish an SSH connection to the compute instance:
`ssh beatlytica-airflow-instance`
-
Clone the Git repository:
git clone https://github.com/The-Algorist/Beatlytica.git && \
cd Beatlytica
-
Install Anaconda, Docker, and Docker Compose:
`bash ~/Beatlytica/scripts/vm_setup.sh && \ exec newgrp docker`
-
Transfer the Google Cloud service account JSON file to the VM machine:
-
Use
sftp
to transfer the file. -
Move the service account JSON file from local to the VM machine in
~/.google/credentials/
directory. Make sure it is named asgoogle_credentials.json
else the DAGs will fail!
-
-
Set the environment variables (same as Terraform values):
-
GCP Project ID
-
Cloud Storage Bucket Name
export GCP_PROJECT_ID=project-id export GCP_GCS_BUCKET=bucket-name`
-
`` Note: You will have to set up these environment variables every time you create a new shell session.
-
Start Airflow. This shall take a few minutes:
`bash ~/Beatlytica/scripts/airflow_startup.sh && cd ~/Beatlytica/airflow`
-
Airflow should be available on port
8080
a couple of minutes after the above setup is complete. Login with default username & password as airflow. -
Airflow will be running in detached mode. To see the logs from Docker, run the following command:
`docker-compose --follow`
-
To stop Airflow:
`docker-compose down`
The setup has two dags
-
load_songs_dag
-
Trigger first and only once to load a onetime song file into BigQuery
-
beatlytica_dag
-
Trigger after
load_songs_dag
to make sure the songs table table is available for the transformations -
This dag will run hourly at the 5th minute and perform transformations to create the dimensions and fact.
-
DAG Flow -
-
We first create an external table for the data that was received in the past hour.
-
We then create an empty table to which our hourly data will be appended. Usually, this will only ever run in the first run.
-
Then we insert or append the hourly data, into the table.
-
And then, delete the external table.
-
Finally, run the dbt transformation, to create our dimensions and facts.
The transformations happen using dbt which is triggered by Airflow. The dbt lineage should look something like this -
Dimensions:
-
dim_artists
-
dim_songs
-
dim_datetime
-
dim_location
-
dim_users
Facts:
fact_streams
Partitioning:
- Data is partitioned on the timestamp column by hour to provide faster data updates for a dashboard that shows data for the last few hours.
Finally, we create wide_stream
view to aid dashboarding.