Crytolytics: Coincap Data Extraction and Analysis Pipeline

Introduction

In today's data-driven world, data plays a pivotal role in shaping decisions within organizations. The sheer volume of data generated necessitates data engineers to centralize data efficiently, clean and model data to align with specific business requirements, and also make the data easily accessible for data consumers.

The aim of this project is to build an automated data pipeline that retrieves cryptocurrency data from the CoinCap API, processes and transforms it for analysis, and presents key metrics on a near-real-time* dashboard. The dashboard provides users with valuable insights into the dynamic cryptocurrency market.

*near-real-time because the data is loaded from the source and processed every 5 minutes rather than instantly

Dataset

This data used in this project was obtained from the CoinCap API, which provides real-time pricing and market activity for over 1,000 cryptocurrencies.

Tools & Technologies used:

Cloud: Google Cloud Platform (GCP)
Infrastructure as Code (Iac): Terraform
Containerization: Docker, Docker Compose
Workflow Orchestration: Apache Airflow
Data Lake: Google Cloud Storage (GCS)
Data Warehouse: Big Query
Data Transformation: Data Build Tool (DBT)
Visualization: Looker Studio
Programming Language: Python (batch processing), SQL (data transformation)

Data Architecture

Project Map:

Provisioning Resources: Terraform is used to set up the necessary GCP resources, including a Compute Engine instance, GCS bucket, and BigQuery datasets
Data Extraction: Every 5 minutes, JSON data is retrieved from the CoinCap API and converted to Parquet format for optimized storage and processing
Data Loading: The converted data is stored in Google Cloud Storage, the data lake, and then loaded into BigQuery, the data warehouse.
Data Transformation: DBT is connected to BigQuery to transform the raw data, after which the processed data is loaded back into BigQuery; with the entire ELT process automated and orchestrated using Apache Airflow
Reporting: The transformed dataset is used to create an analytical report and visualizations in Looker Studio

Dashboard

Go to Dashboard

Disclaimer: This is only a pet project. Please, do not use this dashboard for actual financial decisions. T for thanks!

How to Replicate the Data Pipeline

Below are steps on how to reproduce this pipeline in the cloud. Note, that, Windows/WSL/Gitbash was used locally for this project.

1. Set up Google Cloud Platform (GCP)

If you don't have a GCP account already, create a free trial account (you get free $300 credits) by following the steps in this guide
Create a new project on GCP (see guide) and take note of your Project ID, as it will be needed at the later stages of the project
Next is to enable necessary APIs for the project, create and configure a service account, and generate an auth-key. While all of these can be done via the GCP Web UI (see), Terraform will be used to run the processes (somebody say DevOps, hehehe). So skip for now.
If you haven't already, download and install the Google Cloud SDK for local setup. You can follow this installation guide.
- You might need to restart your system before gcloud can be used via CLI. Check if the installation is successful by running gcloud -v in your terminal to view the version of the gcloud installed
- Run gcloud auth login to authenticate the Google Cloud SDK with your Google account

2. Generate the SSH Key Pair Locally

The SSH Key will be used to connect and gain access to the gcp virtual machine via the local terminal (Linux). In your terminal run the command
ssh-keygen -t rsa -f ~/.ssh/<whatever-you-want-to-name-your-key> -C <the-username-that-you-want-on-your-VM> -b 2048

ex: ssh-keygen -t rsa -f ~/.ssh/ssh_key -C aayomide -b 2048

3. Provision the Needed GCP Resources via Terraform.

Follow the terraform reproduce guide

4. Create an SSH Connection to the newly created VM (on your local machine)

Create a file called config within the .ssh directory in your home folder and paste the following information:

HOST <vm-name-to-use-when-connecting>
    Hostname <external-ip-address>   # check the terraform output in the CLI or navigate to GCP > Compute Engine > VM instances.
    User <username used when running the ssh-keygen command>  # it is also the same as the gce_ssh_user
    IdentityFile <absolute-path-to-your-private-ssh-key-on-local-machine>
    LocalForward 8080 localhost:8080     # forward traffic from local port 8080 to port 8080 on the remote server where Airflow is running
    LocalForward 8888 localhost:8888     # forward traffic from local port 888 to port 8888 on the remote server where Jupyter Notebook is running

for example

HOST cryptolytics_vm
    Hostname 35.225.33.44
    User aayomide
    IdentityFile c:/Users/aayomide/.ssh/ssh_key
    LocalForward 8080 localhost:8080
    LocalForward 8888 localhost:8888

Afterward, connect to the virtual machine via your local terminal by running ssh cryptolytics_vm.

You can also access the VM via VS code as shown here

Note: the value of the external IP address changes as you turn the VM instance on and off

5: Setup DBT (data build tool).

Follow the dbt how-to-reproduce guide

6. Orchestrate the Dataflow with Airflow.

Follow the airflow how-to-reproduce guide

7. Create a Report in Looker Studio:

Log in to Looker Studio using your google account
Click on "Blank report" and select the "BigQuery" data connector
Choose your data source (project -> dataset), which in this case is "prod_coins_dataset"

Further Improvements

Use Apache Kafka to stream the data in real-time
Perform advanced data transformation using DBT or even PySpark
Implement more robust error handling with try-catch blocks and write more robust data quality tests in DBT
Pipeline alerting & monitoring feature

Resources

DataTalks.Club
Data Engineering Zoomcamp
Michael Shoemaker animated dataflow architecture youtube tutorial
- Instead of installing the Peek screen recorder via the terminal as done in the video, I downloaded the app from the Microsoft store here

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
airflow		airflow
dbt		dbt
setup		setup
terraform		terraform
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crytolytics: Coincap Data Extraction and Analysis Pipeline

Introduction

Dataset

Tools & Technologies used:

Data Architecture

Dashboard

How to Replicate the Data Pipeline

1. Set up Google Cloud Platform (GCP)

2. Generate the SSH Key Pair Locally

3. Provision the Needed GCP Resources via Terraform.

4. Create an SSH Connection to the newly created VM (on your local machine)

5: Setup DBT (data build tool).

6. Orchestrate the Dataflow with Airflow.

7. Create a Report in Looker Studio:

Further Improvements

Resources

About

Releases

Packages

Languages

aayomide/crypto_analytics_engineering

Folders and files

Latest commit

History

Repository files navigation

Crytolytics: Coincap Data Extraction and Analysis Pipeline

Introduction

Dataset

Tools & Technologies used:

Data Architecture

Dashboard

How to Replicate the Data Pipeline

1. Set up Google Cloud Platform (GCP)

2. Generate the SSH Key Pair Locally

3. Provision the Needed GCP Resources via Terraform.

4. Create an SSH Connection to the newly created VM (on your local machine)

5: Setup DBT (data build tool).

6. Orchestrate the Dataflow with Airflow.

7. Create a Report in Looker Studio:

Further Improvements

Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages