Skip to content

uche-madu/twitter-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Streaming pipeline with GCP

This project connects to the Twitter API and streams tweets into bigquery or Google Cloud Storage (deadletter handling). Data flows from Twitter into Google Cloud PubSub, passes through apache beam using DirectRunner or DataflowRunner and into Bigquery or Google Cloud Storage (deadletter handling -> in case tweet data cannot be processed by the pipeline into Bigquery for any reason). The processing code creates two apache beam pipelines which out data into two bigquery datasets:

  • An ETL pipeline that performs a simple tweet count per minute
  • An ELT pipeline that streams raw tweets into Bigquery for further transformation and analysis as needed

Terraform To Do:

  • Create a service account + roles and enable relevant gloud services
  • Create pubsub topic
  • Create Bigquery dataset
  • Create GCS bucket

Prerequisites:

  • Sign up for a Twitter Developer account and create an App
  • Generate a Bearer Token from your App in the Twitter developer portal
  • Create an account on GCP and create a project

Steps to Run Pipeline:

  1. Use terraform to provision dependent resources on Google Cloud Platform
cd terraform
terraform init
terraform apply
  1. Install dependencies:
cd ..
make install
  1. Run the following commands on 2 seperate terminals: starts streaming tweets into pubsub and processing into bigquery
  • First, substitute the command line interface (CLI) input values in the Makefile
make stream_tweets
make process tweets
  • The pipeline runs until cancelled using CTRL+C
  1. When finished, destroy the GCP resources to save cost:
terraform destroy