Skip to content

Latest commit

 

History

History
67 lines (58 loc) · 3.12 KB

README.md

File metadata and controls

67 lines (58 loc) · 3.12 KB

covid-19-dataset

US county level COVID-19 case data.

Daily snapshots of US cases by county.

County Data Status

State Scraper Validator Aggergator Time Series
AK Y N N N
AL Y N N N
CA Y N N N
CO Y N N N
DE Y N N N
FL Y N N N
GA Y N N N
IA Y N N N
KS Y N N N
KY Y N N N
LA Y N N N
MD Y N N N
ME Y N N N
MI Y N N N
MO Y N N N
MN Y N N N
MT Y N N N
NJ Y N N N
NY Y N N N
OH Y N N N
PA Y N N N
TN Y N N N
TX Y N N N
VA Y N N N
WA Y N N N
WY Y N N N

Project structure

/data  # county level snapshots by scrape timestamp.
    |
    - {state}_by_county_{scraper_timestamp_in_EDT}.txt # snapshot of scraped results as of timestamp.
/source_page_backup # backup of source pages by scrape timestamp.
    |
    - {state}_county_{scrape_timestamp}.html # backup of source page. Extension depends on data source.
- main.ipynb # triggers crawler
- config.yaml # shared scraper configurations
- {state}_by_county.ipynb # State specific scapers

Scraper Format

Scrapers are simple python scripts or jupyter notebooks that implement a fetch, save, and run method.

fetch()

Returns - DataFrame containing positive cases by county. - Source data - HTML page, etc.

Fetch is responsible for getting and processing a page into a Pandas DataFrame. Fetch must return a DataFrame must contain county and positive_cases columns (additional columns are fine) and a string containing the data source being scraped.

save(df, source)

Params: - df (DataFrame): DataFrame containing county and positive_cases columns (additional columns are fine) - source (str): string containing the data source page that was scraped.

Save handles persisting the Data Frame and source data. df is saved as a pipe delimited text file in the data directory with the scraping timestamp in EDT.

run()

Handles fetch and save in one action. Used in main crawling job.