Skip to content
This repository has been archived by the owner on Feb 14, 2022. It is now read-only.
/ sonar Public archive

ingestion framework for project sonar [archived] ๐Ÿ‘ฃ ๐Ÿ•ต๐Ÿผ ๐Ÿš›

Notifications You must be signed in to change notification settings

thetanz/sonar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

sonar

thetanz/sonar was an ingestion framework for Project Sonar, an initiative led by Rapid7 that provided normalised datasets of global network scan data across public internet space.

Intended for monthly execution, these scripts would concurrently download and process all available Rapid 7 datasets into Google Big Query with the help of multiple Google Compute instances.


Deprecation Notice

Rapid 7 have deprecated the public revision of this service, see the below post released Feb 10, 2022

Evolving How We Share Rapid7 Research Data

It would appear as though the case noted below coupled with GDPR & CCPA regulations have put into question what Rapid 7 can publicly share.

Case C-582/14 - The court ruled that dynamic IP addresses may constitute โ€˜personal dataโ€™ even where only a third party (in this case an internet service provider) has the additional data necessary to identify the individual

If you intend on using Rapid7's service for publicly-disclosed research it would appear as though you can still gain access by reaching out.

This ingestion framework is being released in an archived state. No support or updates to follow


Overview

This set of scripts helps ingest monthly GZIP archives on scan datasets. Data is downloaded and subsequently loaded back into Google CLoud's Big Query service.

URL's from BackBlaze are fetched, subsequently downloading archived before processing and loading events

these scripts expect the presence of a GCP service account JSON file within the current directory, named gcp-svc-sonar.json

The container image by default will iterate through each dataset listed within the variable file

This can take a fair amount of time to download, process and load (upwards of 15 hours) with many disk-heavy operations

The dockerfile will initially run the orchestrator script orchestrator.sh which will read the sourcetypes in the sonardatasets array within the variable file datasets.sh.

For every sourcetype identified the latest available URL to download the given dataset is discovered and is passed to loader.sh

The loader creates the relevant SQL table and downloads the sonar tarball. We treat downloads differently depending on BQ's quotas.

The files are simply too large to process in-memory so all datasets are currently written to disk and uploaded as either direct or indirect, chunked tarballs depending on the ultimate size of the archive.

when the archive is over 4gb whilst still below 15tb we chunk the file into three-milion line sections and create tarballs for each

Tarballs are uploaded directly as compressive uploads with content-encoding values.

Once the tarball is available in JSON form within GCP Storage, a BQ Batch Load Operation brings this in with indexing leveraging inline decompressive transcoding.

Schemas

Big Query can 'auto detect' a json schema but it can be temperamental. Establishing a fixed schema is the best way to ensure reliable ingestion and removed any auto-detection guesswork.

Whilst Rapid 7 provide a standard schema, Big Query requires a custom file to specify this . Reference the Google docs on BQ Schemas for more info

the schema provided by rapid7 can be fetched with the below

baseuri='https://opendata.rapid7.com'
schemafile=`curl -s ${baseuri}/sonar.rdns_v2/ | grep "schema.json" | cut -d '"' -f2`
wget --no-verbose --show-progress --progress=dot:mega ${baseuri}${schemafile} -O json_schema.json

Running

GCP Parallel

time: 4-6 hours

create a set of gcp container vm's to process each dataset concurrently with batch.sh.

when using batch.sh - ensure you update YOUR_GCP_PROJECT accordingly to allow google container registry to correctly function within the context of your project

Local Singleton

time: 30-40 hours free disk space: ~200gb

docker build . -t sonar
docker run sonar

you can specify a single dataset from the variable file as an input argument to process an individual dataset, i.e

docker run sonar fdns_v2:fdns_txt_mx_dmarc.json.gz

Notes

  • piping a gzip archive to gcp big query directly takes longer than it does to upload it with transcoding to gcp storage and running a subsequent load job

  • we implicitly decompress and chunk any archive over 4gb - we can upload decompressed archives up to 15tb in size however we save on data transfer costs when we only upload tarballs/archives.

  • the decompressed archives (80gb+ json line-delimited files) are massive, chunking at any good speed is rather difficult - we avoid doing so wherever possible.

  • using GNU's chunk doesn't seem to do 'in place' chunking so we end up downloading an archive, unpacking it & then chunking it which doubles disk space. some of these large datasets can often grow above 80GB after decompression.

  • maintaining direct pipes from stdin would be ideal but the sheer filesize of some operations makes this difficult given compute and memory availability

    wget example.com/file.gz | unzip | upload
  • due to the inherent size of these datasets work with tools such as sed/awk take extended amounts of time - no transposing/additions to datasets is performed


About

ingestion framework for project sonar [archived] ๐Ÿ‘ฃ ๐Ÿ•ต๐Ÿผ ๐Ÿš›

Topics

Resources

Security policy

Stars

Watchers

Forks

Languages