Skip to content
Jorrit Poelen edited this page Apr 15, 2016 · 4 revisions

Welcome to the idigbio-spark wiki!

This wiki is meant to share best practices for using large biodiversity data sets.

#When to use? When possible, try to use tools that you are comfortable with (e.g. R, python/pandas). Only when the datasets are no longer able to fit on a single machine, or when the calculations take too long, you might want to consider distributed processing frameworks/platforms like hadoop, apache spark. Don't use distributed computing unless you really, really need it.

#pick a distributed processing platform These pages focus on using Apache Spark, but many other distributed processing frameworks exist.

#prepare your data Once you realized that you need distributing processing for your analysis, the first step is to prepare your datasets. The data format/data store should be suitable for distributed processing.

Do's - use hdfs, parquet or other technologies specifically designed to do distributed computing. Using unprocessed files help distributed processing frameworks like Spark to distribute work by chunking files.

Don't - most large compressed files cannot be distributed because they are hard to chunk or split.

#setup your cluster Although it is getting easier to setup a compute cluster from scratch, it still requires quite some software/hardware skills to make this happen. Suggest to re-use an existing compute cluster when possible to avoid days/weeks of setup time and system administration. The idea of GUODA is to provide access to such a compute cluster.

#develop your processing jobs

#testing your processing jobs

#deploying your processing jobs

Clone this wiki locally