Skip to content

Latest commit



94 lines (61 loc) · 2.92 KB


File metadata and controls

94 lines (61 loc) · 2.92 KB

Technical cheatsheet

Reading the datasets

  1. Connect to the cluster ssh {username}
  2. Access the datasets in the HDFS (hadoop cluster)
    • list of datasets:hadoop fs -ls /datasets
    • wikipedia datasets:
      • hadoop fs -ls /datasets/wikidatawiki
        • /datasets/wikidatawiki/wikidatawiki-20170301-pages-articles-multistream-index.txt
        • /datasets/wikidatawiki/wikidatawiki-20170301-pages-articles-multistream.xml
      • hadoop fs -ls /datasets/enwiki-20191001
        • /datasets/enwiki-20191001/enwiki-20191001-pages-articles-multistream.xml
  3. To read (take a sneak peak) from a dataset to the cluster: hadoop fs -cat /datasets/<file_name> | less
  4. To read the first 100 lines hadoop fs -cat /datasets/<file_name> | head -n 100
  5. To copy the output to your own space in the cluster: hadoop fs -cat /datasets/enwiki-20191001/enwiki-20191001-pages-articles-multistream.xml | head -n 200 > ~/enwiki_200_lines.txt
  • To transfer files from local to cluster: hadoop fs -put {filename.txt}

Usefull commands from tutorial

hadoop fs -ls list the files in your HDFS home directory

hadoop fs -put file.txt copy a file from the regular file system to HDFS

hadoop fs -get hdfs_file_name local_file_name copy a file or folder from HDFS to the local file system

hadoop fs -getmerge hdfs_file_name local_file_name copy a file from HDFS to the local file system by merging multiple files in 1

Copy files from/to the cluster

  1. Open a terminal and execute the commands LOCALLY

  2. copy files FROM local TO cluster (scp {source} {destination}) scp {username}

  3. copy files FROM cluster TO local

    scp {username} ./

Run spark-python files in the cluster

  1. copy your python file to the cluster (as mentioned above)

    scp {username}

  2. connect to the cluster

    ssh {username}

  3. run the command:

    spark-submit \
     --master yarn \
     --packages com.databricks:spark-xml_2.11:0.6.0 \
     --num-executors 50 \
     --executor-memory 4g \ 

    or to specify the workers

    spark-submit --master yarn --packages com.databricks:spark-xml_2.11:0.7.0 --num-executors 70 --executor-memory 6g
  4. you can see the process of the job you submitted here:

    (click on the ID of your job)

Run spark-python files locally

  1. go to where the ada environment is installed (usually in Anaconda3/envs/ada) and run:

    ./bin/spark-submit \
      --master local \
      --packages com.databricks:spark-xml_2.11:0.6.0 \
      {absolute path to python file}   

Visualization Libraries

  • D3
  • Vega (specifies graphics in JSON format)
  • Vincent (Python-to-Vega translator)
  • Bokeh