Skip to content

Bayesian structure learning with parallel bnlearn on a distributed R cluster.

Notifications You must be signed in to change notification settings

Arun-George-Zachariah/Parallel-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel-R

Through this project, we set up a distributed R cluster, leveraging the parallel package. The parallel package offers support for parallel computation by forking parallel process (based on the multicore package) on the same machine thus utilizing most of the cores of the machine. In addition to it, the package also offers communication using sockets (obtained from the snow package) parallelizing the computation utlizing the resources of the nodes in the cluster.

We then study Bayesian structure learning, by learning the Bayesian structure on a sample dataset, using the bnlearn package. The dataset is split into equal parts, based on the number of nodes in the cluster. A network structure is learnt over each split paralelly and aggregated to output the final structure.

Dataset

The sample data used is obtained from learning.test a small synthetic dataset compirsing of 6 nodes, 5 arcs and 41 parameters.

Fig. 1 - learning.test Network (Ref: https://www.bnlearn.com/documentation/networks/)

Execution

  • To setup a distributed R cluster

    cd scripts && ./configure.sh --machines <MACHINES> --user <USERNAME> --key <PRIVATE_KEY>
    
    Parameter Default Description
    --machines ../conf/machine_list.txt A file listing the public IP addresses of the nodes.
    --user ${USER} User name, if different from the current user name.
    --key ~/.ssh/id_rsa Path to the private key.
    Eg:
    cd scripts && ./configure.sh --machines ../conf/machine_list.txt --user arung --key ~/.ssh/id_rsa
    
  • To learn the Bayesian network

    ./exec.sh --machines <MACHINES> --user <USERNAME> --key <PRIVATE_KEY> --inp <INPUT_DATA> --data <DATA_DIR>
    
    Parameter Default Description
    --machines ../conf/machine_list.txt A file listing the public IP addresses of the nodes.
    --user ${USER} User name, if different from the current user name.
    --key ~/.ssh/id_rsa Path to the private key.
    --inp ~/.ssh/id_rsa CSV File
    --data /mydata Path to the directory to install R packages and save data splits and other metadata.
    Eg:
    ./exec.sh --machines conf/machine_list.txt --user arung --key ~/.ssh/id_rsa --inp data/Sample_Data.csv --data /mydata
    

References

About

Bayesian structure learning with parallel bnlearn on a distributed R cluster.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published