Through this project, we set up a distributed R cluster, leveraging the parallel package. The parallel package offers support for parallel computation by forking parallel process (based on the multicore package) on the same machine thus utilizing most of the cores of the machine. In addition to it, the package also offers communication using sockets (obtained from the snow package) parallelizing the computation utlizing the resources of the nodes in the cluster.
We then study Bayesian structure learning, by learning the Bayesian structure on a sample dataset, using the bnlearn package. The dataset is split into equal parts, based on the number of nodes in the cluster. A network structure is learnt over each split paralelly and aggregated to output the final structure.
The sample data used is obtained from learning.test a small synthetic dataset compirsing of 6 nodes, 5 arcs and 41 parameters.
Fig. 1 - learning.test Network (Ref: https://www.bnlearn.com/documentation/networks/)
-
To setup a distributed R cluster
cd scripts && ./configure.sh --machines <MACHINES> --user <USERNAME> --key <PRIVATE_KEY>
Parameter Default Description --machines ../conf/machine_list.txt A file listing the public IP addresses of the nodes. --user ${USER} User name, if different from the current user name. --key ~/.ssh/id_rsa Path to the private key. cd scripts && ./configure.sh --machines ../conf/machine_list.txt --user arung --key ~/.ssh/id_rsa
-
To learn the Bayesian network
./exec.sh --machines <MACHINES> --user <USERNAME> --key <PRIVATE_KEY> --inp <INPUT_DATA> --data <DATA_DIR>
Parameter Default Description --machines ../conf/machine_list.txt A file listing the public IP addresses of the nodes. --user ${USER} User name, if different from the current user name. --key ~/.ssh/id_rsa Path to the private key. --inp ~/.ssh/id_rsa CSV File --data /mydata Path to the directory to install R packages and save data splits and other metadata. ./exec.sh --machines conf/machine_list.txt --user arung --key ~/.ssh/id_rsa --inp data/Sample_Data.csv --data /mydata