This is the repository for our paper "Predicting forest fire in Indonesia using remote sensing data and machine learning".
This document contains the necessary information about setting up the environment and reproducing the results in the paper.
NOTE: all the scripts assume usage of python 2.7.x and ubuntu os
paper-supplementary-materials
>evaluation_scripts
>auc_calculation
-auc_calculation.py
>plot_roc_curve
-plot_roc_curve.ipynb
-evaluation_script_50_epoch_stride1.py
-evaluation_script_50_epoch_stride1_3month.py
-evaluation_script_50_epoch_stride1_6month.py
-evaluation_script_50_epoch_stride1_9month.py
-evaluation_script_script_logistic_baseline.py
>models
-baseline_logistic_regression_model.hdf5
-model.hdf5
-model_3month.hdf5
-model_6month.hdf5
-model_9month.hdf5
>preprocessing_scripts
-eval_set_preprocessing_dates
-fire_only_dates
-preprocessing_dates
-preprocessing_script_new_mask.py
-preprocessing_script_new_mask_fire_only.py
-test_set_preprocessing_dates
>training_scripts
-baseline_logistic_regression_training_script_12ts_mse.py
-training_script_50_epoch_stride1.py
-training_script_50_epoch_stride1_3month.py
-training_script_50_epoch_stride1_6month.py
-training_script_50_epoch_stride1_9month.py
-readme.html
-README.md
-requirements.txt
- Google earth engine account
- Google earth engine account is required to access the satellite image data and perform data preprocessing.
- Google account with access to google drive api v3
- Google drive api will need to be enabled for downloading of dataset. Each satellite image time series is first exported to google drive and subsequently download.
- Gdal 2
Install Gdal 2.1.3 (version used during implementation) If you are installing a different version of of Gdal, make sure to change the version of pygdal
in requirements.txt to the same version. You can check the available version here
sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get install libgdal20=2.1.3
sudo apt-get install libgdal2-dev=2.1.3
- Python 2.x
Python 2 is used because Google Earth Engine only supported python 2 in its API at the time of implementation
To install required python packages:
pip install -r requirements.txt
If you are not using GPU, change tensorflow-gpu to tensorflow in the requirements.txt
You will need to authenticate the google earth engine python api to access your account. You can do so by executing the following in terminal:
earthengine authenticate
You also need to have Google Drive API enabled to run the data retrieval and preprocessing script. Follow steps 1 and 2 here
Note: Ensure that there is around 200gb of free disk space
There are 2 version of the data retrieval and preprocessing script in the preprocessing_scripts
folder:
preprocessing_script_new_mask.py
and
preprocessing_script_new_mask_fire_only.py
As the names suggest preprocessing_script_new_mask.py
retrieves and process data that contains both hotspot and non-hotspot labels, whereas preprocessing_script_new_mask_fire_only.py
retrieves and process data only for hotspot labels.
There are a few other parameters that you should set in both of the data retrieval and preprocessing scripts.
save_folder_location
-- where to save the processed data filesperiod_start
andperiod_end
-- the interval in which to get reference images. The date of the reference images is time t, as referred in the paper. For each reference image, historical data for t - 52 weeks to t is retrieved and hotspot labels in the period t + 4 weeks to t + 5 weeks are retrieved.
The data used in paper is retrieved by running the following: For training data:
preprocessing_script_new_mask_fire_only.py
script with dates infire_only_dates
preprocessing_script_new_mask.py
script with dates inpreprocessing_dates
- Set different folders for
save_folder_location
parameter forpreprocessing_script_new_mask_fire_only.py
andpreprocessing_script_new_mask.py
- Run
preprocessing_script_new_mask_fire_only.py
script with dates infire_only_dates
- Run
preprocessing_script_new_mask.py
script with dates inpreprocessing_dates
For test set data:
- Set desired test data folder location for
save_folder_location
parameter inpreprocessing_script_new_mask.py
- Run
preprocessing_script_new_mask.py
script with dates intest_set_preprocessing_dates
For evaluation set data:
For each pair of period_start
and period_end
in eval_set_preprocessing_dates
:
- Set desired test data folder location for
save_folder_location
parameter inpreprocessing_script_new_mask.py
(the test data folder location should be different for each pair ofperiod_start
andperiod_end
) preprocessing_script_new_mask.py
script with dates ineval_set_preprocessing_dates
NOTE: the data retrieved using the above instruction may not be the exact data used in our experiments, but rather as close as possible. This is because we used to import the Indonesia boundary KML format file via Google Fusion Tables (now discontinued). As such, we found a more recent source of country boundary data provided by United States Department of State, Office of the Geographer, via Google Earth Engine.
-
Indonesia boundary
The Indonesian boundary used is loaded from: https://developers.google.com/earth-engine/datasets/catalog/USDOS_LSIB_SIMPLE_2017
As this country boundary is different from the one used in our paper, there are minor differences in the data retrieved. However, we still expect our model to perform similarly.
-
Landsat 7 data
The Landsat 7 images used are the existing version within google earth engine. The ImageCollectionID is
'LANDSAT/LE07/C01/T1_RT'
. Information on the Landsat 7 dataset used is available here: https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LE07_C01_T1_RT -
Hotspot data (FIRMS)
The Fire Information for Resource Management System (FIRMS) hotspot dataset is being used. It is an existing dataset on google earth engine with the ImageCollectionID:
'FIRMS'
. Additional information is available: https://developers.google.com/earth-engine/datasets/catalog/FIRMS
To train the model(s) in the paper, run the training scripts in the training_scripts
folder.
Step 1 : Set at the user parameters at start of the training scripts:
- saved model file name
- log file name
- directories of dataset location (training dataset, training dataset with fire only and testset)
training_dataset_fire_only_directory
--- set it to directory of data downloaded by running preprocessing_script_new_mask_fire_only.py
script with dates in fire_only_dates
training_dataset_directory
--- set it to directory of data downloaded by running preprocessing_script_new_mask.py
script with dates in preprocessing_dates
testset_directory
--- set it to the directory of test set data downloaded using the preprocessing script for example './preprocessed_data_test_set/'
**NOTE: when setting directory parameters include backslash at the end eg **'./preprocessed_data_test_set/'
**instead of **'./preprocessed_data_test_set'
Step 2: Run the training script as follows
python <script_name>.py
There are 5 included training scripts, <script_name>
can be:
training_script_50_epoch_stride1.py
-- trains model on 1 year of historical data
training_script_50_epoch_stride1_3month.py
-- trains model on recent 3 month of historical data
training_script_50_epoch_stride1_6month.py
-- trains model on recent 6 month of historical data
training_script_50_epoch_stride1_9month.py
-- trains model on recent 9 month of historical data
baseline_logistic_regression_training_script_12ts_mse.py
-- trains baseline logistic regression model
The dataset location can be the same, processing is done within the script to ensure the right amount of data is passed to model during training.
In the scripts, the logs output the performance on test set, with 0.5 as prediction threshold
The scripts to perform evaluation of performance, on evaluation set is located in the evaluation_scripts
folder.
Step 1: Set parameters at the start of the training scripts:
- saved model file name
- log file name
- prediction pickle file name
- directory of evaluation dataset location
The default values in the script are for evaluating data instances with reference time t in August 2019, predicting for hotspots in September 2019. If evaluating on different data, change location parameter in the script accordingly.
Step 2: Run the training script as follows
python <script_name>.py
There are 5 inluded evaluation scripts, <script_name>
can be:
evaluation_script_50_epoch_stride1.py
-- evaluates model on 1 year of historical data
evaluation_script_50_epoch_stride1_3month.py
-- evaluates model on recent 3 month of historical data
evaluation_script_50_epoch_stride1_6month.py
-- evaluates model on recent 6 month of historical data
evaluation_script_50_epoch_stride1_9month.py
-- evaluates model on recent 9 month of historical data
evaluation_script_script_logistic_baseline.py
-- evaluates baseline logistic regression model
The dataset location can be the same, processing is done within the script to ensure the right amount of data is passed to model during evaluation.
The saved pickle file from the evaluation script is a tuple of 2 lists, the fiirst being the model prediction and the second being the ground truth label ([model prediction list], [ground truth label list])
.
The pickle file is then used to calculate the area under curve.
In the auc_calculation
folder, there is a script (auc_calculation.py
) to calculate AUC from the pickle files generated from evaluation scripts.
Step 1: In the script, there are 2 parameters to be set:
prediction_filename
-- the pickle file from the evaluation script to have AUC calculatedoutput_log_file
-- the name of the log file where the AUC result will be written to
Step 2: Run the script, AUC result will be saved to the log file.
python auc_calculation.py
In the plot_roc_curve
folder under evaluation_scripts
folder, there is a jupyter notebook plot_roc_curve.ipynb
for plotting the ROC curve from pickle files that are output by the evaluation scripts.
Start jupyter on python 2.x by running jupyter lab
in terminal and follow instructions within notebook.
We have included some of the trained model files in the models
folder. The naming of the model files matches the preset names in user parameters in the training and evaluation scripts.
The models included are:
model.hdf5 --- model trained on 1 year of historical satellite image histogram data
model_3month.hdf5 --- model trained on ~3 month of historical satellite image histogram data
model_6month.hdf5 --- model trained on ~6 month of historical satellite image histogram data
model_9month.hdf5 --- model trained on ~9 month of historical satellite image histogram data
baseline_logistic_regresion_model.hdf5 --- baseline model trained on same data as model.hdf5
The model trained and evaluated on 9 month data is omitted due to size limitations of the supplementary file allowed.
You will be able to run the respective evaluation scripts on the models provided to get an idea of the model's performance.
To replicate the tables and figure in the paper, please first retrieve the evaluation set using the preprocessing script
To reproduce the values in table 4: comparison of our model with baseline, you need to run the evaluation script on the provided model and baseline.
The two models are the model.hdf5
(our model) and baseline_logistic_regression_model.hdf5
(baseline model) in the model folder. Run the evaluation script for each model for each of the month of the evaluation dataset, when downloading them using the preprocessing script, change the save_folder_location
parameter to save each month of the evaluation dataset in a separate location.
Each run of the evaluation script should produce a .pickle
file, consisting of the model's prediction and the ground truth labels for all the instances in that evaluation. Run the auc_calculation.py
script (setting prediction_filename
to be the .pickle
file output by the evaluation script) to calculate the area under ROC curve result that is in the table.
To reproduce the values in table 5: auc values for reduced data, you need to first train the model for 9 month as we did not include it due to space constraints. The other models are provided in the models
folder.
Then once all the models are trained, run the different evaluation scripts, corresponding to the different amount of data each model is trained with (eg 3 month, 6 month, 9 month, 1 year). The evaluation scripts need to be run for the different months in the evaluation data for each model. The 1 year
column of Table 5 is the same as Agni (our model)
in Table 4.
Step 1: Run the evaluation script for model.hdf5
with September 2019 and August 2019 hotspot evaluation dataset. The data are retrieved by running preprocessing_script_new_mask.py
with period_start = '2019-08-01', period_end = '2019-08-28'
and period_start = '2019-07-01', period_end = '2019-07-28'
Refer to Section 2.1 for more information.
Step 2: Follow instructions in Section 4.2 to use provided jupter notebook plot_roc_curve.ipynb
to plot roc curve.
The other tables in paper are not results. Table 1 is the specification of Landsat 7 satellite which can also be found here
Table 2 is the architecture of our model, it corresponds to the code found in the training scripts.