Green-DCC: Benchmarking Dynamic Workload Distribution Techniques for Sustainable Data Center Cluster

Introduction

Green-DCC is a benchmark environment designed to evaluate dynamic workload distribution techniques for sustainable Data Center Clusters (DCC). It aims to reduce the environmental impact of cloud computing by distributing workloads within a DCC that spans multiple geographical locations. The benchmark environment supports the evaluation of various control algorithms, including reinforcement learning-based approaches.

Detailed documentation can be found here.

Figure: Green-DCC Framework for Data Center Cluster Management.

Key features of Green-DCC include:

Dynamic time-shifting of workloads within data centers and geographic shifting between data centers in a cluster.
Incorporation of non-uniform computing resources, cooling capabilities, auxiliary power resources, and varying external weather and carbon intensity conditions.
A dynamic bandwidth cost model that accounts for the geographical characteristics and amount of data transferred.
Realistic workload execution delays to reflect changes in data center capacity and demand.
Support for benchmarking multiple heuristic and hierarchical reinforcement learning-based approaches.
Customizability to address specific needs of cloud providers or enterprise data center clusters.

Green-DCC provides a complex, interdependent, and realistic benchmarking environment that is well-suited for evaluating hierarchical reinforcement learning algorithms applied to data center control. The ultimate goal is to optimize workload distribution to minimize the carbon footprint, energy usage, and energy cost, while considering various operational constraints and environmental factors.

The figure above illustrates the Hierarchical Green-DCC Framework for Data Center Cluster Management. In this framework:

Top-Level Agent: Controls the geographic distribution of workloads across the entire DCC. This agent makes strategic decisions to optimize resource usage and sustainability across multiple locations.
Lower-Level Agents: Manage the time-shifting of workloads and the cooling processes within individual data centers. These agents implement the directives from the top-level agent while addressing local operational requirements.
Additional Controls: Can include energy storage, among other capabilities. These controls further enhance the system's ability to optimize for multiple objectives, such as reducing the carbon footprint, minimizing energy usage and costs, and potentially extending to water usage.

The hierarchical structure allows for coordinated, multi-objective optimization that considers both global strategies and local operational constraints.

Figure: Green-DCC Framework demonstrating Geographic and Temporal Load Shifting Strategies.

The figure above shows the Green-DCC framework using two main strategies to optimize data center operations and reduce carbon emissions:

Geographic Load Shifting: Dynamically moves workloads between different data centers (DC1, DC2, DC3) based on decisions made by the Top-Level Agent. This strategy leverages regional differences in energy costs, carbon intensity of the grid, and external temperatures.
Temporal Load Shifting: Defers non-critical/shiftable tasks to future time periods within a single data center (e.g., DC3), when conditions are more favorable for energy-efficient operation. Tasks are stored in a Deferred Task Queue (DTQ) and executed during periods of lower carbon intensity, external temperatures, or lower overall data center utilization.

Demo of Green-DCC:
.

Installation

To get started with Green-DCC, follow the steps below to set up your environment and install the necessary dependencies.

Prerequisites

Python 3.10+
Ray 2.4.0 (installed when installing the requirements.txt file)
Git
Conda (for creating virtual environments)

Setting Up the Environment

Clone the repository

First, clone the Green-DCC repository from GitHub:

git clone https://github.com/HewlettPackard/green-dcc.git
cd green-dcc

Create a new Conda environment

Create a new Conda environment with Python 3.10:
```
conda create --name greendcc python=3.10
```
Activate the environment

Activate the newly created environment:
```
conda activate greendcc
```
Install dependencies

Install the required dependencies using pip:
```
pip install -r requirements.txt
```

Usage

This section provides instructions on how to run simulations, configure the environment, and use the Green-DCC benchmark.

Running Simulations

Navigate to the Green-DCC directory

Ensure you are in the green-dcc directory:
```
cd green-dcc
```
Run a experiment

To run a basic experiment, use the following command:
```
python train_truly_hierarchical.py
```
This will start a simulation with the default configuration. The results will be saved in results/ output directory.
Visualize the experiments with TensorBoard

To visualize the experiments while they are running, you can launch TensorBoard. Open a new terminal, navigate to the results/ directory, and run the following command:
```
tensorboard --logdir=./test
```
This will start a TensorBoard server, and you can view the experiment visualizations by opening a web browser and navigating to http://localhost:6006.

Figure: Example of TensorBoard visualization for Green-DCC experiments.

Benchmarking

The Green-DCC environment supports benchmarking various Multi Agent / Hierarchical control algorithms to evaluate their effectiveness in optimizing workload distribution and minimizing the carbon footprint of data center clusters. This section provides instructions on how to run benchmarks using different algorithms and configurations.

Tested Algorithms

While Green-DCC is compatible with a wide range of algorithms provided by Ray RLlib, our experiments have primarily tested and validated the following algorithms:

Advantage Actor-Critic (A2C)
Adaptive Proximal Policy Optimization (APPO)
Proximal Policy Optimization (PPO)

These algorithms have been successfully trained and evaluated within the Green-DCC environment, demonstrating their performance in terms of energy consumption, carbon footprint, and other relevant metrics.

Other algorithms listed on the Ray RLlib documentation should also be compatible with Green-DCC, but additional work may be required to adapt the environment to the expected input and output shapes of each method as implemented in RLlib. For more details on these algorithms and how to adapt them for Green-DCC, refer to the Ray RLlib documentation.

utils/dc_config.json and DEFAULT_CONFIG in envs/heirarchical_env.py

Running Benchmarks

Navigate to the Green-DCC directory

Ensure you are in the green-dcc directory:
```
cd green-dcc
```

Configure the benchmark

Edit the configuration files as needed to set up your desired benchmark parameters.

The configuration file for each simulated data center (number of cabinets, rows, HVAC configuration, etc.) can be found in the utils/dc_config_dcX.json files.
Update the DEFAULT_CONFIG in envs/hierarchical_env.py.

Here is an example of the DEFAULT_CONFIG in hierarchical_env.py:

DEFAULT_CONFIG = {
    # DC1
    'config1': {
        'location': 'NY',
        'cintensity_file': 'NY_NG_&_avgCI.csv',
        'weather_file': 'USA_NY_New.York-LaGuardia.epw',
        'workload_file': 'Alibaba_CPU_Data_Hourly_1.csv',
        'dc_config_file': 'dc_config_dc3.json',
        'datacenter_capacity_mw': 1.0,
        'timezone_shift': 0,
        'month': 7,
        'days_per_episode': 30,
        'partial_obs': True,
        'nonoverlapping_shared_obs_space': True
    },

    # DC2
    'config2': {
        'location': 'GA',
        'cintensity_file': 'GA_NG_&_avgCI.csv',
        'weather_file': 'USA_GA_Atlanta-Hartsfield-Jackson.epw',
        'workload_file': 'Alibaba_CPU_Data_Hourly_1.csv',
        'dc_config_file': 'dc_config_dc2.json',
        'datacenter_capacity_mw': 1.0,
        'timezone_shift': 2,
        'month': 7,
        'days_per_episode': 30,
        'partial_obs': True,
        'nonoverlapping_shared_obs_space': True
    },

    # DC3
    'config3': {
        'location': 'CA',
        'cintensity_file': 'CA_NG_&_avgCI.csv',
        'weather_file': 'USA_CA_San.Jose-Mineta.epw',
        'workload_file': 'Alibaba_CPU_Data_Hourly_1.csv',
        'dc_config_file': 'dc_config_dc1.json',
        'datacenter_capacity_mw': 1.0,
        'timezone_shift': 3,
        'month': 7,
        'days_per_episode': 30,
        'partial_obs': True,
        'nonoverlapping_shared_obs_space': True
    },
    
    # Number of transfers per step
    'num_transfers': 1,

    # List of active low-level agents
    'active_agents': ['agent_dc'],
}

Train and evaluate algorithms

To train and evaluate an RL algorithm using Ray, use the appropriate training script. Here are the commands for different configurations:

HRL (Hierarchical Reinforcement Learning) Configuration:
```
python train_truly_hierarchical.py
```
HL+LLP (High Level + Low-Level Pretrained) Configuration:
```
python baselines/train_geo_dcrl.py
```
HLO (High Level Only) Configuration:
```
python baselines/train_hierarchical.py
```

The provided training script (train_truly_hierarchical.py) uses Ray for distributed training. Here's a brief overview of the script for PPO of HRL configuration:

import os
import ray
from ray import air, tune
from ray.rllib.algorithms.ppo import PPO, PPOConfig
from gymnasium.spaces import Discrete, Box
from ray.rllib.algorithms.ppo import PPOConfig

from envs.truly_heirarchical_env import TrulyHeirarchicalDCRL
from envs.heirarchical_env import HeirarchicalDCRL, DEFAULT_CONFIG
from create_trainable import create_wrapped_trainable

NUM_WORKERS = 1
NAME = "test"
RESULTS_DIR = './results/'

# Dummy env to get obs and action space
hdcrl_env = HeirarchicalDCRL()

CONFIG = (
        PPOConfig()
        .environment(
            env=TrulyHeirarchicalDCRL,
            env_config=DEFAULT_CONFIG
        )
        .framework("torch")
        .rollouts(
            num_rollout_workers=NUM_WORKERS,
            rollout_fragment_length=2,
            )
        .training(
            gamma=0.99,
            lr=1e-5,
            kl_coeff=0.2,
            clip_param=0.1,
            entropy_coeff=0.0,
            use_gae=True,
            train_batch_size=4096,
            num_sgd_iter=10,
            model={'fcnet_hiddens': [64, 64]}, 
            shuffle_sequences=True
        )
        .multi_agent(
        policies={
            "high_level_policy": (
                None,
                hdcrl_env.observation_space,
                hdcrl_env.action_space,
                PPOConfig()
            ),
            "DC1_ls_policy": (
                None,
                Box(-1.0, 1.0, (14,)),
                Discrete(3),
                PPOConfig()
            ),
            "DC2_ls_policy": (
                None,
                Box(-1.0, 1.0, (14,)),
                Discrete(3),
                PPOConfig()
            ),
            "DC3_ls_policy": (
                None,
                Box(-1.0, 1.0, (14,)),
                Discrete(3),
                PPOConfig()
            ),
        },
        policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: agent_id,
        )
        .resources(num_gpus=0)
        .debugging(seed=0)
    )


if __name__ == "__main__":
    os.environ["RAY_DEDUP_LOGS"] = "0"
    ray.init(ignore_reinit_error=True)
    
    tune.Tuner(
        create_wrapped_trainable(PPO),
        param_space=CONFIG.to_dict(),
        run_config=air.RunConfig(
            stop={"timesteps_total": 100_000_000},
            verbose=0,
            local_dir=RESULTS_DIR,
            name=NAME,
            checkpoint_config=ray.air.CheckpointConfig(
                checkpoint_frequency=5,
                num_to_keep=5,
                checkpoint_score_attribute="episode_reward_mean",
                checkpoint_score_order="max"
            ),
        )
).fit()

This example assumes a DCC with three data centers. To use a different algorithm, such as A2C, you need to replace the PPOConfig with A2CConfig (or the appropriate config class for the algorithm) and adjust the hyperparameters accordingly. For example:

from ray.rllib.algorithms.a2c import A2C, A2CConfig

CONFIG = (
        A2CConfig()
        .environment(
            env=TrulyHeirarchicalMSDCRL,
            env_config=DEFAULT_CONFIG
        )
        .framework("torch")
        .rollouts(
            num_rollout_workers=NUM_WORKERS,
            rollout_fragment_length=2,
            )
        .training(
            gamma=0.99,
            lr=1e-5,
            kl_coeff=0.2,
            clip_param=0.1,
            entropy_coeff=0.0,
            use_gae=True,
            train_batch_size=4096,
            num_sgd_iter=10,
            model={'fcnet_hiddens': [64, 64]}, 
        )
        .multi_agent(
        policies={
            "high_level_policy": (
                None,
                hdcrl_env.observation_space,
                hdcrl_env.action_space,
                A2CConfig()
            ),
            "DC1_ls_policy": (
                None,
                Box(-1.0, 1.0, (14,)),
                Discrete(3),
                A2CConfig()
            ),
            "DC2_ls_policy": (
                None,
                Box(-1.0, 1.0, (14,)),
                Discrete(3),
                A2CConfig()
            ),
            "DC3_ls_policy": (
                None,
                Box(-1.0, 1.0, (14,)),
                Discrete(3),
                A2CConfig()
            ),
        },
        policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: agent_id,
        )
        .resources(num_gpus=0)
        .debugging(seed=1)
    )


if __name__ == "__main__":
    os.environ["RAY_DEDUP_LOGS"] = "0"
    ray.init(ignore_reinit_error=True)
    
    tune.Tuner(
        create_wrapped_trainable(A2C),
        param_space=CONFIG.to_dict(),
        run_config=air.RunConfig(
            stop={"timesteps_total": 100_000_000},
            verbose=0,
            local_dir=RESULTS_DIR,
            name=NAME,
            checkpoint_config=ray.air.CheckpointConfig(
                checkpoint_frequency=5,
                num_to_keep=5,
                checkpoint_score_attribute="episode_reward_mean",
                checkpoint_score_order="max"
            ),
        )
    ).fit()

Compare results

After running the benchmarks, you can compare the results by examining the output files in the results/ directory. These files include detailed metrics on energy consumption, carbon footprint, and workload distribution across data centers. Use these metrics to assess the relative performance of different algorithms and configurations.

Evaluation Metrics

Green-DCC provides a range of evaluation metrics to assess the performance of the benchmarked algorithms:

Energy Consumption: Total energy consumed by the data centers during the simulation.
Carbon Footprint: Total carbon emissions generated by the data centers, calculated based on the energy mix and carbon intensity of the power grid.
Workload Distribution: Efficiency of workload distribution across data centers, considering factors like latency, bandwidth cost, and data center utilization.

These metrics provide a comprehensive view of the performance of different algorithms and configurations, enabling you to identify the most effective strategies for sustainable data center management.

Customizing Benchmarks

Green-DCC is designed to be highly customizable, allowing you to tailor the benchmark environment to your specific needs. You can modify the configuration files to:

Add or remove data center locations.
Adjust the workload characteristics, such as the proportion of shiftable tasks.
Change the parameters of the RL algorithms, such as learning rates and discount factors.
Include additional control strategies, such as energy storage or renewable energy integration.

Refer to the detailed documentation for more information on customizing the Green-DCC environment and running advanced benchmarks.

We are continually expanding Green-DCC to integrate additional control strategies and external energy sources, including auxiliary battery integration and on-site renewable energy generators (solar, wind, etc.). This ongoing development ensures that Green-DCC remains a comprehensive and up-to-date benchmarking tool for sustainable data center management.

Experimental Details

For all experiments, we considered three different locations: New York (NY), Atlanta (GA), and San Jose (CA). These locations were chosen to present a variety of weather conditions and carbon intensity profiles, creating a comprehensive and challenging evaluation environment. The goal was to develop a policy capable of addressing the unique challenges specific to each location. We utilized weather and carbon intensity data from the month of July. Weather data was sourced from EnergyPlus, and carbon intensity data was retrieved from the EIA API. The base workload for our experiments was derived from open-source workload traces provided by Alibaba (GitHub repository). Users can use their own data for weather, carbon intensity, and workload.

Each data center (DC) had a capacity of 1 Mega-Watt.

Green-DCC offers support for more locations beyond the three selected for these experiments. Detailed information about these additional locations can be found in the Provided Locations section. The diverse climate and carbon intensity characteristics of these locations allow for extensive benchmarking and evaluation of RL controllers.

Weather and Carbon Intensity Data

Figure Weather conditions (temperature) for New York, Atlanta, and San Jose over the month of July.

Figure Carbon intensity profiles for New York, Atlanta, and San Jose over the month of July.

Workload Distribution Comparison

Figure Comparison of workload distribution across the three data centers under the Do Nothing controller.

Figure Comparison of workload distribution across the three data centers under the HLO RL Controller.

Provided Locations

Green-DCC supports a wide range of locations, each with distinct weather patterns and carbon intensity profiles. This diversity allows for extensive benchmarking and evaluation of RL controllers under various environmental conditions. The table and the figure below provide a summary of the provided locations, including typical weather conditions and carbon intensity characteristics.

Provided Location Summaries

Location	Typical Weather	Carbon Intensity (CI)
Arizona	Hot, dry summers; mild winters	High avg CI
California	Mild, Mediterranean climate	Medium avg CI
Georgia	Hot, humid summers; mild winters	High avg CI
Illinois	Cold winters; hot, humid summers	High avg CI
New York	Cold winters; hot, humid summers	Medium avg CI
Texas	Hot summers; mild winters	Medium avg CI
Virginia	Mild climate, seasonal variations	Medium avg CI
Washington	Mild, temperate climate; wet winters	Low avg CI

Table: Summary of Provided Locations with Typical Weather and Carbon Intensity Characteristics

External Temperature Distribution

The figure below illustrates the external temperature profiles for the different selected locations during the month of July, highlighting the variations in weather conditions that affect cooling requirements and energy consumption.

Figure: External temperature profiles for the selected locations during July.

Carbon Intensity Distribution

The figure below shows the average daily carbon intensity for the selected locations during the month of July, providing insight into the environmental impact of energy consumption at these locations.

Figure: Average daily carbon intensity for the selected locations during July.

These locations were chosen because they are typical data center locations within the United States, offering a variety of environmental conditions that reflect real-world challenges faced by data centers.

Contributing

We welcome contributions to Green-DCC! If you are interested in contributing to the project, please follow the guidelines below.

How to Contribute

Fork the Repository

Start by forking the Green-DCC repository to your GitHub account.
```
git clone https://github.com/YOUR_USERNAME/green-dcc.git
cd green-dcc
```
Create a Branch

Create a new branch for your feature or bug fix.
```
git checkout -b feature-or-bugfix-name
```
Make Changes

Make your changes to the codebase. Be sure to follow the existing coding style and conventions.
Commit Your Changes

Commit your changes with a clear and descriptive commit message.
```
git add .
git commit -m "Description of your changes"
```
Push to Your Fork

Push your changes to your forked repository.
```
git push origin feature-or-bugfix-name
```
Create a Pull Request

Go to the original Green-DCC repository and create a pull request. Provide a clear description of your changes and any additional context that might be useful for the review.

Code of Conduct

Please note that we have a Code of Conduct. By participating in this project, you agree to abide by its terms.

Development Guidelines

Follow the coding style and conventions used in the existing codebase.
Write clear and concise commit messages.
Document your code where necessary to make it easier for others to understand.
Ensure that your changes do not break existing functionality by running tests and validating your code.

Testing

Before submitting a pull request, make sure your changes pass the existing tests and add new tests if your changes introduce new functionality.

Thank you for your interest in contributing to Green-DCC! We appreciate your support and look forward to your contributions.

Contact

If you have any questions, feedback, or need assistance, please feel free to reach out to us. We are here to help and would love to hear from you.

For any project-specific queries or issues, you can contact to: [email protected]

Reporting Issues

If you encounter any issues or bugs with Green-DCC, please report them on our GitHub Issues page. Provide as much detail as possible to help us understand and resolve the issue.

Thank you for your interest in Green-DCC. We look forward to your contributions and feedback!

License

Green-DCC is licensed under the MIT License.

For more details, please refer to the LICENSE file in the repository.

By contributing to Green-DCC, you agree that your contributions will be licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
Figures		Figures
baselines		baselines
data		data
docs		docs
envs		envs
harl		harl
sphinx		sphinx
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
train_truly_hierarchical.py		train_truly_hierarchical.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Green-DCC: Benchmarking Dynamic Workload Distribution Techniques for Sustainable Data Center Cluster

Table of Contents

Introduction

Detailed documentation can be found here.

Installation

Prerequisites

Setting Up the Environment

Usage

Running Simulations

Benchmarking

Tested Algorithms

Running Benchmarks

Evaluation Metrics

Customizing Benchmarks

Experimental Details

Provided Locations

Provided Location Summaries

External Temperature Distribution

Carbon Intensity Distribution

Contributing

How to Contribute

Code of Conduct

Development Guidelines

Testing

Contact

Reporting Issues

License

About

Contributors 7

Languages

License

HewlettPackard/green-dcc

Folders and files

Latest commit

History

Repository files navigation

Green-DCC: Benchmarking Dynamic Workload Distribution Techniques for Sustainable Data Center Cluster

Table of Contents

Introduction

** Detailed documentation can be found here. **

Installation

Prerequisites

Setting Up the Environment

Usage

Running Simulations

Benchmarking

Tested Algorithms

Running Benchmarks

Evaluation Metrics

Customizing Benchmarks

Experimental Details

Provided Locations

Provided Location Summaries

External Temperature Distribution

Carbon Intensity Distribution

Contributing

How to Contribute

Code of Conduct

Development Guidelines

Testing

Contact

Reporting Issues

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors 7

Languages

Detailed documentation can be found here.