From d96f9802b87649abd1e0fc7367eb27d5f35340a2 Mon Sep 17 00:00:00 2001 From: Meng Xu Date: Thu, 29 Jun 2023 09:52:46 -0700 Subject: [PATCH] Resolve review comments 1. Add YAML example for design section 2. Fix typos and styles --- docs/design/node_taint_eviction.md | 17 +++++++++++++---- docs/manual/replacements_and_deletions.md | 12 ++++++------ 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/docs/design/node_taint_eviction.md b/docs/design/node_taint_eviction.md index 2a2d33414..5481b39da 100644 --- a/docs/design/node_taint_eviction.md +++ b/docs/design/node_taint_eviction.md @@ -26,9 +26,18 @@ The operator does replace failed ProcessGroups in the [Replace Failed ProcessGro ## Proposed Design ### Design Requirements -**Requirement 1.** The feature can be enabled and disabled by users, because (1) when someone wants to run the operator inside a k8s cluster without Node access, they should be able to bypass this feature; (2) when we notice problems during roll-out of this feature, we need to disable it online. +**Requirement 1.** The feature can be enabled and disabled by users, because (1) when someone wants to run the operator inside a k8s cluster without Node access, they should be able to bypass this feature; (2) when we notice problems during roll-out of this feature, we need to disable it. -The feature can be disabled by option `automationOptions.replacements.taintReplacementOptions = {}`. +The feature can be disabled by option `automationOptions.replacements.taintReplacementOptions = {}`. YAML example is as follows. Please note that `taintReplacementOptions` section is deleted. + +```yaml +... +automaticReplacementOptions: + enabled: true + failureDetectionTimeSeconds: 7200 + maxConcurrentReplacements: 1 + enableNodeFailureDetection: true +``` **Requirement 2.** The feature allows users to configure a set of specific taint keys to detect and replace. @@ -45,7 +54,7 @@ automaticReplacementOptions: failureDetectionTimeSeconds: 7200 maxConcurrentReplacements: 1 enableNodeFailureDetection: true - taints: + taintReplacementOptions: - key: "example.org/maintenance" durationInSeconds: 7200 # Mark Pods to be replaced after this taint key has been on the Node for at least 7200 seconds (i.e., 2 hours). - key: "example.org/disconnected" @@ -71,7 +80,7 @@ We need to make the following four types of changes: - `NodeTaintDetected` represents a ProcessGroup Node is tainted, but not long enough to replace it. - `NodeTaintReplacing` represents a ProcessGroup whose Node has been tainted for longer-than-configured time and the ProcessGroup should be replaced. -**Replace ProcessGroups with `NodeTaintReplacing` conditioin.** We can extend the existing replacement logic to replace ProcessGroups with `NodeTaintReplacing` condition. We can do this by adding the `NodeTaintReplacing` condition to [`conditionsThatNeedReplacement` list](https://github.com/FoundationDB/fdb-kubernetes-operator/blob/d38d6bc7abf7764215976b75d581c933280389f8/api/v1beta2/foundationdbcluster_types.go#L65-L67) +**Replace ProcessGroups with `NodeTaintReplacing` condition.** We can extend the existing replacement logic to replace ProcessGroups with `NodeTaintReplacing` condition. We can do this by adding the `NodeTaintReplacing` condition to [`conditionsThatNeedReplacement` list](https://github.com/FoundationDB/fdb-kubernetes-operator/blob/d38d6bc7abf7764215976b75d581c933280389f8/api/v1beta2/foundationdbcluster_types.go#L65-L67) ## Limitations diff --git a/docs/manual/replacements_and_deletions.md b/docs/manual/replacements_and_deletions.md index e64805b02..ff00cac3d 100644 --- a/docs/manual/replacements_and_deletions.md +++ b/docs/manual/replacements_and_deletions.md @@ -27,7 +27,7 @@ The following changes can only be rolled out through replacement: The number of inflight replacements can be configured by setting `maxConcurrentReplacements`, per default the operator will replace all misconfigured process groups. Depending on the cluster size this can require a quota that is has double the capacity of the actual required resources. -## Automatic Replacements for ProcessGroups in Bad State +## Automatic Replacements for ProcessGroups in Undesired State The operator has an option to automatically replace pods that are in a bad state. This behavior is disabled by default, but you can enable it by setting the field `automationOptions.replacements.enabled` in the cluster spec. This will replace any pods that meet the following criteria: @@ -43,7 +43,7 @@ The following conditions are currently eligible for replacement: * `MissingPVC`: This indicates that a process group that doesn't have a PVC assigned. * `MissingService`: This indicates that a process group that doesn't have a Service assigned. * `PodPending`: This indicates that a process group where the Pod is in a pending state. -* `NodeTaintReplacing`: This indicates a process group where the Pod has been running on a tainted Node for longer-than configured duration. If a ProcessGroup has the `NodeTaintReplacing` condition, the replacement cannot be stopped, even after the Node taint was removed. +* `NodeTaintReplacing`: This indicates a process group where the Pod has been running on a tainted Node for at least the configured duration. If a ProcessGroup has the `NodeTaintReplacing` condition, the replacement cannot be stopped, even after the Node taint was removed. Process groups that are set into the crash loop state with the `Buggify` setting won't be replaced by the operator. If the `cluster.Spec.Buggify.EmptyMonitorConf` setting is active the operator won't replace any process groups. @@ -57,7 +57,7 @@ We use three examples below to illustrate how to set up the feature. The following YAML setup lets the operator detect Pods running on Nodes with taint key `example.com/maintenance`, set the ProcessGroup' condition to `NodeTaintReplacing`, if their Nodes have been tainted for 3600 seconds, and replace the Pods after 1800 seconds. -``` +```yaml spec: automationOptions: replacements: @@ -72,8 +72,8 @@ If there are multiple Pods on tainted Nodes, the operator will simultaneously re ### Example Setup 2 -We can enable taint feature on all taint keys except one taint key with the following YAML configuration: -``` +We can enable the taint feature on all taint keys except one taint key with the following configuration: +```yaml spec: automationOptions: replacements: @@ -91,7 +91,7 @@ The operator will detect and mark all Pods on tainted Nodes with `NodeTaintDetec We can disable the taint feature by resetting `automationOptions.replacements.taintReplacementOptions = {}`. The following example YAML config deletes the `taintReplacementOptions` section. -``` +```yaml spec: automationOptions: replacements: