Skip to content

Commit

Permalink
Resolve review comments
Browse files Browse the repository at this point in the history
1. Add YAML example for design section
2. Fix typos and styles
  • Loading branch information
xumengpanda committed Jun 29, 2023
1 parent a248659 commit d96f980
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 10 deletions.
17 changes: 13 additions & 4 deletions docs/design/node_taint_eviction.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,18 @@ The operator does replace failed ProcessGroups in the [Replace Failed ProcessGro
## Proposed Design

### Design Requirements
**Requirement 1.** The feature can be enabled and disabled by users, because (1) when someone wants to run the operator inside a k8s cluster without Node access, they should be able to bypass this feature; (2) when we notice problems during roll-out of this feature, we need to disable it online.
**Requirement 1.** The feature can be enabled and disabled by users, because (1) when someone wants to run the operator inside a k8s cluster without Node access, they should be able to bypass this feature; (2) when we notice problems during roll-out of this feature, we need to disable it.

The feature can be disabled by option `automationOptions.replacements.taintReplacementOptions = {}`.
The feature can be disabled by option `automationOptions.replacements.taintReplacementOptions = {}`. YAML example is as follows. Please note that `taintReplacementOptions` section is deleted.

```yaml
...
automaticReplacementOptions:
enabled: true
failureDetectionTimeSeconds: 7200
maxConcurrentReplacements: 1
enableNodeFailureDetection: true
```
**Requirement 2.** The feature allows users to configure a set of specific taint keys to detect and replace.
Expand All @@ -45,7 +54,7 @@ automaticReplacementOptions:
failureDetectionTimeSeconds: 7200
maxConcurrentReplacements: 1
enableNodeFailureDetection: true
taints:
taintReplacementOptions:
- key: "example.org/maintenance"
durationInSeconds: 7200 # Mark Pods to be replaced after this taint key has been on the Node for at least 7200 seconds (i.e., 2 hours).
- key: "example.org/disconnected"
Expand All @@ -71,7 +80,7 @@ We need to make the following four types of changes:
- `NodeTaintDetected` represents a ProcessGroup Node is tainted, but not long enough to replace it.
- `NodeTaintReplacing` represents a ProcessGroup whose Node has been tainted for longer-than-configured time and the ProcessGroup should be replaced.

**Replace ProcessGroups with `NodeTaintReplacing` conditioin.** We can extend the existing replacement logic to replace ProcessGroups with `NodeTaintReplacing` condition. We can do this by adding the `NodeTaintReplacing` condition to [`conditionsThatNeedReplacement` list](https://github.com/FoundationDB/fdb-kubernetes-operator/blob/d38d6bc7abf7764215976b75d581c933280389f8/api/v1beta2/foundationdbcluster_types.go#L65-L67)
**Replace ProcessGroups with `NodeTaintReplacing` condition.** We can extend the existing replacement logic to replace ProcessGroups with `NodeTaintReplacing` condition. We can do this by adding the `NodeTaintReplacing` condition to [`conditionsThatNeedReplacement` list](https://github.com/FoundationDB/fdb-kubernetes-operator/blob/d38d6bc7abf7764215976b75d581c933280389f8/api/v1beta2/foundationdbcluster_types.go#L65-L67)


## Limitations
Expand Down
12 changes: 6 additions & 6 deletions docs/manual/replacements_and_deletions.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The following changes can only be rolled out through replacement:
The number of inflight replacements can be configured by setting `maxConcurrentReplacements`, per default the operator will replace all misconfigured process groups.
Depending on the cluster size this can require a quota that is has double the capacity of the actual required resources.

## Automatic Replacements for ProcessGroups in Bad State
## Automatic Replacements for ProcessGroups in Undesired State

The operator has an option to automatically replace pods that are in a bad state. This behavior is disabled by default, but you can enable it by setting the field `automationOptions.replacements.enabled` in the cluster spec.
This will replace any pods that meet the following criteria:
Expand All @@ -43,7 +43,7 @@ The following conditions are currently eligible for replacement:
* `MissingPVC`: This indicates that a process group that doesn't have a PVC assigned.
* `MissingService`: This indicates that a process group that doesn't have a Service assigned.
* `PodPending`: This indicates that a process group where the Pod is in a pending state.
* `NodeTaintReplacing`: This indicates a process group where the Pod has been running on a tainted Node for longer-than configured duration. If a ProcessGroup has the `NodeTaintReplacing` condition, the replacement cannot be stopped, even after the Node taint was removed.
* `NodeTaintReplacing`: This indicates a process group where the Pod has been running on a tainted Node for at least the configured duration. If a ProcessGroup has the `NodeTaintReplacing` condition, the replacement cannot be stopped, even after the Node taint was removed.

Process groups that are set into the crash loop state with the `Buggify` setting won't be replaced by the operator.
If the `cluster.Spec.Buggify.EmptyMonitorConf` setting is active the operator won't replace any process groups.
Expand All @@ -57,7 +57,7 @@ We use three examples below to illustrate how to set up the feature.

The following YAML setup lets the operator detect Pods running on Nodes with taint key `example.com/maintenance`, set the ProcessGroup' condition to `NodeTaintReplacing`, if their Nodes have been tainted for 3600 seconds, and replace the Pods after 1800 seconds.

```
```yaml
spec:
automationOptions:
replacements:
Expand All @@ -72,8 +72,8 @@ If there are multiple Pods on tainted Nodes, the operator will simultaneously re

### Example Setup 2

We can enable taint feature on all taint keys except one taint key with the following YAML configuration:
```
We can enable the taint feature on all taint keys except one taint key with the following configuration:
```yaml
spec:
automationOptions:
replacements:
Expand All @@ -91,7 +91,7 @@ The operator will detect and mark all Pods on tainted Nodes with `NodeTaintDetec

We can disable the taint feature by resetting `automationOptions.replacements.taintReplacementOptions = {}`. The following example YAML config deletes the `taintReplacementOptions` section.

```
```yaml
spec:
automationOptions:
replacements:
Expand Down

0 comments on commit d96f980

Please sign in to comment.