Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QAT plugin crashes when device is detected, but not fully configured #1571

Closed
brgavino opened this issue Oct 20, 2023 · 15 comments
Closed

QAT plugin crashes when device is detected, but not fully configured #1571

brgavino opened this issue Oct 20, 2023 · 15 comments

Comments

@brgavino
Copy link

A QAT device may be present on a node, but may only be configured with the kernel driver - or SR-IOV mode may not have been enabled. The operator will do a scan of the available PCI devices but does not check if any of the further pre-requisites as described in the QAT plugin pre-requisites section (possibly, vfio-pci driver loaded, vfs enabled on QAT device, etc).

This could be the case when a QAT device is present in the system, but will not be available for node resource allocation and exposed to the cluster - but other nodes may have configured QAT devices available.

It is unclear how to limit the deployment of plugins via the operator to avoid nodes with available, unconfigured devices on the cluster without disabling deployment of QAT plugin to the whole cluster. In this case the operator should continue deployment of other detected devices and avoid attempting to deploy plugins to unconfigured, but installed nodes.

Please advise on correct approach/behavior

Example failure of QAT plugin when device is present via lspci, but not intended to be configured on the node.

I1020 03:28:26.413601       1 qat_plugin.go:62] QAT device plugin started in 'dpdk' mode
E1020 03:28:26.413805       1 manager.go:102] Device scan failed: open /sys/bus/pci/drivers/vfio-pci/new_id: permission denied
write to driver failed: 8086 4941
github.com/intel/intel-device-plugins-for-kubernetes/cmd/qat_plugin/dpdkdrv.writeToDriver
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/cmd/qat_plugin/dpdkdrv/dpdkdrv.go:445
github.com/intel/intel-device-plugins-for-kubernetes/cmd/qat_plugin/dpdkdrv.(*DevicePlugin).setupDeviceIDs
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/cmd/qat_plugin/dpdkdrv/dpdkdrv.go:196
github.com/intel/intel-device-plugins-for-kubernetes/cmd/qat_plugin/dpdkdrv.(*DevicePlugin).Scan
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/cmd/qat_plugin/dpdkdrv/dpdkdrv.go:210
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).Run.func1
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:100
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598
failed to set device ID 4941 for vfio-pci. Driver module not loaded?
@mythi
Copy link
Contributor

mythi commented Oct 20, 2023

It is unclear how to limit the deployment of plugins via the operator to avoid nodes with available, unconfigured devices on the cluster without disabling deployment of QAT plugin to the whole cluster. In this case the operator should continue deployment of other detected devices and avoid attempting to deploy plugins to unconfigured, but installed nodes.

The operator is not HW aware but it just deploys the device plugin daemonSets based on user provided resources. The CRD has two fields relevant to your questions:

initImage is used for HW provisioning. It enables SR-IOV if it's not enabled and does other configurations if requested.
nodeSelector is the filter label that is used to select the right nodes for the QAT plugin to run. The default label comes node-feature-discovery based on a NodeFeatureRule.

@eero-t
Copy link
Contributor

eero-t commented Oct 20, 2023

The operator will do a scan of the available PCI devices but does not check if any of the further pre-requisites as described in the QAT plugin pre-requisites section (possibly, vfio-pci driver loaded, vfs enabled on QAT device, etc).

nodeSelector is the filter label that is used to select the right nodes for the QAT plugin to run. The default label comes node-feature-discovery based on a NodeFeatureRule.

There could be another NFD label rule for VF module (assuming the related features are all configured as modules on distro kernels or loaded as DKMS), and that label can be added to the nodeSelector when dpdkDriver: vfio-pci option is used.

NFD label rule can also check PCI device sriov_totalvfs attribute for the rules:
https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide.html#available-features

(On quick check did not find when that was added, but it's there e.g. inNFD v0.12 docs.)

@brgavino
Copy link
Author

Thanks, this is great information to have. Combining this it is possible to ignore certain nodes, and also to ignore nodes that are not configured (such as for sriov). Should this be added to some section of documentation (such as the end for This section of the README in QAT)?

@eero-t
Copy link
Contributor

eero-t commented Oct 20, 2023

Can you do some testing with such rules?

If that works fine, project NFD rules overlay can be updated, and QAT example changed to use the new label.

I think the relation between label used in nodeSelector, the driver variant selected in the same YAML file, and what options there are for those, is better documented directly in the example file comments. README can then have more generic note about that, e.g. "Example file documents relations between labels given for nodeSelector, and the requested QAT attributes".

@mythi
Copy link
Contributor

mythi commented Oct 23, 2023

Thanks, this is great information to have. Combining this it is possible to ignore certain nodes, and also to ignore nodes that are not configured (such as for sriov). Should this be added to some section of documentation (such as the end for This section of the README in QAT)?

We have plans to improve the documentation on HW initialization and troubleshooting (#1555) and we'll keep this feedback in mind.

On the error itself:

device scan failed: open /sys/bus/pci/drivers/vfio-pci/new_id: permission denied write to driver failed: 8086 4941

'Permission denied' is often linked with the error where the plugin runs on an Ubuntu node which has Apparmor enabled. For that, the "fix" is to deploy the plugin with unconfined policy.

@tkatila
Copy link
Contributor

tkatila commented Oct 23, 2023

Could we just print the error and not abort? It would increase the feedback loop for cases where a node should get QAT resources but is incorrectly configured.

@tkatila
Copy link
Contributor

tkatila commented Oct 23, 2023

For the NFD rule, we could also add checks for vfio-pci to be either loaded as a module (kernel.loadedmodule) or built-in (kernel.enabledmodule).

@brgavino
Copy link
Author

@eero-t

NFD label rule can also check PCI device sriov_totalvfs attribute for the rules: https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide.html#available-features

Is this going to be a helpful selector? vfio-pci may be available and sriov_totalvfs set for each pci device, but for whatever reason sriov may not be in enabled state (particularly when in kernel builtin) - /sys/module/vfio_pci/parameters/enable_sriov = N (I haven't found a way to check this by features, yet)

There's several states in how sriov/vfio-pci may not be in a configured state needed by the plugin. We can at least hope that with all of these present, QAT plugin can attempt to be successful in initializing, and not render the node unschedulable.
(this doesn't include any rules for nodes to explicitly exclude, just features related to QAT / vfio config state).
qat-exclusion.yaml.txt

@eero-t
Copy link
Contributor

eero-t commented Oct 25, 2023

I thought sriov_totalvfs is max number of VFs a given kernel device supports. If SR-IOV is not enabled for the device, it being non-zero sounds like a bug...?

EDIT: with NFD rule providing srviov_totalvfs value as-is, pod spec needs to use Gt rule for nodeAffinity (instead of label nodeSelector): https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

@mythi
Copy link
Contributor

mythi commented Oct 26, 2023

Is this going to be a helpful selector? vfio-pci may be available and sriov_totalvfs set for each pci device, but for whatever reason sriov may not be in enabled state (particularly when in kernel builtin) - /sys/module/vfio_pci/parameters/enable_sriov = N (I haven't found a way to check this by features, yet)

enable_sriov for vfio-pci is not relevant here. As suggested by @tkatila, we can improve the labeling by adding a rule for vfio-pci being available (module loaded or built-in available). We can also add "IOMMU enabled" which is another pre-condition.

There's several states in how sriov/vfio-pci may not be in a configured state needed by the plugin. We can at least hope that with all of these present, QAT plugin can attempt to be successful in initializing, and not render the node unschedulable. (this doesn't include any rules for nodes to explicitly exclude, just features related to QAT / vfio config state). qat-exclusion.yaml.txt

I'm not able to follow this part. A plugin failing won't make the node unschedulable (other than workloads requesting the plugin resources won't land there).

@brgavino
Copy link
Author

I'm not able to follow this part. A plugin failing won't make the node unschedulable (other than workloads requesting the plugin resources won't land there).

While the error reported above only crashes the pod (can also lead to a node labelled as degraded via a health monitor, for example) , the (mis)configuration of QAT can cause a fatal node error (until the qat plugin is removed).

(I won't derail this conversation here, but I have seen this happen in my test cluster)

At the minimum we need to skip nodes that don't have a chance to be configured for qat plugin (ie sriov not properly configured / iommu / etc) and define nodes (via labels) to skip deployment to (for various other reasons, like being a control plane node).

So apart from the NFD rules we will set that check the configuration (as above), is it possible to just set the label intel.feature.node.kubernetes.io/qat: 'true' as 'false' per node or other role, or does the qat plugin spec require a specific anti affinity, instead (applied by a kustomize or helm value)? To summarize: handle explicit "qat plugin exclusion" either with NFD, or via the qat plugin deployment ?

@mythi
Copy link
Contributor

mythi commented Oct 31, 2023

To summarize: handle explicit "qat plugin exclusion" either with NFD, or via the qat plugin deployment ?

Perhaps use NFD Tainting Feature?.

@mythi
Copy link
Contributor

mythi commented Feb 14, 2024

@brgavino are you OK we close this or is there something missing still? We have updated the default NFD rules to make the labeling more reliable (e2a7e5d) and the default QAT deployment sets an annotation to make the Apparmor case working (ab0e8bc). Also, the NFD Tainting feature was proposed.

@brgavino
Copy link
Author

looks good, thanks for the update - can close this

@mythi
Copy link
Contributor

mythi commented Feb 14, 2024

@brgavino thanks for the confirmation, closing

@mythi mythi closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants