Dynamic Resource Allocation

FEATURE STATE: Kubernetes v1.35 [stable](enabled by default)

This page describes dynamic resource allocation (DRA) in Kubernetes.

About DRA

DRA is a Kubernetes feature that lets you request and share resources among Pods. These resources are often attached devices like hardware accelerators.

With DRA, device drivers and cluster admins define device classes that are available to claim in workloads. Kubernetes allocates matching devices to specific claims and places the corresponding Pods on nodes that can access the allocated devices.

Allocating resources with DRA is a similar experience to dynamic volume provisioning, in which you use PersistentVolumeClaims to claim storage capacity from storage classes and request the claimed capacity in your Pods.

Benefits of DRA

DRA provides a flexible way to categorize, request, and use devices in your cluster. Using DRA provides benefits like the following:

Flexible device filtering: use common expression language (CEL) to perform fine-grained filtering for specific device attributes.
Device sharing: share the same resource with multiple containers or Pods by referencing the corresponding resource claim.
Centralized device categorization: device drivers and cluster admins can use device classes to provide app operators with hardware categories that are optimized for various use cases. For example, you can create a cost-optimized device class for general-purpose workloads, and a high-performance device class for critical jobs.
Simplified Pod requests: with DRA, app operators don't need to specify device quantities in Pod resource requests. Instead, the Pod references a resource claim, and the device configuration in that claim applies to the Pod.

These benefits provide significant improvements in the device allocation workflow when compared to device plugins, which require per-container device requests, don't support device sharing, and don't support expression-based device filtering.

Types of DRA users

The workflow of using DRA to allocate devices involves the following types of users:

Device owner: responsible for devices. Device owners might be commercial vendors, the cluster operator, or another entity. To use DRA, devices must have DRA-compatible drivers that do the following:
- Create ResourceSlices that provide Kubernetes with information about nodes and resources.
- Update ResourceSlices when resource capacity in the cluster changes.
- Optionally, create DeviceClasses that workload operators can use to claim devices.
Cluster admin: responsible for configuring clusters and nodes, attaching devices, installing drivers, and similar tasks. To use DRA, cluster admins do the following:
- Attach devices to nodes.
- Install device drivers that support DRA.
- Optionally, create DeviceClasses that workload operators can use to claim devices.
Workload operator: responsible for deploying and managing workloads in the cluster. To use DRA to allocate devices to Pods, workload operators do the following:
- Create ResourceClaims or ResourceClaimTemplates to request specific configurations within DeviceClasses.
- Deploy workloads that use specific ResourceClaims or ResourceClaimTemplates.

DRA terminology

DRA uses the following Kubernetes API kinds to provide the core allocation functionality. All of these API kinds are included in the resource.k8s.io/v1 API group.

DeviceClass: Defines a category of devices that can be claimed and how to select specific device attributes in claims. The DeviceClass parameters can match zero or more devices in ResourceSlices. To claim devices from a DeviceClass, ResourceClaims select specific device attributes.
ResourceClaim: Describes a request for access to attached resources, such as devices, in the cluster. ResourceClaims provide Pods with access to a specific resource. ResourceClaims can be created by workload operators or generated by Kubernetes based on a ResourceClaimTemplate.
ResourceClaimTemplate: Defines a template that Kubernetes uses to create per-Pod ResourceClaims for a workload. ResourceClaimTemplates provide Pods with access to separate, similar resources. Each ResourceClaim that Kubernetes generates from the template is bound to a specific Pod. When the Pod terminates, Kubernetes deletes the corresponding ResourceClaim.
ResourceSlice: Represents one or more resources that are attached to nodes, such as devices. Drivers create and manage ResourceSlices in the cluster. When a ResourceClaim is created and used in a Pod, Kubernetes uses ResourceSlices to find nodes that have access to the claimed resources. Kubernetes allocates resources to the ResourceClaim and schedules the Pod onto a node that can access the resources.

DeviceClass

A DeviceClass lets cluster admins or device drivers define categories of devices in the cluster. DeviceClasses tell operators what devices they can request and how they can request those devices. You can use common expression language (CEL) to select devices based on specific attributes. A ResourceClaim that references the DeviceClass can then request specific configurations within the DeviceClass.

To create a DeviceClass, see Set Up DRA in a Cluster.

ResourceClaims and ResourceClaimTemplates

A ResourceClaim defines the resources that a workload needs. Every ResourceClaim has requests that reference a DeviceClass and select devices from that DeviceClass. ResourceClaims can also use selectors to filter for devices that meet specific requirements, and can use constraints to limit the devices that can satisfy a request. ResourceClaims can be created by workload operators or can be generated by Kubernetes based on a ResourceClaimTemplate. A ResourceClaimTemplate defines a template that Kubernetes can use to auto-generate ResourceClaims for Pods.

Use cases for ResourceClaims and ResourceClaimTemplates

The method that you use depends on your requirements, as follows:

ResourceClaim: you want multiple Pods to share access to specific devices. You manually manage the lifecycle of ResourceClaims that you create.
ResourceClaimTemplate: you want Pods to have independent access to separate, similarly-configured devices. Kubernetes generates ResourceClaims from the specification in the ResourceClaimTemplate. The lifetime of each generated ResourceClaim is bound to the lifetime of the corresponding Pod.

When you define a workload, you can use Common Expression Language (CEL) to filter for specific device attributes or capacity. The available parameters for filtering depend on the device and the drivers.

If you directly reference a specific ResourceClaim in a Pod, that ResourceClaim must already exist in the same namespace as the Pod. If the ResourceClaim doesn't exist in the namespace, the Pod won't schedule. This behavior is similar to how a PersistentVolumeClaim must exist in the same namespace as a Pod that references it.

You can reference an auto-generated ResourceClaim in a Pod, but this isn't recommended because auto-generated ResourceClaims are bound to the lifetime of the Pod that triggered the generation.

To learn how to claim resources using one of these methods, see Allocate Devices to Workloads with DRA.

Prioritized list

FEATURE STATE: Kubernetes v1.34 [beta](enabled by default)

You can provide a prioritized list of subrequests for requests in a ResourceClaim or ResourceClaimTemplate. The scheduler will then select the first subrequest that can be allocated. This allows users to specify alternative devices that can be used by the workload if the primary choice is not available.

In the example below, the ResourceClaimTemplate requested a device with the color black and the size large. If a device with those attributes is not available, the pod cannot be scheduled. With the prioritized list feature, a second alternative can be specified, which requests two devices with the color white and size small. The large black device will be allocated if it is available. If it is not, but two small white devices are available, the pod will still be able to run.

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: prioritized-list-claim-template
spec:
  spec:
    devices:
      requests:
      - name: req-0
        firstAvailable:
        - name: large-black
          deviceClassName: resource.example.com
          selectors:
          - cel:
              expression: |-
                device.attributes["resource-driver.example.com"].color == "black" &&
                device.attributes["resource-driver.example.com"].size == "large"                
        - name: small-white
          deviceClassName: resource.example.com
          selectors:
          - cel:
              expression: |-
                device.attributes["resource-driver.example.com"].color == "white" &&
                device.attributes["resource-driver.example.com"].size == "small"                
          count: 2

If the pod is eligible for multiple nodes in the cluster, the scheduler will use the index of chosen subrequests from any prioritized lists as one of the inputs when it scores each node. So nodes that can allocate devices requested in a higher ranked subrequest are more likely to be chosen than nodes that can only allocate devices for lower ranked subrequests.

The decision is made on a per-Pod basis, so if the Pod is a member of a ReplicaSet or similar grouping, you cannot rely on all the members of the group having the same subrequest chosen. Your workload must be able to accommodate this.

Prioritized lists is a beta feature and is enabled by default with the DRAPrioritizedList feature gate in the kube-apiserver and kube-scheduler.

ResourceSlice

Each ResourceSlice represents one or more devices in a pool. The pool is managed by a device driver, which creates and manages ResourceSlices. The resources in a pool might be represented by a single ResourceSlice or span multiple ResourceSlices.

ResourceSlices provide useful information to device users and to the scheduler, and are crucial for dynamic resource allocation. Every ResourceSlice must include the following information:

Resource pool: a group of one or more resources that the driver manages. The pool can span more than one ResourceSlice. Changes to the resources in a pool must be propagated across all of the ResourceSlices in that pool. The device driver that manages the pool is responsible for ensuring that this propagation happens.
Devices: devices in the managed pool. A ResourceSlice can list every device in a pool or a subset of the devices in a pool. The ResourceSlice defines device information like attributes, versions, and capacity. Device users can select devices for allocation by filtering for device information in ResourceClaims or in DeviceClasses.
Nodes: the nodes that can access the resources. Drivers can choose which nodes can access the resources, whether that's all of the nodes in the cluster, a single named node, or nodes that have specific node labels.

Drivers use a controller to reconcile ResourceSlices in the cluster with the information that the driver has to publish. This controller overwrites any manual changes, such as cluster users creating or modifying ResourceSlices.

Consider the following example ResourceSlice:

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: cat-slice
spec:
  driver: "resource-driver.example.com"
  pool:
    generation: 1
    name: "black-cat-pool"
    resourceSliceCount: 1
  # The allNodes field defines whether any node in the cluster can access the device.
  allNodes: true
  devices:
  - name: "large-black-cat"
    attributes:
      color:
        string: "black"
      size:
        string: "large"
      cat:
        bool: true

This ResourceSlice is managed by the resource-driver.example.com driver in the black-cat-pool pool. The allNodes: true field indicates that any node in the cluster can access the devices. There's one device in the ResourceSlice, named large-black-cat, with the following attributes:

color: black
size: large
cat: true

A DeviceClass could select this ResourceSlice by using these attributes, and a ResourceClaim could filter for specific devices in that DeviceClass.

How resource allocation with DRA works

The following sections describe the workflow for the various types of DRA users and for the Kubernetes system during dynamic resource allocation.

Workflow for users

Driver creation: device owners or third-party entities create drivers that can create and manage ResourceSlices in the cluster. These drivers optionally also create DeviceClasses that define a category of devices and how to request them.
Cluster configuration: cluster admins create clusters, attach devices to nodes, and install the DRA device drivers. Cluster admins optionally create DeviceClasses that define categories of devices and how to request them.
Resource claims: workload operators create ResourceClaimTemplates or ResourceClaims that request specific device configurations within a DeviceClass. In the same step, workload operators modify their Kubernetes manifests to request those ResourceClaimTemplates or ResourceClaims.

Workflow for Kubernetes

ResourceSlice creation: drivers in the cluster create ResourceSlices that represent one or more devices in a managed pool of similar devices.
Workload creation: the cluster control plane checks new workloads for references to ResourceClaimTemplates or to specific ResourceClaims.
- If the workload uses a ResourceClaimTemplate, a controller named the resourceclaim-controller generates ResourceClaims for every Pod in the workload.
- If the workload uses a specific ResourceClaim, Kubernetes checks whether that ResourceClaim exists in the cluster. If the ResourceClaim doesn't exist, the Pods won't deploy.
ResourceSlice filtering: for every Pod, Kubernetes checks the ResourceSlices in the cluster to find a device that satisfies all of the following criteria:
- The nodes that can access the resources are eligible to run the Pod.
- The ResourceSlice has unallocated resources that match the requirements of the Pod's ResourceClaim.
Resource allocation: after finding an eligible ResourceSlice for a Pod's ResourceClaim, the Kubernetes scheduler updates the ResourceClaim with the allocation details.
Pod scheduling: when resource allocation is complete, the scheduler places the Pod on a node that can access the allocated resource. The device driver and the kubelet on that node configure the device and the Pod's access to the device.

Observability of dynamic resources

You can check the status of dynamically allocated resources by using any of the following methods:

kubelet device metrics
ResourceClaim status
Device health monitoring

kubelet device metrics

The PodResourcesLister kubelet gRPC service lets you monitor in-use devices. The DynamicResource message provides information that's specific to dynamic resource allocation, such as the device name and the claim name. For details, see Monitoring device plugin resources.

ResourceClaim device status

FEATURE STATE: Kubernetes v1.33 [beta](enabled by default)

DRA drivers can report driver-specific device status data for each allocated device in the status.devices field of a ResourceClaim. For example, the driver might list the IP addresses that are assigned to a network interface device.

The accuracy of the information that a driver adds to a ResourceClaim status.devices field depends on the driver. Evaluate drivers to decide whether you can rely on this field as the only source of device information.

If you disable the DRAResourceClaimDeviceStatus feature gate, the status.devices field automatically gets cleared when storing the ResourceClaim. A ResourceClaim device status is supported when it is possible, from a DRA driver, to update an existing ResourceClaim where the status.devices field is set.

For details about the status.devices field, see the ResourceClaim API reference.

Device Health Monitoring

FEATURE STATE: Kubernetes v1.31 [alpha](disabled by default)

As an alpha feature, Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources. For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers.

To enable this functionality, the ResourceHealthStatus feature gate must be enabled, and the DRA driver must implement the DRAResourceHealth gRPC service.

When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the allocatedResourcesStatus field in the status of each container, detailing the health of each device assigned to that container.

This provides crucial visibility for users and controllers to react to hardware failures. For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device.

Pre-scheduled Pods

When you - or another API client - create a Pod with spec.nodeName already set, the scheduler gets bypassed. If some ResourceClaim needed by that Pod does not exist yet, is not allocated or not reserved for the Pod, then the kubelet will fail to run the Pod and re-check periodically because those requirements might still get fulfilled later.

Such a situation can also arise when support for dynamic resource allocation was not enabled in the scheduler at the time when the Pod got scheduled (version skew, configuration, feature gate, etc.). kube-controller-manager detects this and tries to make the Pod runnable by reserving the required ResourceClaims. However, this only works if those were allocated by the scheduler for some other pod.

It is better to avoid bypassing the scheduler because a Pod that is assigned to a node blocks normal resources (RAM, CPU) that then cannot be used for other Pods while the Pod is stuck. To make a Pod run on a specific node while still going through the normal scheduling flow, create the Pod with a node selector that exactly matches the desired node:

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-cats
spec:
  nodeSelector:
    kubernetes.io/hostname: name-of-the-intended-node
  ...

You may also be able to mutate the incoming Pod, at admission time, to unset the .spec.nodeName field and to use a node selector instead.

DRA beta features

The following sections describe DRA features that are available in the Beta feature stage. For more information, see Set up DRA in the cluster.

Admin access

FEATURE STATE: Kubernetes v1.34 [beta](enabled by default)

You can mark a request in a ResourceClaim or ResourceClaimTemplate as having privileged features for maintenance and troubleshooting tasks. A request with admin access grants access to in-use devices and may enable additional permissions when making the device available in a container:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: large-black-cat-claim-template
spec:
  spec:
    devices:
      requests:
      - name: req-0
        exactly:
          deviceClassName: resource.example.com
          allocationMode: All
          adminAccess: true

If this feature is disabled, the adminAccess field will be removed automatically when creating such a ResourceClaim.

Admin access is a privileged mode and should not be granted to regular users in multi-tenant clusters. Starting with Kubernetes v1.33, only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.k8s.io/admin-access: "true" (case-sensitive) can use the adminAccess field. This ensures that non-admin users cannot misuse the feature. Starting with Kubernetes v1.34, this label has been updated to resource.kubernetes.io/admin-access: "true".

DRA alpha features

The following sections describe DRA features that are available in the Alpha feature stage. They depend on enabling feature gates and may depend on additional API groups. For more information, see Set up DRA in the cluster.

Extended resource allocation by DRA

FEATURE STATE: Kubernetes v1.34 [alpha](disabled by default)

You can provide an extended resource name for a DeviceClass. The scheduler will then select the devices matching the class for the extended resource requests. This allows users to continue using extended resource requests in a pod to request either extended resources provided by device plugin, or DRA devices. The same extended resource can be provided either by device plugin, or DRA on one single cluster node. The same extended resource can be provided by device plugin on some nodes, and DRA on other nodes in the same cluster.

In the example below, the DeviceClass is given an extendedResourceName example.com/gpu. If a pod requested for the extended resource example.com/gpu: 2, it can be scheduled to a node with two or more devices matching the DeviceClass.

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: gpu.example.com
spec:
  selectors:
  - cel:
      expression: device.driver == 'gpu.example.com' && device.attributes['gpu.example.com'].type
        == 'gpu'
  extendedResourceName: example.com/gpu

In addition, users can use a special extended resource to allocate devices without having to explicitly create a ResourceClaim. Using the extended resource name prefix deviceclass.resource.kubernetes.io/ and the DeviceClass name. This works for any DeviceClass, even if it does not specify an extended resource name. The resulting ResourceClaim will contain a request for an ExactCount of the specified number of devices of that DeviceClass.

Extended resource allocation by DRA is an alpha feature and only enabled when the DRAExtendedResource feature gate is enabled in the kube-apiserver, kube-scheduler, and kubelet.

Partitionable devices

FEATURE STATE: Kubernetes v1.33 [alpha](disabled by default)

Devices represented in DRA don't necessarily have to be a single unit connected to a single machine, but can also be a logical device comprised of multiple devices connected to multiple machines. These devices might consume overlapping resources of the underlying phyical devices, meaning that when one logical device is allocated other devices will no longer be available.

In the ResourceSlice API, this is represented as a list of named CounterSets, each of which contains a set of named counters. The counters represent the resources available on the physical device that are used by the logical devices advertised through DRA.

Logical devices can specify the ConsumesCounters list. Each entry contains a reference to a CounterSet and a set of named counters with the amounts they will consume. So for a device to be allocatable, the referenced counter sets must have sufficient quantity for the counters referenced by the device.

CounterSets must be specified in separate ResourceSlices from devices. Devices can consume counters from any CounterSet defined in the same resource pool as the device.

Here is an example of two devices, each consuming 6Gi of memory from the a shared counter with 8Gi of memory. Thus, only one of the devices can be allocated at any point in time. The scheduler handles this and it is transparent to the consumer as the ResourceClaim API is not affected.

kind: ResourceSlice
apiVersion: resource.k8s.io/v1
metadata:
  name: resourceslice-with-countersets
spec:
  nodeName: worker-1
  pool:
    name: pool
    generation: 1
    resourceSliceCount: 2
  driver: dra.example.com
  sharedCounters:
  - name: gpu-1-counters
    counters:
      memory:
        value: 8Gi
---
kind: ResourceSlice
apiVersion: resource.k8s.io/v1
metadata:
  name: resourceslice-with-devices
spec:
  nodeName: worker-1
  pool:
    name: pool
    generation: 1
    resourceSliceCount: 2
  driver: dra.example.com
  devices:
  - name: device-1
    consumesCounters:
    - counterSet: gpu-1-counters
      counters:
        memory:
          value: 6Gi
  - name: device-2
    consumesCounters:
    - counterSet: gpu-1-counters
      counters:
        memory:
          value: 6Gi

Partitionable devices is an alpha feature and only enabled when the DRAPartitionableDevices feature gate is enabled in the kube-apiserver and kube-scheduler.

Consumable capacity

FEATURE STATE: Kubernetes v1.34 [alpha](disabled by default)

The consumable capacity feature allows the same devices to be consumed by multiple independent ResourceClaims, with the Kubernetes scheduler managing how much of the device's capacity is used up by each claim. This is analogous to how Pods can share the resources on a Node; ResourceClaims can share the resources on a Device.

The device driver can set allowMultipleAllocations field added in .spec.devices of ResourceSlice to allow allocating that device to multiple independent ResourceClaims or to multiple requests within a ResourceClaim.

Users can set capacity field added in spec.devices.requests of ResourceClaim to specify the device resource requirements for each allocation.

For the device that allows multiple allocations, the requested capacity is drawn from — or consumed from — its total capacity, a concept known as consumable capacity. Then, the scheduler ensures that the aggregate consumed capacity across all claims does not exceed the device’s overall capacity. Furthermore, driver authors can use the requestPolicy constraints on individual device capacities to control how those capacities are consumed. For example, the driver author can specify that a given capacity is only consumed in increments of 1Gi.

Here is an example of a network device which allows multiple allocations and contains a consumable bandwidth capacity.

kind: ResourceSlice
apiVersion: resource.k8s.io/v1
metadata:
  name: resourceslice
spec:
  nodeName: worker-1
  pool:
    name: pool
    generation: 1
    resourceSliceCount: 1
  driver: dra.example.com
  devices:
  - name: eth1
    allowMultipleAllocations: true
    attributes:
      name:
        string: "eth1"
    capacity:
      bandwidth:
        requestPolicy:
          default: "1M"
          validRange:
            min: "1M"
            step: "8"
        value: "10G"

The consumable capacity can be requested as shown in the below example.

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: bandwidth-claim-template
spec:
  spec:
    devices:
      requests:
      - name: req-0
        exactly:
          deviceClassName: resource.example.com
          capacity:
            requests:
              bandwidth: 1G

The allocation result will include the consumed capacity and the identifier of the share.

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
...
status:
  allocation:
    devices:
      results:
      - consumedCapacity:
          bandwidth: 1G
        device: eth1
        shareID: "a671734a-e8e5-11e4-8fde-42010af09327"

In this example, a multiply-allocatable device was chosen. However, any resource.example.com device with at least the requested 1G bandwidth could have met the requirement. If a non-multiply-allocatable device were chosen, the allocation would have resulted in the entire device. To force the use of a only multiply-allocatable devices, you can use the CEL criteria device.allowMultipleAllocations == true.

Device taints and tolerations

FEATURE STATE: Kubernetes v1.33 [alpha](disabled by default)

Device taints are similar to node taints: a taint has a string key, a string value, and an effect. The effect is applied to the ResourceClaim which is using a tainted device and to all Pods referencing that ResourceClaim. The "NoSchedule" effect prevents scheduling those Pods. Tainted devices are ignored when trying to allocate a ResourceClaim because using them would prevent scheduling of Pods.

The "NoExecute" effect implies "NoSchedule" and in addition causes eviction of all Pods which have been scheduled already. This eviction is implemented in the device taint eviction controller in kube-controller-manager by deleting affected Pods.

The "None" effect is ignored by the scheduler and eviction controller. DRA drivers can use it to communicate exceptions to admins or other controllers, like for example degraded health of a device. Admins can also use it to do dry-runs of pod eviction in DeviceTaintRules (more on that below).

ResourceClaims can tolerate taints. If a taint is tolerated, its effect does not apply. An empty toleration matches all taints. A toleration can be limited to certain effects and/or match certain key/value pairs. A toleration can check that a certain key exists, regardless which value it has, or it can check for specific values of a key. For more information on this matching see the node taint concepts.

Eviction can be delayed by tolerating a taint for a certain duration. That delay starts at the time when a taint gets added to a device, which is recorded in a field of the taint.

Taints apply as described above also to ResourceClaims allocating "all" devices on a node. All devices must be untainted or all of their taints must be tolerated. Allocating a device with admin access (described above) is not exempt either. An admin using that mode must explicitly tolerate all taints to access tainted devices.

Device taints and tolerations is an alpha feature and only enabled when the DRADeviceTaints feature gate is enabled in the kube-apiserver, kube-controller-manager and kube-scheduler. To use DeviceTaintRules, the resource.k8s.io/v1alpha3 API version must be enabled.

You can add taints to devices in the following ways, by using the DeviceTaintRule API kind.

Taints set by the driver

A DRA driver can add taints to the device information that it publishes in ResourceSlices. Consult the documentation of a DRA driver to learn whether the driver uses taints and what their keys and values are.

Taints set by an admin

FEATURE STATE: Kubernetes v1.35 [alpha](disabled by default)

An admin or a control plane component can taint devices without having to tell the DRA driver to include taints in its device information in ResourceSlices. They do that by creating DeviceTaintRules. Each DeviceTaintRule adds one taint to devices which match the device selector. Without such a selector, no devices are tainted. This makes it harder to accidentally evict all pods using ResourceClaims when leaving out the selector by mistake.

Devices can be selected by giving the name of a DeviceClass, driver, pool, and/or device. The DeviceClass selects all devices that are selected by the selectors in that DeviceClass. With just the driver name, an admin can taint all devices managed by that driver, for example while doing some kind of maintenance of that driver across the entire cluster. Adding a pool name can limit the taint to a single node, if the driver manages node-local devices.

Finally, adding the device name can select one specific device. The device name and pool name can also be used alone, if desired. For example, drivers for node-local devices are encouraged to use the node name as their pool name. Then tainting with that pool name automatically taints all devices on a node.

Drivers might use stable names like "gpu-0" that hide which specific device is currently assigned to that name. To support tainting a specific hardware instance, CEL selectors can be used in a DeviceTaintRule to match a vendor-specific unique ID attribute, if the driver supports one for its hardware.

The taint applies as long as the DeviceTaintRule exists. It can be modified and and removed at any time. Here is one example of a DeviceTaintRule for a fictional DRA driver:

apiVersion: resource.k8s.io/v1alpha3
kind: DeviceTaintRule
metadata:
  name: example
spec:
  # The entire hardware installation for this
  # particular driver is broken.
  # Evict all pods and don't schedule new ones.
  deviceSelector:
    driver: dra.example.com
  taint:
    key: dra.example.com/unhealthy
    value: Broken
    effect: NoExecute

The apiserver automatically tracks when this taint was created and the eviction controller adds a condition with some information:

kubectl describe devicetaintrules

Name:         example
...
Spec:
  Device Selector:
    Driver:  dra.example.com
  Taint:
    Effect:      NoExecute
    Key:         dra.example.com/unhealthy
    Time Added:  2025-11-05T18:15:37Z
    Value:       Broken
Status:
  Conditions:
    Last Transition Time:  2025-11-05T18:15:37Z
    Message:               1 pod evicted since starting the controller.
    Observed Generation:   1
    Reason:                Completed
    Status:                False
    Type:                  EvictionInProgress
Events:                    <none>

Pods get evicted by deleting them. Usually this happens very quickly, except when a toleration for the taint delays it for a certain period or when there are very many pods which need to be evicted. When it takes longer, the message provides information about the current status:

2 pods need to be evicted in 2 different namespaces. 1 pod evicted since starting the controller.

The condition can be used to check whether an eviction is currently active:

kubectl wait --for=condition=EvictionInProgress=false DeviceTaintRule/example

Beware of the potential race between scheduler and controller observing the new taint at different times, which can lead to pods still being scheduled at a time when the controller thinks that there are none which need to be evicted and thus sets this condition to False. In practice, this race is made very unlikely by updating the status only after an intentional delay of a few seconds.

For effect: None, the message provides information about the number of affected devices, how many of those are allocated, and how many pods would be evicted if the effect was NoExecute. This can be used to do a dry-run before actually triggering eviction:

Create a DeviceTaintRule with the desired selectors and effect: None.

Review the message:

3 published devices selected. 1 allocated device selected.
1 pod would be evicted in 1 namespace if the effect was NoExecute.
This information will not be updated again. Recreate the DeviceTaintRule to trigger an update.

Published devices are those listed in ResourceSlices. Tainting them prevents allocation for new pods. Only allocated devices cause eviction of the pods using them.

Edit the DeviceTaintRule and change the effect into NoExecute.

Device Binding Conditions

FEATURE STATE: Kubernetes v1.34 [alpha](disabled by default)

Device Binding Conditions allow the Kubernetes scheduler to delay Pod binding until external resources, such as fabric-attached GPUs or reprogrammable FPGAs, are confirmed to be ready.

This waiting behavior is implemented in the PreBind phase of the scheduling framework. During this phase, the scheduler checks whether all required device conditions are satisfied before proceeding with binding.

This improves scheduling reliability by avoiding premature binding and enables coordination with external device controllers.

To use this feature, device drivers (typically managed by driver owners) must publish the following fields in the Device section of a ResourceSlice. Cluster administrators must enable the DRADeviceBindingConditions and DRAResourceClaimDeviceStatus feature gates for the scheduler to honor these fields.

bindingConditions: A list of condition types that must be set to True in the status.conditions field of the associated ResourceClaim before the Pod can be bound. These typically represent readiness signals such as "DeviceAttached" or "DeviceInitialized".
bindingFailureConditions: A list of condition types that, if set to True in status.conditions field of the associated ResourceClaim, indicate a failure state. If any of these conditions are True, the scheduler will abort binding and reschedule the Pod.
bindsToNode: if set to true, the scheduler records the selected node name in the status.allocation.nodeSelector field of the ResourceClaim. This does not affect the Pod's spec.nodeSelector. Instead, it sets a node selector inside the ResourceClaim, which external controllers can use to perform node-specific operations such as device attachment or preparation.

All condition types listed in bindingConditions and bindingFailureConditions are evaluated from the status.conditions field of the ResourceClaim. External controllers are responsible for updating these conditions using standard Kubernetes condition semantics (type, status, reason, message, lastTransitionTime).

The scheduler waits up to 600 seconds (default) for all bindingConditions to become True. If the timeout is reached or any bindingFailureConditions are True, the scheduler clears the allocation and reschedules the Pod. This timeout duration is configurable by the user through KubeSchedulerConfiguration.

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: gpu-slice
spec:
  driver: dra.example.com
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
      - key: accelerator-type
        operator: In
        values:
        - "high-performance"
  pool:
    name: gpu-pool
    generation: 1
    resourceSliceCount: 1
  devices:
    - name: gpu-1
      attributes:
        vendor:
          string: "example"
        model:
          string: "example-gpu"
      bindsToNode: true
      bindingConditions:
        - dra.example.com/is-prepared
      bindingFailureConditions:
        - dra.example.com/preparing-failed

This example ResourceSlice has the following properties:

The ResourceSlice targets nodes labeled with accelerator-type=high-performance, so that the scheduler uses only a specific set of eligible nodes.
The scheduler selects one node from the selected group (for example, node-3) and sets the status.allocation.nodeSelector field in the ResourceClaim to that node name.
The dra.example.com/is-prepared binding condition indicates that the device gpu-1 must be prepared (the is-prepared condition has a status of True) before binding.
If the gpu-1 device preparation fails (the preparing-failed condition has a status of True), the scheduler aborts binding.
The scheduler waits up to 600 seconds (default) for the device to become ready.
External controllers can use the node selector in the ResourceClaim to perform node-specific setup on the selected node.

An example of configuring this timeout in KubeSchedulerConfiguration is given below:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  pluginConfig:
  - name: DynamicResources
    args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: DynamicResourcesArgs
      bindingTimeout: 60s

What's next

Set Up DRA in a Cluster
Allocate devices to workloads using DRA
For more information on the design, see the Dynamic Resource Allocation with Structured Parameters KEP.

Last modified October 22, 2025 at 8:15 AM PST: DRA: device taints update for 1.35 (463d655b78)

Dynamic Resource Allocation

About DRA

Benefits of DRA

Types of DRA users

DRA terminology

DeviceClass

ResourceClaims and ResourceClaimTemplates

Use cases for ResourceClaims and ResourceClaimTemplates

Prioritized list

ResourceSlice

How resource allocation with DRA works

Workflow for users

Workflow for Kubernetes

Observability of dynamic resources

kubelet device metrics

ResourceClaim device status

Device Health Monitoring

Pre-scheduled Pods

DRA beta features

Admin access

DRA alpha features

Extended resource allocation by DRA

Partitionable devices

Consumable capacity

Device taints and tolerations

Taints set by the driver

Taints set by an admin

Device Binding Conditions

What's next

Feedback