Enabling MIG Function¶
This section describes how to enable NVIDIA MIG function. NVIDIA currently provides two strategies for exposing MIG devices on Kubernetes nodes:
- Single mode : Nodes expose a single type of MIG device on all their GPUs.
- Mixed mode : Nodes expose a mixture of MIG device types on all their GPUs.
Tip
After disabling MIG mode, the physical node needs to be restarted in order to use the whole card mode properly.
For more details, refer to the NVIDIA GPU Card Usage Modes.
Prerequisites¶
- Check the system requirements for the GPU driver installation on the target node: GPU Support Matrix
- Ensure that the cluster nodes have GPUs of the corresponding models (NVIDIA H100, A100, and A30 Tensor Core GPUs). For more information, see the GPU Support Matrix.
- All GPUs on the nodes must belong to the same product line (e.g., A100-SXM-40GB).
Enable GPU MIG single mode¶
-
Enable MIG Single mode through the Operator. Configure the parameters in the installation interface:
-
After the installation is complete, it is necessary to label the corresponding node (the node where the GPU card is inserted) with the partitioning specifications. If this step is not executed, it will default to no partitioning.
Tip
The Single mode can only be partitioned in a single mode. It is recommended to use the default strategy, but you can also customize the partitioning strategy.
UI Configuration:
-
Search for
default-mig-parted-config
in the ConfigMap, enter the details, and find the partitioning specifications corresponding to the GPU card model. -
Find the corresponding node, select Modify Labels, and add nvidia.com/mig.config="all-1g.10gb". If you choose another specification, then partition according to that specification.
CLI Configuration:
-
-
Check the configuration result:
After the setup is complete, you can confirm the deployment of the application and then use GPU MIG resources.
Enable GPU MIG Mixed mode¶
-
Enable MIG Mixed mode through the Operator. Configure the parameters in the installation interface:
- Set DevicePlugin to enable
- Set MIG strategy to mixed
- Enable the enabled parameter under Mig Manager
- Set MigManager Config to the default MIG partitioning strategy default-mig-parted-config , or customize the partitioning strategy configuration file.
-
After the installation is complete, it is necessary to label the corresponding node (the node where the GPU card is inserted) with the partitioning specifications. If this step is not executed, it will default to no partitioning.
Tip
It is recommended to use the default strategy, but you can also customize the partitioning strategy.
UI Configuration:
-
Search for default-mig-parted-config in the ConfigMap, enter the details, and find the partitioning specifications corresponding to the GPU card model.
-
Find the corresponding node, select Modify Labels, and add nvidia.com/mig.config="all-1g.10gb". If you choose another specification, then partition according to that specification.
CLI Configuration:
-
-
Check the configuration result
After the setup is complete, you can confirm the deployment of the application and then use GPU MIG resources.
Custom Partitioning Strategy¶
You can customize the partitioning strategy configuration file, with a maximum of 7 instances per card. This needs to be created before installing the GPU Operator and specified during installation with the ConfigMap name.
-
Create a custom partitioning strategy in the ConfigMap, which needs to be in the same namespace as the GPU operator during deployment. The file name you create cannot be the same as the default default-mig-parted-config. The configuration data can be referenced in the following YAML.
Click to view detailed YAML configuration instructions
The following YAML is an example of a custom configuration named custom-mig-parted-config. The key in the configuration data is as shown in the content of config.yaml below, and you can customize and add other partitioning strategies.
config.yaml## Custom split GI instance configuration version: v1 mig-configs: all-disabled: - devices: all mig-enabled: false # A100-40GB, A800-40GB all-1g.5gb: - devices: all mig-enabled: true mig-devices: "1g.5gb": 7 all-1g.5gb.me: - devices: all mig-enabled: true mig-devices: "1g.5gb+me": 1 all-2g.10gb: - devices: all mig-enabled: true mig-devices: "2g.10gb": 3 all-3g.20gb: - devices: all mig-enabled: true mig-devices: "3g.20gb": 2 all-4g.20gb: - devices: all mig-enabled: true mig-devices: "4g.20gb": 1 all-7g.40gb: - devices: all mig-enabled: true mig-devices: "7g.40gb": 1 # H100-80GB, H800-80GB, A100-80GB, A800-80GB, A100-40GB, A800-40GB all-1g.10gb: # H100-80GB, H800-80GB, A100-80GB, A800-80GB - device-filter: ["0x233010DE", "0x233110DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"] devices: all mig-enabled: true mig-devices: "1g.10gb": 7 # A100-40GB, A800-40GB - device-filter: ["0x20B010DE", "0x20B110DE", "0x20F110DE", "0x20F610DE"] devices: all mig-enabled: true mig-devices: "1g.10gb": 4 # H100-80GB, H800-80GB, A100-80GB, A800-80GB all-1g.10gb.me: - devices: all mig-enabled: true mig-devices: "1g.10gb+me": 1 # H100-80GB, H800-80GB, A100-80GB, A800-80GB all-1g.20gb: - devices: all mig-enabled: true mig-devices: "1g.20gb": 4 all-2g.20gb: - devices: all mig-enabled: true mig-devices: "2g.20gb": 3 all-3g.40gb: - devices: all mig-enabled: true mig-devices: "3g.40gb": 2 all-4g.40gb: - devices: all mig-enabled: true mig-devices: "4g.40gb": 1 all-7g.80gb: - devices: all mig-enabled: true mig-devices: "7g.80gb": 1 # A30-24GB all-1g.6gb: - devices: all mig-enabled: true mig-devices: "1g.6gb": 4 all-1g.6gb.me: - devices: all mig-enabled: true mig-devices: "1g.6gb+me": 1 all-2g.12gb: - devices: all mig-enabled: true mig-devices: "2g.12gb": 2 all-2g.12gb.me: - devices: all mig-enabled: true mig-devices: "2g.12gb+me": 1 all-4g.24gb: - devices: all mig-enabled: true mig-devices: "4g.24gb": 1 # H100 NVL, H800 NVL all-1g.12gb: - devices: all mig-enabled: true mig-devices: "1g.12gb": 7 all-1g.12gb.me: - devices: all mig-enabled: true mig-devices: "1g.12gb+me": 1 all-2g.24gb: - devices: all mig-enabled: true mig-devices: "2g.24gb": 3 all-3g.47gb: - devices: all mig-enabled: true mig-devices: "3g.47gb": 2 all-4g.47gb: - devices: all mig-enabled: true mig-devices: "4g.47gb": 1 all-7g.94gb: - devices: all mig-enabled: true mig-devices: "7g.94gb": 1 # H100-96GB, PG506-96GB all-3g.48gb: - devices: all mig-enabled: true mig-devices: "3g.48gb": 2 all-4g.48gb: - devices: all mig-enabled: true mig-devices: "4g.48gb": 1 all-7g.96gb: - devices: all mig-enabled: true mig-devices: "7g.96gb": 1 # H100-96GB, H100 NVL, H800 NVL, H100-80GB, H800-80GB, A800-40GB, A800-80GB, A100-40GB, A100-80GB, A30-24GB, PG506-96GB all-balanced: # H100 NVL, H800 NVL - device-filter: ["0x232110DE", "0x233A10DE"] devices: all mig-enabled: true mig-devices: "1g.12gb": 1 "2g.24gb": 1 "3g.47gb": 1 # H100-80GB, H800-80GB, A100-80GB, A800-80GB - device-filter: ["0x233010DE", "0x233110DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"] devices: all mig-enabled: true mig-devices: "1g.10gb": 2 "2g.20gb": 1 "3g.40gb": 1 # A100-40GB, A800-40GB - device-filter: ["0x20B010DE", "0x20B110DE", "0x20F110DE", "0x20F610DE"] devices: all mig-enabled: true mig-devices: "1g.5gb": 2 "2g.10gb": 1 "3g.20gb": 1 # A30-24GB - device-filter: "0x20B710DE" devices: all mig-enabled: true mig-devices: "1g.6gb": 2 "2g.12gb": 1 # H100-96GB, PG506-96GB - device-filter: ["0x233D10DE", "0x20B610DE"] devices: all mig-enabled: true mig-devices: "1g.12gb": 2 "2g.24gb": 1 "3g.48gb": 1 # After setting, the CI instance will be partitioned according to the set specifications custom-config: - devices: all mig-enabled: true mig-devices: "1g.10gb": 4 "1g.20gb": 2
Set custom-config in the above YAML, and after setting, the CI instance will be partitioned according to the specification.
-
Specify this ConfigMap during the installation of the GPU Operator.