Knowledge Check
The following questions validate your technical proficiency in monitoring, troubleshooting, and optimizing AI infrastructure, specifically focusing on Installation using BCM, K8s initialization, and DOCA services; Administration using tools like Slurm, Kubernetes, and Run:ai, as well as GPU configuration (including MIG); Workload management for inference/training deployment and NGC container use; and Troubleshooting Fabric Manager issues, using DCGM diagnostics, and optimizing Storage bottlenecks. This resource helps MLOps engineers and system architects prepare for the NVIDIA-Certified Professional: AI Operations (NCP-AIO).
Found this Knowledge Check helpful?
|
1. Which web interface allows administrators to monitor cluster performance, resource utilization, and node health in real time using widgets and resource trees?
|
|
2. Where can features available in the NVIDIA Mission Control toolkit be managed within the Base View interface?
|
|
3. When installing a workload manager after the initial BCM setup, which utility provides a Text User Interface (TUI) to guide the administrator through the configuration?
|
|
4. To synchronize software images across nodes or update a running node with a dry-run option, which command is used within the device mode of cmsh?
|
|
5. Which BCM concept allows an administrator to organize compute nodes into categories to inherit specific software images and roles?
|
|
6. Which specific BCM network object is used for booting non-head nodes and for all cluster management communication?
|
|
7. Which tool is required to create a specialized software image for BlueField DPUs and provision them to boot over the network?
|
|
8. When diagnosing cluster issues, which utility is run on the head node to gather comprehensive diagnostic data for BCM support?
|
|
9. How can an administrator configure Kubernetes on NVIDIA hosts using BCM to ensure users have dedicated 'restricted' namespaces?
|
|
10. What is the primary function of the 'Mission Control' toolkit mentioned in BCM documentation?
|
|
11. To apply firmware updates to a DGX H100 system, where should the '.fwpkg' packages be placed on the head node?
|
|
12. In BCM, which command allows an administrator to interactively request that a new node be assigned a specific hostname and category based on its MAC address?
|
|
13. When managing job scheduling, which Slurm concept allows administrators to group nodes under a unique name so that features can be assigned to them collectively in 'slurm.conf'?
|
|
14. Which DPU operation mode allows the embedded ARM system of the DPU to control the NIC resources and data path of both the host and the DPU?
|
|
15. To diagnose a node that fails to start the network during the ramdisk stage, which action should the administrator perform?
|
|
16. What is the function of the 'hasclientdaemon' parameter when configuring a Cumulus switch in BCM?
|
|
17. Which utility allows a user to submit a Slurm job directly from a Jupyter Notebook cell?
|
|
18. How does BCM handle the administration of user accounts to ensure secure access across the cluster?
|
|
19. When using BCM to monitor GPU utilization at the cluster level, which data producer is responsible for aggregating these metrics?
|
|
20. Which command in cmsh allows an administrator to see the results of a health check query across multiple nodes, grouped by category?
|
|
21. What is the purpose of the 'cm-chroot-sw-img' wrapper utility?
|
|
22. Which tool is used to generate a Slurm network topology configuration (topology.conf) automatically based on cloud or physical network hierarchy?
|
|
23. Which command must be run to make committed DPU settings active on the DPU hardware after they are modified in cmsh?
|
|
24. What does the 'drain' command do in the context of workload management in BCM?
|
|
25. Which monitoring feature allows BCM to scale a cluster up or down automatically by powering nodes on and off based on job demand?
|
|
26. When setting up Kubernetes with BCM, which operator is recommended for managing the lifecycle of Jupyter kernels?
|
|
27. What is the purpose of the 'FrozenFile' directive in BCM?
|
|
28. In BCM, what is a 'chunk' in the context of PBS job scripts?
|
|
29. Which command-line tool is used to manage and update the BCM Ansible collection for NVIDIA Base Command Manager 11?
|
|
30. What is the maximum number of GPU instances that an NVIDIA A100 GPU can be partitioned into using Multi-Instance GPU (MIG) technology?
|
|
31. Which BCM monitoring tool provides the capability to view historical resource consumption metrics aggregated by user or account?
|
|
32. When a node is in the 'INSTALLER_FAILED' state, where can the administrator usually find a log explaining the failure?
|
|
33. Which Slurm command allows a user to allocate resources in real-time and spawn a shell to execute parallel tasks?
|
|
34. To ensure that a passive head node can take over from the active head in an HA setup, which command is used to initially clone the active head node?
|
|
35. What is the primary benefit of using 'Offloaded Monitoring' in clusters with more than 1,000 nodes?
|
|
36. In the Jupyter integration, which tool provides an interactive way to create and customize kernels without editing JSON files?
|
|
37. Which BCM component is responsible for identifying and placing the software image on a regular node during the boot process?
|
|
38. To configure a specific node to use a different BIOS setting via Redfish, which tool should the administrator use?
|
|
39. Which command is used to monitor network performance between cluster nodes using a handy Python script provided by BCM?
|
|
40. What is the recommended method for an administrator to get support for a cluster that has had its BCM package versions replaced with standard distribution versions?
|
References
Preparing for the NVIDIA-Certified Professional: AI Operations (NCP-AIO) requires an understanding of cluster management, workload orchestration, and GPU optimization. Below is a curated list of official documentation and resources to help you master the exam domains.
- Base Command Manager: Study the “Installation Guide” and “Administrator Guide” to understand cluster provisioning and node health.
- DCGM (Data Center GPU Manager): Study for monitoring cluster health, running diagnostics, and performance profiling.
- Multi-Instance GPU (MIG): You must know how to configure and manage MIG profiles for various workloads.
- Fabric Manager: Review troubleshooting procedures for the fabric manager service.
- Slurm: Focus on job scheduling, resource allocation (GRES), and partition management.
- NVIDIA GPUDirect RDMA: Understand components used to accelerate data movement across networks.
- Run:ai: Learn to manage GPU fractions, compute quotas, and project-based resource allocation for multi-tenant teams.
- Kubernetes and NVIDIA GPUs: Focus on the NVIDIA GPU Operator for automating driver deployment and managing GPU resources in containerized environments.
- NGC: Understand how to pull optimized containers, models, and Helm charts while using the NGC CLI for registry management.
- Magnum IO: Study the architecture for GPUDirect Storage (GDS) and how it bypasses the CPU to accelerate data movement.