AI Operations

Knowledge Check

The following questions validate your technical proficiency in monitoring, troubleshooting, and optimizing AI infrastructure, specifically focusing on Installation using BCM, K8s initialization, and DOCA services; Administration using tools like Slurm, Kubernetes, and Run:ai, as well as GPU configuration (including MIG); Workload management for inference/training deployment and NGC container use; and Troubleshooting Fabric Manager issues, using DCGM diagnostics, and optimizing Storage bottlenecks. This resource helps MLOps engineers and system architects prepare for the NVIDIA-Certified Professional: AI Operations (NCP-AIO).

Found this Knowledge Check helpful?

Score: 0 / 120 (0%)

Reset | PDF

1. Which web interface allows administrators to monitor cluster performance, resource utilization, and node health in real time using widgets and resource trees?

A. Cluster Management Shell (cmsh) B. Base View GUI C. Mission Control Dashboard D. NVIDIA System Management (NVSM)

2. Where can features available in the NVIDIA Mission Control toolkit be managed within the Base View interface?

A. Under the 'Identity Management' tab B. In the 'License Information' window using the 'Manage Mission Control' URL C. Directly within the 'Workload Utilization' chart D. Through the 'Search domains' section in Cluster Settings

3. When installing a workload manager after the initial BCM setup, which utility provides a Text User Interface (TUI) to guide the administrator through the configuration?

A. cm-wlm-setup B. cm-create-image C. cmsh D. cm-dpu-setup

4. To synchronize software images across nodes or update a running node with a dry-run option, which command is used within the device mode of cmsh?

A. reboot B. imageupdate C. updateprovisioners D. softwareimage clone

5. Which BCM concept allows an administrator to organize compute nodes into categories to inherit specific software images and roles?

A. Node Groups B. Configuration Overlays C. Node Categories D. Device Identification

6. Which specific BCM network object is used for booting non-head nodes and for all cluster management communication?

A. externalnet B. internalnet C. globalnet D. bmcnet

7. Which tool is required to create a specialized software image for BlueField DPUs and provision them to boot over the network?

A. cm-dpu-setup B. cm-image C. cm-wlm-setup D. cm-kubernetes-setup

8. When diagnosing cluster issues, which utility is run on the head node to gather comprehensive diagnostic data for BCM support?

A. cmsh B. cm-diagnose C. request-remote-assistance D. monitoringinfo

9. How can an administrator configure Kubernetes on NVIDIA hosts using BCM to ensure users have dedicated 'restricted' namespaces?

A. Using the 'Identity Management' resource in Base View B. Running 'cm-kubernetes-setup --add-user <user>' C. Manually editing the '/etc/kubernetes/admin.conf' file D. Assigning the 'Scaleserver' role to nodes

10. What is the primary function of the 'Mission Control' toolkit mentioned in BCM documentation?

A. Managing end-user login passwords via SSH B. Providing advanced capabilities like leak detection and rack management integrated with BCM C. Setting up Slurm job queues D. Updating the Linux kernel version on head nodes

11. To apply firmware updates to a DGX H100 system, where should the '.fwpkg' packages be placed on the head node?

A. /tmp/firmware/ B. /cm/shared/docs/fw/ C. /cm/local/apps/cmd/etc/htdocs/bios/firmware/ D. /etc/firmware/packages/

12. In BCM, which command allows an administrator to interactively request that a new node be assigned a specific hostname and category based on its MAC address?

A. add node B. newnodes C. device initialize D. provisioningstatus

13. When managing job scheduling, which Slurm concept allows administrators to group nodes under a unique name so that features can be assigned to them collectively in 'slurm.conf'?

A. GRES B. Nodesets C. QOS D. Partitions

14. Which DPU operation mode allows the embedded ARM system of the DPU to control the NIC resources and data path of both the host and the DPU?

A. separated_host B. embedded_cpu C. gateway_mode D. passthrough

15. To diagnose a node that fails to start the network during the ramdisk stage, which action should the administrator perform?

A. Reinstall the head node B. Add the correct kernel module to the software image's kernel module configuration C. Change the LDAP root password D. Disable the internalnet firewall

16. What is the function of the 'hasclientdaemon' parameter when configuring a Cumulus switch in BCM?

A. It enables the switch to act as a DHCP server B. It specifies that CMDaemon Lite should run on the switch for management C. It allows the switch to be used as a Head Node D. It triggers a ZTP (Zero Touch Provisioning) sequence immediately

17. Which utility allows a user to submit a Slurm job directly from a Jupyter Notebook cell?

A. kubectl B. Jupyter WLM Magic C. sbatch D. cm-jupyter-setup

18. How does BCM handle the administration of user accounts to ensure secure access across the cluster?

A. By synchronizing /etc/passwd files manually B. By running a central LDAP service on the head node C. By using local accounts only on each node D. By disabling all user logins except root

19. When using BCM to monitor GPU utilization at the cluster level, which data producer is responsible for aggregating these metrics?

A. GPUSampler B. JobSampler C. ClusterTotal D. MonitoringSystem

20. Which command in cmsh allows an administrator to see the results of a health check query across multiple nodes, grouped by category?

A. samplenow B. execute report C. metrics status D. device list

21. What is the purpose of the 'cm-chroot-sw-img' wrapper utility?

A. To reboot all nodes into a new image B. To enter a software image's filesystem on the head node for maintenance C. To clone a head node into a passive HA head D. To verify the integrity of the RPM database

22. Which tool is used to generate a Slurm network topology configuration (topology.conf) automatically based on cloud or physical network hierarchy?

A. Topograph B. cmsh C. nvcc D. Enroot

23. Which command must be run to make committed DPU settings active on the DPU hardware after they are modified in cmsh?

A. commit B. dpu apply C. initialize D. dpu discovery

24. What does the 'drain' command do in the context of workload management in BCM?

A. It immediately terminates all jobs on a node B. It prevents new jobs from being scheduled on a node while allowing current ones to finish C. It deletes the node's software image from the head node D. It powers off the node instantly

25. Which monitoring feature allows BCM to scale a cluster up or down automatically by powering nodes on and off based on job demand?

A. Scaleserver role (cm-scale service) B. AggregateNode data producer C. Mission Control leak detection D. Slurm accounting

26. When setting up Kubernetes with BCM, which operator is recommended for managing the lifecycle of Jupyter kernels?

A. NVIDIA GPU Operator B. cm-jupyter-kernel-operator C. Prometheus Operator D. Spark Operator

27. What is the purpose of the 'FrozenFile' directive in BCM?

A. To prevent a node from rebooting B. To protect specific system configuration files from being modified by CMDaemon C. To encrypt software images at rest D. To stop a job from being scheduled

28. In BCM, what is a 'chunk' in the context of PBS job scripts?

A. A specific version of a software package B. A collection of resources requested from the same physical compute node C. A 10MB block of monitoring data D. A single task in a parallel job

29. Which command-line tool is used to manage and update the BCM Ansible collection for NVIDIA Base Command Manager 11?

A. ansible-playbook B. ansible-galaxy C. cmsh D. yum

30. What is the maximum number of GPU instances that an NVIDIA A100 GPU can be partitioned into using Multi-Instance GPU (MIG) technology?

A. 2 B. 4 C. 7 D. 16

31. Which BCM monitoring tool provides the capability to view historical resource consumption metrics aggregated by user or account?

A. Job Accounting B. Health Checks C. Device Status widget D. monitoringpickup

32. When a node is in the 'INSTALLER_FAILED' state, where can the administrator usually find a log explaining the failure?

A. /var/log/messages B. /var/log/node-installer C. /cm/local/apps/cmd/etc/cmd.conf D. /var/log/slurmd

33. Which Slurm command allows a user to allocate resources in real-time and spawn a shell to execute parallel tasks?

A. sbatch B. salloc C. squeue D. scancel

34. To ensure that a passive head node can take over from the active head in an HA setup, which command is used to initially clone the active head node?

A. cm-clone-install --failover B. cmha makeactive C. reboot --headnode D. yum clone

35. What is the primary benefit of using 'Offloaded Monitoring' in clusters with more than 1,000 nodes?

A. It encrypts monitoring data for security B. It shares resource-intensive monitoring storage and sampling across multiple nodes to prevent head node overload C. It allows monitoring of Windows-based nodes D. It increases the frequency of CPU sampling to 1 millisecond

36. In the Jupyter integration, which tool provides an interactive way to create and customize kernels without editing JSON files?

A. Jupyter Kernel Provisioning B. Jupyter Kernel Creator C. JupyterHub Dashboard D. cm-jupyter-setup

37. Which BCM component is responsible for identifying and placing the software image on a regular node during the boot process?

A. CMDaemon B. node-installer C. Slurm D. systemd

38. To configure a specific node to use a different BIOS setting via Redfish, which tool should the administrator use?

A. cmsh or Base View B. ipmitool C. nvidia-smi D. systemctl

39. Which command is used to monitor network performance between cluster nodes using a handy Python script provided by BCM?

A. ping B. cm-iperf.py C. connectivity D. routes

40. What is the recommended method for an administrator to get support for a cluster that has had its BCM package versions replaced with standard distribution versions?

A. Open a ticket on the NVIDIA support portal B. Run cm-diagnose and email the file C. Replace the standard versions back with BCM versions first, as support cannot be provided in that state D. Use request-remote-assistance

References

Preparing for the NVIDIA-Certified Professional: AI Operations (NCP-AIO) requires an understanding of cluster management, workload orchestration, and GPU optimization. Below is a curated list of official documentation and resources to help you master the exam domains.

Base Command Manager: Study the “Installation Guide” and “Administrator Guide” to understand cluster provisioning and node health.
DCGM (Data Center GPU Manager): Study for monitoring cluster health, running diagnostics, and performance profiling.
Multi-Instance GPU (MIG): You must know how to configure and manage MIG profiles for various workloads.
Fabric Manager: Review troubleshooting procedures for the fabric manager service.
Slurm: Focus on job scheduling, resource allocation (GRES), and partition management.
NVIDIA GPUDirect RDMA: Understand components used to accelerate data movement across networks.
Run:ai: Learn to manage GPU fractions, compute quotas, and project-based resource allocation for multi-tenant teams.
Kubernetes and NVIDIA GPUs: Focus on the NVIDIA GPU Operator for automating driver deployment and managing GPU resources in containerized environments.
NGC: Understand how to pull optimized containers, models, and Helm charts while using the NGC CLI for registry management.
Magnum IO: Study the architecture for GPUDirect Storage (GDS) and how it bypasses the CPU to accelerate data movement.