Skip to content

NVIDIA Data Center GPU Manager

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA Datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts, and governance policies including power and clock management. Infrastructure teams can use it standalone and in addition easily integrate it into cluster management tools, resource scheduling, and monitoring products from NVIDIA partners.

DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, and aarch64 (sbsa) platforms. The installer packages include libraries, binaries, and source examples for using the API (C and Python). In addition, Go bindings are available via the open-source GitHub repository . Please refer to the documentation for additional details and instructions.

DCGM also integrates into the Kubernetes ecosystem using DCGM-Exporter to provide rich GPU telemetry in containerized environments. DCGM has an open-core architecture - the foundational libraries and building blocks are available as open source on GitHub but at the same time certain blocks such as diagnostics and tests remain proprietary.

Categories

See also

Favorite site