GPU Developer relationship/NVIDIA DCGM support in Nagios
John Coombs
jcoombs at nvidia.com
Mon Feb 6 21:58:20 CET 2017
George, Scott and Ethan -
I am the Alliance Manager for cluster management and job scheduling tools at NVIDIA.
We recently introduced a new (free) software product for GPU management called NVIDIA DCGM (Data Center GPU Manager), which enables:
- Active Health Monitoring
- Active Diagnostics and System Validation
- Policy and Group Configuration Management
- Power and Clock Management
This product supplements NVML and SMI, which you already leverage in the GPU Sensor Monitoring plug-in and the nvidia-smi plugin.
More information is available at: https://developer.nvidia.com/data-center-gpu-manager-dcgm . (Note that you need to be a member of the NVIDIA Developer Zone if you want to download the bits.)
The monitoring components of this, at minimum, may be interesting for you. We had two customer account managers ask about Nagios support at a meeting last week.
I support a number of partners in the space, enabling technical calls, and sometimes providing the latest GPUs for testing and software validation. If I can help you support NVIDIA GPUs more productively and fully, we should talk.
Thanks.
John Coombs
NVIDIA Tesla Alliance Management
jcoombs at nvidia.com<mailto:jcoombs at nvidia.com>
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-plugins.org/archive/devel/attachments/20170206/be8d233c/attachment.html>
More information about the Devel
mailing list