Overview

Personal Webpage Overview.png

Introduction

In this project, I managed clusters consisting of thousands of GPUs, hundreds of computing machines, storage systems, and network systems. These clusters were critical in providing computing resources to AI researchers. However, there was no centralized system to report the availability and utilization of computing resources, making it difficult to assess how effectively they were being used. Researchers could not check the real-time availability of computing resources, including GPUs, which often resulted in resource waste. Moreover, executives could not make appropriate business decisions due to the absence of periodic resource usage summaries.

To address these issues, I developed metric exporters to collect hardware and software metrics, managed a Prometheus server to gather and process metrics from thousands of computing resources, and built a dashboard to visualize these metrics in a readable format. For periodic resource usage summaries, I integrated metering-operator into our system to generate resource usage reports. This system monitored the status of GPUs, CPUs, memory, disk usage, and network performance, while also producing daily, weekly, and monthly usage reports.

Task

Multi-cluster Deployment
- Packaging
Data Collection and Processing
- Metric Collection and Processing
- Report Generation
Resource Usage Visualization
- Real-time Multi-cluster Monitoring
- Periodic Data Summary

Approach

Multi-cluster Deployment
- Packaging
  - Helm chart integration: Exporters, Metering-operator, Prometheus, Grafana
Data Collection and Processing
- Metric Collection and Processing
  - Custom exporters for our service-specific data
  - Open-source exporters for common systems (e.g., node-exporter, kube-state-metrics, dcgm-exporter, etc.)
  - Data processing using PromQL in Prometheus
- Report Generation
  - Daily, weekly, and monthly data aggregation with metering-operator and SQL
Resource Usage Visualization
- Real-time Multi-Cluster Monitoring
  - Grafana dashboard creation
  - Central Grafana connected to multiple Prometheus instances across clusters
- Periodic Data Summary
  - Designing report data APIs and delivering them to frontend developers for visualization