Overview

Personal Webpage Overview.png

Introduction

In this project, I managed clusters consisting of thousands of GPUs, hundreds of computing machines, storage systems, and network systems. These clusters were critical in providing computing resources to AI researchers. However, there was no centralized system to report the availability and utilization of computing resources, making it difficult to assess how effectively they were being used. Researchers could not check the real-time availability of computing resources, including GPUs, which often resulted in resource waste. Moreover, executives could not make appropriate business decisions due to the absence of periodic resource usage summaries.

To address these issues, I developed metric exporters to collect hardware and software metrics, managed a Prometheus server to gather and process metrics from thousands of computing resources, and built a dashboard to visualize these metrics in a readable format. For periodic resource usage summaries, I integrated metering-operator into our system to generate resource usage reports. This system monitored the status of GPUs, CPUs, memory, disk usage, and network performance, while also producing daily, weekly, and monthly usage reports.

Task

Approach