Overview

Introduction
At Samsung Research, this project was initiated to address the need for efficient language model evaluation. Before this platform, researchers and engineers had to manually download test datasets, implement evaluation code based on research papers, and optimize evaluation speeds themselves. This inefficiency slowed down research and development. With this platform, however, LLM R&D processes are significantly accelerated.
Task
- Software Architecture Design
- Evaluation Speedup
- RAG System Integration
- **LLM-as-a-Judge implementation**
- Prompt Design
- Ensuring Evaluation Objectivity
- Judge Model Serving
- Supporting Various Benchmarks
- Benchmark Search and Integration
Approach
- Software Architecture Design
-
Evaluation Speedup
-
RAG System Integration
- Built and integrated a RAG system with Docling, an embedding model, and FAISS. The system can be dynamically configured via an API for RAG system evaluation.
- LLM-as-a-Judge Implementation
- Prompt Design
- Ensuring Evaluation Objectivity
- Judge Model Serving
- Supporting Various Benchmarks
- Benchmark Search and Integration
Result