Overview

Introduction

At Samsung Research, this project was initiated to address the need for efficient language model evaluation. Before this platform, researchers and engineers had to manually download test datasets, implement evaluation code based on research papers, and optimize evaluation speeds themselves. This inefficiency slowed down research and development. With this platform, however, LLM R&D processes are significantly accelerated.

Task

Software Architecture Design
- Evaluation Speedup
- RAG System Integration
**LLM-as-a-Judge implementation**
- Prompt Design
- Ensuring Evaluation Objectivity
- Judge Model Serving
Supporting Various Benchmarks
- Benchmark Search and Integration

Approach

Software Architecture Design
- Evaluation Speedup
  - Distributed system design
  - Concurrency and Parallelism in Backend
- RAG System Integration
  - Built and integrated a RAG system with Docling, an embedding model, and FAISS. The system can be dynamically configured via an API for RAG system evaluation.
LLM-as-a-Judge Implementation
- Prompt Design
  - Pointwise & Pairwise evaluation
  - Diverse Criteria and Metrics ( example )
- Ensuring Evaluation Objectivity
  - Mitigation of Positional bias, self-enhancement bias
- Judge Model Serving
  - vLLM
Supporting Various Benchmarks
- Benchmark Search and Integration
  - Benchmarks: NeedleBench, LongBenchV1, and LongBenchV2
  - **Open Source Modification and Integration**
    - Custom evaluation steps, custom model integration, RAG integration, etc.

Overview

Introduction

Task

Approach

Result