LLM Evals 입문

AI Backend · 2026-05-12 · 5분 읽기

Markdown약 1280 tokens

테스트 코드로 잡히지 않는 품질을 측정하기

이 글에서는 LLM Evals의 기본 개념과 실무 적용 방법을 정리합니다.

일반 백엔드 테스트는 비교적 명확합니다. 입력이 같으면 출력이 같아야 하고, DB 상태가 기대한 대로 바뀌어야 합니다. 하지만 LLM 서비스는 다릅니다. 같은 질문에도 표현이 조금씩 달라질 수 있고, 답변이 그럴듯하지만 틀릴 수 있고, 프롬프트를 조금 바꿨을 뿐인데 특정 케이스 품질이 떨어질 수 있습니다.

그래서 LLM 서비스에는 별도의 평가 체계가 필요합니다.

분석 기준일: 2026-05-12
실습 기준 환경: OpenAI Evals 개념, Python, pytest, JSONL dataset
주요 참고자료: OpenAI Evals Docs, OpenAI Evaluation Best Practices

핵심 요약

Evals는 LLM 출력이 기대 기준을 만족하는지 측정하는 구조화된 테스트다.
Golden Set은 회귀 테스트를 위한 대표 질문 데이터셋이다.
LLM 품질 평가는 correctness, groundedness, citation accuracy, safety를 나눠야 한다.
프롬프트 변경과 모델 변경은 eval 결과 없이 배포하면 위험하다.
Evals는 테스트 코드의 대체물이 아니라 보완재다.

1. 왜 LLM에는 별도 평가가 필요한가

LLM 서비스의 품질 문제는 일반 테스트로 잘 잡히지 않습니다.

문제	일반 테스트	Eval
JSON schema 오류	잘 잡음	보조 가능
답변 사실성	잘 못 잡음	필요
출처 정확성	제한적	필요
답변 톤	제한적	가능
프롬프트 변경 회귀	잘 못 잡음	매우 중요
모델 변경 영향	잘 못 잡음	매우 중요

즉, 테스트 코드는 시스템이 깨졌는지 확인하고, eval은 답변이 쓸 만한지 확인합니다.

2. Unit Test와 Eval의 차이

구분	Unit Test	Eval
대상	코드 로직	모델 출력 품질
결과	deterministic	probabilistic
기준	expected value	rubric, grader, human label
실행 시점	CI	CI + 배포 전 + 주기적
실패 의미	코드 오류 가능성	품질 저하 가능성

둘 다 필요합니다. schema validation은 unit/integration test로 잡고, 답변 품질은 eval로 봅니다.

3. Golden Set 만들기

Golden Set은 대표 질문과 기대 기준을 모은 데이터셋입니다.

1// 예시 JSON 구조입니다.2{3  "id": "rag_001",4  "question": "Redis Cache Aside 패턴은 언제 사용하나요?",5  "expected_facts": [6    "캐시를 먼저 조회한다",7    "miss 시 원본 데이터 저장소를 조회한다",8    "조회 결과를 TTL과 함께 캐시에 저장한다"9  ],10  "must_cite": true,11  "category": "cache"12}

좋은 golden set은 다양한 실패 유형을 포함해야 합니다.

1# 예시입니다.2[ ] 쉬운 질문3[ ] 애매한 질문4[ ] 문서에 없는 질문5[ ] 최신성이 필요한 질문6[ ] 권한이 필요한 질문7[ ] citation이 중요한 질문8[ ] 안전 정책이 필요한 질문

4. 평가 기준 정의

평가 기준은 하나로 뭉치면 안 됩니다.

기준	질문
Correctness	답이 사실적으로 맞는가?
Groundedness	제공된 문서 근거에 기반했는가?
Citation Accuracy	citation이 실제 근거와 연결되는가?
Completeness	필요한 내용을 빠뜨리지 않았는가?
Safety	위험하거나 금지된 내용을 피했는가?
Format	요구한 schema/format을 지켰는가?

5. Grader 설계

Grader는 사람이 할 수도 있고, 규칙 기반일 수도 있고, LLM-as-a-Judge일 수도 있습니다.

Grader 유형	장점	단점
Exact match	빠르고 명확	표현 다양성에 약함
Rule-based	안정적	복잡한 품질 판단 어려움
Human review	신뢰도 높음	비용 큼
LLM-as-a-Judge	확장성 좋음	judge 편향과 불안정성

초기에는 rule-based + human review로 시작하고, 이후 LLM judge를 보조로 쓰는 편이 안전합니다.

6. RAG 평가 나누기

RAG는 검색과 생성을 분리해서 평가해야 합니다.

1# 예시입니다.2Retrieval Eval:3정답 문서가 top-k 안에 들어왔는가?4 5Generation Eval:6검색된 문서를 근거로 올바른 답을 했는가?

예시:

지표	설명
`recall@k`	정답 문서가 top-k에 포함되는 비율
`precision@k`	검색 결과 중 관련 문서 비율
`answer_correctness`	답변의 사실성
`citation_accuracy`	출처 정확성
`groundedness`	근거 기반성

7. Release Gate로 연결하기

Eval은 읽고 끝나는 리포트가 아니라 배포 기준이 되어야 합니다.

1# 예시입니다.2프롬프트 v3 배포 조건:3- overall pass rate >= 90%4- critical set pass rate >= 98%5- citation accuracy >= 95%6- regression count <= 3

프롬프트나 모델을 바꾸기 전후를 비교합니다.

1# 예시입니다.2baseline: prompt.v2 + model.A3candidate: prompt.v3 + model.A4candidate: prompt.v2 + model.B

8. 운영 지표와 dashboard

지표	의미
`eval_pass_rate`	전체 통과율
`eval_regression_count`	이전보다 나빠진 케이스 수
`eval_category_pass_rate`	카테고리별 품질
`citation_accuracy`	출처 정확도
`human_review_required_rate`	사람 검토 필요 비율

이 지표는 배포 판단, 품질 개선, 프롬프트 변경 기록에 사용됩니다.

9. 실무 체크리스트

1# 예시입니다.2[ ] 대표 질문 golden set이 있는가?3[ ] 검색 평가와 답변 평가를 분리했는가?4[ ] correctness와 groundedness를 구분했는가?5[ ] citation accuracy를 평가하는가?6[ ] 프롬프트 변경 전후 eval을 비교하는가?7[ ] critical case를 별도 관리하는가?8[ ] eval 실패 케이스를 다음 개선에 반영하는가?9[ ] eval 결과가 release gate와 연결되는가?

10. Q&A

Q1. Eval은 얼마나 자주 실행해야 하나요?

프롬프트, 모델, retrieval 로직이 바뀔 때는 반드시 실행하는 것이 좋습니다. 운영 중에는 주기적으로 샘플링해 drift를 확인할 수 있습니다.

Q2. LLM-as-a-Judge를 믿어도 되나요?

보조 수단으로는 유용하지만 완전히 신뢰하면 위험합니다. 중요한 케이스는 human review나 rule-based check와 함께 써야 합니다.

Q3. Golden Set은 몇 개부터 시작하면 되나요?

처음에는 30–50개라도 충분합니다. 중요한 것은 숫자보다 실패 유형을 다양하게 담는 것입니다.

11. 참고자료와 불확실성

참고자료

OpenAI Evals: https://platform.openai.com/docs/guides/evals
OpenAI Evaluation Best Practices: https://platform.openai.com/docs/guides/evaluation-best-practices
OpenAI Cookbook — Evals: https://developers.openai.com/cookbook/topic/evals

불확실성

Eval 기준은 서비스 도메인에 맞게 조정해야 합니다.
LLM-as-a-Judge는 judge 모델, rubric, temperature에 따라 결과가 달라질 수 있습니다.

실무 예시: golden set을 작게 시작하기

처음부터 수백 개 평가 문항을 만들 필요는 없습니다. 운영에서 자주 들어오는 질문 20개, 실패하면 큰 문제가 되는 질문 10개, 새 기능이 의도한 질문 10개 정도로 시작할 수 있습니다. 중요한 것은 질문, 기대 근거, 허용 가능한 답변 범위, 실패 조건을 같이 적는 것입니다.

항목	예
질문	"계정 삭제 후 데이터는 얼마나 보관되나요?"
기대 근거	개인정보 처리방침 section id
통과 기준	보관 기간과 예외 조건을 모두 언급
실패 기준	삭제 즉시 완전 삭제라고 단정

실패 사례는 LLM-as-a-Judge 점수만 release gate로 쓰는 것입니다. grader도 모델이므로 기준이 흔들릴 수 있습니다. 중요한 정책, 법무, 결제 답변은 rule-based check나 human review 샘플을 섞어야 합니다. Eval은 정답을 자동으로 보장하는 도구가 아니라, 배포 전후 품질 변화를 같은 기준으로 비교하게 해주는 계측 장치입니다.

작은 golden set이라도 version을 붙여야 합니다. 질문이 바뀌었는지, 기대 근거가 바뀌었는지, grader prompt가 바뀌었는지 모르면 점수 변화를 해석할 수 없습니다. 배포 gate에는 평균 점수보다 "절대 실패하면 안 되는 문항"을 따로 두는 편이 안전합니다.

Measuring quality that is not captured by test code

This article summarizes the basic concepts of LLM Evals and how to apply them in practice.

General backend testing is relatively clear. If the input is the same, the output should be the same, and the DB state should change as expected. But LLM services are different. Even the same question may be worded slightly differently, the answer may be plausible but wrong, and the quality of a particular case may decrease even if the prompt is only slightly changed.

That's why LLM services require a separate evaluation system.

Analysis date: 2026-05-12 Practice standard environment: OpenAI Evals concept, Python, pytest, JSONL dataset Main reference materials: OpenAI Evals Docs, OpenAI Evaluation Best Practices

Key takeaways

Evals are structured tests that measure whether LLM output meets expected standards.
Golden Set is a representative question dataset for regression testing.
LLM quality evaluation should be divided into correctness, groundedness, citation accuracy, and safety.
Prompt changes and model changes are dangerous if deployed without eval results.
Evals are a complement to test code, not a replacement.

1. Why does LLM require a separate assessment?

Quality issues in LLM services are not well captured by general testing.

problem	general test	Eval
JSON schema error	good noise	Assistable
Answer Factuality	I didn't catch it well	necessary
Source Accuracy	limited	necessary
answer tone	limited	possible
Prompt change regression	I didn't catch it well	very important
Model change impact	I didn't catch it well	very important

That is, the test code checks whether the system is broken, and eval checks whether the answer is usable.

2. Difference between Unit Test and Eval

division	Unit Test	Eval
Target	code logic	Model output quality
result	deterministic	probabilistic
standard	expected value	rubric, grader, human label
When to run	CI	CI + Before deployment + Periodically
failure meaning	Possibility of code errors	Possible decline in quality

You need both. Schema validation is considered unit/integration test, and answer quality is considered eval.

3. Creating a Golden Set

The Golden Set is a dataset that collects representative questions and expected standards.

1// This is an example JSON structure.2{3  "id": "rag_001",4  "question": "When to use the Redis Cache Aside pattern?",5  "expected_facts": [6    "Check the cache first",7    "In case of a miss, the original data storage is searched.",8    "Store search results in cache with TTL"9  ],10  "must_cite": true,11  "category": "cache"12}

A good golden set should contain a variety of failure types.

1# This is an example.2[ ] Easy questions3[ ] Ambiguous question4[ ] Questions not in the document5[ ] Questions that require up-to-dateness6[ ] Questions requiring permission7[ ] Citation is an important question8[ ] Questions that require a safety policy

4. Define evaluation criteria

Evaluation criteria should not be lumped together.

standard	question
Correctness	Is the answer factually correct?
Groundedness	Was it based on the documentary evidence provided?
Citation Accuracy	Is the citation connected to actual evidence?
Completeness	Did I leave out any necessary information?
Safety	Did you avoid dangerous or prohibited content?
Format	Did you follow the requested schema/format?

5. Grader design

Grader can be human, rule-based, or LLM-as-a-Judge.

Grader type	merit	disadvantage
Exact match	fast and clear	Weak in expression diversity
Rule-based	stable	Complex quality judgment difficult
Human review	High reliability	Cost is large
LLM-as-a-Judge	Good scalability	judge bias and instability

It is safer to start with rule-based + human review initially and then use an LLM judge as an assistant.

6. Divide RAG evaluation

RAGs should be evaluated separately for retrieval and generation.

1# This is an example.2Retrieval Eval:3Is the correct answer document in top-k?4 5Generation Eval:6Did you give the correct answer based on the retrieved document?

example:

characteristic	explanation
`recall@k`	Proportion of correct answer documents included in top-k
`precision@k`	Percentage of relevant documents among search results
`answer_correctness`	The truth of the answer
`citation_accuracy`	Source Accuracy
`groundedness`	evidence-based

7. Connect to Release Gate

Eval should be a standard for distribution, not just a report that you read and end.

1# This is an example.2Prompt v3 deployment terms:3- overall pass rate >= 90%4- critical set pass rate >= 98%5- citation accuracy >= 95%6- regression count <= 3

Compare before and after changing prompts or models.

1# This is an example.2baseline: prompt.v2 + model.A3candidate: prompt.v3 + model.A4candidate: prompt.v2 + model.B

8. Operational indicators and dashboard

characteristic	meaning
`eval_pass_rate`	overall pass rate
`eval_regression_count`	Number of cases worse than before
`eval_category_pass_rate`	Quality by Category
`citation_accuracy`	Source Accuracy
`human_review_required_rate`	Percentage of human review required

This metric is used to determine deployment, improve quality, and record prompt changes.

9. Practical checklist

1# This is an example.2[ ] Representative question Is there a golden set?3[ ] Are search evaluation and answer evaluation separate?4[ ] Did you distinguish between correctness and groundedness?5[ ] Do you evaluate citation accuracy?6[ ] Compare eval before and after changing the prompt?7[ ] Are critical cases managed separately?8[ ] Are eval failure cases reflected in the next improvement?9[ ] Is the eval result connected to the release gate?

10. Q&A

Q1. How often should I run Eval?

It is recommended to run this whenever the prompt, model, or retrieval logic changes. During operation, drift can be checked through periodic sampling.

Q2. Can I Trust LLM-as-a-Judge?

It's useful as an aid, but dangerous if you trust it completely. Critical cases should be combined with human review or rule-based checks.

Q3. How many Golden Sets should I start with?

At first, 30 to 50 is enough. The important thing is to include a variety of failure types rather than numbers.

11. References and uncertainty

References

OpenAI Evals:https://platform.openai.com/docs/guides/evals
OpenAI Evaluation Best Practices:https://platform.openai.com/docs/guides/evaluation-best-practices
OpenAI Cookbook — Evals:https://developers.openai.com/cookbook/topic/evals

uncertainty

Eval criteria must be tailored to the service domain.
LLM-as-a-Judge results may vary depending on the judge model, rubric, and temperature.

Practical example: starting small with a golden set

There is no need to create hundreds of assessment questions from scratch. You might start with about 20 questions that come up frequently in operations, 10 questions that would be a big problem if they fail, and 10 questions that the new feature is intended for. The important thing is to write down the question, the basis for expectations, the range of acceptable responses, and the failure conditions.

item	yes
question	“How long is my data retained after I delete my account?”
basis of expectation	Privacy Policy section id
passing criteria	Mention both storage period and exception conditions
failure criteria	Assume that deletion is complete immediately.

A failure case is using only the LLM-as-a-Judge score as the release gate. The grader is also a model, so standards can be shaky. Important policy, legal, and payment responses should incorporate rule-based checks or human review samples. Eval is not a tool that automatically guarantees the correct answer, but rather a measurement device that allows you to compare changes in quality before and after deployment using the same criteria.

Even if it is a small golden set, a version must be added. It is impossible to interpret changes in scores without knowing whether the questions changed, the expectations changed, or the grader prompt changed. It is safer to set aside “questions that must never be failed” rather than average scores at the distribution gate.

GitHub 계정으로 로그인하면 댓글을 남길 수 있습니다. 댓글은 GitHub Discussions를 통해 운영됩니다.