OpenTelemetry로 LLM 요청 Trace 연결하기

AI Backend · 2026-05-12 · 마지막 수정일 2026-05-26 · 6분 읽기

Markdown약 2141 tokens

API, RAG, 모델 호출, 응답 검증을 하나의 흐름으로 보기

이 글에서는 OpenTelemetry를 사용해 LLM 서비스의 요청 흐름을 trace로 연결하는 방법을 정리합니다.

LLM 서비스의 장애 분석은 일반 API보다 어렵습니다. 한 요청 안에 검색, 모델 호출, 응답 검증, 저장, 캐시, 외부 provider 호출이 섞여 있기 때문입니다. 사용자는 “답변이 느리다”고 말하지만, 실제 원인은 vector search일 수도 있고, 모델 호출일 수도 있고, schema validation 재시도일 수도 있습니다.

그래서 trace가 필요합니다.

분석 기준일: 2026-05-12
실습 기준 환경: FastAPI, OpenTelemetry, PostgreSQL, Redis, LLM Provider API
주요 참고자료: OpenTelemetry Docs, W3C Trace Context, Google SRE

핵심 요약

OpenTelemetry는 traces, metrics, logs를 생성·수집·내보내기 위한 관측성 프레임워크다.
LLM 요청은 API, retrieval, LLM call, validation, DB save를 span으로 나눠야 한다.
trace_id는 로그, metric, eval result와 연결되어야 한다.
prompt 원문과 문서 전문은 span attribute에 넣지 않는다.
p95/p99 latency, error rate, token usage를 trace와 함께 봐야 한다.

1. 왜 LLM 서비스에 trace가 필요한가

LLM 요청은 여러 하위 작업으로 나뉩니다.

1# 예시입니다.2POST /answers3→ authenticate4→ rate limit check5→ retrieval6→ prompt build7→ llm call8→ schema validation9→ db save10→ response

전체 응답 시간이 8초라고 할 때 어디서 시간이 걸렸는지 모르면 개선할 수 없습니다.

병목 위치	가능한 원인
retrieval	vector index, metadata filter, DB 부하
prompt build	context 과다, token 계산
llm call	provider 지연, rate limit, output 길이
validation	schema 실패, 재시도
db save	connection pool, transaction 지연

Trace는 이 흐름을 span 단위로 나눠 보여줍니다.

2. OpenTelemetry 기본 개념

개념	설명
Trace	하나의 요청 전체 흐름
Span	trace 안의 개별 작업 단위
Attribute	span에 붙는 key-value metadata
Metric	시간에 따른 수치 데이터
Log	개별 이벤트 기록
Collector	telemetry 수집·가공·전송 컴포넌트

LLM 서비스에서는 trace와 token usage metric을 연결하는 것이 중요합니다.

3. LLM 요청 span 설계

4. Trace ID 전파

외부에서 들어온 traceparent header가 있으면 이어받고, 없으면 새 trace를 만듭니다.

1# 예시입니다.2traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

응답에도 request_id나 trace_id를 포함하면 장애 신고를 받을 때 추적이 쉬워집니다.

1// 예시 JSON 구조입니다.2{3  "answer": "...",4  "trace_id": "0af7651916cd43dd8448eb211c80319c"5}

5. Span Attribute 설계

좋은 attribute는 분석에 도움이 되면서 민감정보를 포함하지 않아야 합니다.

Attribute	예시	주의
`llm.provider`	openai	OK
`llm.model`	configured-model	OK
`llm.prompt_version`	answer.v3	OK
`llm.input_tokens`	3200	OK
`llm.output_tokens`	800	OK
`rag.top_k`	5	OK
`rag.document_scope`	official_docs	OK
`user.question`	원문 질문	넣지 않기
`prompt.full`	전체 prompt	넣지 않기
`document.content`	chunk 전문	넣지 않기

6. 민감정보와 보안

보안 주의
OpenTelemetry attribute와 log에는 prompt 원문, 문서 전문, 개인정보, API key, access token을 넣지 않아야 합니다. 관측성 데이터도 외부 시스템으로 전송될 수 있으므로 보안 등급을 별도로 관리해야 합니다.

대신 hash나 길이, category를 기록합니다.

1# 예시입니다.2question_hash3prompt_version4context_token_count5retrieved_chunk_count6document_scope

7. Metrics와 Logs 연결

Trace만으로는 전체 경향을 보기 어렵습니다. Metrics와 Logs도 함께 설계해야 합니다.

Signal	역할
Trace	한 요청의 상세 흐름
Metric	전체 경향과 알림
Log	이벤트와 디버깅 정보

예시 metric:

1# 예시입니다.2llm_request_duration_ms3llm_provider_latency_ms4llm_schema_validation_fail_total5rag_retrieval_latency_ms6rag_empty_result_total7llm_input_tokens_total8llm_output_tokens_total

로그에는 trace_id를 반드시 포함합니다.

1// 예시 JSON 구조입니다.2{3  "level": "ERROR",4  "message": "LLM schema validation failed",5  "trace_id": "0af...",6  "prompt_version": "answer.v3",7  "error_code": "SCHEMA_VALIDATION_FAILED"8}

8. Dashboard에서 볼 것

초기 dashboard는 복잡할 필요가 없습니다.

1# 예시입니다.2[ ] 요청 수3[ ] p50/p95/p99 latency4[ ] error rate5[ ] provider latency6[ ] schema validation failure rate7[ ] retrieval latency8[ ] token usage9[ ] cache hit rate10[ ] eval pass rate

Google SRE의 four golden signals인 latency, traffic, errors, saturation을 LLM 서비스에 맞게 확장하면 됩니다.

9. FastAPI 예제 구조

1# 예시 코드입니다.2from opentelemetry import trace3 4tracer = trace.get_tracer(__name__)5 6# 이 선언은 예시 흐름을 보여줍니다.7async def create_answer(req):8    with tracer.start_as_current_span("answer.create") as span:9        span.set_attribute("llm.prompt_version", "answer.v3")10 11        with tracer.start_as_current_span("retrieval.search") as s:12            chunks = await retrieve(req.question)13            s.set_attribute("rag.top_k", len(chunks))14 15        with tracer.start_as_current_span("llm.call") as s:16            result = await call_llm(req, chunks)17            s.set_attribute("llm.input_tokens", result.usage.input_tokens)18            s.set_attribute("llm.output_tokens", result.usage.output_tokens)19 20        with tracer.start_as_current_span("output.validate"):21            validated = validate_output(result)22 23        return validated

이 예시는 구조를 보여주기 위한 코드입니다. 실제 서비스에서는 middleware, exporter, collector 설정이 필요합니다.

10. 실무 체크리스트

1# 예시입니다.2[ ] 요청 전체를 하나의 trace로 볼 수 있는가?3[ ] retrieval, llm call, validation이 별도 span인가?4[ ] trace_id가 API 응답과 로그에 포함되는가?5[ ] span attribute에 민감정보가 없는가?6[ ] token usage가 metric으로 기록되는가?7[ ] schema validation failure를 추적하는가?8[ ] p95/p99 latency를 dashboard에서 보는가?9[ ] eval result와 prompt_version을 연결할 수 있는가?

실패 사례: 로그는 많은데 병목을 찾지 못하는 경우

LLM 서비스에서 request started, retrieval done, model done 같은 로그를 많이 남겨도 장애 분석이 어려울 때가 있습니다. 각 로그가 같은 request에 속한다는 것은 알지만, 어느 단계가 p95 latency를 밀어 올리는지, retry가 몇 번 일어났는지, validation 실패가 model 호출 전인지 후인지 한눈에 보기 어렵기 때문입니다. 로그는 사건 목록이고, trace는 사건 사이의 부모-자식 관계와 시간을 보여줍니다.

특히 RAG 요청은 병목이 매번 바뀝니다. 어떤 요청은 vector search가 느리고, 어떤 요청은 provider queue가 길고, 어떤 요청은 schema validation retry 때문에 두 번째 모델 호출이 생깁니다. 하나의 trace 안에 api.request, retrieval.search, llm.call, output.validate, answer.persist span이 있으면 문제 위치를 단계별로 분리할 수 있습니다.

구현 예시: LLM 요청 span 구조

1trace llm_answer_request2  span api.request3    span auth.check4    span retrieval.search5    span llm.call6    span output.validate7    span answer.persist

각 span attribute에는 원문이 아니라 운영 가능한 요약값을 넣습니다.

1llm.model = "runtime-selected-model"2llm.prompt_version = "rag_answer.v4"3llm.input_tokens = 42004llm.output_tokens = 6205retrieval.top_k = 86retrieval.index = "document_chunks_hnsw_v2"7validation.schema = "answer_with_citations.v2"

prompt 전문, 문서 전문, 사용자 개인정보를 attribute에 넣지 않는 것이 중요합니다. 대신 hash, token count, version, chunk id처럼 재현과 분석에 필요한 최소값을 남깁니다.

체크리스트 적용 결과

확인 항목	좋은 신호	위험 신호
trace 연결	API, retrieval, LLM, validation span이 한 trace에 있음	단계별 로그만 흩어져 있음
민감정보	prompt hash와 token count만 저장	원문 prompt와 chunk 전문 저장
비용 분석	token usage가 trace와 metric에 연결	비용 리포트와 장애 trace가 분리
재시도	retry span이 원 호출의 child로 보임	재시도가 새 request처럼 보임

이 구조가 있으면 "LLM이 느리다"는 모호한 신고를 "retrieval p95가 증가했다" 또는 "schema validation retry가 늘었다"로 바꿀 수 있습니다.

운영 예시: 느린 답변 신고를 trace로 좁히기

사용자가 "답변이 12초나 걸렸다"고 신고했다고 해봅시다. trace가 없으면 로그를 시간순으로 뒤져야 합니다. trace가 있으면 한 요청 안에서 어느 span이 시간을 썼는지 바로 비교할 수 있습니다.

Span	정상 범위 예	신고 trace	해석
auth.check	20ms	18ms	정상
retrieval.search	300ms	4,800ms	검색 병목
prompt.build	80ms	120ms	정상
llm.call	2,500ms	2,900ms	약간 증가
output.validate	50ms	1,700ms	retry 의심

이 trace에서는 provider만 탓하면 안 됩니다. retrieval과 validation이 함께 느립니다. 다음 확인은 vector index 상태, metadata filter selectivity, validation 실패 로그입니다. 같은 시간대에 rag_empty_result_total이나 schema_validation_fail_total이 늘었는지 metric도 같이 봐야 합니다.

경계 사례는 민감정보 마스킹입니다. 장애를 빨리 보려고 prompt 전문과 chunk 전문을 span attribute에 넣으면 나중에 관측성 저장소가 개인정보 저장소가 됩니다. 대신 prompt_hash, chunk_ids, token_count, schema_version을 남기고, 원문이 필요한 재현은 권한이 있는 별도 eval 저장소에서 처리하는 편이 안전합니다. 좋은 trace는 모든 내용을 담는 로그가 아니라 원인 범위를 좁히는 지도에 가깝습니다.

11. Q&A

Q1. 로그만 잘 남기면 trace가 없어도 되나요?

작은 서비스에서는 로그로 시작할 수 있습니다. 하지만 요청이 여러 단계로 나뉘면 trace가 훨씬 유리합니다. 특히 LLM provider 호출과 retrieval 병목을 구분하려면 span이 필요합니다.

Q2. 모든 요청을 trace하면 비용이 크지 않나요?

샘플링을 사용할 수 있습니다. 다만 오류 요청, 긴 지연 요청, eval 실패 요청은 우선적으로 보존하는 정책이 좋습니다.

Q3. prompt 내용을 trace에 넣어도 되나요?

권장하지 않습니다. prompt와 문서 chunk에는 민감정보가 포함될 수 있습니다. hash, token count, version 등으로 대체합니다.

12. 참고자료와 불확실성

참고자료

OpenTelemetry Docs: https://opentelemetry.io/docs/
What is OpenTelemetry: https://opentelemetry.io/docs/what-is-opentelemetry/
W3C Trace Context: https://www.w3.org/TR/trace-context/
Google SRE — Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/

불확실성

OpenTelemetry SDK 설정은 언어와 프레임워크에 따라 달라집니다.
샘플링 정책과 데이터 보관 기간은 조직의 비용·보안 정책에 따라 결정해야 합니다.

View API, RAG, model call, and response validation as one flow

In this article, we outline how to connect the request flow of the LLM service with a trace using OpenTelemetry.

Failure analysis for LLM services is more difficult than for regular APIs. This is because search, model call, response validation, storage, cache, and external provider call are mixed within one request. Users say “response is slow,” but the actual cause may be vector search, model call, or schema validation retry.

So we need a trace.

Analysis base date: 2026-05-12 Practice standard environment: FastAPI, OpenTelemetry, PostgreSQL, Redis, LLM Provider API Main reference materials: OpenTelemetry Docs, W3C Trace Context, Google SRE

Key takeaways

OpenTelemetry is an observability framework for creating, collecting, and exporting traces, metrics, and logs.
LLM requests must be divided into API, retrieval, LLM call, validation, and DB save into spans.
trace_id must be connected to log, metric, and eval result.
The original text of the prompt and the full text of the document are not included in the span attribute.
You should look at p95/p99 latency, error rate, and token usage along with trace.

1. Why trace is needed for LLM services

The LLM request is divided into several subtasks.

1# This is an example.2POST /answers3→ authenticate4→ rate limit check5→ retrieval6→ prompt build7→ llm call8→ schema validation9→ db save10→ response

If your overall response time is 8 seconds, you can't improve it if you don't know where the time was taken.

bottleneck location	possible cause
retrieval	vector index, metadata filter, DB load
prompt build	Excessive context, token calculation
llm call	provider delay, rate limit, output length
validation	schema failed, retry
db save	connection pool, transaction delay

Trace shows this flow divided into spans.

2. OpenTelemetry basic concepts

concept	explanation
Trace	One request entire flow
Span	Individual units of work within a trace
Attribute	key-value metadata attached to span
Metric	Numerical data over time
Log	Individual event recording
Collector	Telemetry collection, processing, and transmission components

In LLM services, it is important to connect trace and token usage metrics.

3. LLM request span design

Recommended span structure:

1# This is an example.2answer.create3├── auth.check4├── rate_limit.check5├── retrieval.search6│   └── vector.query7├── prompt.build8├── llm.call9├── output.validate10└── answer.save

The duration and error status are recorded for each span.

4. trace ID propagation

If there is atraceparentheader coming from outside, it is inherited, and if not, a new trace is created.

1# This is an example.2traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Including a request_id or trace_id in the response also makes tracking easier when receiving a crash report.

1// This is an example JSON structure.2{3  "answer": "...",4  "trace_id": "0af7651916cd43dd8448eb211c80319c"5}

5. Span Attribute Design

A good attribute should be helpful for analysis and not contain sensitive information.

Attribute	example	caution
`llm.provider`	openai	OK
`llm.model`	configured-model	OK
`llm.prompt_version`	answer.v3	OK
`llm.input_tokens`	3200	OK
`llm.output_tokens`	800	OK
`rag.top_k`	5	OK
`rag.document_scope`	official_docs	OK
`user.question`	original question	Don't put it in
`prompt.full`	full prompt	Don't put it in
`document.content`	chunk specialty	Don't put it in

6. Sensitive information and security

Security Caution OpenTelemetry attributes and logs should not include the original text of the prompt, full text of the document, personal information, API key, or access token. Since observable data can also be transmitted to external systems, its security level must be managed separately.

Instead, record hash, length, and category.

1# This is an example.2question_hash3prompt_version4context_token_count5retrieved_chunk_count6document_scope

7. Connect Metrics and Logs

It is difficult to see the overall trend using Trace alone. Metrics and logs must also be designed together.

Signal	role
Trace	Detailed flow of one request
Metric	All trends and notifications
Log	Events and debugging information

Example metric:

1# This is an example.2llm_request_duration_ms3llm_provider_latency_ms4llm_schema_validation_fail_total5rag_retrieval_latency_ms6rag_empty_result_total7llm_input_tokens_total8llm_output_tokens_total

Logs must include trace_id.

1// This is an example JSON structure.2{3  "level": "ERROR",4  "message": "LLM schema validation failed",5  "trace_id": "0af...",6  "prompt_version": "answer.v3",7  "error_code": "SCHEMA_VALIDATION_FAILED"8}

8. What to see in Dashboard

Your initial dashboard doesn't have to be complicated.

1# This is an example.2[ ] Number of requests3[ ] p50/p95/p99 latency4[ ] error rate5[ ] provider latency6[ ] schema validation failure rate7[ ] retrieval latency8[ ] token usage9[ ] cache hit rate10[ ] eval pass rate

Google SRE's four golden signals - latency, traffic, errors, and saturation - can be expanded to suit the LLM service.

9. FastAPI example structure

1# This is example code.2from opentelemetry import trace3 4tracer = trace.get_tracer(__name__)5 6# This declaration shows an example flow.7async def create_answer(req):8    with tracer.start_as_current_span("answer.create") as span:9        span.set_attribute("llm.prompt_version", "answer.v3")10 11        with tracer.start_as_current_span("retrieval.search") as s:12            chunks = await retrieve(req.question)13            s.set_attribute("rag.top_k", len(chunks))14 15        with tracer.start_as_current_span("llm.call") as s:16            result = await call_llm(req, chunks)17            s.set_attribute("llm.input_tokens", result.usage.input_tokens)18            s.set_attribute("llm.output_tokens", result.usage.output_tokens)19 20        with tracer.start_as_current_span("output.validate"):21            validated = validate_output(result)22 23        return validated

This example is code to demonstrate the structure. In the actual service, middleware, exporter, and collector settings are required.

10. Practical checklist

1# This is an example.2[ ] Can I view the entire request as one trace?3[ ] Are retrieval, llm call, and validation separate spans?4[ ] Is trace_id included in API responses and logs?5[ ] Is there no sensitive information in the span attribute?6[ ] Is token usage recorded as a metric?7[ ] Do you track schema validation failure?8[ ] Do you see p95/p99 latency in the dashboard?9[ ] Is it possible to connect eval result and prompt_version?

Failure case: There are a lot of logs, but the bottleneck cannot be found.

Even though the LLM service leaves many logs such asrequest started,retrieval done, andmodel done, it is sometimes difficult to analyze the failure. Although we know that each log belongs to the same request, it is difficult to see at a glance which step is pushing up the p95 latency, how many retries occurred, and whether the validation failure occurred before or after the model call. A log is a list of events, and a trace shows the parent-child relationship and time between events.

In particular, RAG requests are a bottleneck that changes every time. Some requests have slow vector search, some have long provider queues, and some require a second model call due to schema validation retry. If there are spansapi.request,retrieval.search,llm.call,output.validate, andanswer.persistin one trace, the problem location can be separated step by step.

Example implementation: LLM request span structure

1trace llm_answer_request2span api.request3span auth.check4span retrieval.search5span llm.call6span output.validate7span answer.persist

Each span attribute contains an operable summary value rather than the original text.

1llm.model = "runtime-selected-model"2llm.prompt_version = "rag_answer.v4"3llm.input_tokens = 42004llm.output_tokens = 6205retrieval.top_k = 86retrieval.index = "document_chunks_hnsw_v2"7validation.schema = "answer_with_citations.v2"

It is important not to include the full text of the prompt, the full text of the document, or user personal information in the attribute. Instead, it leaves the minimum values necessary for reproduction and analysis, such as hash, token count, version, and chunk id.

Checklist application results

Check items	good signal	danger signal
trace connection	API, retrieval, LLM, validation span in one trace	Only step-by-step logs are scattered
Sensitive information	Save only the prompt hash and token count	Save the original prompt and chunk text
cost analysis	Token usage is linked to trace and metrics	Separate cost report and fault trace
retry	The retry span appears to be a child of the original call.	Retry looks like a new request

With this structure, a vague report of “LLM is slow” can be changed to “retrieval p95 has increased” or “schema validation retry has increased”.

Operational example: Narrowing down slow response reports to traces

Let's say a user reports that "it took 12 seconds to reply." If there is no trace, you will have to search through the logs chronologically. If you have a trace, you can immediately compare which spans spent time within one request.

Span	Normal range example	report trace	analysis
auth.check	20ms	18ms	normal
retrieval.search	300ms	4,800ms	Search bottleneck
prompt.build	80ms	120ms	normal
llm.call	2,500ms	2,900ms	slightly increased
output.validate	50ms	1,700ms	retry doubt

In this trace, you can't just blame the provider. Both retrieval and validation are slow. Next checks are vector index status, metadata filter selectivity, and validation failure log. You should also check the metric to see ifrag_empty_result_totalorschema_validation_fail_totalhas increased during the same time period.

A borderline case is the masking of sensitive information. To quickly see the failure, put the prompt and chunk text into the span attribute, and later the observable storage becomes a private information storage. Instead, it is safer to leaveprompt_hash,chunk_ids,token_count, andschema_version, and handle reproductions that require the original text in a separate eval repository with permission. A good trace is less of an exhaustive log and more of a map that narrows down the cause.

11. Q&A

Q1. If I just leave a good log, is there no need to have a trace?

For small services, you can start with logs. However, if the request is broken down into multiple steps, trace is much more advantageous. In particular, a span is needed to distinguish between the LLM provider call and the retrieval bottleneck.

Q2. Wouldn't it be expensive to trace every request?

Sampling is available. However, it is recommended to preferentially preserve error requests, long-delay requests, and eval failure requests.

Q3. Can I put the contents of the prompt in the trace?

Not recommended. Prompts and document chunks may contain sensitive information. Replace with hash, token count, version, etc.

12. References and uncertainty

References

OpenTelemetry Docs:https://opentelemetry.io/docs/
What is OpenTelemetry:https://opentelemetry.io/docs/what-is-opentelemetry/
W3C Trace Context:https://www.w3.org/TR/trace-context/
Google SRE — Monitoring Distributed Systems:https://sre.google/sre-book/monitoring-distributed-systems/

uncertainty

OpenTelemetry SDK settings vary depending on the language and framework.
Sampling policies and data retention periods should be determined based on the organization's cost and security policies.

GitHub 계정으로 로그인하면 댓글을 남길 수 있습니다. 댓글은 GitHub Discussions를 통해 운영됩니다.