운영 가능한 API 설계

AI Backend · 2026-05-12 · 5분 읽기

Markdown약 1710 tokens

엔드포인트보다 실패 응답과 추적 ID가 먼저다

이 글에서는 LLM 서비스의 API를 운영 가능한 형태로 설계하는 방법을 정리합니다.

API 설계라고 하면 보통 URL과 HTTP method부터 떠올립니다. POST /chat, GET /documents/{id} 같은 엔드포인트를 먼저 정합니다. 하지만 운영 관점에서는 엔드포인트보다 먼저 정해야 하는 것이 있습니다.

실패 응답입니다.
추적 ID입니다.
재시도 가능 여부입니다.
rate limit 정책입니다.
health check와 readiness 기준입니다.

분석 기준일: 2026-05-12
실습 기준 환경: FastAPI, Pydantic, PostgreSQL, OpenTelemetry
주요 참고자료: OpenAPI Specification, FastAPI Error Handling, W3C Trace Context, Google SRE 공식 문서의 사실 확인과 이 글의 운영 해석은 본문에서 구분합니다.

핵심 요약

API 계약은 성공 응답만이 아니라 실패 응답까지 포함한다.
모든 요청에는 request_id 또는 trace_id가 있어야 한다.
LLM API는 retry, timeout, idempotency 여부를 문서화해야 한다.
health check는 프로세스 생존 확인과 의존성 준비 상태를 분리해야 한다.
API 문서는 프론트엔드용 스펙이 아니라 운영자와 장애 대응자를 위한 문서이기도 하다.

1. 운영 가능한 API란 무엇인가

운영 가능한 API는 정상 상황에서만 잘 동작하는 API가 아닙니다. 장애가 났을 때 원인을 추적할 수 있고, 호출자가 재시도 여부를 판단할 수 있으며, 운영자가 지표로 상태를 확인할 수 있는 API입니다.

항목	단순 API	운영 가능한 API
성공 응답	데이터 반환	schema와 version 포함
실패 응답	문자열 또는 임의 JSON	표준 error envelope
추적성	로그 검색	request_id, trace_id
재시도	호출자 판단	retryable 명시
제한 정책	없음	rate limit header
상태 확인	`/health` 하나	liveness/readiness 분리

LLM 서비스에서는 이 차이가 더 중요합니다. 모델 호출 실패, schema validation 실패, rate limit, 긴 지연이 자주 발생하기 때문입니다.

2. 성공 응답보다 실패 응답이 먼저다

실무에서 장애 대응을 어렵게 만드는 API는 대부분 실패 응답이 제각각입니다.

1// 예시 JSON 구조입니다.2{3  "message": "failed"4}

이 정도 응답으로는 원인을 알 수 없습니다. 어떤 요청인지, 재시도 가능한지, 사용자가 잘못 보낸 것인지, 서버 내부 문제인지 구분할 수 없습니다.

운영 가능한 실패 응답은 최소한 아래 정보를 포함해야 합니다.

1// 예시 JSON 구조입니다.2{3  "error": {4    "code": "LLM_SCHEMA_VALIDATION_FAILED",5    "message": "Model response did not match the expected schema.",6    "retryable": true,7    "details": {8      "schema_version": "answer.v1"9    }10  },11  "request_id": "req_01HX...",12  "trace_id": "0af7651916cd43dd8448eb211c80319c"13}

3. Error Response Envelope 설계

추천하는 error envelope 구조는 다음과 같습니다.

필드	설명
`error.code`	시스템 내부에서 검색 가능한 고유 코드
`error.message`	사용자 또는 개발자가 이해할 수 있는 설명
`error.retryable`	클라이언트가 재시도해도 되는지 여부
`error.details`	디버깅용 추가 정보. 민감정보 제외
`request_id`	단일 요청 식별자
`trace_id`	분산 추적 식별자

주의할 점은 details에 프롬프트 원문, 개인정보, API key, 내부 문서 전문을 넣지 않는 것입니다.

보안 주의
LLM 요청/응답은 민감한 문서를 포함할 수 있습니다. 실패 응답과 로그에 prompt, document chunk, user input 원문을 그대로 넣지 않도록 마스킹 정책을 먼저 정해야 합니다.

4. Trace ID와 Request ID 설계

request_id는 서비스 내부의 단일 요청 식별자입니다. trace_id는 여러 서비스와 외부 호출을 관통하는 분산 추적 식별자입니다.

1# 예시입니다.2Client3→ API Server span4→ Retrieval span5→ LLM Provider span6→ Validation span7→ DB Save span

LLM 서비스에서는 한 요청이 여러 하위 작업으로 나뉩니다. 따라서 trace를 연결해야 “어디서 느려졌는지” 볼 수 있습니다.

응답 header 예시는 다음과 같습니다.

1# 예시입니다.2X-Request-Id: req_01HXABC3traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

5. Idempotency Key가 필요한 API

모든 API에 idempotency key가 필요한 것은 아닙니다. 하지만 상태를 바꾸는 API에는 필요합니다.

API	Idempotency 필요 여부	이유
`GET /answers/{id}`	낮음	읽기 요청
`POST /questions`	중간	같은 질문 중복 생성 방지 가능
`POST /documents/{id}/index`	높음	중복 색인 방지
`POST /payments`	매우 높음	중복 결제 방지

문서 색인 API는 대표적인 예입니다. 같은 문서가 여러 번 색인되면 vector row가 중복되고 검색 품질이 떨어질 수 있습니다.

1# 예시입니다.2POST /documents/doc_123/index3Idempotency-Key: doc_123:index:v4

서버는 이 key를 작업 테이블에 저장하고, 같은 key가 다시 들어오면 기존 작업 상태를 반환합니다.

6. Rate Limit 응답 설계

LLM API는 비용과 provider 제한이 함께 걸립니다. 따라서 rate limit은 단순 보안 장치가 아니라 비용 통제 장치입니다.

1# 예시입니다.2HTTP/1.1 429 Too Many Requests3X-RateLimit-Limit: 1004X-RateLimit-Remaining: 05X-RateLimit-Reset: <unix_timestamp>6Retry-After: 30

429 응답에도 request_id와 error code를 포함해야 합니다.

1// 예시 JSON 구조입니다.2{3  "error": {4    "code": "RATE_LIMIT_EXCEEDED",5    "message": "Too many requests. Retry after 30 seconds.",6    "retryable": true7  },8  "request_id": "req_01HX..."9}

7. Health Check와 Readiness

/health 하나만 두면 충분하지 않습니다.

엔드포인트	목적	확인 대상
`/livez`	프로세스 생존	애플리케이션 프로세스
`/readyz`	트래픽 수신 가능	DB, Redis, Queue, config
`/metrics`	지표 수집	Prometheus 등

LLM provider 상태까지 readiness에 넣을지는 신중해야 합니다. 외부 모델 API가 일시적으로 느리다고 전체 서비스를 트래픽에서 제외하면 오히려 장애가 커질 수 있습니다. 보통은 provider 상태를 별도 metric으로 보고 fallback 정책을 둡니다.

8. OpenAPI 문서화 기준

API 문서에는 성공 응답뿐 아니라 실패 응답도 포함해야 합니다.

1# 예시입니다.2responses:3  '200':4    description: Answer created5  '400':6    description: Invalid request7  '429':8    description: Rate limit exceeded9  '500':10    description: Internal server error

문서화할 항목은 다음과 같습니다.

1# 예시입니다.2[ ] request schema3[ ] response schema4[ ] error schema5[ ] authentication6[ ] rate limit7[ ] retry policy8[ ] idempotency key9[ ] timeout expectation10[ ] trace/request ID header

9. 실무 체크리스트

1# 예시입니다.2[ ] 모든 API 응답에 request_id가 있는가?3[ ] error.code가 검색 가능한 고유 문자열인가?4[ ] retryable 필드가 있는가?5[ ] 429 응답에 Retry-After가 있는가?6[ ] 상태 변경 API에 idempotency key를 지원하는가?7[ ] liveness와 readiness를 분리했는가?8[ ] OpenAPI 문서에 실패 응답이 포함되어 있는가?9[ ] 로그에 민감정보가 남지 않는가?

10. Q&A

Q1. request_id와 trace_id는 둘 다 필요한가요?

작은 서비스에서는 request_id만으로 시작해도 됩니다. 하지만 서비스가 여러 컴포넌트로 나뉘면 trace_id가 필요합니다. request_id는 단일 요청 추적에, trace_id는 여러 span 연결에 유리합니다.

Q2. 모든 API에 idempotency key를 넣어야 하나요?

아닙니다. 상태를 변경하고 재시도 가능성이 있는 API부터 적용하면 됩니다. 문서 색인, 결제, 외부 이벤트 처리 같은 API가 우선순위입니다.

Q3. LLM provider 장애는 readiness 실패로 봐야 하나요?

대부분의 경우 별도 provider health metric으로 보는 편이 낫습니다. provider 장애 때문에 전체 API 서버를 내리는 것보다 fallback, provider routing, degraded response가 더 안전할 수 있습니다.

11. 참고자료와 불확실성

참고자료

OpenAPI Specification: https://swagger.io/specification/
FastAPI Error Handling: https://fastapi.tiangolo.com/tutorial/handling-errors/
W3C Trace Context: https://www.w3.org/TR/trace-context/
Google SRE: Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/
AWS Builders Library: Idempotent APIs: https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/

불확실성

실제 error code 체계는 조직의 API governance 기준에 맞춰야 합니다.
trace ID header는 사용하는 observability stack에 따라 세부 구현이 달라질 수 있습니다.

실무 적용 예시: 실패 응답을 먼저 정한 API

LLM 답변 생성 API를 만든다면 성공 응답보다 실패 응답을 먼저 정하는 편이 좋습니다. 사용자가 문서를 너무 크게 올렸는지, 권한이 없는 문서인지, provider timeout인지, schema validation 실패인지에 따라 재시도 가능 여부가 달라지기 때문입니다.

1{2  "error": {3    "code": "CONTEXT_TOO_LARGE",4    "message": "입력 문서가 현재 처리 한도를 넘었습니다.",5    "retryable": false,6    "request_id": "req_123",7    "trace_id": "trace_456"8  }9}

비교 기준은 retryable입니다. rate limit은 일정 시간 뒤 재시도할 수 있지만, 권한 없음은 재시도해도 해결되지 않습니다. context 초과는 문서를 줄이거나 작업을 나눠야 합니다. 이 구분이 없으면 클라이언트는 모든 실패를 같은 alert으로 표시하고, 사용자는 무엇을 해야 하는지 알 수 없습니다.

실패 코드	클라이언트 행동
`RATE_LIMITED`	backoff 후 재시도
`UNAUTHORIZED_DOCUMENT`	접근 권한 안내
`CONTEXT_TOO_LARGE`	입력 축소 요청
`PROVIDER_TIMEOUT`	재시도 또는 fallback

Failure responses and tracking IDs come before endpoints

This article summarizes how to design the LLM service's API in an operable form.

When you think of API design, URLs and HTTP methods usually come to mind. First decide on endpoints such asPOST /chatandGET /documents/{id}. However, from an operational perspective, there are things you need to decide before endpoints.

This is a failure response. Tracking ID. Whether retry is possible. Rate limit policy. It is based on health check and readiness.

Analysis date: 2026-05-12 Practice standard environment: FastAPI, Pydantic, PostgreSQL, OpenTelemetry Key reference materials: OpenAPI Specification, FastAPI Error Handling, W3C Trace Context, Google SRE Fact checking of official documents and operational interpretation of this article are separated in the main text.

Key takeaways

API contracts include not only success responses but also failure responses.
All requests must haverequest_idortrace_id.
The LLM API must document retry, timeout, and idempotency.
Health checks should separate process survival verification and dependency readiness.
The API document is not just a front-end specification, but also a document for operators and failure responders.

1. What is an operational API?

An operational API is not an API that only works well under normal circumstances. This is an API that allows the cause of a failure to be tracked, the caller to determine whether to retry, and the operator to check the status with indicators.

item	Simple API	Operational API
success response	return data	Includes schema and version
failure response	String or arbitrary JSON	standard error envelope
traceability	Log Search	request_id, trace_id
retry	Caller judgment	specify retryable
Restriction Policy	doesn't exist	rate limit header
Check status	`/health`one	liveness/readiness separation

In LLM services, this difference is more important. This is because model call failures, schema validation failures, rate limits, and long delays often occur.

2. Failure responses come before success responses.

Most APIs, which make it difficult to respond to failures in practice, have different failure responses.

1// This is an example JSON structure.2{3  "message": "failed"4}

The cause cannot be determined from this response. I can't tell what the request is, if it's retryable, if it was sent incorrectly by the user, or if it's an internal server issue.

An operational failure response must include, at a minimum, the information below:

1// This is an example JSON structure.2{3  "error": {4    "code": "LLM_SCHEMA_VALIDATION_FAILED",5    "message": "Model response did not match the expected schema.",6    "retryable": true,7    "details": {8      "schema_version": "answer.v1"9    }10  },11  "request_id": "req_01HX...",12  "trace_id": "0af7651916cd43dd8448eb211c80319c"13}

3. Error Response Envelope Design

The recommended error envelope structure is as follows.

field	explanation
`error.code`	Unique code retrievable from within the system
`error.message`	Description that users or developers can understand
`error.retryable`	Whether the client can retry
`error.details`	Additional information for debugging. Excluding sensitive information
`request_id`	single request identifier
`trace_id`	distributed tracking identifier

One thing to be careful of is not to include the original text of the prompt, personal information, API key, or full text of internal documents indetails.

Security Caution LLM requests/responses may contain sensitive documents. You must first set a masking policy to avoid including the original text of prompts, document chunks, and user inputs in failure responses and logs.

4. trace ID and Request ID design

request_idis a single request identifier inside the service.trace_idis a distributed tracking identifier that cuts across multiple services and external calls.

1# This is an example.2Client3→ API Server span4→ Retrieval span5→ LLM Provider span6→ Validation span7→ DB Save span

In the LLM service, a request is divided into several subtasks. So you need to connect a trace to see “where it’s slow.”

An example response header is as follows:

1# This is an example.2X-Request-Id: req_01HXABC3traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

5. API that requires idempotency key

Not all APIs require an idempotency key. However, it is required for APIs that change state.

API	Is Dempotency Required?	reason
`GET /answers/{id}`	lowness	read request
`POST /questions`	middle	Prevents duplicate creation of the same question
`POST /documents/{id}/index`	height	Avoid duplicate indexes
`POST /payments`	very high	Prevent duplicate payments

The Document Index API is a prime example. If the same document is indexed multiple times, vector rows may be duplicated and search quality may decrease.

1# This is an example.2POST /documents/doc_123/index3Idempotency-Key: doc_123:index:v4

The server stores this key in the task table, and returns the existing task status when the same key comes again.

6. Rate limit response design

The LLM API has both cost and provider limitations. Therefore, the rate limit is not just a security device but also a cost control device.

1# This is an example.2HTTP/1.1 429 Too Many Requests3X-RateLimit-Limit: 1004X-RateLimit-Remaining: 05X-RateLimit-Reset: <unix_timestamp>6Retry-After: 30

The 429 response must also include request_id and error code.

1// This is an example JSON structure.2{3  "error": {4    "code": "RATE_LIMIT_EXCEEDED",5    "message": "Too many requests. Retry after 30 seconds.",6    "retryable": true7  },8  "request_id": "req_01HX..."9}

7. Health Check and Readiness

Having just one/healthis not enough.

endpoint	purpose	Check for
`/livez`	process survival	application process
`/readyz`	Traffic can be received	DB, Redis, Queue, config
`/metrics`	Indicator collection	Prometheus, etc.

You should be careful about whether to add the LLM provider status to readiness. If the entire service is excluded from traffic just because the external model API is temporarily slow, the failure may actually increase. Usually, the provider status is viewed as a separate metric and a fallback policy is set.

8. OpenAPI documentation standards

Your API documentation should include failure responses as well as success responses.

1# This is an example.2responses:3'200':4description: Answer created5'400':6description: Invalid request7'429':8description: Rate limit exceeded9'500':10description: Internal server error

Items to document include:

1# This is an example.2[ ] request schema3[ ] response schema4[ ] error schema5[ ] authentication6[ ] rate limit7[ ] retry policy8[ ] idempotency key9[ ] timeout expectation10[ ] trace/request ID header

9. Practical checklist

1# This is an example.2[ ] Do all API responses have a request_id?3[ ] Is error.code a unique, searchable string?4[ ] Is there a retryable field?5[ ] 429 Is there a Retry-After in the response?6[ ] Is idempotency key supported in the state change API?7[ ] Have you separated liveness and readiness?8[ ] Does the OpenAPI document include a failure response?9[ ] Is there no sensitive information left in the log?

10. Q&A

Q1. Are both request_id and trace_id needed?

For small services, you can start with just request_id. However, if the service is divided into multiple components, trace_id is required. request_id is advantageous for tracing a single request, trace_id is advantageous for concatenating multiple spans.

Q2. Do I need to include an idempotency key in all APIs?

no. You can change the state and apply it first to the API with the possibility of retrying. APIs such as document indexing, payments, and external event handling are priorities.

Q3. Should LLM provider failure be considered a failure in readiness?

In most cases, it is better to view it as a separate provider health metric. Fallback, provider routing, and degraded responses may be safer than taking down the entire API server due to provider failure.

11. References and uncertainty

References

OpenAPI Specification:https://swagger.io/specification/
FastAPI Error Handling:https://fastapi.tiangolo.com/tutorial/handling-errors/
W3C Trace Context:https://www.w3.org/TR/trace-context/
Google SRE: Monitoring Distributed Systems:https://sre.google/sre-book/monitoring-distributed-systems/
AWS Builders Library: Idempotent APIs:https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/

uncertainty

The actual error code scheme should align with your organization's API governance standards.
The detailed implementation of the trace ID header may vary depending on the observability stack used.

Practical application example: API that sets the failure response first

If you are creating an LLM answer generation API, it is better to decide on failure responses before success responses. This is because whether a retry is possible depends on whether the user uploaded a document that is too large, a document without permission, a provider timeout, or schema validation failure.

1{2  "error": {3    "code": "CONTEXT_TOO_LARGE",4    "message": "The input document has exceeded the current processing limit.",5    "retryable": false,6    "request_id": "req_123",7    "trace_id": "trace_456"8  }9}

The comparison standard isretryable. Rate limits can be retried after a certain period of time, but no permissions cannot be resolved by retrying. If the context is exceeded, the document must be shortened or the work divided. Without this distinction, the client would display all failures with the same alert, and the user wouldn't know what to do.

failure code	client behavior
`RATE_LIMITED`	Retry after backoff
`UNAUTHORIZED_DOCUMENT`	Access permission information
`CONTEXT_TOO_LARGE`	Request to reduce input
`PROVIDER_TIMEOUT`	Retry or fallback

GitHub 계정으로 로그인하면 댓글을 남길 수 있습니다. 댓글은 GitHub Discussions를 통해 운영됩니다.