코딩 에이전트는 왜 런타임이 되는가

AI Agent · 2026-05-06 · 5분 읽기

Markdown약 1685 tokens

AI 코딩 도구를 실무에 붙일 때 가장 먼저 부딪히는 문제는 “코드를 생성하느냐”가 아닙니다. 문제는 생성된 코드가 제품 저장소 안에서 끝까지 검증되고, 실패하면 이유를 남기고, 위험한 변경 앞에서는 멈추는가입니다.

짧은 함수나 테스트 초안은 모델만으로도 충분히 좋아졌습니다. 하지만 실제 제품 개발은 한 번의 답변으로 끝나지 않습니다. 파일을 읽고, 변경 범위를 판단하고, 테스트를 실행하고, 실패 로그를 해석하고, 다시 고쳐야 합니다.

그래서 팀이 준비해야 할 것은 “더 긴 프롬프트”가 아니라, 에이전트가 안전하게 오래 일할 수 있는 실행 환경입니다.

이 글은 “코딩 에이전트 런타임 설계” 시리즈의 1편입니다. 결론부터 말하면, 코딩 에이전트는 더 이상 프롬프트 응답기가 아니라 목표, 상태, 권한, 검증을 가진 개발 런타임으로 봐야 합니다.

핵심 요약

기존 AI 코딩 도구의 기본 단위는 prompt였습니다.
실제 개발 업무는 단일 프롬프트보다 길고 상태 의존적입니다.
SWE-bench와 SWE-agent는 저장소 탐색, 파일 수정, 테스트 실행이 가능한 환경의 중요성을 보여줍니다.
OpenAI Codex harness와 App Server 구조는 여러 UI보다 아래쪽의 agent loop와 thread state가 중요해지고 있음을 보여줍니다.
실무에서 필요한 것은 프롬프트 모음이 아니라 에이전트가 오래 일해도 안전한 작업 계약입니다.

프롬프트만으로는 개발을 끝낼 수 없다

초기 AI 코딩 도구의 사용 방식은 단순했습니다.

1이 함수 리팩터링해줘.

조금 더 구체적으로는 이렇게 요청했습니다.

1auth 모듈 테스트 코드를 작성하고 실패하는 테스트를 고쳐줘.

이런 요청은 여전히 유효합니다. 작은 함수 수정, 단일 컴포넌트 개선, 테스트 코드 초안에는 충분히 효과적입니다.

하지만 실제 제품 개발은 보통 이런 단일 요청으로 끝나지 않습니다. 예를 들어 결제 모듈을 다른 PG로 교체한다고 해보겠습니다. 이 작업에는 기존 결제 흐름 분석, API client 교체, webhook 수정, 환경 변수와 secret 관리, 테스트 수정, 배포 전 검증, 실패 시 rollback 계획까지 포함됩니다.

실제 흐름은 단일 응답보다 반복 실행 루프에 가깝습니다. 테스트가 실패하면 다시 수정으로 돌아가고, 통과하면 리뷰와 문서화로 넘어갑니다.

준비

요구사항 이해
코드베이스 탐색
변경 범위 추정 / 구현 계획

반복 실행

파일 수정
테스트 실행
실패 원인 분석
재수정

재시도

정리

리뷰
문서화
작업 상태 업데이트

여기서 중요한 것은 “한 번의 프롬프트”가 아니라 “작업 상태”입니다.

실제 개발 업무는 상태를 가진다

개발자는 작업 중에 많은 상태를 머릿속에 유지합니다.

상태	예시
목표	왜 이 기능을 바꾸는가
범위	어떤 파일과 모듈을 건드릴 수 있는가
제약	DB schema 변경은 가능한가
검증	어떤 테스트를 통과해야 하는가
실패 이력	어떤 접근이 실패했는가
산출물	최종 diff, 테스트 로그, 리뷰 노트

AI 코딩 에이전트도 개발 업무를 제대로 수행하려면 이 상태를 다룰 수 있어야 합니다. 여기서 상태는 단순 대화 기록이 아닙니다. 대화 기록은 길고, 노이즈가 많고, 재검토하기 어렵습니다.

개발 워크플로우에 필요한 것은 아래처럼 분리된 구조화 상태입니다.

Agent Runtime State

작업 정의

Goal: 무엇을 끝낼 것인가

Task: 지금 수행 중인 하위 작업

실행 증거

Validation: 통과한 검증

Failure: 실패 원인

Artifact: 남긴 결과물

통제와 재사용

Permission: 사람 승인 지점

Memory: 다음 작업에도 유효한 지식

결국 코딩 에이전트는 응답 생성기보다 작업 실행기에 가까워집니다.

SWE-bench가 보여준 난이도

SWE-bench는 실제 GitHub issue와 pull request에서 추출한 소프트웨어 엔지니어링 문제를 평가하는 벤치마크입니다. 단순 알고리즘 문제가 아니라 실제 저장소 안에서 발생한 버그와 기능 수정 문제를 다룹니다.

SWE-bench의 핵심은 모델에게 코드베이스와 issue를 주고, 실제 테스트를 통과하는 patch를 만들도록 요구한다는 점입니다. Princeton의 SWE-bench 소개는 이 작업이 수천 줄 파일 속에서 문제를 찾고 여러 부분과 상호작용하는 수정을 만들어야 하는 문제라고 설명합니다.

이 지점이 중요합니다.

코딩 에이전트가 어려운 이유는 “코드를 모른다”가 아닙니다. LLM은 짧은 코드 조각은 꽤 잘 생성합니다. 어려운 것은 저장소 전체 맥락 안에서 어떤 파일을 읽어야 하고, 어떤 테스트를 실행해야 하며, 실패 로그를 어떻게 해석해야 하는가입니다.

즉 문제는 모델 지능만이 아니라 실행 환경입니다.

SWE-agent가 보여준 인터페이스의 중요성

SWE-agent 논문은 이 문제를 Agent-Computer Interface 관점에서 봅니다. 사람에게 IDE가 필요한 것처럼, 언어 모델 에이전트에게도 코드 탐색, 파일 수정, 테스트 실행에 맞춘 전용 인터페이스가 필요하다는 주장입니다.

Princeton의 논문 페이지는 SWE-agent의 custom interface가 코드 파일 생성과 편집, 저장소 탐색, 테스트 실행 능력을 개선했다고 설명합니다. 실무적으로 가져올 결론은 명확합니다.

좋은 코딩 에이전트는 좋은 프롬프트만으로 만들어지지 않는다. 좋은 실행 인터페이스와 작업 계약이 필요하다.

하네스는 단순히 모델을 호출하는 wrapper가 아닙니다. 에이전트에게 실행 인터페이스와 작업 계약을 함께 제공하는 환경입니다.

개발 하네스

실행 인터페이스

읽을 수 있는 파일 범위

수정 가능한 파일 범위

실행 가능한 명령

검증해야 하는 테스트

작업 계약

중단해야 하는 조건

남겨야 하는 산출물

기억해도 되는 정보

Codex harness가 시사하는 변화

OpenAI의 Codex harness와 App Server 글은 이 흐름을 제품 구조 관점에서 보여줍니다. Codex는 웹 앱, CLI, IDE 확장, macOS 앱 같은 여러 surface에서 동작하지만, 아래에서는 동일한 harness와 agent loop를 공유합니다.

그 글에서 Codex core는 agent code가 있는 library이면서 하나의 Codex thread를 persistence와 함께 관리하는 runtime으로 설명됩니다. App Server는 장기 실행 프로세스로 core thread를 호스팅하고, 클라이언트 요청과 서버 알림이 양방향으로 오가는 JSON-RPC API를 제공합니다.

개발자 관점에서는 이렇게 해석할 수 있습니다.

1UI가 핵심이 아니다.2핵심은 agent loop, thread state, workspace access, diff event, approval request다.

VS Code, CLI, 웹 앱, 데스크톱 앱은 surface입니다. 진짜 중요한 것은 그 아래에서 목표를 유지하고, 작업 상태를 저장하고, 도구 실행을 통제하고, 검증 결과를 남기는 runtime layer입니다.

질문이 바뀐다

AI 코딩 도구를 평가할 때 질문도 바뀌어야 합니다.

예전 질문	이제 필요한 질문
코드 잘 짜나?	목표를 오래 유지할 수 있나?
응답 빠른가?	작업 상태를 복구할 수 있나?
테스트 짜주나?	테스트 실패를 해석하고 재시도하나?
파일 수정하나?	diff와 검증 결과를 산출물로 남기나?
기억하나?	무엇을 memory로 승격하고 무엇을 버리나?

모델 성능은 중요합니다. 하지만 모델만으로는 충분하지 않습니다. 오래 실행되는 개발 작업에서는 잘못된 목표를 오래 실행하는 것도 위험이고, 검증 없이 “완료”라고 말하는 것도 위험입니다.

그래서 runtime 설계가 필요합니다.

실무적으로 무엇을 준비해야 하나

이 시리즈의 결론은 처음부터 거대한 멀티 에이전트 시스템을 만들자는 것이 아닙니다. 오히려 반대입니다. 작게 시작해야 합니다.

다만 시작점은 프롬프트 모음이 아니라 작업 계약이어야 합니다.

문서	역할
`goal.md`	목표, 완료 조건, 중단 조건
`run-ledger.md`	작업 단위 실행 기록
`artifact-contract.md`	마지막에 남길 결과물 형식
`memory-policy.md`	기억할 것과 버릴 것의 기준
`permission-policy.md`	승인 필요 작업과 차단 작업

이 다섯 문서만 있어도 코딩 에이전트의 품질이 달라집니다. 프롬프트는 요청을 전달합니다. 하지만 작업 계약은 에이전트가 안전하게 오래 일할 수 있는 경계를 만듭니다.

적용 체크리스트

현재 팀의 코딩 에이전트 운영 방식을 점검할 때는 아래 항목부터 확인하면 됩니다.

1[ ] 작업 목표가 한 문장으로 정리되어 있는가?2[ ] 완료 조건이 테스트나 산출물로 확인 가능한가?3[ ] 위험 작업의 중단 조건이 있는가?4[ ] 실패한 시도를 대화가 아니라 기록으로 남기는가?5[ ] 최종 응답이 아니라 검토 가능한 artifact를 남기는가?6[ ] 반복 지식과 일회성 로그를 구분하는가?

이번 편에서 가져갈 기준

코딩 에이전트를 도입할 때 첫 질문은 “어떤 모델을 쓸까”가 아니라 “이 에이전트가 어떤 runtime contract 안에서 일할까”여야 합니다. 최소한 목표, 허용 범위, 중단 조건, 검증 명령, 산출물 형식은 작업 전에 분리되어 있어야 합니다.

다음 편

2편에서는 Codex /goal을 출발점으로 목표 기반 개발을 다룹니다. 핵심은 명령어 사용법이 아니라 Goal Contract입니다.

시리즈 이어 읽기

1편: 코딩 에이전트는 왜 런타임이 되는가
2편: Codex /goal로 보는 목표 기반 개발
3편: A2A와 MCP로 보는 멀티 에이전트 개발 워크플로우
4편: AI Memory는 RAG가 아니다
5편: 개발 하네스에 적용하는 AI 코딩 에이전트 문서 세트

참고자료

The first question encountered when applying AI coding tools to practice is not “whether it generates code.” The question is whether the generated code is fully verified within the product repository, provides reasons for failures, and stops in the face of dangerous changes.

For short functions and test drafts, the model alone is good enough. But actual product development doesn’t end with just one answer. You need to read the file, determine the scope of the change, run tests, interpret the failure log, and fix it again.

So what teams need to prepare is not “longer prompts,” but an execution environment where agents can work safely and for long periods of time.

This article is part 1 of the “Coding agent runtime Design” series. Bottom line: Coding agents should no longer be viewed as prompt responders, but as development runtimes with goals, state, permissions, and validation.

Key takeaways

The basic unit of existing AI coding tools wasprompt.
Actual development tasks are longer and more state-dependent than a single prompt.
SWE-bench and SWE-agent demonstrate the importance of an environment that allows browsing repositories, modifying files, and running tests.
The OpenAI Codex harness and App Server structure show that the underlying agent loop and thread state are becoming more important than various UIs.
In practice, what you need is not a collection of prompts, but a work contract that is safe for agents to work with for a long time.

Development cannot be completed with prompts alone

Early AI coding tools were simple to use.

1Please refactor this function.

To be more specific, I requested this:

1Write auth module test code and fix failing tests.

These requests are still valid. It's effective enough for small function modifications, single component improvements, and test code drafts.

But actual product development usually doesn't end with this single request. For example, let's say you replace the payment module with another PG. This work includes analyzing existing payment flows, replacing API clients, modifying webhooks, managing environment variables and secrets, modifying tests, verifying before deployment, and even planning rollback in case of failure.

The actual flow is more of an iterative execution loop than a single response. If the test fails, we go back to editing. If it passes, we move on to review and documentation.

준비

Understand the requirements
Codebase navigation
Estimate scope of change/implementation plan

반복 실행

edit file
run test
Failure cause analysis
re-edit

retry

정리

review
documentation
Job status updates

The important thing here is not “one prompt” but “task status”.

Actual development work has state

Developers keep a lot of state in their heads while they work.

situation	example
target	Why change this feature?
range	What files and modules can be touched
pharmaceutical	Is it possible to change DB schema?
verification	What tests do I need to pass?
failure history	Which Approaches Failed?
output	Final diff, test log, review notes

AI coding agents also need to be able to handle this state to properly perform their development tasks. The state here is not just a record of a conversation. Conversation records are long, noisy, and difficult to review.

What your development workflow needs is a separate, structured state like the one below.

Agent Runtime State

task definition

Goal: What to finish

Task: Subtask currently being performed

proof of execution

Validation: Passed validation

Failure: Cause of failure

Artifact: Results left behind

Control and Reuse

Permission: Human approval point

Memory: Knowledge that is also valid for the next task

Ultimately, the coding agent becomes more of a task executor than a response generator.

Difficulty level shown by SWE-bench

SWE-bench is a benchmark that evaluates software engineering problems extracted from actual GitHub issues and pull requests. It deals with bugs and function correction issues that occurred within the actual repository, not simple algorithm issues.

The key to SWE-bench is that it gives the model a code base and issues, and asks it to create patches that pass actual tests. [Princeton's Introduction to SWE-bench] (https://pli.princeton.edu/blog/2023/swe-bench-can-language-models-resolve-real-world-github-issues) describes the task as a problem that requires finding problems in files of thousands of lines and making fixes that interact with many parts.

This point is important.

The reason why coding agents are difficult is not because they “don’t know code.” LLM is pretty good at generating short code snippets. What's difficult is which files to read, which tests to run, and how to interpret failure logs within the context of the entire repository.

In other words, it's not just model intelligence that's the problem, but the execution environment.

The importance of the interface shown by SWE-agent

SWE-agent paper looks at this problem from the Agent-Computer Interface perspective. The argument is that just as humans need an IDE, language model agents also need a dedicated interface for navigating code, editing files, and running tests.

Princeton's paper page explains that SWE-agent's custom interface improves the ability to create and edit code files, navigate repositories, and run tests. The practical conclusions are clear.

A good coding agent isn't made by just good prompts. You need a good execution interface and a working contract.

Harness is not just a wrapper that calls a model. It is an environment that provides agents with an execution interface and a work contract.

development harness

execution interface

Readable file range

Modifiable file range

executable command

Tests that need to be verified

work contract

Conditions for stopping

Deliverables that must be left behind

Information you can remember

Changes implied by Codex harness

OpenAI's [Codex harness and App Server article] (https://openai.com/index/unlocking-the-codex-harness/) shows this flow from a product structure perspective. Codex runs on multiple surfaces such as web apps, CLI, IDE extensions, and macOS apps, but shares the same harness and agent loop below.

In that article, Codex core is described as a library with agent code and a runtime that manages one Codex thread with persistence. App Server hosts the core thread as a long-running process and provides a JSON-RPC API that allows client requests and server notifications to flow in both directions.

From a developer's perspective, it can be interpreted this way.

1UI is not the point.2The core elements are agent loop, thread state, workspace access, diff event, and approval request.

VS Code, CLI, web app, and desktop app are surface. What's really important is the runtime layer underneath, which holds goals, stores job state, controls tool execution, and leaves verification results.

the question changes

When evaluating AI coding tools, the questions you ask must also change.

old question	Now the question we need to ask
Are you good at writing code?	Can you maintain your goal for a long time?
Is the response fast?	Can I restore my work status?
Are you planning a test?	Do you interpret and retry test failures?
Do I edit the file?	Do you leave diff and verification results as output?
Do you remember?	What is promoted to memory and what is discarded?

Model performance is important. But models alone are not enough. In long-running development work, there is a risk in running on the wrong goal for too long, and there is also a risk in saying “done” without verification.

So runtime design is needed.

What do you need to prepare practically?

The conclusion of this series is not to create a massive multi-agent system from scratch. Quite the opposite. You have to start small.

However, your starting point should be a task contract, not a set of prompts.

document	role
`goal.md`	Goals, completion conditions, stopping conditions
`run-ledger.md`	Unit of work execution history
`artifact-contract.md`	Final result format
`memory-policy.md`	Standards for what to remember and what to discard
`permission-policy.md`	Actions that require approval and actions that are blocked

These five documents alone will make a difference in the quality of your coding agent. Prompts communicate requests. But work contracts create boundaries within which agents can safely work for long periods of time.

Application Checklist

When checking your team's current coding agent operation method, you can start by checking the items below.

1[ ] Are the work goals summarized in one sentence?2[ ] Can the completion conditions be confirmed through tests or artifacts?3[ ] Are there conditions for stopping hazardous work?4[ ] Do you leave failed attempts as records rather than conversations?5[ ] Does it leave an artifact that can be reviewed rather than a final response?6[ ] Do you distinguish between repeated knowledge and one-time logs?

Standards to be taken in this episode

When introducing a coding agent, the first question should be not “What model should I use?” but “What runtime contract will this agent work within?” At a minimum, goals, tolerances, stopping conditions, verification instructions, and deliverable formats should be separated before operation.

Next time

Part 2 covers goal-based development, starting with Codex/goal. The key is not how to use the command, but the Goal Contract.

Continue reading the series

Part 1: Why are coding agents runtime?
Part 2: Goal-based development through Codex/goal
Part 3: Multi-agent development workflow from A2A and MCP perspective
Part 4: AI Memory is not RAG
Part 5: AI Coding Agent Document Set for Application to Development Harness

References

GitHub 계정으로 로그인하면 댓글을 남길 수 있습니다. 댓글은 GitHub Discussions를 통해 운영됩니다.