AWTL: 실패 로그를 다음 실행 힌트로 바꾸기

Workflow · 2026-05-08 · 5분 읽기

Markdown약 2428 tokens

좋은 하네스는 실패 로그를 많이 남기는 데서 끝나지 않습니다. 실패를 다음 실행에서 쓸 수 있는 힌트로 바꿔야 합니다.

1편에서는 Ctx2Skill을 개발 하네스의 운영 기억 관점으로 적용했고, 2편에서는 MemoryGraph가 검증된 compact rule만 받아야 한다고 정리했습니다. 이번 글의 주제는 그 중간 계층입니다. AWTL, 즉 Agent Work Trace Logging입니다.

AWTL의 목적은 로그 수집이 아닙니다. 목적은 실패를 action 단위로 귀속하고, 다음 attempt 전에 필요한 만큼만 재발 방지 힌트로 주입하는 것입니다.

핵심 요약

AWTL은 agent 작업을 action/span/event 단위로 기록합니다.
실패한 judge event를 source action, artifact, memory read와 연결합니다.
attribution 결과는 failed turn case로 압축됩니다.
다음 attempt 전에는 관련 case만 Failure Prevention Brief로 작게 주입합니다.
replay scorecard는 stale, risky, blocked case를 걸러냅니다.
MemoryGraph promotion은 replay 또는 human approval을 통과한 경우에만 허용합니다.

전체 흐름

AWTL은 trace에서 시작하지만 trace에서 끝나지 않습니다.

준비

Trace event 수집
Failed judge event 탐지
Failure attribution
Failed turn case

반복 실행

Failure Prevention Brief
Replay scorecard

실패 시 다시 실행 단계로 돌아갑니다.

정리

Verified-only memory promotion

여기서 핵심은 “모든 로그를 다음 프롬프트에 붙인다”가 아닙니다. 현재 작업과 관련 있는 실패 사례만 골라, 작고 실행 가능한 힌트로 바꿉니다.

trace event의 최소 단위

최종 PR만 보면 실패 원인을 알기 어렵습니다. 실패는 보통 중간 action에서 이미 시작됩니다.

실패 시작점	최종 결과만 볼 때의 문제	AWTL이 남겨야 하는 것
잘못된 파일 선택	왜 엉뚱한 파일을 고쳤는지 모름	search/read action과 선택 근거
테스트 명령 오판	검증 누락만 보임	실행 명령, exit code, stderr
에러 로그 오해	재수정 방향이 틀어짐	observation과 다음 action 연결
memory 오적용	기억이 원인인지 알 수 없음	memory_read event와 action 연결
verifier 실패	어떤 artifact가 실패했는지 모름	judge_result와 artifact_refs

따라서 trace는 최소한 다음 계층을 가져야 합니다.

Task

Run

Turn / Span

Action

Observation

judge_result

Trace Sink: 재생 가능한 이벤트 스트림 만들기

Trace Sink는 에이전트가 실행 중에 남기는 일을 한 줄씩 받아서, 나중에 다시 읽을 수 있는 실행 로그로 저장하는 계층입니다. 여기서 중요한 것은 “많이 저장한다”가 아니라 “나중에 같은 기준으로 다시 분석할 수 있게 저장한다”입니다.

이 글에서 말하는 trace는 JSONL 파일에 가깝습니다. 한 줄에는 하나의 event가 들어갑니다.

1{"type":"action","spanId":"s-42","tool":"shell","command":"npm test"}2{"type":"observation","spanId":"s-42","exitCode":1,"summary":"contract test failed"}3{"type":"judge_result","spanId":"s-42","status":"failed","reason":"contract mismatch"}

이런 파일이 있어야 나중에 “어떤 action 뒤에 어떤 observation이 왔고, judge가 왜 실패로 봤는지”를 다시 계산할 수 있습니다. 그래서 Trace Sink의 첫 번째 책임은 두 가지입니다.

책임	이유
정해진 trace 저장 위치만 쓴다	다른 프로젝트 로그나 임시 파일이 섞이면 실패 분석이 오염됩니다.
event 형식을 한 가지 canonical 형태로 맞춘다	실행마다 필드 이름과 구조가 달라지면 replay와 집계가 불가능합니다.

아래 코드는 trace를 아무 디렉터리에나 쓰지 못하게 막는 경계입니다. EXPECTED_TRACE_ROOT가 하네스가 허용한 저장 위치이고, 그 밖의 경로가 들어오면 즉시 실패시킵니다.

1function assertExpectedTraceRoot(traceRoot) {2  const resolvedRoot = normalizeAbsolutePath(traceRoot);3  const expectedRoot = normalizeAbsolutePath(EXPECTED_TRACE_ROOT);4 5  if (resolvedRoot !== expectedRoot) {6    throw new Error(`Invalid trace root: ${traceRoot}`);7  }8 9  return resolvedRoot;10}

두 번째 문제는 깨진 JSONL line입니다. 실행 중간에 프로세스가 죽거나 파일 write가 끊기면 한 줄이 JSON으로 파싱되지 않을 수 있습니다. 이때 전체 trace를 버리면 정상 event까지 잃습니다. 그래서 깨진 line만 quarantine 파일로 분리하고, 정상 line은 계속 분석합니다.

1quarantined.push(quarantineLine(2  quarantinePath,3  rawLine,4  reason,5  canonicalPath,6  index + 1,7));

이 설계는 세 가지를 보장합니다.

보장	의미
trace root 고정	trace artifact가 예상 범위를 벗어나지 않습니다.
corrupt line 격리	한 줄 오류가 전체 run 분석을 망치지 않습니다.
materialized view 재생성	canonical trace에서 파생 산출물을 다시 만들 수 있습니다.

Harness Capture: action과 judge를 연결하기

judge 실패가 단순히 failed로만 남으면 재발 방지에 쓰기 어렵습니다. 어떤 action이 어떤 artifact를 만들었고, 어떤 verifier가 그 artifact를 보고 실패했는지 연결해야 합니다.

1async function recordJudgeResult(details = {}) {2  return emit("judge_result", {3    judge_name: toText(details.judgeName, "phase-verifier"),4    result: toText(details.result, "warn"),5    lifecycle_event: "judge_result",6    source_action_id: actionId,7    artifact_refs: normalizeArtifactRefs(details.artifactRefs ?? [], repoRoot),8    detail: toText(details.detail, ""),9  });10}

source_action_id와 artifact_refs가 핵심입니다. 이 둘이 없으면 실패를 다음 실행에서 재현하거나 방지하기 어렵습니다.

Failure Attribution: 원인과 승격 가능성 분리

Attribution은 실패한 judge event를 source action, artifact, memory read와 연결합니다. 동시에 이 실패가 memory promotion 대상인지도 분리해야 합니다.

1return {2  failureEvent,3  failureTurnId,4  failedArtifactRefs,5  sourceActionIds,6  verifierActionId,7  touchedActionIds,8  memoryReadNodeIds,9  evidenceRefs,10  rootCauseSummary,11  verificationProbeCandidate,12  classification,13  failureTypeInfo,14  attributionHeuristics,15};

중요한 것은 root cause 문장 하나가 아닙니다. classification입니다.

classification	기본 판정
`agent_failure`	failed turn case와 memory candidate 후보
`verification_failure`	replay probe 후보
`environment_blocker`	memory promotion 차단
`flaky_blocker`	재현성 확인 전 차단
`harness_blocker`	하네스 수정 backlog로 분리

환경 실패와 agent failure를 구분하지 않으면 장기 기억이 잘못됩니다. 예를 들어 브라우저 바이너리가 없어 e2e가 실패했는데 “이 저장소에서는 e2e를 생략한다”는 memory가 생기면 안 됩니다.

Failed Turn Case: 다음 실행용 compact case

Attribution 결과는 raw trace 전체가 아니라 compact case로 압축해야 합니다.

1{2  "schema_version": 1,3  "case_id": "case-demo-a",4  "turn_id": "turn-3-1",5  "failure_turn_id": "turn-3-1",6  "failure_event_id": "judge-17",7  "artifact_refs": ["artifact:build-output"],8  "memory_read_node_ids": ["memory:contract-rule"],9  "prevention_hint": "Closeout 전에 변경 artifact를 대상으로 같은 verifier를 다시 실행한다.",10  "applicability": ["contract_change", "public_api"],11  "evidence_refs": ["judge:contract-verifier"]12}

검증도 엄격해야 합니다.

1if (caseValue.turn_id !== caseValue.failure_turn_id) {2  errors.push("turn_id and failure_turn_id must match");3}4 5if (!Array.isArray(caseValue.artifact_refs) || caseValue.artifact_refs.length === 0) {6  errors.push("artifact_refs must be a non-empty array of strings");7}

case는 다음 attempt에 영향을 줍니다. 따라서 artifact_refs, evidence_refs, applicability가 없는 case는 재발 방지 데이터로 쓰기 어렵습니다.

Failure Prevention Brief: 작게 주입하기

다음 attempt 전에 모든 failed case를 붙이면 안 됩니다. 현재 phase와 매칭되는 case만 제한적으로 넣어야 합니다.

1const selectedCases = selectFailurePreventionCases(loaded.cases, context, options);2 3if (selectedCases.length === 0) {4  return {5    status: "no-op",6    section: "",7  };8}

brief는 짧아야 합니다.

1Failure Prevention Brief2- [high-confidence] Closeout 전에 실패했던 verifier를 변경 artifact 대상으로 다시 실행한다.3- [scope: public_api] contract artifact와 verifier evidence를 같이 갱신한다.

좋은 brief는 세 가지 조건을 만족합니다.

조건	이유
현재 작업과 관련 있음	불필요한 memory noise를 줄입니다.
실행 가능한 문장임	에이전트의 다음 action을 바꿉니다.
근거가 있음	실패 로그와 verifier evidence로 추적됩니다.

Replay Scorecard: 실패 기억의 유효성 관리

한 번 유효했던 failed case도 시간이 지나면 stale해질 수 있습니다. verifier가 바뀌거나, 코드 구조가 바뀌거나, 해당 실패가 더 이상 재현되지 않을 수 있습니다.

1{2  "schema_version": 1,3  "record_id": "replay-demo-a",4  "status": "passed",5  "decision": "allow_brief_and_promotion",6  "candidate_id": "memcand-demo-a",7  "case_id": "case-demo-a",8  "validated_by": "replay",9  "last_validated_at": "example-timestamp",10  "memory_graph_status": "candidate",11  "replay_status": "passed",12  "risk_level": "low",13  "applies_to": ["public_api"],14  "does_not_apply_to": ["internal_refactor"],15  "evidence_refs": ["judge:contract-verifier"]16}

brief 주입 전에 scorecard로 제외 조건을 확인합니다.

1export function isReplayScorecardExcluded(record = {}) {2  const status = normalizeStatus(record.status ?? record.result ?? record.outcome);3 4  return isReplayScorecardStaleOrRisky(record)5    || ["blocked", "skipped", "unavailable", "denied"].includes(status);6}

이 필터가 있어야 오래된 실패 기억이 계속 프롬프트에 남는 문제를 막을 수 있습니다.

Memory promotion은 마지막 단계

AWTL의 결과가 곧바로 MemoryGraph로 들어가면 안 됩니다. promotion은 마지막 단계입니다.

1if (!approval.approved && !replayOk) {2  reasons.push("replay or human approval is required before promotion");3}4 5if (isImportedOnlyCandidate(candidate)) {6  reasons.push("imported-only or trace-only candidate is blocked");7}8 9const shouldWrite = options.writeMemoryGraph === true10  && toText(options.autoPromote, "verified-only") === "verified-only";

이 세 조건이 만드는 정책은 명확합니다.

replay 또는 human approval 없이는 승격하지 않습니다.
imported-only 후보는 차단합니다.
explicit write flag 없이는 MemoryGraph에 쓰지 않습니다.
자동 승격은 verified-only만 허용합니다.

실전 체크리스트

AWTL을 실제 하네스에 붙이기 전에는 다음 항목을 확인합니다.

1AWTL 적용 체크리스트2 3[ ] event schema에 action, observation, judge_result, artifact_ref가 포함된다.4[ ] trace event에는 run_id, attempt_id, turn/span/action 식별자가 있다.5[ ] writer_seq 또는 ingest_seq로 정렬 가능성을 보장한다.6[ ] trace root는 안전하게 고정하고 path traversal을 차단한다.7[ ] corrupt JSONL line은 전체 trace를 망치지 않고 quarantine한다.8[ ] judge_result는 source_action_id와 artifact_refs를 가진다.9[ ] memory_read event를 attribution에 포함한다.10[ ] failure attribution은 failure class를 분류한다.11[ ] environment/flaky/harness failure는 기본 승격 차단한다.12[ ] failed turn case는 compact metadata만 가진다.13[ ] prevention brief는 현재 phase에 매칭되는 case만 제한적으로 주입한다.14[ ] replay scorecard에서 stale/risky/blocked case를 제외한다.15[ ] MemoryGraph promotion은 verified-only로 제한한다.

마무리

AWTL의 핵심은 로그를 잘 남기는 것이 아닙니다. 실패를 다음 실행의 힌트로 바꾸는 것입니다.

실패 로그
failure attribution
failed turn case
prevention brief
replay scorecard
verified-only memory promotion

이 루프가 만들어지면 하네스는 단순 실행기가 아닙니다. 실행을 관찰하고, 실패를 해석하고, 검증된 지식만 장기 기억으로 승격하는 운영 시스템이 됩니다.

A good harness doesn't end with lots of failure logs. We need to turn the failure into a hint that can be used in the next run.

In Part 1, Ctx2Skill was applied from the operational memory perspective of the development harness, and in Part 2, it was summarized that MemoryGraph should only accept proven compact rules. The topic of this article is the middle layer.AWTL, or Agent Work Trace Logging.

The purpose of AWTL is not log collection. The goal is to attribute failures to the action unit and inject only as many recurrence prevention hints as necessary before the next attempt.

Key takeaways

AWTL records agent actions in action/span/event units.
Connect failed judge events with source actions, artifacts, and memory reads.
The attribution results are compressed into failed turn cases.
Before the next attempt, only relevant cases are injected into the Failure Prevention Brief.
The replay scorecard filters out stale, risky, and blocked cases.
MemoryGraph promotion is only allowed if it passes replay or human approval.

full flow

AWTL starts with a trace but does not end with a trace.

준비

Trace event collection
Failed judge event detection
Failure attribution
Failed turn case

반복 실행

Failure Prevention Brief
Replay scorecard

실패 시 다시 실행 단계로 돌아갑니다.

정리

Verified-only memory promotion

The point here is not “Paste all logs to the next prompt”. Pick only the failures that are relevant to the task at hand and turn them into small, actionable hints.

Minimum unit of trace event

It is difficult to know the cause of failure by only looking at the final PR. Failure usually begins already in the middle of the action.

failure starting point	Problems with only looking at the final result	What AWTL must leave behind
Wrong file selection	I don't know why I modified the wrong file	search/read action and selection basis
misjudgment of test command	Only verification missing is visible	Execution command, exit code, stderr
Error log misunderstanding	Re-edit direction is wrong	Connect observation and next action
memory misapplication	Don't know if memory is the cause	Connect memory_read event and action
verifier failed	Not sure which artifact failed	judge_result and artifact_refs

Therefore, a trace must have at least the following layers:

Task

Run

Turn / Span

Action

Observation

judge_result

Trace Sink: Creating a Replayable Event Stream

Trace Sink is a layer that receives the work left by the agent during execution line by line and saves it as an execution log that can be read again later. The important thing here is not “save a lot,” but “save so that it can be analyzed again with the same criteria later.”

The trace referred to in this article is closer to a JSONL file. One event is included in one line.

1{"type":"action","spanId":"s-42","tool":"shell","command":"npm test"}2{"type":"observation","spanId":"s-42","exitCode":1,"summary":"contract test failed"}3{"type":"judge_result","spanId":"s-42","status":"failed","reason":"contract mismatch"}

You must have this file so that you can later recalculate “what observation came after what action and why the judge considered it a failure.” So Trace Sink's first responsibility is twofold.

responsibility	reason
Only use designated trace storage locations	Mixing other project logs or temporary files will pollute the failure analysis.
Set the event format to one canonical form.	If field names and structures change from run to run, replay and aggregation are impossible.

The code below is a boundary that prevents trace from being written to any directory.EXPECTED_TRACE_ROOTis the storage location allowed by the harness, and if any other path comes in, it will fail immediately.

1function assertExpectedTraceRoot(traceRoot) {2  const resolvedRoot = normalizeAbsolutePath(traceRoot);3  const expectedRoot = normalizeAbsolutePath(EXPECTED_TRACE_ROOT);4 5  if (resolvedRoot !== expectedRoot) {6    throw new Error(`Invalid trace root: ${traceRoot}`);7  }8 9  return resolvedRoot;10}

The second problem is a broken JSONL line. If the process dies mid-execution or file writing is interrupted, a single line may not be parsed as JSON. At this time, if you discard the entire trace, even normal events will be lost. Therefore, only broken lines are separated intoquarantinefiles, and normal lines are continued to be analyzed.

1quarantined.push(quarantineLine(2  quarantinePath,3  rawLine,4  reason,5  canonicalPath,6  index + 1,7));

This design ensures three things:

guarantee	meaning
trace root fixed	Trace artifacts are not outside the expected range.
corrupt line isolation	A single line error won't ruin the entire run analysis.
Regenerate materialized view	Derived output can be recreated from the canonical trace.

Harness Capture: Connecting action and judge

If the judge failure remains simplyfailed, it is difficult to use to prevent recurrence. You need to connect which action created which artifact and which verifier saw that artifact and failed.

1async function recordJudgeResult(details = {}) {2  return emit("judge_result", {3    judge_name: toText(details.judgeName, "phase-verifier"),4    result: toText(details.result, "warn"),5    lifecycle_event: "judge_result",6    source_action_id: actionId,7    artifact_refs: normalizeArtifactRefs(details.artifactRefs ?? [], repoRoot),8    detail: toText(details.detail, ""),9  });10}

source_action_idandartifact_refsare the key. Without these two, failures are difficult to reproduce or prevent in subsequent runs.

Failure Attribution: Separating cause from promotability

Attribution connects failed judge events with source actions, artifacts, and memory reads. At the same time, we need to isolate whether this failure is subject to memory promotion.

1return {2  failureEvent,3  failureTurnId,4  failedArtifactRefs,5  sourceActionIds,6  verifierActionId,7  touchedActionIds,8  memoryReadNodeIds,9  evidenceRefs,10  rootCauseSummary,11  verificationProbeCandidate,12  classification,13  failureTypeInfo,14  attributionHeuristics,15};

What is important is not the single root cause statement. This isclassification.

classification	basic decision
`agent_failure`	Failed turn case and memory candidate candidates
`verification_failure`	replay probe candidate
`environment_blocker`	Block memory promotion
`flaky_blocker`	Block before checking reproducibility
`harness_blocker`	Separated into harness modification backlog

Long-term memory is faulty if we do not distinguish between environmental and agent failures. For example, if e2e fails because the browser binary is missing, you should not have memory saying “e2e is omitted in this repository.”

Failed Turn Case: Compact case for next run

Attribution results should be compressed into compact cases rather than the entire raw trace.

1{2  "schema_version": 1,3  "case_id": "case-demo-a",4  "turn_id": "turn-3-1",5  "failure_turn_id": "turn-3-1",6  "failure_event_id": "judge-17",7  "artifact_refs": ["artifact:build-output"],8  "memory_read_node_ids": ["memory:contract-rule"],9  "prevention_hint": "Before closeout, run the same verifier again for the changed artifact.",10  "applicability": ["contract_change", "public_api"],11  "evidence_refs": ["judge:contract-verifier"]12}

Verification must also be rigorous.

1if (caseValue.turn_id !== caseValue.failure_turn_id) {2  errors.push("turn_id and failure_turn_id must match");3}4 5if (!Array.isArray(caseValue.artifact_refs) || caseValue.artifact_refs.length === 0) {6  errors.push("artifact_refs must be a non-empty array of strings");7}

The case affects the next attempt. Therefore, cases withoutartifact_refs,evidence_refs, andapplicabilityare difficult to use as recurrence prevention data.

Failure Prevention Brief: Inject Smallly

You should not add all failed cases before the next attempt. Only cases that match the current phase should be included.

1const selectedCases = selectFailurePreventionCases(loaded.cases, context, options);2 3if (selectedCases.length === 0) {4  return {5    status: "no-op",6    section: "",7  };8}

The brief should be short.

1Failure Prevention Brief2- [high-confidence] Re-run the verifier that failed before Closeout as the target for the changed artifact.3- [scope: public_api] Update contract artifact and verifier evidence together.

A good brief satisfies three conditions:

condition	reason
Relevant to current task	Reduces unnecessary memory noise.
This is an executable statement	Change the agent's next action.
have a basis	Tracked with failure logs and verifier evidence.

Replay Scorecard: Managing the effectiveness of failure memories.

Even a failed case that was once valid can become stale over time. The verifier may change, the code structure may change, or the failure may no longer be reproducible.

1{2  "schema_version": 1,3  "record_id": "replay-demo-a",4  "status": "passed",5  "decision": "allow_brief_and_promotion",6  "candidate_id": "memcand-demo-a",7  "case_id": "case-demo-a",8  "validated_by": "replay",9  "last_validated_at": "example-timestamp",10  "memory_graph_status": "candidate",11  "replay_status": "passed",12  "risk_level": "low",13  "applies_to": ["public_api"],14  "does_not_apply_to": ["internal_refactor"],15  "evidence_refs": ["judge:contract-verifier"]16}

Brief Check exclusion conditions with scorecard before injection.

1export function isReplayScorecardExcluded(record = {}) {2  const status = normalizeStatus(record.status ?? record.result ?? record.outcome);3 4  return isReplayScorecardStaleOrRisky(record)5    || ["blocked", "skipped", "unavailable", "denied"].includes(status);6}

This filter prevents old failure memories from persisting in the prompt.

Memory promotion is the final step

The results from AWTL should not go directly into MemoryGraph. Promotion is the final step.

1if (!approval.approved && !replayOk) {2  reasons.push("replay or human approval is required before promotion");3}4 5if (isImportedOnlyCandidate(candidate)) {6  reasons.push("imported-only or trace-only candidate is blocked");7}8 9const shouldWrite = options.writeMemoryGraph === true10  && toText(options.autoPromote, "verified-only") === "verified-only";

The policy created by these three conditions is clear.

No promotion without replay or human approval.
Imported-only candidates are blocked.
Do not write to MemoryGraph without an explicit write flag.
Auto-promotion only allows verified-only.

Practical Checklist

Before attaching the AWTL to the actual harness, check the following items:

1AWTL application checklist2 3[ ] The event schema includes action, observation, judge_result, and artifact_ref.4[ ] Trace event has run_id, attempt_id, and turn/span/action identifiers.5[ ] Guarantees sortability with writer_seq or ingest_seq.6[ ] The trace root is safely fixed and path traversal is blocked.7[ ] Quarantine corrupt JSONL lines without ruining the entire trace.8[ ] judge_result has source_action_id and artifact_refs.9[ ] Include memory_read event in attribution.10[ ] failure attribution classifies the failure class.11[ ] environment/flaky/harness failure blocks default promotion.12[ ] The failed turn case only has compact metadata.13[ ] The prevention brief injects only cases that match the current phase.14[ ] Exclude stale/risky/blocked cases from the replay scorecard.15[ ] MemoryGraph promotion is limited to verified-only.

finish

The point of AWTL is not to keep good logs. It's about turning failure into a hint for the next move.

failure log
failure attribution
failed turn case
prevention brief
replay scorecard
verified-only memory promotion

Once this loop is created, the harness is not a simple executor. It becomes an operating system that observes execution, interprets failures, and promotes only verified knowledge to long-term memory.

GitHub 계정으로 로그인하면 댓글을 남길 수 있습니다. 댓글은 GitHub Discussions를 통해 운영됩니다.