Skip to content

api-integration-test Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation target: api-integration-test


api-integration-test is an integration-test skill for Go internal HTTP/gRPC APIs. It creates, maintains, and runs gated integration tests with real configuration, focusing on contract verification, failure triage, and safe, controlled execution. Its three main strengths are: strict scope determination up front, clearly distinguishing internal APIs, third-party APIs, and unit-test scenarios; built-in Production Safety Gate and config-completeness gates that default to rejecting unsafe or insufficient execution paths; and requirements for build-tag isolation and structured output reports, making tests both safe to integrate into CI and easy to diagnose and audit.

1. Evaluation Overview

This evaluation reviews the api-integration-test skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (internal API standard test, third-party API scope rejection, comprehensive mode upgrade). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 38 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 36/38 (94.7%) 22/38 (57.9%) +36.8 pp
Production Safety Gate 3/3 correct 0/3 Skill-only
Build tag isolation 3/3 correct 0/3 Skill-only
Output Contract structured report 3/3 correct 0/3 Skill-only
Scope identification ✅ Identifies third-party API ❌ No scope awareness Skill-only
Skill Token cost (SKILL.md only) ~2,100 tokens 0
Skill Token cost (all references) ~6,000 tokens 0
Typical load cost ~3,500 tokens 0
Token cost per 1% pass-rate gain ~57 tokens (SKILL.md) / ~95 tokens (typical)

2. Test Methodology

2.1 Scenario Design

Scenario Goal Core focus Assertions
Eval 1: Internal webapp API (Standard mode) internal/webapp HTTP endpoints build tag, env gate, prod safety, protocol+business assertions, Output Contract 15
Eval 2: Third-party API scope rejection internal/github REST client scope validation gate, redirect to correct skill, whether test code is generated 8
Eval 3: Comprehensive mode upgrade Existing integration tests → Comprehensive concurrency safety, large payload, timeout policy, full Output Contract 15

2.2 Execution

  • issue2md project source used as real code context for all scenarios
  • With-skill runs load SKILL.md and its referenced materials first
  • Without-skill runs load no skill; output is generated by model default behavior
  • All runs execute in independent subagents; output saved to /tmp/api-integ-eval/

2.3 issue2md Project Context

  • Go version: 1.25.8
  • Internal API: internal/webapp (5 endpoints: /, /convert, /openapi.json, /swagger, /swagger/index.html)
  • Third-party client: internal/github (REST + GraphQL calls to api.github.com)
  • Existing integration tests: tests/integration/http/web_api_integration_test.go (httptest.NewRecorder + fake fetcher)
  • Gate env var: ISSUE2MD_API_INTEGRATION=1

3. Assertion Pass Rate

3.1 Summary

Scenario Assertions With Skill Without Skill Delta
Eval 1: Internal webapp API 15 15/15 (100%) 11/15 (73.3%) +26.7%
Eval 2: Third-party API scope 8 6/8 (75%) 2/8 (25.0%) +50.0%
Eval 3: Comprehensive mode upgrade 15 15/15 (100%) 9/15 (60.0%) +40.0%
Total 38 36/38 (94.7%) 22/38 (57.9%) +36.8%

3.2 Classification of 16 Without-Skill Failures

Failure type Count Evals Notes
Missing //go:build integration build tag 3 Eval 1/2/3 Tests compile and skip in go test ./..., wasting build time
Missing Production Safety Gate 3 Eval 1/2/3 If ENV=prod and gate var is mis-set, tests run in production
Missing Output Contract structured report 3 Eval 1/2/3 No structured output for execution mode, degradation level, or var list
No scope identification 3 Eval 2 Generated tests for third-party API with no scope check
Missing Execution Mode declaration 2 Eval 1/3 No Standard/Comprehensive mode stated
Missing context.WithTimeout coverage 1 Eval 3 Used http.Post directly, no context timeout
Missing Quality Scorecard 1 Eval 3 No pre-authoring/code-quality checklist

3.3 Analysis of 2 With-Skill Failures

Assertion Scenario Analysis
Scope gate did not hard-stop Eval 2 Agent identified GitHub API as third-party but continued generating test code citing "evaluation task"
Did not fully prevent test code generation Eval 2 Report clearly noted Scope Note, but still produced full _integration_test.go file

Root cause: The skill’s Scope Validation Gate says "redirect to $thirdparty-api-integration-test, stop" but lacks a hard-stop mechanism (e.g. like fuzzing-test’s "If item 2 or 3 fails → stop, do not write tests"). The agent had good scope awareness but chose a "best effort" continuation.


4. Dimension-by-Dimension Comparison

4.1 Build Tag Isolation (//go:build integration)

This dimension failed in all 3 scenarios and has the widest impact.

Scenario With Skill Without Skill
Eval 1 //go:build integration + // +build integration ❌ No build tag
Eval 2 //go:build integration + // +build integration ❌ No build tag
Eval 3 //go:build integration + // +build integration ❌ No build tag

Practical impact: Without build tags: - go test ./... compiles integration test files and their deps (even if they eventually t.Skip) - CI build time increases - If tests depend on special packages (e.g. service container), builds fail in environments without them

4.2 Production Safety Gate

Scenario With Skill Without Skill
Eval 1 ENV=prodt.Skip unless INTEGRATION_ALLOW_PROD=1 ❌ None
Eval 2 ENV=prodt.Skip unless INTEGRATION_ALLOW_PROD=1 ❌ None
Eval 3 ENV=prodt.Skip unless INTEGRATION_ALLOW_PROD=1 ❌ None

Practical impact: This is a safety-critical dimension. Without-skill tests would run when ISSUE2MD_API_INTEGRATION=1 and ENV=production, potentially hitting production. The skill’s dual gates (gate + prod safety) add defense in depth.

4.3 Output Contract (Structured Report)

With-skill produced reports containing:

Report field Eval 1 Eval 2 Eval 3
Execution Mode Standard Standard Comprehensive
Integration Target 5 endpoints 5 REST methods 5 endpoints
Degradation Level Full Full Full
Gate Variables list 3 vars 8 vars 3 vars
Exact Commands
Timeout/Retry Policy 15s/none 15s/none 30s/none
Result Summary 15 pass 8 skip 38 pass
Failure Classification N/A N/A N/A
Quality Scorecard ✅ Complete ✅ Complete ✅ Complete

Without-skill produced brief text summaries with no structured fields.

4.4 Scope Validation (Eval 2 Specific)

Dimension With Skill Without Skill
Identifies GitHub API as third-party ✅ (explicit Scope Note)
Mentions correct alternative skill ✅ ($thirdparty-api-integration-test)
Hard-stop, no test code generated ❌ (continued for evaluation) ❌ (generated directly)
Scope explanation ✅ (detailed in report)

Analysis: With-skill scope awareness is far better than baseline (3/4 vs 0/4 scope-related assertions) but does not enforce a hard stop. The skill’s Gate 1 instruction "redirect to $thirdparty-api-integration-test, stop" is not strong enough in its current wording—the agent had enough "reason" to bypass it.

4.5 context.WithTimeout Coverage

Scenario With Skill Without Skill
Eval 1 Every HTTP call has 15s context 5s context via helper ✅
Eval 2 Every API call has 15s context 30s http.Client.Timeout ⚠️
Eval 3 Every request has 30s context http.Post used directly, no context ❌

Analysis: Without-skill used http.Post and http.Get in Eval 3; these helpers do not accept context. The skill’s mode requires "Guard each external call with context.WithTimeout", ensuring consistent, testable timeout behavior.


5. Skill Differentiators

5.1 Skill-Only Capabilities (Never Observed Without-Skill)

Capability Description Occurrences
Build tag isolation //go:build integration at file top 3/3
Production Safety Gate ENV=prodt.Skip dual gate 3/3
Output Contract Structured report with 9 required fields 3/3
Scope validation Distinguishes internal vs third-party API, suggests redirect 1/1
Quality Scorecard Pre-Authoring + Test Quality checklists 3/3
Execution Mode declaration Smoke/Standard/Comprehensive auto-selection 3/3
Degradation Level Full/Scaffold/Blocked degradation judgment 3/3

5.2 Capabilities Where Both Performed Well

Capability With Skill Without Skill
Env var gate 3/3 3/3
Protocol-level assertions (HTTP status) 3/3 3/3
Business-level assertions (response content) 3/3 3/3
Success + failure path coverage 3/3 3/3
Actionable skip messages 3/3 3/3
File naming convention 3/3 3/3
Run command provided 3/3 3/3

5.3 Skill Advantage by Prompt Density

Prompt density Eval With-Skill advantage
Standard prompt (medium info) Eval 1 +26.7%
Scope-boundary prompt (high info) Eval 2 +50.0%
Explicit Comprehensive prompt (high info) Eval 3 +40.0%

The skill’s advantage is largest in scope-boundary scenarios (+50%), because scope validation is skill-specific knowledge the model does not have by default.


6. Token Cost-Effectiveness

6.1 Skill File Token Estimates

File Lines Est. Tokens
SKILL.md 336 ~2,100
references/common-integration-gate.md 97 ~600
references/common-output-contract.md 30 ~200
references/checklists.md 98 ~600
references/internal-api-patterns.md 415 ~2,500
Full load 976 ~6,000

6.2 Typical Load Scenarios

The skill’s Reference Loading Gate specifies on-demand loading:

Scenario Loaded files Token cost
HTTP standard test SKILL.md + gate + output + checklists + api-patterns ~6,000
Scope rejection only SKILL.md + gate + output ~2,900
Result reporting SKILL.md + output ~2,300
Typical (standard HTTP test) Full ~6,000

Note: The current Reference Loading Gate design causes most HTTP test scenarios to load all references (because internal-api-patterns.md is always triggered for HTTP). Typical load ≈ full load.

6.3 Cost-Effectiveness

Metric Value
With-Skill pass rate 94.7%
Without-Skill pass rate 57.9%
Improvement +36.8 pp
SKILL.md token cost ~2,100
Full-load token cost ~6,000
Token cost per 1% gain (SKILL.md) ~57 tokens
Token cost per 1% gain (full) ~163 tokens

6.4 Comparison with Other Skills

Skill Pass-rate gain SKILL.md Tokens Cost per 1%
go-makefile-writer +31.0% ~1,960 ~63
api-integration-test +36.8% ~2,100 ~57
fuzzing-test +54.3% ~2,250 ~41
go-ci-workflow +33.0% ~1,500 ~45

api-integration-test’s SKILL.md-only cost-effectiveness (57 tokens/1%) is in a reasonable range. Full-load cost-effectiveness (163 tokens/1%) is higher, mainly because internal-api-patterns.md (2,500 tokens) is loaded in almost all scenarios.


7. Overall Score

7.1 Dimensions and Weights

Dimension Weight Score Weighted
Safety gates (build tag + prod safety) 25% 10.0 2.50
Test code quality (protocol + business assertions) 20% 9.5 1.90
Scope validation 15% 7.5 1.13
Output Contract completeness 15% 10.0 1.50
Token cost-effectiveness 10% 7.5 0.75
Actual pass-rate gain 10% 9.5 0.95
Execution Mode auto-selection 5% 9.0 0.45
Weighted total 100% 9.18 / 10

7.2 Dimension Notes

  • Safety gates (10.0): 3/3 scenarios correct on build tag + prod safety gate; this dimension is entirely absent without-skill
  • Test code quality (9.5): Protocol + business assertions complete, consistent context.WithTimeout coverage; -0.5 for Eval 1 httptest.NewServer pattern overlapping with Eval 3
  • Scope validation (7.5): Identified GitHub API as third-party with detailed explanation, but no hard-stop (-2.5)
  • Output Contract (10.0): All 9 required fields present in all 3 scenarios
  • Token cost-effectiveness (7.5): SKILL.md cost-effectiveness is strong (57 tokens/1%); full load is higher (163 tokens/1%) due to large internal-api-patterns.md
  • Pass-rate gain (9.5): +36.8 pp is a significant gain, especially in safety
  • Mode auto-selection (9.0): Correctly chose Standard and Comprehensive modes

8. Per-Eval Detailed Scores

Eval 1: Internal webapp API — Standard Mode

# Assertion With Skill Without Skill
1 //go:build integration build tag
2 Gate env var check (ISSUE2MD_API_INTEGRATION=1)
3 Production safety gate (ENV=prodt.Skip)
4 context.WithTimeout on HTTP calls
5 Protocol-level assertion (HTTP status codes)
6 Business-level assertion (response content/fields)
7 Success path test case
8 Expected-failure path test case
9 Actionable skip messages
10 File naming *_integration_test.go
11 No hardcoded secrets/endpoints
12 Execution mode stated (Standard)
13 Degradation level stated (Full)
14 Output Contract with mandatory fields
15 Exact run command provided
Total 15/15 11/15

Eval 2: Third-Party API Scope Rejection

# Assertion With Skill Without Skill
1 Identifies GitHub API as third-party
2 Mentions $thirdparty-api-integration-test
3 Hard-stop, no test code generated
4 Provides clear scope explanation
5 Build tag //go:build integration (if tests generated)
6 Production safety gate (if tests generated)
7 context.WithTimeout on calls (if tests generated)
8 Actionable skip messages (if tests generated)
Total 6/8 2/8

Eval 3: Comprehensive Mode Upgrade

# Assertion With Skill Without Skill
1 //go:build integration build tag
2 Gate env var check
3 Production safety gate
4 Concurrent request safety test
5 Large payload test (1MB MaxBytesReader)
6 All error status codes (400, 401, 403, 404, 429, 502)
7 context.WithTimeout on all HTTP calls
8 Response header assertions
9 Protocol-level assertions
10 Business-level assertions
11 Execution mode stated (Comprehensive)
12 Timeout documented (30s)
13 Output Contract present
14 Run command with comprehensive timeout
15 Quality Scorecard
Total 15/15 9/15

9. Summary

The api-integration-test skill achieves a +36.8 pp pass-rate gain at ~2,100 tokens (SKILL.md), with full differentiation in safety (build-tag isolation + production safety gate) and structured output (Output Contract + Quality Scorecard). Scope validation is a unique strength—the baseline has no sense of third-party APIs, while the skill identifies them and suggests redirects.

Main improvement areas: strengthen the Scope Validation Gate’s hard-stop semantics (currently advisory, not mandatory) and trim internal-api-patterns.md to improve full-load cost-effectiveness.