api-integration-test is an integration-test skill for Go internal HTTP/gRPC APIs. It creates, maintains, and runs gated integration tests with real configuration, focusing on contract verification, failure triage, and safe, controlled execution. Its three main strengths are: strict scope determination up front, clearly distinguishing internal APIs, third-party APIs, and unit-test scenarios; built-in Production Safety Gate and config-completeness gates that default to rejecting unsafe or insufficient execution paths; and requirements for build-tag isolation and structured output reports, making tests both safe to integrate into CI and easy to diagnose and audit.
This evaluation reviews the api-integration-test skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (internal API standard test, third-party API scope rejection, comprehensive mode upgrade). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 38 assertions.
Agent identified GitHub API as third-party but continued generating test code citing "evaluation task"
Did not fully prevent test code generation
Eval 2
Report clearly noted Scope Note, but still produced full _integration_test.go file
Root cause: The skill’s Scope Validation Gate says "redirect to $thirdparty-api-integration-test, stop" but lacks a hard-stop mechanism (e.g. like fuzzing-test’s "If item 2 or 3 fails → stop, do not write tests"). The agent had good scope awareness but chose a "best effort" continuation.
This dimension failed in all 3 scenarios and has the widest impact.
Scenario
With Skill
Without Skill
Eval 1
//go:build integration + // +build integration
❌ No build tag
Eval 2
//go:build integration + // +build integration
❌ No build tag
Eval 3
//go:build integration + // +build integration
❌ No build tag
Practical impact: Without build tags: - go test ./... compiles integration test files and their deps (even if they eventually t.Skip) - CI build time increases - If tests depend on special packages (e.g. service container), builds fail in environments without them
Practical impact: This is a safety-critical dimension. Without-skill tests would run when ISSUE2MD_API_INTEGRATION=1 and ENV=production, potentially hitting production. The skill’s dual gates (gate + prod safety) add defense in depth.
Analysis: With-skill scope awareness is far better than baseline (3/4 vs 0/4 scope-related assertions) but does not enforce a hard stop. The skill’s Gate 1 instruction "redirect to $thirdparty-api-integration-test, stop" is not strong enough in its current wording—the agent had enough "reason" to bypass it.
Analysis: Without-skill used http.Post and http.Get in Eval 3; these helpers do not accept context. The skill’s mode requires "Guard each external call with context.WithTimeout", ensuring consistent, testable timeout behavior.
The skill’s advantage is largest in scope-boundary scenarios (+50%), because scope validation is skill-specific knowledge the model does not have by default.
Note: The current Reference Loading Gate design causes most HTTP test scenarios to load all references (because internal-api-patterns.md is always triggered for HTTP). Typical load ≈ full load.
api-integration-test’s SKILL.md-only cost-effectiveness (57 tokens/1%) is in a reasonable range. Full-load cost-effectiveness (163 tokens/1%) is higher, mainly because internal-api-patterns.md (2,500 tokens) is loaded in almost all scenarios.
Safety gates (10.0): 3/3 scenarios correct on build tag + prod safety gate; this dimension is entirely absent without-skill
Test code quality (9.5): Protocol + business assertions complete, consistent context.WithTimeout coverage; -0.5 for Eval 1 httptest.NewServer pattern overlapping with Eval 3
Scope validation (7.5): Identified GitHub API as third-party with detailed explanation, but no hard-stop (-2.5)
Output Contract (10.0): All 9 required fields present in all 3 scenarios
Token cost-effectiveness (7.5): SKILL.md cost-effectiveness is strong (57 tokens/1%); full load is higher (163 tokens/1%) due to large internal-api-patterns.md
Pass-rate gain (9.5): +36.8 pp is a significant gain, especially in safety
Mode auto-selection (9.0): Correctly chose Standard and Comprehensive modes
The api-integration-test skill achieves a +36.8 pp pass-rate gain at ~2,100 tokens (SKILL.md), with full differentiation in safety (build-tag isolation + production safety gate) and structured output (Output Contract + Quality Scorecard). Scope validation is a unique strength—the baseline has no sense of third-party APIs, while the skill identifies them and suggests redirects.
Main improvement areas: strengthen the Scope Validation Gate’s hard-stop semantics (currently advisory, not mandatory) and trim internal-api-patterns.md to improve full-load cost-effectiveness.