api-integration-test Skill Evaluation Report¶

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation target: api-integration-test

api-integration-test is an integration-test skill for Go internal HTTP/gRPC APIs. It creates, maintains, and runs gated integration tests with real configuration, focusing on contract verification, failure triage, and safe, controlled execution. Its three main strengths are: strict scope determination up front, clearly distinguishing internal APIs, third-party APIs, and unit-test scenarios; built-in Production Safety Gate and config-completeness gates that default to rejecting unsafe or insufficient execution paths; and requirements for build-tag isolation and structured output reports, making tests both safe to integrate into CI and easy to diagnose and audit.

1. Evaluation Overview¶

This evaluation reviews the api-integration-test skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (internal API standard test, third-party API scope rejection, comprehensive mode upgrade). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 38 assertions.

Dimension	With Skill	Without Skill	Delta
Assertion pass rate	36/38 (94.7%)	22/38 (57.9%)	+36.8 pp
Production Safety Gate	3/3 correct	0/3	Skill-only
Build tag isolation	3/3 correct	0/3	Skill-only
Output Contract structured report	3/3 correct	0/3	Skill-only
Scope identification	✅ Identifies third-party API	❌ No scope awareness	Skill-only
Skill Token cost (SKILL.md only)	~2,100 tokens	0	—
Skill Token cost (all references)	~6,000 tokens	0	—
Typical load cost	~3,500 tokens	0	—
Token cost per 1% pass-rate gain	~57 tokens (SKILL.md) / ~95 tokens (typical)	—	—

2. Test Methodology¶

2.1 Scenario Design¶

Scenario	Goal	Core focus	Assertions
Eval 1: Internal webapp API (Standard mode)	`internal/webapp` HTTP endpoints	build tag, env gate, prod safety, protocol+business assertions, Output Contract	15
Eval 2: Third-party API scope rejection	`internal/github` REST client	scope validation gate, redirect to correct skill, whether test code is generated	8
Eval 3: Comprehensive mode upgrade	Existing integration tests → Comprehensive	concurrency safety, large payload, timeout policy, full Output Contract	15

2.2 Execution¶

issue2md project source used as real code context for all scenarios
With-skill runs load SKILL.md and its referenced materials first
Without-skill runs load no skill; output is generated by model default behavior
All runs execute in independent subagents; output saved to /tmp/api-integ-eval/

2.3 issue2md Project Context¶

Go version: 1.25.8
Internal API: internal/webapp (5 endpoints: /, /convert, /openapi.json, /swagger, /swagger/index.html)
Third-party client: internal/github (REST + GraphQL calls to api.github.com)
Existing integration tests: tests/integration/http/web_api_integration_test.go (httptest.NewRecorder + fake fetcher)
Gate env var: ISSUE2MD_API_INTEGRATION=1

3. Assertion Pass Rate¶

3.1 Summary¶

Scenario	Assertions	With Skill	Without Skill	Delta
Eval 1: Internal webapp API	15	15/15 (100%)	11/15 (73.3%)	+26.7%
Eval 2: Third-party API scope	8	6/8 (75%)	2/8 (25.0%)	+50.0%
Eval 3: Comprehensive mode upgrade	15	15/15 (100%)	9/15 (60.0%)	+40.0%
Total	38	36/38 (94.7%)	22/38 (57.9%)	+36.8%

3.2 Classification of 16 Without-Skill Failures¶

Failure type	Count	Evals	Notes
Missing `//go:build integration` build tag	3	Eval 1/2/3	Tests compile and skip in `go test ./...`, wasting build time
Missing Production Safety Gate	3	Eval 1/2/3	If ENV=prod and gate var is mis-set, tests run in production
Missing Output Contract structured report	3	Eval 1/2/3	No structured output for execution mode, degradation level, or var list
No scope identification	3	Eval 2	Generated tests for third-party API with no scope check
Missing Execution Mode declaration	2	Eval 1/3	No Standard/Comprehensive mode stated
Missing context.WithTimeout coverage	1	Eval 3	Used `http.Post` directly, no context timeout
Missing Quality Scorecard	1	Eval 3	No pre-authoring/code-quality checklist

3.3 Analysis of 2 With-Skill Failures¶

Assertion	Scenario	Analysis
Scope gate did not hard-stop	Eval 2	Agent identified GitHub API as third-party but continued generating test code citing "evaluation task"
Did not fully prevent test code generation	Eval 2	Report clearly noted Scope Note, but still produced full `_integration_test.go` file

Root cause: The skill’s Scope Validation Gate says "redirect to $thirdparty-api-integration-test, stop" but lacks a hard-stop mechanism (e.g. like fuzzing-test’s "If item 2 or 3 fails → stop, do not write tests"). The agent had good scope awareness but chose a "best effort" continuation.

4. Dimension-by-Dimension Comparison¶

4.1 Build Tag Isolation (`//go:build integration`)¶

This dimension failed in all 3 scenarios and has the widest impact.

Scenario	With Skill	Without Skill
Eval 1	`//go:build integration` + `// +build integration`	❌ No build tag
Eval 2	`//go:build integration` + `// +build integration`	❌ No build tag
Eval 3	`//go:build integration` + `// +build integration`	❌ No build tag

Practical impact: Without build tags: - go test ./... compiles integration test files and their deps (even if they eventually t.Skip) - CI build time increases - If tests depend on special packages (e.g. service container), builds fail in environments without them

4.2 Production Safety Gate¶

Scenario	With Skill	Without Skill
Eval 1	`ENV=prod` → `t.Skip` unless `INTEGRATION_ALLOW_PROD=1`	❌ None
Eval 2	`ENV=prod` → `t.Skip` unless `INTEGRATION_ALLOW_PROD=1`	❌ None
Eval 3	`ENV=prod` → `t.Skip` unless `INTEGRATION_ALLOW_PROD=1`	❌ None

Practical impact: This is a safety-critical dimension. Without-skill tests would run when ISSUE2MD_API_INTEGRATION=1 and ENV=production, potentially hitting production. The skill’s dual gates (gate + prod safety) add defense in depth.

4.3 Output Contract (Structured Report)¶

With-skill produced reports containing:

Report field	Eval 1	Eval 2	Eval 3
Execution Mode	Standard	Standard	Comprehensive
Integration Target	5 endpoints	5 REST methods	5 endpoints
Degradation Level	Full	Full	Full
Gate Variables list	3 vars	8 vars	3 vars
Exact Commands	✅	✅	✅
Timeout/Retry Policy	15s/none	15s/none	30s/none
Result Summary	15 pass	8 skip	38 pass
Failure Classification	N/A	N/A	N/A
Quality Scorecard	✅ Complete	✅ Complete	✅ Complete

Without-skill produced brief text summaries with no structured fields.

4.4 Scope Validation (Eval 2 Specific)¶

Dimension	With Skill	Without Skill
Identifies GitHub API as third-party	✅ (explicit Scope Note)	❌
Mentions correct alternative skill	✅ (`$thirdparty-api-integration-test`)	❌
Hard-stop, no test code generated	❌ (continued for evaluation)	❌ (generated directly)
Scope explanation	✅ (detailed in report)	❌

Analysis: With-skill scope awareness is far better than baseline (3/4 vs 0/4 scope-related assertions) but does not enforce a hard stop. The skill’s Gate 1 instruction "redirect to $thirdparty-api-integration-test, stop" is not strong enough in its current wording—the agent had enough "reason" to bypass it.

4.5 context.WithTimeout Coverage¶

Scenario	With Skill	Without Skill
Eval 1	Every HTTP call has 15s context	5s context via helper ✅
Eval 2	Every API call has 15s context	30s `http.Client.Timeout` ⚠️
Eval 3	Every request has 30s context	`http.Post` used directly, no context ❌

Analysis: Without-skill used http.Post and http.Get in Eval 3; these helpers do not accept context. The skill’s mode requires "Guard each external call with context.WithTimeout", ensuring consistent, testable timeout behavior.

5. Skill Differentiators¶

5.1 Skill-Only Capabilities (Never Observed Without-Skill)¶

Capability	Description	Occurrences
Build tag isolation	`//go:build integration` at file top	3/3
Production Safety Gate	`ENV=prod` → `t.Skip` dual gate	3/3
Output Contract	Structured report with 9 required fields	3/3
Scope validation	Distinguishes internal vs third-party API, suggests redirect	1/1
Quality Scorecard	Pre-Authoring + Test Quality checklists	3/3
Execution Mode declaration	Smoke/Standard/Comprehensive auto-selection	3/3
Degradation Level	Full/Scaffold/Blocked degradation judgment	3/3

5.2 Capabilities Where Both Performed Well¶

Capability	With Skill	Without Skill
Env var gate	3/3	3/3
Protocol-level assertions (HTTP status)	3/3	3/3
Business-level assertions (response content)	3/3	3/3
Success + failure path coverage	3/3	3/3
Actionable skip messages	3/3	3/3
File naming convention	3/3	3/3
Run command provided	3/3	3/3

5.3 Skill Advantage by Prompt Density¶

Prompt density	Eval	With-Skill advantage
Standard prompt (medium info)	Eval 1	+26.7%
Scope-boundary prompt (high info)	Eval 2	+50.0%
Explicit Comprehensive prompt (high info)	Eval 3	+40.0%

The skill’s advantage is largest in scope-boundary scenarios (+50%), because scope validation is skill-specific knowledge the model does not have by default.

6. Token Cost-Effectiveness¶

6.1 Skill File Token Estimates¶

File	Lines	Est. Tokens
`SKILL.md`	336	~2,100
`references/common-integration-gate.md`	97	~600
`references/common-output-contract.md`	30	~200
`references/checklists.md`	98	~600
`references/internal-api-patterns.md`	415	~2,500
Full load	976	~6,000

6.2 Typical Load Scenarios¶

The skill’s Reference Loading Gate specifies on-demand loading:

Scenario	Loaded files	Token cost
HTTP standard test	SKILL.md + gate + output + checklists + api-patterns	~6,000
Scope rejection only	SKILL.md + gate + output	~2,900
Result reporting	SKILL.md + output	~2,300
Typical (standard HTTP test)	Full	~6,000

Note: The current Reference Loading Gate design causes most HTTP test scenarios to load all references (because internal-api-patterns.md is always triggered for HTTP). Typical load ≈ full load.

6.3 Cost-Effectiveness¶

Metric	Value
With-Skill pass rate	94.7%
Without-Skill pass rate	57.9%
Improvement	+36.8 pp
SKILL.md token cost	~2,100
Full-load token cost	~6,000
Token cost per 1% gain (SKILL.md)	~57 tokens
Token cost per 1% gain (full)	~163 tokens

6.4 Comparison with Other Skills¶

Skill	Pass-rate gain	SKILL.md Tokens	Cost per 1%
go-makefile-writer	+31.0%	~1,960	~63
api-integration-test	+36.8%	~2,100	~57
fuzzing-test	+54.3%	~2,250	~41
go-ci-workflow	+33.0%	~1,500	~45

api-integration-test’s SKILL.md-only cost-effectiveness (57 tokens/1%) is in a reasonable range. Full-load cost-effectiveness (163 tokens/1%) is higher, mainly because internal-api-patterns.md (2,500 tokens) is loaded in almost all scenarios.

7. Overall Score¶

7.1 Dimensions and Weights¶

Dimension	Weight	Score	Weighted
Safety gates (build tag + prod safety)	25%	10.0	2.50
Test code quality (protocol + business assertions)	20%	9.5	1.90
Scope validation	15%	7.5	1.13
Output Contract completeness	15%	10.0	1.50
Token cost-effectiveness	10%	7.5	0.75
Actual pass-rate gain	10%	9.5	0.95
Execution Mode auto-selection	5%	9.0	0.45
Weighted total	100%	—	9.18 / 10

7.2 Dimension Notes¶

Safety gates (10.0): 3/3 scenarios correct on build tag + prod safety gate; this dimension is entirely absent without-skill
Test code quality (9.5): Protocol + business assertions complete, consistent context.WithTimeout coverage; -0.5 for Eval 1 httptest.NewServer pattern overlapping with Eval 3
Scope validation (7.5): Identified GitHub API as third-party with detailed explanation, but no hard-stop (-2.5)
Output Contract (10.0): All 9 required fields present in all 3 scenarios
Token cost-effectiveness (7.5): SKILL.md cost-effectiveness is strong (57 tokens/1%); full load is higher (163 tokens/1%) due to large internal-api-patterns.md
Pass-rate gain (9.5): +36.8 pp is a significant gain, especially in safety
Mode auto-selection (9.0): Correctly chose Standard and Comprehensive modes

8. Per-Eval Detailed Scores¶

Eval 1: Internal webapp API — Standard Mode¶

#	Assertion	With Skill	Without Skill
1	`//go:build integration` build tag	✅	❌
2	Gate env var check (`ISSUE2MD_API_INTEGRATION=1`)	✅	✅
3	Production safety gate (`ENV=prod` → `t.Skip`)	✅	❌
4	`context.WithTimeout` on HTTP calls	✅	✅
5	Protocol-level assertion (HTTP status codes)	✅	✅
6	Business-level assertion (response content/fields)	✅	✅
7	Success path test case	✅	✅
8	Expected-failure path test case	✅	✅
9	Actionable skip messages	✅	✅
10	File naming `*_integration_test.go`	✅	✅
11	No hardcoded secrets/endpoints	✅	✅
12	Execution mode stated (Standard)	✅	❌
13	Degradation level stated (Full)	✅	❌
14	Output Contract with mandatory fields	✅	❌
15	Exact run command provided	✅	✅
	Total	15/15	11/15

Eval 2: Third-Party API Scope Rejection¶

#	Assertion	With Skill	Without Skill
1	Identifies GitHub API as third-party	✅	❌
2	Mentions `$thirdparty-api-integration-test`	✅	❌
3	Hard-stop, no test code generated	❌	❌
4	Provides clear scope explanation	✅	❌
5	Build tag `//go:build integration` (if tests generated)	✅	❌
6	Production safety gate (if tests generated)	✅	❌
7	`context.WithTimeout` on calls (if tests generated)	✅	✅
8	Actionable skip messages (if tests generated)	✅	✅
	Total	6/8	2/8

Eval 3: Comprehensive Mode Upgrade¶

#	Assertion	With Skill	Without Skill
1	`//go:build integration` build tag	✅	❌
2	Gate env var check	✅	✅
3	Production safety gate	✅	❌
4	Concurrent request safety test	✅	✅
5	Large payload test (1MB MaxBytesReader)	✅	✅
6	All error status codes (400, 401, 403, 404, 429, 502)	✅	✅
7	`context.WithTimeout` on all HTTP calls	✅	❌
8	Response header assertions	✅	✅
9	Protocol-level assertions	✅	✅
10	Business-level assertions	✅	✅
11	Execution mode stated (Comprehensive)	✅	❌
12	Timeout documented (30s)	✅	❌
13	Output Contract present	✅	❌
14	Run command with comprehensive timeout	✅	✅
15	Quality Scorecard	✅	❌
	Total	15/15	9/15

9. Summary¶

The api-integration-test skill achieves a +36.8 pp pass-rate gain at ~2,100 tokens (SKILL.md), with full differentiation in safety (build-tag isolation + production safety gate) and structured output (Output Contract + Quality Scorecard). Scope validation is a unique strength—the baseline has no sense of third-party APIs, while the skill identifies them and suggests redirects.

Main improvement areas: strengthen the Scope Validation Gate’s hard-stop semantics (currently advisory, not mandatory) and trim internal-api-patterns.md to improve full-load cost-effectiveness.