Skip to content

thirdparty-api-integration-test Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-11 Evaluation subject: thirdparty-api-integration-test


thirdparty-api-integration-test is a skill for writing and running real integration tests for Go third-party API clients. It is suited for verifying vendor interface contracts, troubleshooting external call failures, and performing bounded regression checks under real runtime configuration. Its three standout strengths are: strict scope validation that clearly distinguishes third-party APIs, internal APIs, and unit tests to avoid test strategy mismatch; explicit safety gates for environment variables, runtime configuration, and production access, defaulting to rejecting high-risk execution paths; and build tag isolation plus structured output reports, making these high-cost tests suitable for on-demand runs and easier to capture results.

1. Evaluation Overview

This evaluation reviews the thirdparty-api-integration-test skill along two axes: actual task performance and token cost-effectiveness. Three scenarios were designed (GitHub REST API integration test, OpenAI Responses API integration test, internal webapp API scope boundary test). Each scenario was run with both with-skill and without-skill configurations, for 3 scenarios × 2 configs = 6 independent subagent runs, scored against 36 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 36/36 (100%) 24/36 (66.7%) +33.3 pp
Gate env var isolation 3/3 correct 1/3 Largest single-item delta
Production Safety Gate 3/3 0/3 Skill-only
Build tag isolation 3/3 2/3 Eval 2 without-skill missing
Output Contract structured report 3/3 0/3 Skill-only
Scope boundary identification 4/4 0/4 Skill-only
Skill Token cost (SKILL.md only) ~680 tokens 0
Skill Token cost (all references) ~1,660 tokens 0
Token cost per 1% pass-rate gain ~20 tokens (SKILL.md only) / ~50 tokens (full)

2. Test Methodology

2.1 Scenario Design

Scenario Target Core focus Assertions
Eval 1: GitHub REST client internal/github/rest_client.go (5 methods) Standard third-party API integration: gate, safety gates, assertion quality 15
Eval 2: OpenAI Responses API internal/converter/summary_openai.go (Summarize method) Paid API test: API key management, i18n, timeout boundaries 13
Eval 3: Internal webapp (scope) internal/webapp/handler.go (6 HTTP endpoints) Scope boundary: internal API should not use third-party pattern 8

2.2 Execution

  • With-skill runs first read SKILL.md and its 4 referenced files
  • Without-skill runs read no skill, using model default behavior to generate tests
  • All runs execute in parallel in independent subagents
  • Eval 3 tests the skill’s scope identification for internal APIs

3. Assertion Pass Rate

3.1 Overview

Scenario Assertions With Skill Without Skill Delta
Eval 1: GitHub REST 15 15/15 (100%) 10/15 (66.7%) +33.3%
Eval 2: OpenAI API 13 13/13 (100%) 10/13 (76.9%) +23.1%
Eval 3: Scope boundary 8 8/8 (100%) 4/8 (50.0%) +50.0%
Total 36 36/36 (100%) 24/36 (66.7%) +33.3%

3.2 Per-Assertion Details

Eval 1: GitHub REST Client (15 assertions)

# Assertion With Without Notes
A1 //go:build integration build tag
A2 Dedicated gate env var (GITHUB_INTEGRATION=1) Without uses GITHUB_TOKEN as implicit gate only
A3 Production safety gate (ENV=prod → skip) Without completely missing
A4 GITHUB_TOKEN read from env
A5 Actionable skip messages
A6 context.WithTimeout wraps each API call
A7 Protocol-level assertions (number match, non-nil)
A8 Business-level assertions (title, user, state non-empty)
A9 Failure path explicit error type/code check Without only err != nil, no *statusError assertion
A10 Uses production code path (real client)
A11 No retry (or bounded retry)
A12 File naming *_integration_test.go
A13 Env var validation (TrimSpace + ParseInt) Without no TrimSpace
A14 Test data lifecycle (stable fixtures, overwritable)
A15 Output Contract structured report Without only brief summary

Eval 2: OpenAI Responses API (13 assertions)

# Assertion With Without Notes
B1 //go:build integration build tag Without missing build tag; go test ./... would trigger
B2 Dedicated gate env var Both use gate
B3 Production safety gate Without completely missing
B4 OPENAI_API_KEY read from env
B5 Actionable skip messages
B6 context.WithTimeout wraps each call
B7 Protocol-level assertions (Status == "ok")
B8 Business-level assertions (Summary/Language/KeyDecisions)
B9 Failure path tests (empty key, invalid key)
B10 Uses production code path
B11 No retry (paid API default no retry)
B12 Timeout boundary test (expired context)
B13 Output Contract structured report Without only brief summary

Eval 3: Scope Boundary Test (8 assertions)

# Assertion With Without Notes
C1 Identifies target as internal API (not third-party) Without has no scope concept
C2 Does not apply thirdparty pattern (build tag, gate)
C3 Report indicates scope determination Without no scope analysis
C4 Recommends correct skill ($api-integration-test) Without no skill recommendation concept
C5 Uses httptest mode
C6 Does not wrongly add THIRDPARTY_INTEGRATION gate
C7 Tests cover internal endpoints
C8 Provides clear scope boundary explanation Without no scope analysis

3.3 Classification of 12 Failed Assertions (Without-Skill)

Failure type Count Evals Notes
Production Safety Gate missing 2 Eval 1, 2 Neither has ENV=prod protection; direct run may hit production
Gate env var missing or implicit 1 Eval 1 Only GITHUB_TOKEN implicit gate; no dedicated switch var
Build tag missing 1 Eval 2 go test ./... would accidentally trigger OpenAI integration test
Structured Output Contract missing 2 Eval 1, 2 No gate vars, failure classification, missing prerequisites
Error type/code imprecise 1 Eval 1 404 path only err != nil, no *statusError check
Env var validation non-standard 1 Eval 1 No TrimSpace; leading/trailing spaces can cause misjudgment
Scope identification missing 4 Eval 3 No scope analysis, no alternative skill recommendation

3.4 Risk Matrix: Real Impact of Without-Skill Gaps

Gap Risk level Real scenario
Production Safety Gate High With ENV=prod, tests directly call production third-party API; may consume quota, incur cost, trigger rate limits
Build tag missing High Eval 2 no tag → go test ./... calls OpenAI API → every CI run consumes API quota
Gate env var implicit Medium Runs whenever GITHUB_TOKEN exists → developer shell usually has token configured
Error type imprecise Medium When 404 regresses to 500, err != nil still passes; cannot distinguish error type change
Env var no TrimSpace Low Trailing space in env var causes gate misjudgment (rare but possible)

4. Dimension-by-Dimension Comparison

4.1 Gate Env Var Design

This is the highest practical-value differentiator.

Scenario With Skill Without Skill
GitHub REST GITHUB_INTEGRATION=1 + GITHUB_TOKEN two-level gate Only GITHUB_TOKEN implicit gate
OpenAI API OPENAI_INTEGRATION=1 + OPENAI_API_KEY two-level gate ISSUE2MD_OPENAI_INTEGRATION=1 + OPENAI_API_KEY

Analysis: With-skill consistently uses explicit two-level gates (switch var + credential var separated). Without-skill in Eval 1 relies only on the credential var, so when the developer shell already has GITHUB_TOKEN, running go test -tags=integration ./... would accidentally trigger tests.

The skill’s pattern "Add explicit run gate env var, otherwise t.Skip(...)" addresses this safety design issue.

4.2 Production Safety Gate

Scenario With Skill Without Skill
Eval 1 ENV=prod → t.Skip ❌ Completely missing
Eval 2 ENV=prod → t.Skip ❌ Completely missing

Analysis: This is a Skill-only safety mechanism; Without-skill did not implement it in either scenario. For third-party API tests (especially paid OpenAI API), missing production protection can lead to: - Tests accidentally running in production - Consuming real API quota and cost - Triggering vendor rate-limit policies

4.3 Build Tag Isolation

Scenario With Skill Without Skill
Eval 1 //go:build integration + // +build integration //go:build integration
Eval 2 //go:build integration + // +build integration ❌ No build tag

Analysis: Eval 2 Without-skill output completely lacks a build tag, so go test ./... would compile and run the OpenAI integration test. For paid APIs this is serious — every CI run could incur API cost.

With-skill always outputs both build tag formats (new and legacy) for backward compatibility.

4.4 Error Assertion Precision

Dimension With Skill Without Skill
Eval 1 404 path errors.As(err, &stErr) + stErr.StatusCode == 404 err != nil
Eval 2 auth failure strings.Contains(err.Error(), "status 4") strings.Contains(err.Error(), "status 4")

The skill’s rule "For expected failure paths, assert explicit error type/code (not only require.Error)" produced a clear difference in Eval 1. With-skill uses errors.As to check the concrete *statusError type and 404 status code; Without-skill only checks err != nil.

Practical value: If the GitHub API 404 response format changes (e.g. returns 403 instead of 404), Without-skill’s test would still pass and hide the issue.

4.5 Output Contract (Structured Report)

With-skill produces a report after each run containing:

Report item Eval 1 Eval 2 Eval 3
Integration target details
Gate variable full list ✅ (10 vars) ✅ (6 vars) ✅ (N/A)
Exact run commands
Timeout / Retry policy 30s / none 30s / none 10s / none
Result summary (pass/fail/skip)
Failure classification N/A N/A N/A
Missing prerequisites
Checklist compliance
Scope determination

Without-skill produces a concise task summary but no gate variable table, no failure classification, no missing prerequisites list.

4.6 Scope Boundary Identification (Eval 3)

This is the most distinctive capability in this evaluation.

With-skill in Eval 3: 1. Actively identified internal/webapp/handler.go as an internal API after reading it 2. Explicitly stated "OUT OF SCOPE for thirdparty-api-integration-test skill" 3. Provided a stepwise gate evaluation table proving inapplicability 4. Recommended the correct $api-integration-test skill 5. Still produced high-quality internal API tests (httptest mode)

Without-skill directly produced high-quality webapp tests (25 test functions covering all endpoints) but no scope analysis.

Analysis: The skill’s scope comes from SKILL.md’s "Validate external API integration end-to-end" and "Apply to any third-party API integration" statements. Although SKILL.md has no explicit scope validation gate (unlike api-integration-test’s "Scope Validation Gate" section), the agent still inferred the boundary from context. This shows SKILL.md’s implicit scope definition is sufficient to guide correct judgment.


5. Token Cost-Effectiveness Analysis

5.1 Skill Size

File Lines Words Bytes Est. tokens
SKILL.md 80 482 3,699 ~680
references/common-integration-gate.md 38 209 1,513 ~370
references/common-output-contract.md 19 96 658 ~150
references/checklists.md 31 184 1,397 ~280
references/vendor-examples.md 99 326 3,094 ~570
Description (always in context) ~30 ~40
Total 267 1,327 10,361 ~2,090

Typical load scenarios:

Scenario Files read Total tokens
Full load (all references) SKILL.md + 4 references ~2,050
Standard load (no vendor examples) SKILL.md + gate + contract + checklists ~1,480
Minimal load SKILL.md ~680

5.2 Token Cost for Quality Gain

Metric Value
With-skill pass rate 100% (36/36)
Without-skill pass rate 66.7% (24/36)
Pass-rate gain +33.3 pp
Token cost per assertion fixed ~57 tokens (SKILL.md only) / ~171 tokens (full)
Token cost per 1% pass-rate gain ~20 tokens (SKILL.md only) / ~50 tokens (full)

5.3 Comparison with Sister Skill Cost-Effectiveness

Metric thirdparty-api-integration-test api-integration-test go-makefile-writer git-commit
SKILL.md tokens ~680 ~1,800 ~1,960 ~1,120
Total load tokens ~2,050 ~2,850 ~4,600 ~1,120
Pass-rate gain +33.3% +36.8% +31.0% +22.7%
Tokens per 1% (SKILL.md) ~20 tok ~49 tok ~63 tok ~51 tok
Tokens per 1% (full) ~62 tok ~77 tok ~149 tok ~51 tok

Analysis: thirdparty-api-integration-test’s SKILL.md has best token cost-effectiveness in the series — only ~680 tokens achieves +33.3% pass-rate gain. This is due to: 1. Extremely lean SKILL.md (80 lines vs api-integration-test’s 290) 2. High rule density — 13 Required Patterns cover all core differences 3. Well-designed references — vendor-examples.md provides copy-paste templates

5.4 Token Segment Cost-Effectiveness

Module Est. tokens Related assertion delta Cost-effectiveness
Required Pattern §3–5 (gate + prod safety + env validation) ~80 5 assertions (A2, A3, A13, B1, B3) Very high — 16 tok/assertion
Required Pattern §12 (explicit error type/code) ~20 1 assertion (A9) Very high — 20 tok/assertion
Output Contract pointer ~15 2 assertions (A15, B13) Very high — 8 tok/assertion
common-output-contract.md ~150 Indirect support for Output Contract quality High
Scope definition ("Apply to any third-party API integration") ~30 4 assertions (C1, C3, C4, C8) Very high — 8 tok/assertion
common-integration-gate.md ~370 Indirect support for gate design consistency High
checklists.md ~280 Indirect support for test quality completeness Medium
vendor-examples.md ~570 Indirect support for template consistency (MCS/USS templates) Medium — no direct match for issue2md project

5.5 High-Leverage vs Low-Leverage Instructions

High leverage (~145 tokens SKILL.md → 12 assertion deltas): - Gate + prod safety + env validation rules (80 tok → 5 assertions) - Explicit error type/code rule (20 tok → 1 assertion) - Scope definition (30 tok → 4 assertions) - Output Contract pointer (15 tok → 2 assertions)

Medium leverage (~535 tokens SKILL.md → indirect contribution): - Vendor-Specific Safety Additions (~150 tok) — drives idempotent preference, rate-limit awareness - Safety Rules summary (~130 tok) — reinforces safety constraints - Configuration Gate pointer (~25 tok) — guides reading detailed gate docs - References section (~230 tok) — points to all references

Low leverage (~570 tokens references → limited contribution): - vendor-examples.md (~570 tok) — MCS/USS templates have no direct mapping for issue2md

5.6 Token Efficiency Rating

Rating Conclusion
Overall ROI Excellent — ~2,050 tokens for +33.3% pass rate
SKILL.md ROI Outstanding — ~680 tokens contains all high-leverage rules; best cost-effectiveness in series
High-leverage token share ~21% (145/680) directly contributes to all 12 assertion deltas
Low-leverage token share vendor-examples.md is 27% of budget; no direct mapping for current project
Reference cost-effectiveness High — gate + contract references are small and focused; vendor-examples is large

6. Boundary Analysis vs Claude Base Model Capabilities

6.1 Capabilities Base Model Already Has (No Skill Increment)

Capability Evidence
context.WithTimeout wraps API calls 3/3 scenarios correct
Protocol-level assertions (number match, non-nil) 3/3 scenarios correct
Business-level assertions (non-empty fields) 3/3 scenarios correct
File naming *_integration_test.go 2/2 API scenarios correct
Uses production code path (real client) 3/3 scenarios correct
No retry default policy 3/3 scenarios correct
Basic actionable skip messages 3/3 scenarios correct
httptest.NewServer for internal API 1/1 scenario correct
Stable fixture + overwrite capability 2/2 API scenarios correct

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
Production Safety Gate missing 2/2 API scenarios lack ENV=prod protection High — production trigger can consume quota
Build tag inconsistent 1/2 API scenarios missing build tag Highgo test ./... triggers paid API
Gate env var implicit 1/2 scenarios use credential as gate only Medium — runs when credential exists
Error type imprecise 1/2 scenarios only err != nil Medium — cannot distinguish error type change
Env var no TrimSpace 1/2 scenarios no validation Low — trailing space causes misjudgment
No structured Output report 2/2 scenarios no report Medium — lacks audit traceability
Scope identification 1/1 scenario no scope analysis Low — still produced suitable tests

7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Gate and safety gates 5.0/5 2.0/5 +3.0
Build tag isolation 5.0/5 3.5/5 +1.5
Assertion quality (protocol + business + error) 5.0/5 3.5/5 +1.5
Env var validation and test data lifecycle 5.0/5 4.0/5 +1.0
Structured report 5.0/5 1.0/5 +4.0
Scope boundary identification 5.0/5 1.0/5 +4.0
Overall mean 5.00/5 2.50/5 +2.50

7.2 Weighted Total Score

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 9.5/10 2.38
Gate and safety gates 20% 10/10 2.00
Build tag isolation and consistency 10% 10/10 1.00
Assertion quality and error precision 10% 10/10 1.00
Structured report (Output Contract) 10% 10/10 1.00
Scope boundary identification 10% 10/10 1.00
Token cost-effectiveness 15% 9.5/10 1.43
Weighted total 9.81/10

8. Evaluation Materials

Material Path
Eval 1 with-skill output /tmp/tp-integ-eval/eval-1/with_skill/
Eval 1 without-skill output /tmp/tp-integ-eval/eval-1/without_skill/
Eval 2 with-skill output /tmp/tp-integ-eval/eval-2/with_skill/
Eval 2 without-skill output /tmp/tp-integ-eval/eval-2/without_skill/
Eval 3 with-skill output /tmp/tp-integ-eval/eval-3/with_skill/
Eval 3 without-skill output /tmp/tp-integ-eval/eval-3/without_skill/