Skip to content

google-search Skill Evaluation Report

Evaluation framework: skill-creator Evaluation date: 2026-03-12 Subject: google-search


google-search is a research/search skill that turns "help me search for this" into a verifiable search workflow. It suits fact lookups, error debugging, official docs retrieval, technology comparisons, and public-information gathering that needs source support. Its three main strengths are: classifying the question, defining the evidence chain, and choosing the mode first—elevating search from "finding links" to "finding evidence for conclusions"; outputs include confidence, source tier, budget status, and reusable queries so the search process is reviewable and continuable; and it emphasizes execution completeness and degradation declarations, clearly distinguishing "verified conclusions" from "partial results with insufficient evidence".

1. Evaluation Overview

This evaluation assesses the google-search skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 search scenarios of increasing complexity (Quick-mode fact lookup, Standard-mode error debugging, Deep-mode framework comparison). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 27 assertions.

Dimension With Skill Without Skill Delta
Assertion pass rate 27/27 (100%) 7/27 (25.9%) +74.1 percentage points
Output Contract 8 fields complete 3/3 correct 0/3 Skill-only
Confidence + Source-tier labels 3/3 correct 0/3 Skill-only
Reusable search queries 3/3 correct 0/3 Skill-only
Evidence chain status tracking 3/3 correct 0/3 Skill-only
Content quality (answer correctness/depth) 3/3 correct 3/3 correct No difference
Skill Token cost (SKILL.md only) ~3,100 tokens 0
Skill Token cost (incl. conditional references) ~6,400–7,800 tokens 0
Token cost per 1% pass-rate gain ~42 tok (SKILL.md) / ~99 tok (full)

Key finding: The core value of the google-search skill is search discipline and report structure, not search content quality. The base model already has strong search and synthesis ability (answer correctness, source coverage, code example quality all good), but lacks metadata for the search process (mode choice, budget control, evidence chain tracking, degradation declaration, confidence labels, reusable queries). The skill fills this "search operation discipline" gap.


2. Test Methodology

2.1 Scenario Design

Scenario User request Expected mode Assertions
Eval 1: Fact lookup "Go database/sql package MaxOpenConns and MaxIdleConns default values" Quick 9
Eval 2: Error debugging "gRPC context deadline exceeded — works locally, fails in production" Standard 9
Eval 3: Framework comparison "Compare Gin/Echo/Fiber performance for high-traffic REST API 2026" Deep 9

2.2 Execution

  • With-skill runs first read SKILL.md and related references (query-patterns, programmer-search-patterns, source-evaluation, etc.)
  • Without-skill runs read no skill; search follows model default behavior
  • All runs may use WebSearch and WebFetch tools
  • 6 subagents run in parallel (with-skill uses default model, without-skill uses fast model)

2.3 Skill Characteristics

google-search is a multi-file skill (1 SKILL.md + 6 reference files) with conditional loading.

File Word count Est. Tokens Load condition
SKILL.md 2,085 ~3,100 Always
references/query-patterns.md 1,191 ~1,800 Always (query construction)
references/programmer-search-patterns.md 1,031 ~1,500 Programmer search
references/source-evaluation.md 911 ~1,400 Source evaluation / conflict handling
references/ai-search-and-termination.md 549 ~800 Termination / escalation decisions
references/high-conflict-topics.md 947 ~1,400 High-conflict topics
references/chinese-search-ecosystem.md 279 ~400 Chinese / China topics
SKILL.md description (always in context) ~60 ~80 Always

Actual load per scenario:

Scenario Files loaded Est. Tokens
Eval 1 (Quick, programmer) SKILL.md + query-patterns + programmer-search ~6,400
Eval 2 (Standard, programmer) SKILL.md + query-patterns + programmer-search + source-evaluation ~7,800
Eval 3 (Deep, comparison) SKILL.md + query-patterns + programmer-search + source-evaluation ~7,800
Average ~7,300

3. Assertion Pass Rate

3.1 Summary

Scenario Assertions With Skill Without Skill Delta
Eval 1: Fact lookup (Quick) 9 9/9 (100%) 3/9 (33.3%) +66.7%
Eval 2: Error debugging (Standard) 9 9/9 (100%) 2/9 (22.2%) +77.8%
Eval 3: Framework comparison (Deep) 9 9/9 (100%) 2/9 (22.2%) +77.8%
Total 27 27/27 (100%) 7/27 (25.9%) +74.1%

3.2 Item-by-Item Scoring

Eval 1: Go database/sql default pool size (Quick mode)

# Assertion With Skill Without Skill
A1 Output includes execution mode label ✅ "Quick"
A2 Output includes degradation level ✅ "Full"
A3 Conclusion directly answers question
A4 Output includes reusable queries (≥2) ✅ (5 queries)
A5 At least 1 query uses site:go.dev
A6 Conclusion cites official sources ✅ go.dev, pkg.go.dev ✅ go.dev, pkg.go.dev
A7 Output includes evidence chain status ✅ Explicit table
A8 Conclusion includes specific values ✅ MaxOpenConns=0, MaxIdleConns=2
A9 Key numbers have confidence + source-tier labels ✅ "High" + "Official"

Eval 2: gRPC context deadline exceeded (Standard mode)

# Assertion With Skill Without Skill
B1 Output includes execution mode label ✅ "Standard"
B2 Output includes degradation level ✅ "Full"
B3 Conclusion includes multiple causes ✅ (5 structured causes) ✅ (6 causes)
B4 Output includes reusable queries (≥3) ✅ (5 queries)
B5 At least 1 query targets SO or GitHub site:github.com/grpc/grpc-go
B6 At least 1 query uses quoted exact match for error "context deadline exceeded"
B7 Sources include cross-validation (≥2 independent) ✅ (6 independent sources) ✅ (6 referenced sources)
B8 Output includes evidence chain status ✅ Explicit table
B9 Output includes source assessment ✅ Credibility/recency/gaps/conflicts/confidence reasoning

Eval 3: Go HTTP framework comparison (Deep mode)

# Assertion With Skill Without Skill
C1 Output includes execution mode label (Deep) ✅ "Deep"
C2 Output includes degradation level ✅ "Partial" (honest degradation)
C3 Conclusion includes recommendation ✅ Decision tree + framework positioning ✅ Decision matrix + recommendation
C4 Output includes reusable queries (≥3) ✅ (5 incl. gap-closing)
C5 Key numbers have confidence + source-tier labels ✅ (14 numbers all labeled)
C6 ≥3 independent sources ✅ (5+ sources with detailed assessment) ✅ (16 sources)
C7 Sources include credibility assessment ✅ Source Comparison Table (tier/credibility/gaps/recency/bias)
C8 Output includes evidence chain status ✅ Explicit chain status table
C9 Comparison covers ≥3 frameworks with concrete data ✅ Gin/Echo/Fiber + RPS + latency + stars

3.3 Classification of 20 Without-Skill Failed Assertions

Failure type Count Notes
Missing Output Contract metadata fields 6 execution mode (3) + degradation level (3)
Missing reusable search queries 3 3/3 scenarios no reusable queries section
Missing evidence chain status tracking 3 3/3 scenarios no evidence chain status
Missing confidence + source-tier labels 3 Key numbers lack dual labels
Missing source assessment 3 No credibility/bias/recency assessment
Missing search strategy display 2 No site: precise query, no quoted match

Note: As with the deep-research evaluation, all 20 failures are search discipline / report format failures, not content quality failures. Without-skill passed on answer correctness, source coverage, and code examples.

3.4 Comparison with deep-research Skill

Metric google-search deep-research
With-skill pass rate 100% 100%
Without-skill pass rate 25.9% 33.3%
Delta +74.1% +66.7%
Failure type Search discipline + report format Report format

google-search has a larger assertion delta because it requires not only a report template (deep-research’s 7-section) but also search process metadata (mode, budget, evidence chain, degradation level, reusable queries, precise query strategies). The base model does not produce these concepts at all.


4. Dimension-by-Dimension Comparison

4.1 Output Contract (8 Fields)

Field With Skill 3/3 Without Skill output
1. Execution mode ✅ Quick/Standard/Deep ❌ No mode concept
2. Degradation level ✅ Full/Partial/Blocked ❌ No degradation concept
3. Conclusion summary ✅ (equivalent)
4. Evidence chain status ✅ Explicit table ❌ No tracking
5. Key evidence ✅ Structured table with contribution notes ⚠️ Source list but no structured assessment
6. Source assessment ✅ Credibility/bias/recency/gaps/conflicts ❌ No assessment
7. Key numbers + dual labels ✅ confidence + source-tier ❌ Numbers but no labels
8. Reusable queries ✅ 3–5 with precision/expansion/gap-closing ❌ None

Practical value: - Degradation level showed highest value in Eval 3—With-skill honestly declared "Partial" (TechEmpower data from third-party interpretation, no named company production cases), while Without-skill gave conclusions without marking uncertainty - Evidence chain status lets readers track "which evidence is satisfied, which is missing", avoiding treating partial data as complete conclusions - Reusable queries give readers the ability to "continue searching"—5 well-designed Google queries are more lasting value than a single answer

4.2 Search Strategy Discipline

Dimension With Skill Without Skill
Query construction strategy Primary + Precision + Expansion variants Direct search, no explicit strategy
site: domain constraint ✅ site:go.dev, site:github.com/grpc/grpc-go Occasional but not systematic
Quoted exact match "context deadline exceeded" Not shown
Query budget control ✅ Quick 2 / Standard 5 / Deep 8 No budget concept
Query history log ✅ Gate Execution Log ❌ No log
Post-search strategy ✅ gap-closing queries ❌ None

4.3 Confidence + Source-Tier Labels

Eval 3 With-skill output labeled all 14 key numbers with dual labels:

| Fiber real-world RPS | ~36,000 | May 2024 | Medium | Primary (independent benchmark) |
| Fiber JSON RPS (TechEmpower R23) | ~735,000 | March 2025 | Low | Third-party interpretation of Official |

This distinguishes "Medium confidence from Primary source" from "Low confidence from Third-party interpretation", so readers know TechEmpower data is secondhand and downgraded. Without-skill Eval 3 cited 16 sources and many numbers but no number had credibility or source-tier labels.

4.4 Honest Degradation

Eval 3 With-skill output best illustrates this mechanism:

Degradation Level: Partial — Strong benchmark data and ecosystem analysis available. However: TechEmpower Round 23 Go-specific per-framework numbers could not be directly verified from TechEmpower's own site... Large-scale production experience reports... were not found from named companies with disclosed architectures.

This degradation statement clearly informs readers of two specific uncertainties, avoiding treating the comparison as fully confirmed fact. Without-skill Eval 3 also found no named company cases but did not declare this limitation.

4.5 Content Quality Comparison

Dimension With Skill Without Skill Delta
Answer correctness 3/3 correct 3/3 correct No difference
Source count 2 / 6 / 5 4 / 6 / 16 Without-skill slightly more (Eval 3)
Code examples Excellent (Eval 2: 6 blocks) Excellent (Eval 2: 5 blocks) No significant difference
Debug steps (Eval 2) 6-step structured flow 5-step flow Comparable
Framework comparison table (Eval 3) Source Comparison Table + Decision Tree Decision Matrix + Star ratings Each has strengths
Production advice Excellent Excellent No significant difference

Key conclusion: Consistent with the deep-research skill evaluation—the base model is already strong on content; the skill’s increment is entirely in search discipline and report metadata.


5. Token Cost-Effectiveness Analysis

5.1 Skill Size

File Est. Tokens Load condition
SKILL.md ~3,100 Always
query-patterns.md ~1,800 Always
programmer-search-patterns.md ~1,500 Programmer search
source-evaluation.md ~1,400 Source evaluation
ai-search-and-termination.md ~800 Termination decisions
high-conflict-topics.md ~1,400 High conflict
chinese-search-ecosystem.md ~400 Chinese topics
Max load ~10,400 All loaded
Typical load (programmer search) ~7,800 SKILL + query + programmer + source-eval
Min load (non-programmer Quick) ~4,900 SKILL + query

5.2 Token Cost for Quality Gain

Metric Value
With-skill pass rate 100% (27/27)
Without-skill pass rate 25.9% (7/27)
Pass-rate gain +74.1 percentage points
Token cost per assertion fixed (SKILL.md) ~155 tok
Token cost per assertion fixed (typical load) ~390 tok
Token cost per 1% pass-rate gain (SKILL.md) ~42 tok
Token cost per 1% pass-rate gain (typical load) ~105 tok

5.3 Token Segment Cost-Effectiveness

Module Est. Tokens Related assertion delta Cost-effectiveness
Output Contract (SKILL.md) ~300 6 (mode 3 + degradation 3) Very high — 50 tok/assertion
Confidence + Source-tier rules ~200 3 Very high — 67 tok/assertion
Reusable Queries requirement ~100 3 Very high — 33 tok/assertion
Evidence Chain Gate (Gate 3) ~300 3 High — 100 tok/assertion
Source Assessment requirement ~150 3 High — 50 tok/assertion
query-patterns.md ~1,800 2 (site: + quoted strategy) Medium — 900 tok/assertion
programmer-search-patterns.md ~1,500 Indirect (search quality) Medium — no direct assertion
source-evaluation.md ~1,400 Indirect (assessment quality) Medium — no direct assertion
Worked Examples (SKILL.md) ~500 0 direct Low
Anti-Examples (SKILL.md) ~300 0 direct Low
Other Gates (1,2,4,5,6,7,8) ~450 Indirect Medium

5.4 High-Leverage vs Low-Leverage Instructions

High leverage (~1,050 tokens → 18 assertion delta): - Output Contract 8-field definition (~300 tok → 6) - Confidence + Source-tier dual-label rules (~200 tok → 3) - Reusable Queries requirement (~100 tok → 3) - Evidence Chain Gate (~300 tok → 3) - Source Assessment requirement (~150 tok → 3)

Medium leverage (~5,150 tokens → 2 direct + indirect): - query-patterns.md (~1,800 tok → 2 + indirect search quality) - programmer-search-patterns.md (~1,500 tok → indirect) - source-evaluation.md (~1,400 tok → indirect) - Other Gates (~450 tok → indirect)

Low leverage (~800 tokens → 0 direct delta): - Worked Examples (~500 tok) - Anti-Examples (~300 tok)

5.5 Comparison with Other Skills’ Cost-Effectiveness

Metric google-search deep-research yt-dlp-downloader tdd-workflow go-makefile-writer
SKILL.md Tokens ~3,100 ~1,350 ~2,370 ~2,100 ~1,960
Typical load Tokens ~7,800 ~1,350 ~5,100 ~3,600 ~4,100
Pass-rate gain +74.1% +66.7% +55.0% +46.2% +31.0%
Tokens per 1% (SKILL.md) ~42 tok ~20 tok ~43 tok ~45 tok ~63 tok
Tokens per 1% (typical load) ~105 tok ~20 tok ~93 tok ~78 tok ~132 tok

google-search has the highest absolute pass-rate gain (+74.1%), but SKILL.md-level unit cost-effectiveness (~42 tok/1%) is similar to yt-dlp-downloader (~43) and tdd-workflow (~45). Typical-load cost-effectiveness (~105 tok/1%) is higher due to more reference files.


6. Boundary Analysis vs Base Model Capabilities

6.1 Base Model Capabilities (No Skill Increment)

Capability Evidence
WebSearch information retrieval 3/3 scenarios actively searched and found correct answers
Official sources preferred Eval 1 located go.dev and pkg.go.dev on its own
Error message search Eval 2 searched gRPC error and found GitHub issues
Multi-source synthesis Eval 3 cited 16 sources for framework comparison
Code example generation Eval 2 produced complete debug code snippets
Structured comparison tables Eval 3 produced decision matrix and star ratings

6.2 Base Model Gaps (Skill Fills)

Gap Evidence Risk level
No search mode/budget control 3/3 scenarios no Quick/Standard/Deep concept Medium — may over-search simple or under-search complex
No degradation declaration 3/3 scenarios give conclusions without marking uncertainty High — readers treat Partial as Full
No evidence chain tracking 3/3 scenarios don’t track "what evidence needed, what found" High — can’t assess conclusion reliability
No confidence + source-tier dual labels 3/3 scenarios numbers unlabeled High — third-party vs official treated equally
No reusable queries 3/3 scenarios don’t output search queries Medium — users can’t continue searching
No source credibility assessment 3/3 scenarios don’t assess bias/recency/gaps Medium — competitor blogs and official docs treated equally
No search strategy display Search process opaque Low — no direct impact on final answer

Core finding: The base model’s "search results → answer" ability is strong, but "search process auditability" and "conclusion credibility labeling" are zero. The google-search skill’s value lies in the latter two.


7. Overall Score

7.1 Dimension Scores

Dimension With Skill Without Skill Delta
Output Contract compliance 5.0/5 0.5/5 +4.5
Search discipline (mode/budget/strategy) 5.0/5 1.0/5 +4.0
Confidence + Source-tier 5.0/5 0.5/5 +4.5
Honest degradation 5.0/5 1.0/5 +4.0
Reusable queries 5.0/5 0.0/5 +5.0
Content quality (answer correctness/depth) 5.0/5 4.5/5 +0.5
Source count/diversity 5.0/5 4.5/5 +0.5
Overall mean 5.0/5 1.71/5 +3.29

7.2 Weighted Total

Dimension Weight Score Weighted
Assertion pass rate (delta) 25% 10/10 2.50
Output Contract compliance 15% 10/10 1.50
Search discipline + honest degradation 15% 10/10 1.50
Confidence + Source-tier 10% 10/10 1.00
Reusable queries 10% 10/10 1.00
Token cost-effectiveness 10% 7.0/10 0.70
Content quality increment 10% 2.0/10 0.20
Source count/quality increment 5% 2.0/10 0.10
Weighted total 8.50/10

The lower Token cost-effectiveness score (7.0/10) reflects higher typical load (~7,800 tok) from more reference files, even though SKILL.md cost-effectiveness (~42 tok/1%) is comparable to peer skills.


8. Evaluation Artifacts

Artifact Path
Eval 1 with-skill output /tmp/gsearch-eval/eval-1/with_skill/response.md
Eval 1 without-skill output /tmp/gsearch-eval/eval-1/without_skill/response.md
Eval 2 with-skill output /tmp/gsearch-eval/eval-2/with_skill/response.md
Eval 2 without-skill output /tmp/gsearch-eval/eval-2/without_skill/response.md
Eval 3 with-skill output /tmp/gsearch-eval/eval-3/with_skill/response.md
Eval 3 without-skill output /tmp/gsearch-eval/eval-3/without_skill/response.md