google-search is a research/search skill that turns "help me search for this" into a verifiable search workflow. It suits fact lookups, error debugging, official docs retrieval, technology comparisons, and public-information gathering that needs source support. Its three main strengths are: classifying the question, defining the evidence chain, and choosing the mode first—elevating search from "finding links" to "finding evidence for conclusions"; outputs include confidence, source tier, budget status, and reusable queries so the search process is reviewable and continuable; and it emphasizes execution completeness and degradation declarations, clearly distinguishing "verified conclusions" from "partial results with insufficient evidence".
This evaluation assesses the google-search skill along two dimensions: actual task performance and Token cost-effectiveness. It uses 3 search scenarios of increasing complexity (Quick-mode fact lookup, Standard-mode error debugging, Deep-mode framework comparison). Each scenario runs with both with-skill and without-skill configurations, for 3 scenarios × 2 configurations = 6 independent subagent runs, scored against 27 assertions.
Dimension
With Skill
Without Skill
Delta
Assertion pass rate
27/27 (100%)
7/27 (25.9%)
+74.1 percentage points
Output Contract 8 fields complete
3/3 correct
0/3
Skill-only
Confidence + Source-tier labels
3/3 correct
0/3
Skill-only
Reusable search queries
3/3 correct
0/3
Skill-only
Evidence chain status tracking
3/3 correct
0/3
Skill-only
Content quality (answer correctness/depth)
3/3 correct
3/3 correct
No difference
Skill Token cost (SKILL.md only)
~3,100 tokens
0
—
Skill Token cost (incl. conditional references)
~6,400–7,800 tokens
0
—
Token cost per 1% pass-rate gain
~42 tok (SKILL.md) / ~99 tok (full)
—
—
Key finding: The core value of the google-search skill is search discipline and report structure, not search content quality. The base model already has strong search and synthesis ability (answer correctness, source coverage, code example quality all good), but lacks metadata for the search process (mode choice, budget control, evidence chain tracking, degradation declaration, confidence labels, reusable queries). The skill fills this "search operation discipline" gap.
Comparison covers ≥3 frameworks with concrete data
✅ Gin/Echo/Fiber + RPS + latency + stars
✅
3.3 Classification of 20 Without-Skill Failed Assertions¶
Failure type
Count
Notes
Missing Output Contract metadata fields
6
execution mode (3) + degradation level (3)
Missing reusable search queries
3
3/3 scenarios no reusable queries section
Missing evidence chain status tracking
3
3/3 scenarios no evidence chain status
Missing confidence + source-tier labels
3
Key numbers lack dual labels
Missing source assessment
3
No credibility/bias/recency assessment
Missing search strategy display
2
No site: precise query, no quoted match
Note: As with the deep-research evaluation, all 20 failures are search discipline / report format failures, not content quality failures. Without-skill passed on answer correctness, source coverage, and code examples.
google-search has a larger assertion delta because it requires not only a report template (deep-research’s 7-section) but also search process metadata (mode, budget, evidence chain, degradation level, reusable queries, precise query strategies). The base model does not produce these concepts at all.
Practical value: - Degradation level showed highest value in Eval 3—With-skill honestly declared "Partial" (TechEmpower data from third-party interpretation, no named company production cases), while Without-skill gave conclusions without marking uncertainty - Evidence chain status lets readers track "which evidence is satisfied, which is missing", avoiding treating partial data as complete conclusions - Reusable queries give readers the ability to "continue searching"—5 well-designed Google queries are more lasting value than a single answer
Eval 3 With-skill output labeled all 14 key numbers with dual labels:
| Fiber real-world RPS | ~36,000 | May 2024 | Medium | Primary (independent benchmark) |
| Fiber JSON RPS (TechEmpower R23) | ~735,000 | March 2025 | Low | Third-party interpretation of Official |
This distinguishes "Medium confidence from Primary source" from "Low confidence from Third-party interpretation", so readers know TechEmpower data is secondhand and downgraded. Without-skill Eval 3 cited 16 sources and many numbers but no number had credibility or source-tier labels.
Eval 3 With-skill output best illustrates this mechanism:
Degradation Level: Partial — Strong benchmark data and ecosystem analysis available. However: TechEmpower Round 23 Go-specific per-framework numbers could not be directly verified from TechEmpower's own site... Large-scale production experience reports... were not found from named companies with disclosed architectures.
This degradation statement clearly informs readers of two specific uncertainties, avoiding treating the comparison as fully confirmed fact. Without-skill Eval 3 also found no named company cases but did not declare this limitation.
Key conclusion: Consistent with the deep-research skill evaluation—the base model is already strong on content; the skill’s increment is entirely in search discipline and report metadata.
High leverage (~1,050 tokens → 18 assertion delta): - Output Contract 8-field definition (~300 tok → 6) - Confidence + Source-tier dual-label rules (~200 tok → 3) - Reusable Queries requirement (~100 tok → 3) - Evidence Chain Gate (~300 tok → 3) - Source Assessment requirement (~150 tok → 3)
Medium leverage (~5,150 tokens → 2 direct + indirect): - query-patterns.md (~1,800 tok → 2 + indirect search quality) - programmer-search-patterns.md (~1,500 tok → indirect) - source-evaluation.md (~1,400 tok → indirect) - Other Gates (~450 tok → indirect)
Low leverage (~800 tokens → 0 direct delta): - Worked Examples (~500 tok) - Anti-Examples (~300 tok)
5.5 Comparison with Other Skills’ Cost-Effectiveness¶
Metric
google-search
deep-research
yt-dlp-downloader
tdd-workflow
go-makefile-writer
SKILL.md Tokens
~3,100
~1,350
~2,370
~2,100
~1,960
Typical load Tokens
~7,800
~1,350
~5,100
~3,600
~4,100
Pass-rate gain
+74.1%
+66.7%
+55.0%
+46.2%
+31.0%
Tokens per 1% (SKILL.md)
~42 tok
~20 tok
~43 tok
~45 tok
~63 tok
Tokens per 1% (typical load)
~105 tok
~20 tok
~93 tok
~78 tok
~132 tok
google-search has the highest absolute pass-rate gain (+74.1%), but SKILL.md-level unit cost-effectiveness (~42 tok/1%) is similar to yt-dlp-downloader (~43) and tdd-workflow (~45). Typical-load cost-effectiveness (~105 tok/1%) is higher due to more reference files.
Medium — may over-search simple or under-search complex
No degradation declaration
3/3 scenarios give conclusions without marking uncertainty
High — readers treat Partial as Full
No evidence chain tracking
3/3 scenarios don’t track "what evidence needed, what found"
High — can’t assess conclusion reliability
No confidence + source-tier dual labels
3/3 scenarios numbers unlabeled
High — third-party vs official treated equally
No reusable queries
3/3 scenarios don’t output search queries
Medium — users can’t continue searching
No source credibility assessment
3/3 scenarios don’t assess bias/recency/gaps
Medium — competitor blogs and official docs treated equally
No search strategy display
Search process opaque
Low — no direct impact on final answer
Core finding: The base model’s "search results → answer" ability is strong, but "search process auditability" and "conclusion credibility labeling" are zero. The google-search skill’s value lies in the latter two.
The lower Token cost-effectiveness score (7.0/10) reflects higher typical load (~7,800 tok) from more reference files, even though SKILL.md cost-effectiveness (~42 tok/1%) is comparable to peer skills.