Session Date: 2026-02-25
Project: ast-grep-mcp
Focus: End-to-end live validation of condense feature tools
Session Type: Validation | Bugfix

Executive Summary

Ran all 6 condense MCP tools against the production tcad-scraper codebase (219 files, 1.1MB TypeScript/JS/Python). Every tool executed successfully with zero errors. During validation, discovered and fixed a serialization gap where per_language_stats in condense_pack output was missing byte-level metrics — only line counts were emitted. After the fix, all 117 condense tests continue to pass.

The ai_chat strategy achieved 15.7% actual reduction (43,827 tokens saved) against a theoretical 85% estimate, confirming the JS/TS surface extractor’s brace-matching heuristic retains more code than the theoretical model assumes — documented as a future improvement area.

Key Metrics

Metric Value
Tools validated 6 / 6
Target codebase tcad-scraper (219 files, 1.1 MB)
Errors encountered 0
Bug found and fixed 1 (per-language byte stats)
Tests passing 117 / 117
Best actual reduction 15.7% (ai_chat)
Tokens saved (ai_chat) 43,827

Tool-by-Tool Results

1. condense_normalize (batch, 212 TS files)

Processed in 5 batches of 50. Completed in 0.4s.

Metric Value
Files processed 212
Normalizations applied 11,494
Files with changes 210 / 212
Byte delta +1,774 (0.16% expansion)

Quote canonicalization was the dominant transform. Net byte expansion is expected — normalization targets compression consistency, not direct size reduction. Top file: continuous-batch-scraper.ts with 1,159 normalizations.

2. condense_strip (batch, 219 files)

Metric Value
Files processed 219
Lines removed 164
Line reduction 0.40%
Files with removals 20 / 219
Elapsed 0.6s

Removed console.log, debugger, print(), and pdb.set_trace statements. Top file: setup-test-db.ts (37 lines removed). Codebase is relatively clean — only 0.4% dead code.

3. condense_extract_surface (212 TS files)

Metric Value
Files processed 212
Condensed lines 33,938
Reduction 15.0%
Output size 946,780 chars
Elapsed 22.5s

Kept only export declarations with brace-matched blocks. Test files with describe/it (no export prefix) fall back to keeping all lines, limiting reduction.

4. condense_pack (all 4 strategies)

Strategy Condensed Reduction Tokens (est) Time
ai_chat 938,923 B 15.7% 234,730 10.6s
ai_analysis 1,102,420 B 1.1% 275,605 19.2s
archival 1,102,420 B 1.1% 275,605 14.9s
polyglot 938,923 B 15.7% 234,730 15.5s

ai_chat and polyglot produce identical output (all files are code, no config/text routing divergence). ai_analysis and archival are identical (both lossless, normalize+strip only).

Per-language breakdown (ai_chat):

Language Files Reduction
TypeScript 212 15.6%
JavaScript 4 26.4%
Python 3 23.5%

5. condense_estimate

Strategy Est. Bytes Est. Tokens Theoretical Reduction
ai_chat 167,134 41,783 ~85%
ai_analysis 668,538 167,134 ~40%
archival 779,961 194,990 ~30%
polyglot 389,980 97,495 ~65%

Top reduction candidates: continuous-batch-scraper.ts (1,769 lines, 4.3% of codebase).

6. condense_normalize on ~/reports/ (32 files)

Also validated against the reports site (JS, Python, CSS files):

Metric Value
Files processed 32
Normalizations applied 59
Files with changes 16 / 32
Byte reduction 75 (0.02%)

Bug Found and Fixed

Problem: per_language_stats in condense_pack output only serialized files_processed, original_lines, condensed_lines — missing byte-level metrics entirely. This caused per-language stats to appear as all zeros when accessing original_bytes/condensed_bytes keys.

Root cause: LanguageCondenseStats dataclass had no byte fields, and condense_pack_impl only aggregated line counts per language.

Fix (2 files):

src/ast_grep_mcp/models/condense.py:11-12 — Added fields:

original_bytes: int = 0
condensed_bytes: int = 0

src/ast_grep_mcp/features/condense/service.py:399-400 — Aggregate bytes:

stats.original_bytes += file_result["original_bytes"]
stats.condensed_bytes += file_result["condensed_bytes"]

src/ast_grep_mcp/features/condense/service.py:431-436 — Serialize with computed reduction:

"original_bytes": s.original_bytes,
"condensed_bytes": s.condensed_bytes,
"reduction_pct": round((1.0 - s.condensed_bytes / s.original_bytes) * 100, 1)

All 117 condense tests pass after the fix.

Estimate vs Actual Gap

Strategy Estimated Reduction Actual Reduction Gap
ai_chat ~85% 15.7% 69.3pp
ai_analysis ~40% 1.1% 38.9pp

The estimator uses theoretical STRATEGY_REDUCTION_RATIOS constants. The actual JS/TS surface extractor keeps entire brace-matched export blocks (including function bodies), and test files with describe/it fall back to keeping everything. This is the primary improvement target for the next phase.

Files Modified

File Change
src/ast_grep_mcp/models/condense.py:11-12 Added original_bytes, condensed_bytes fields
src/ast_grep_mcp/features/condense/service.py:399-400 Aggregate byte counts per language
src/ast_grep_mcp/features/condense/service.py:429-436 Serialize byte metrics + reduction_pct

Git Context

d97d782 refactor(condense): remove unused CondenseDefaults constants and standardize field naming
1ffc15b feat(condense): implement P9 — condense_train_dictionary tool (zstd)
9a09893 fix(condense): address critical/high code review findings

6. condense_train_dictionary (TypeScript)

Metric Value
Dictionary path .condense/dictionaries/dict_typescript.zdict
Dictionary size 112,640 B (110 KB)
Samples used 200
Total sample bytes 999,080 B (~1 MB)
Est. compression improvement 15.0%
Elapsed 13.2s

Trained a zstd dictionary on 200 TypeScript files from tcad-scraper, written to tcad-scraper/.condense/dictionaries/dict_typescript.zdict. The dictionary captures repeated cross-file patterns (import paths, type annotations, test boilerplate) that standard zstd cannot exploit. Usage:

zstd -D .condense/dictionaries/dict_typescript.zdict <file>

The 15% estimated improvement applies on top of standard zstd compression ratios — most effective for small-to-medium files (<100KB) with consistent coding patterns across the codebase.

References