Building an AI Telemetry CLI

The Real Problem

You run Claude Code for 3 hours. It spits out 284,489 lines of terminal output. You could read it. You probably shouldn’t. But here’s what you actually want to know:

Which tools does Claude Code use most? (Spoiler: 66% are Bash commands)
How much time is wasted on UI progress indicators? (15% of all output)
Did it use the model I requested or something else?
How do different AI assistants solve the same problem differently?

Raw logs don’t answer these questions. You need queryable data.

What We Built

ait (AI CLI Telemetry) is a lightweight wrapper that turns your AI CLI output into structured, analyzable data. Think of it like adding “events” to a CLI tool—same output you already see, but with metadata you can actually query.

The CLI Interface

# Wrap any AI assistant
$ ait run -- claude "Analyze this codebase"

# Or use quick aliases
$ alias ac='ait run claude'
$ ac "Generate tests for src/auth.go"

# Export everything from the last 7 days
$ ait export --days 7 | jq '.llm_meta.tool_call.name' | sort | uniq -c

Help System

$ ait --help
ait - AI CLI Telemetry Wrapper

QUICK START:
  ait init --local
  ait run claude --model sonnet-4 --tag project=myapp
  ait export --days 7 --out last-week.jsonl

$ ait export --help
Export session logs as JSONL

EXAMPLES:
  ait export --days 7 --out last-week.jsonl
  ait export --since 2025-01-01 --limit 100
  ait export | jq '.token_usage'

What We Actually Learned (The Good Stuff)

I analyzed a real 3+ hour Claude Code session building this feature. Here’s what the data revealed:

Finding 1: Tool Usage is Wildly Skewed

Most people assume AI CLI tools are balanced. They’re not.

$ jq -r 'select(.llm_meta.tool_call) | .llm_meta.tool_call.name' session.jsonl | \
  sort | uniq -c | sort -rn

Bash      # 66.0% - Tests, builds, git operations
Read      # 14.0% - Reading files
Write     # 11.0% - Editing files
Search    #  2.4% - Code search
Explore   #  1.0% - Exploration

Why this matters: If Claude is spending 2/3 of its time running commands, maybe your build is slow. Maybe your tests take forever. Maybe better feedback would help the AI make faster decisions.

Finding 2: 15% of Your Output is Garbage

Those horizontal lines? ───────────────────? They show up 42,174 times in a single session.

Total output lines:    284,489
Separator lines:        42,174
Percentage:            14.8%

Why this matters: When you’re measuring latency or trying to understand what actually happened, that UI noise pollutes everything. With telemetry metadata, you can filter it out:

# Skip all the progress junk, get real events only
$ jq 'select(.llm_meta.milestone != "progress")' session.jsonl

Finding 3: Tool Call Detection is Tricky

We discovered 3,355 “hits” for get_weather — except we never called that. Why? The AI was showing us example XML in the conversation, and our parser thought it was real.

The lesson: Easy to detect false positives. Hard to know when you’re wrong until you look at the data. We fixed it by distinguishing:

Real tool calls: ⏺ Read(file.go) (Claude Code format)
Examples: <function_calls> in conversation

The Shift: From Logs to Queries

Before telemetry, asking questions about your session meant reading tea leaves.

Now, you can ask:

# Which files got read most?
$ jq -r 'select(.llm_meta.tool_call.name == "Read") |
  .llm_meta.tool_call.input.value' session.jsonl | sort | uniq -c

# How long between tool calls?
$ jq -r '.timestamp' session.jsonl | awk -F'T' '{print $2}' |
  sort | uniq -c

# What's the actual model being used?
$ jq '.llm_meta.model' session.jsonl | sort | uniq -c

Why You Want This

You just turned unreadable noise into queryable data. That matters in two ways:

For Builders

Optimize your workflow: See what’s actually slow. Is it the AI? Your tests? Your code? Data reveals the truth that intuition misses.
Model comparison: Run the same task with Claude vs. Copilot vs. Gemini, compare tool usage patterns. Who finishes first? Who finds the most bugs?
Cost tracking: Know exactly which model was used (not just what you requested). Spot drift before it explodes your bill.

For AI Tool Developers

Real usage data: Understand how users actually interact with your tool. Stop guessing. Start measuring.
Telemetry without creepy: Privacy-first, completely anonymous. Users can trust it.
Query-driven insights: Not logs, actual data you can aggregate and analyze. Scale your understanding across hundreds of sessions.

The Lane Decision

Here’s the interesting part: I needed to build this in Go, but Go isn’t my primary lane.

Rails is my lane. I’ve shipped a dozen SaaS apps. I can spot architecture problems before they happen. I know the ecosystem deeply.

But this problem required Go. Not because I was chasing shiny objects. Because:

Single binary distribution - One executable, works on any machine. No dependency hell for users.
Performance - Parsing 284k lines of output needs to be fast. Go handles it efficiently.
Type safety - CLI tools have weird edge cases. Go’s type system catches them before production.
Concurrency - Future versions need to handle parallel sessions. Go’s goroutines make this natural.

Rails would’ve been slower to distribute and architectural overkill. Node adds dependency management overhead. Python forces users to install packages. Go was the right tool for the job.

The key: I didn’t wing it.

The Vibe Engineering manifesto says: “Use AI to go 10x faster in areas you know well. For new domains, learn the fundamentals first—then let AI accelerate you.”

Here’s how I derisk the lane:

Day 1 - Fundamentals: I spent time learning Go patterns—error handling, interfaces, concurrency primitives, testing patterns. Not casual prompting. Intentional study. Just enough to know what good looks like, so I could evaluate Claude’s suggestions.

Architecture-first: Before any code, I worked through the design with Codex as the planner. What are the data structures? What are the failure modes? What does the test suite need to cover? Then I knew what I was building.

TDD from the start: Every feature was test-first. This forced clarity about contracts and edge cases. It also meant Claude Code was working within constraints—the tests defined success, not my fuzzy requirements.

Dual review cycle: Claude Code built the implementation. Codex reviewed it with fresh eyes, caught edge cases I missed, suggested optimizations. Two AI agents, different personas, both playing to their strengths.

Why This Was a “Struggle”

The rigor helped, but early testing revealed something I wasn’t prepared for: terminal corruption from control character overflow.

We were parsing ANSI codes and control characters from raw CLI output. Something in the logic—I can’t fully explain what—was causing buffer issues or character encoding problems. Terminal tabs would freeze. Sessions would crash. I’d have to kill the tab and restart.

This is the kind of edge case that Go’s type system should catch, but I didn’t know Go well enough to spot it immediately. I had to debug empirically: run a test, watch the terminal corrupt, kill it, adjust the logic, try again.

That’s the struggle. Not architectural. Not algorithmic. Just unfamiliar territory + complex input data = mysterious failures.

But here’s where experience mattered: I knew I wasn’t going crazy. I knew this was a solvable problem. I didn’t panic and rewrite everything in Rails. I stayed in the debug loop, narrowed the reproduction, isolated the parsing logic, and eventually found the issue.

A vibe coder would’ve given up or shipped broken code. An experienced engineer knows: “This is weird, but it’s solvable. Keep going.”

The result: 47 commits, 14 PRs, 32 new tests across 14 packages, zero regressions. The struggle wasn’t failure—it was the messy middle where you don’t fully understand your tools but you’re stubborn enough to figure it out.

That’s what staying in your lane means at this level:

It’s not “only work in Rails.” It’s “know enough to evaluate. Design first. Write tests first. Then let AI accelerate you through the implementation.”

Experience isn’t about mastery of every language. It’s about knowing how to reduce risk before moving fast.

What’s Next

The foundation is solid. Obvious next steps:

Dashboard: Visualize tool usage patterns over time
Cross-tool comparison: Run same task with different AI assistants, see the differences
Cost analysis: Aggregate token usage and actual model usage across team
Predictive insights: Which tools tend to finish fastest? Which models solve your problems best?

But here’s the truth: These are probably YAGNI.

Right now, the telemetry works. It exports data. I can query it with jq. That’s enough. I don’t need a dashboard until I have enough data to show meaningful patterns. I don’t need cross-tool comparison until someone actually asks for it. I don’t need cost analysis until I’m running this across a team.

Building features “just in case” is how side projects become maintenance burdens. The smart move: ship what works, let real usage drive the next features. If users pile up requests for a dashboard, I’ll build one. If they’re happy with JSONL exports, I won’t waste time on visualization.

The best code is code you didn’t write.

Built with: Go, JSONL, jq-friendly output Test coverage: 32 new tests across 14 packages Session analyzed: 284,489 lines → queryable structured data Lane tension: Real, worked through, still learning Next features: Only if needed

Vibe Engineering