Question 1

What evals framework do you use?

Accepted Answer

We avoid frameworks-for-frameworks-sake. The minimum viable eval stack is: a golden dataset of 50–200 representative inputs and expected behaviors, a deterministic test runner (pytest or vitest works fine), and an LLM-as-judge prompt calibrated against human-rated samples to confirm it correlates. CI runs the suite on every prompt change, comparing pass rate and segment-level metrics against the previous prompt. For richer needs we use Braintrust, Langfuse, or Inspect — but only after the basics are in place.

Question 2

How do you keep LLM costs under control?

Accepted Answer

Four levers. First, attribution — every request is tagged with feature, user, and model so cost per feature is queryable. Second, routing — cheap models (gpt-4o-mini, Haiku, Flash) handle classification and extraction, expensive models handle generation only when needed. Third, caching — exact-match caches for deterministic prompts, semantic caches for retrieval-heavy flows. Fourth, hard limits — per-user and per-feature token budgets enforced server-side, so a runaway agent loop cannot torch the bill before someone notices.

Question 3

Which vector database should we use?

Accepted Answer

For most products under 10M vectors with predictable QPS, pgvector on Postgres is the right answer — one database, transactional consistency with your application data, no separate operational surface. Pinecone or Weaviate make sense when you need multi-tenant namespace isolation, high QPS with sub-50ms latency, or features like hybrid search out of the box. Turbopuffer is excellent when you have hundreds of millions of vectors and cost is dominant. We will not pick the trendy answer; we will pick what your access pattern justifies.

Question 4

How do you version and test prompts?

Accepted Answer

Prompts live in source control as templated files (typically Markdown or TOML), not in a database. Every change ships as a PR, runs through the eval suite in CI, and is deployed behind a feature flag so it can be rolled back instantly. We version model + prompt as a unit because they co-evolve. For richer experimentation we plug in Braintrust or Langfuse, but the source-of-truth is always the repo.

Question 5

Can you build agents — or do you think they are overhyped?

Accepted Answer

Both. Multi-step tool-using agents work for narrow, well-bounded tasks where the action space is small and reversible: data extraction, code review, scheduled research, narrow customer-support flows. They struggle when the action space is large, irreversible, or requires real judgment under ambiguity. We build agents with hard step limits, tool-call budgets, structured outputs, deterministic fallbacks, and full observability into every reasoning trace — because debugging an agent that "just stopped working" without traces is genuinely awful.

AI infrastructure that ships and stays cheap.

Why Sage Ideas for AI Startups

What we solve

A demo that breaks at scale

No evals — every prompt change is a coin flip

OpenAI bill growing faster than revenue

RAG quality flatlining at 60%

Recommended tiers

Automate

Build

Ship

Brand Sprint

Content Engine

Relevant work

Nexural — Full-Stack Fintech Platform

AlphaStream — ML Trading Signal Engine

Jobpoise — AI Job Copilot

AI Startups questions

What evals framework do you use?

How do you keep LLM costs under control?

Which vector database should we use?

How do you version and test prompts?

Can you build agents — or do you think they are overhyped?

Bring your demo, your eval gap, or your runaway OpenAI bill — we will turn it into infrastructure.