← All Articles
Last updated: 2026-03-30

Before You Build an AI Product: The Architecture Questions Nobody Asks

AI product architecture review. Model selection, RAG vs fine-tuning, cost calculation, when NOT to use AI.

TL;DR

Most AI products fail not because of bad models but because of bad architecture decisions made before the first prompt was written. This guide covers the questions you must answer before building: How much latency can your users tolerate? What's your cost per request at scale? Is AI even the right solution? You'll get a model selection framework, a RAG vs fine-tuning decision tree, and a realistic cost calculator — because "we'll use GPT-4" is not an architecture.

Prerequisites

Step 1: The Latency-Cost-Accuracy Triangle

Every AI product must make explicit tradeoffs between three competing constraints. You cannot optimize all three simultaneously.

The Triangle

         ACCURACY
           /\
          /  \
         /    \
        /  Pick \
       /   Two   \
      /____________\
   LATENCY      COST

Concrete Tradeoffs

Priority                | Architecture Implication
------------------------+--------------------------------------------
Accuracy + Low Latency  | Expensive. Large model, powerful GPU, edge
                        | deployment. Budget: $$$ per request.

Accuracy + Low Cost     | Slow. Use smaller model with multiple passes,
                        | caching, async processing. User waits.

Low Latency + Low Cost  | Less accurate. Smaller/faster models, fewer
                        | tokens, simpler prompts. Good enough wins.

Define Your Constraints

Before choosing any technology, fill in these numbers:

Maximum acceptable latency (p95): _____ ms
  - Conversational chat: 500-2000ms for first token
  - Search/recommendation: 200-500ms
  - Batch processing: minutes to hours (doesn't matter)
  - Real-time analysis: 50-200ms

Target cost per request: $_____ 
  - Consumer product (high volume): $0.001-0.01
  - B2B SaaS (medium volume): $0.01-0.10
  - Enterprise tool (low volume): $0.10-1.00
  - Internal tool: whatever it costs

Minimum accuracy threshold: _____% 
  - Medical/legal advice: 99%+ (probably don't use AI alone)
  - Content generation: 85-95% (human review catches rest)
  - Classification/routing: 90-98%
  - Search/recommendation: 80-90%

Step 2: Model Selection Framework

Choosing a model is not about picking the "best" one. It's about picking the right one for your constraints.

Decision Matrix

Use Case               | Recommended Model Tier        | Why
-----------------------+-------------------------------+-----------------------
Simple classification  | Small (Llama 3 8B, Mistral 7B)| Fast, cheap, sufficient
Named entity extraction| Small/Medium                  | Structured output
Summarization          | Medium (Llama 3 70B, Claude Haiku)| Balance of quality/cost
Code generation        | Large (GPT-4o, Claude Sonnet) | Needs reasoning
Complex reasoning      | Large (GPT-4o, Claude Opus)   | Accuracy critical
Multi-modal (images)   | Large (GPT-4o, Claude Sonnet) | Limited options
Embeddings/search      | Embedding model (text-embedding-3)| Purpose-built
Real-time conversation | Medium with streaming         | Latency matters

Self-Hosted vs API

Factor               | Self-Hosted (Ollama, vLLM)   | API (OpenAI, Anthropic)
---------------------+------------------------------+------------------------
Latency control      | Full control                 | Depends on provider
Cost at low volume   | High (GPU costs)             | Low (pay per token)
Cost at high volume  | Lower per request            | Linear scaling
Data privacy         | Data stays on-premise        | Leaves your infra
Model updates        | Manual                       | Automatic
Operational burden   | High (GPU, VRAM, scaling)    | None
Max model quality    | Limited by your GPU          | State of the art

Break-even point: ~100,000 requests/day is where self-hosting
becomes cheaper than API calls (rough estimate, model-dependent).

The "Start With API, Move to Self-Hosted" Pattern

This is almost always the right approach for new products:

  1. Build your product using an API provider (OpenAI, Anthropic)
  2. Abstract the LLM call behind an interface (never hardcode the provider)
  3. Measure actual usage patterns: tokens/request, requests/day, latency needs
  4. When cost or data privacy requires it, switch to self-hosted for specific workloads
  5. Keep the API for complex reasoning tasks where model quality matters most
# Good: Abstracted LLM interface
class LLMProvider:
    def complete(self, prompt: str, **kwargs) -> str:
        raise NotImplementedError

class OpenAIProvider(LLMProvider):
    def complete(self, prompt, **kwargs):
        # OpenAI API call
        ...

class OllamaProvider(LLMProvider):
    def complete(self, prompt, **kwargs):
        # Local Ollama call
        ...

# Switch providers without changing business logic
llm = OpenAIProvider()  # or OllamaProvider()
result = llm.complete("Classify this ticket: ...")

Step 3: RAG vs Fine-Tuning Decision Tree

This is the most misunderstood decision in AI product architecture. Most teams default to RAG without understanding when fine-tuning is better — and vice versa.

Decision Tree

Does the model need to know facts specific to your domain?
├── NO → Use the base model with good prompting. You're done.
└── YES → Does the knowledge change frequently (weekly or more)?
    ├── YES → RAG (Retrieval-Augmented Generation)
    │   Reason: Fine-tuning on changing data is impractical.
    └── NO → Is it about KNOWLEDGE (facts) or BEHAVIOR (style/format)?
        ├── KNOWLEDGE → RAG
        │   Reason: RAG is better at injecting specific facts.
        └── BEHAVIOR → Fine-tuning
            Reason: Fine-tuning changes how the model responds.

Can you combine both?
└── YES, and this is often the best answer.
    Fine-tune for behavior + RAG for knowledge.

RAG Architecture Decisions

Component           | Options                        | Recommendation
--------------------+--------------------------------+------------------
Vector database     | Pinecone, Weaviate, Qdrant,    | pgvector if you
                    | pgvector, Chroma               | already use Postgres.
                    |                                | Qdrant for dedicated.
Embedding model     | text-embedding-3-small/large,  | text-embedding-3-small
                    | Cohere embed, BGE              | for most cases.
Chunk size          | 256-2048 tokens                | Start at 512, measure.
Chunk overlap       | 10-20% of chunk size           | 50-100 tokens.
Retrieval strategy  | Semantic, keyword, hybrid      | Hybrid (semantic +
                    |                                | BM25 keyword) is best.
Reranking           | Cohere rerank, cross-encoder   | Yes, always rerank.
                    |                                | Improves relevance 20-40%.

RAG Pipeline Example

# Simplified RAG pipeline

# 1. Ingestion (offline)
documents = load_documents("./knowledge_base/")
chunks = split_into_chunks(documents, size=512, overlap=50)
embeddings = embed_model.encode(chunks)
vector_db.upsert(chunks, embeddings)

# 2. Query (online, per-request)
query = "What is our refund policy?"
query_embedding = embed_model.encode(query)

# Retrieve top-k relevant chunks
candidates = vector_db.search(query_embedding, top_k=20)

# Rerank for relevance
reranked = reranker.rank(query, candidates, top_k=5)

# Build prompt with context
context = "\n\n".join([chunk.text for chunk in reranked])
prompt = f"""Answer the question based on the context below.
If the answer is not in the context, say "I don't know."

Context:
{context}

Question: {query}
Answer:"""

# Generate answer
answer = llm.complete(prompt)

Step 4: Cost Calculation — The Math Nobody Does

Before building, calculate your projected costs at scale. Surprises here kill products.

Token Cost Calculator

# Per-request cost estimation

# Inputs (measure these from your prototype)
avg_input_tokens = 2000    # System prompt + RAG context + user input
avg_output_tokens = 500    # Model response

# Pricing (GPT-4o as of March 2026)
input_price_per_1k = 0.0025   # $2.50 per 1M input tokens
output_price_per_1k = 0.01    # $10.00 per 1M output tokens

# Per-request cost
cost_per_request = (avg_input_tokens / 1000 * input_price_per_1k) + \
                   (avg_output_tokens / 1000 * output_price_per_1k)
# = (2 * 0.0025) + (0.5 * 0.01) = $0.005 + $0.005 = $0.01

# Monthly cost projections
requests_per_user_per_day = 20
active_users = 1000
monthly_requests = requests_per_user_per_day * active_users * 30
# = 600,000 requests

monthly_llm_cost = monthly_requests * cost_per_request
# = 600,000 * $0.01 = $6,000/month

# Don't forget:
# + Embedding costs for RAG (~$0.00002 per query)
# + Vector DB hosting ($50-500/month)
# + Compute for self-hosted models ($500-5000/month per GPU)
# + Reranking costs if using API reranker

Cost Optimization Strategies

Strategy              | Savings  | Tradeoff
----------------------+----------+------------------
Caching common queries | 30-60%   | Stale responses for some queries
Smaller model for simple tasks | 50-80% | Lower accuracy on edge cases
Shorter system prompts | 10-30%  | Less instruction, possibly worse output
Batch processing      | 50%      | Higher latency
Prompt compression    | 20-40%   | Slightly worse context understanding
Self-hosted for high-volume | 60-80% | Operational overhead

The Unit Economics Test

Revenue per user per month:     $___
AI cost per user per month:     $___
Other costs per user per month: $___

If AI cost > 30% of revenue per user, your unit economics
are broken. Either:
  1. Increase price
  2. Reduce AI costs (smaller model, caching, fewer calls)
  3. Reconsider whether AI is the right approach

Step 5: When NOT to Use AI

The hardest architecture question: should you use AI at all?

Don't Use AI When:

Use AI When:

Troubleshooting & Considerations

"Our RAG pipeline gives irrelevant results"

Check in this order: (1) Are your chunks the right size? Too large = noise, too small = missing context. (2) Are you using a reranker? It makes a massive difference. (3) Is your embedding model appropriate for your domain? (4) Are you using hybrid search (semantic + keyword)? Pure semantic search misses exact matches.

"The AI is too slow for our users"

First, measure where the time goes. Common breakdown: (1) Embedding the query: 50-200ms. (2) Vector search: 20-100ms. (3) Reranking: 100-300ms. (4) LLM generation: 500-5000ms. Use streaming for the LLM response — first-token latency matters more than total time. Cache embeddings. Use a faster vector DB if search is the bottleneck.

"Costs are higher than projected"

Measure actual token usage — it's almost always higher than estimates because system prompts, context, and conversation history grow. Implement token counting. Add caching for repeated queries (semantic caching with embeddings can catch paraphrased duplicates). Route simple queries to cheaper models.

"The model hallucinates too much"

Reduce context window noise (send only the most relevant chunks). Add explicit instructions: "If the answer is not in the provided context, say I don't know." Use structured output (JSON mode) for extractive tasks. Lower the temperature. Consider using citations: ask the model to quote the source chunk.

Prevention & Best Practices

Build Evaluation First

Before building the product, build the evaluation pipeline. Create a test set of 50-100 question-answer pairs. Run every architecture change against this test set. If you can't measure quality, you can't improve it.

Log Everything

Log every LLM call: input, output, latency, token count, cost, model version. This data is invaluable for debugging, cost optimization, and fine-tuning later. Use a tool like LangSmith, Weights & Biases, or a simple structured logging setup.

Plan for Model Changes

Models improve and change. Pricing changes. Providers deprecate versions. Abstract your LLM calls behind an interface so you can switch models without rewriting business logic. Test new models against your evaluation set before switching.

Start Simple, Add Complexity

Build the simplest possible version first: a basic prompt with the base model, no RAG, no fine-tuning. Measure quality. Then add RAG if the model lacks domain knowledge. Then add reranking if relevance is poor. Then consider fine-tuning if behavior needs adjustment. Each layer should demonstrably improve your evaluation metrics.

Security and Privacy

Define data classification before building. PII in prompts? Require a data processing agreement with API providers or self-host. User inputs as training data? Most providers offer opt-out, but verify. Prompt injection? Implement input sanitization and output validation. This is not optional — it is architecture.

Need Expert Help?

Want your AI architecture reviewed by an expert? €200, 60-min deep dive + written report.

Book Now — €200

100% money-back guarantee

HR

Harald Roessler

Infrastructure Engineer with 20+ years experience. Founder of DSNCON GmbH.