The Model Architecture Decisions That Define Your System's Ceiling

Framework selection, fine-tuning vs. RAG analysis, training approach specification, and a benchmark methodology — the model architecture decisions made with rigor, not gut feel.

Duration: 3–5 days Team: 1 Senior ML Architect

You might be experiencing...

You are building an LLM-powered product and can't decide between fine-tuning your own model and building a RAG pipeline. The blog posts say different things and your team is split — you need a structured decision, not more opinions.
Your ML team is debating PyTorch vs. JAX for your next model. The debate has been running for three weeks, nobody is moving, and you are about to miss a milestone because of an unresolved framework decision.
You need a real-time fraud detection model. Half your team wants a neural network, the other half wants gradient-boosted trees. You don't have a benchmark methodology to settle it — just intuitions.
You are about to start training a large model and have no formal benchmark methodology. You don't know what metrics matter, what test set to use, or how to compare model versions in a way your business stakeholders will accept.

The Model Design & Selection sprint resolves the architecture decisions that determine your ML system’s capability ceiling — before you build infrastructure around the wrong choice.

Why Model Architecture Decisions Fail

Most ML model architecture decisions fail in one of three ways:

By default — the team uses the architecture they already know. PyTorch because the last project used PyTorch. Fine-tuning because someone read a fine-tuning tutorial. The decision is never made — it just happens.

By debate — the team identifies the right frameworks and approaches, forms opinions, and then cannot converge. The debate runs for weeks because there is no structured decision process and no agreed criteria.

By premature commitment — the team makes a decision quickly to unblock execution, without documenting the constraints that drove it. Six months later, when those constraints change, the decision gets relitigated — and nobody remembers why the original choice was made.

The Constraints That Drive Correct Decisions

Every model architecture decision is correct or incorrect relative to a specific set of constraints. The same use case has a different correct answer depending on:

Latency requirements — fine-tuning a 7B parameter model produces lower latency than a RAG pipeline. If you need sub-100ms inference, that matters.

Training data availability — fine-tuning requires thousands of high-quality labelled examples. RAG requires a document corpus and retrieval infrastructure. The correct choice depends on what you have and can obtain.

Inference budget — a large fine-tuned model running on GPU is expensive at scale. A retrieval-augmented pipeline over a smaller model may achieve comparable quality at lower cost. The cost model needs to be explicit.

Team capabilities — the correct architecture for a team that has run RAG pipelines before is different from the correct architecture for a team that has never done retrieval. We design for your team’s actual capabilities, not an idealised team.

What the Decision Documentation Delivers

Architecture decisions documented with explicit rationale survive team changes. When a new engineer joins and asks why the system uses RAG instead of fine-tuning, the answer is in the decision log — not in someone’s memory, not in a Slack thread, not in a document that says “we chose RAG” without saying why.

Engagement Phases

Day 1

Use Case & Constraints Mapping

Structured analysis of your use case requirements — latency, throughput, accuracy targets, training data availability, inference budget, and team capabilities. We map every constraint that will determine the correct model architecture and framework choice. This phase produces the decision criteria that drive the rest of the sprint.

Days 2–3

Model Architecture Evaluation

Systematic evaluation of the model architecture options against your documented constraints. For LLM use cases: fine-tuning vs. RAG vs. prompt engineering vs. hybrid approaches, with analysis of data requirements, cost, latency, and maintenance burden. For classical ML: model family selection with complexity-performance tradeoff analysis. Framework evaluation where relevant — PyTorch, JAX, TensorFlow, XGBoost, scikit-learn.

Days 4–5

Benchmark Methodology & Decision Documentation

Design of the benchmark methodology that will be used to validate the chosen architecture: evaluation metrics, test set construction, baseline comparisons, and acceptance criteria. Delivery of the full decision documentation package — framework recommendation, fine-tuning vs. RAG decision doc, training approach specification, and architecture decision log.

Deliverables

Framework Recommendation — scored comparison with explicit rationale tied to your constraints
Fine-Tuning vs. RAG Decision Document — structured analysis with recommendation and implementation guidance
Training Approach Specification — data requirements, training infrastructure, and evaluation protocol
Benchmark Methodology — metrics, test set design, baseline definitions, and acceptance criteria
Architecture Decision Log — all major model decisions in ADR format for future reference

Before & After

MetricBeforeAfter
Decision ConfidenceFramework debate running for weeks — team split, milestone at riskDocumented decision with explicit rationale — team aligned and execution unblocked
Benchmark ClarityNo formal benchmark methodology — model comparison based on intuition and ad hoc testsDefined metrics, test set, baselines, and acceptance criteria — model selection objective and defensible
Architecture Risk ReductionMajor model architecture commitment made without structured analysis — risk of costly pivotConstraints documented, options evaluated, decision justified — architecture risk quantified and mitigated

Tools We Use

PyTorch / JAX / TensorFlow Hugging Face Transformers RAGAS Custom Benchmark Harness

Frequently Asked Questions

What are the actual tradeoffs between fine-tuning and RAG — when does each make sense?

RAG is the right default for most LLM use cases where the knowledge base changes frequently, where factual grounding is critical, or where you cannot collect 10,000+ high-quality labelled examples. Fine-tuning is appropriate when you need the model to adopt a specific style or format, when latency is critical and retrieval overhead is unacceptable, or when you have a narrow, well-defined task with sufficient labelled data. The decision document we deliver maps these tradeoffs against your specific requirements — not the general case.

When should we use open-source models vs. API-based models (GPT-4, Claude)?

API-based models are the right default for prototyping and for use cases where data privacy allows it — they are faster to iterate with and the capability ceiling is high. Open-source models (Llama, Mistral, Qwen) become the right choice when data privacy requirements preclude sending data to third-party APIs, when inference volume makes API costs prohibitive at scale, or when you need fine-tuning control that API providers do not offer. We document this analysis with cost modelling specific to your expected inference volume.

When does gradient-boosted trees beat a neural network for structured data?

For most tabular data tasks — fraud detection, demand forecasting, credit scoring — gradient-boosted trees (XGBoost, LightGBM) outperform neural networks when the dataset is under 1M rows, latency is critical, and interpretability matters to stakeholders or regulators. Neural networks are competitive when the dataset is large, the feature space includes unstructured data (text, images), or the task requires learning complex cross-feature interactions at scale. We benchmark both approaches against your data and make a recommendation based on results, not convention.

How do you design a benchmark methodology that business stakeholders will accept?

A benchmark stakeholders accept has three properties: the metrics map to business outcomes (not just ML metrics), the test set reflects real-world distribution (not a held-out slice of training data), and the acceptance criteria are defined before evaluation starts. We design each of these with your team. The benchmark methodology we deliver includes the offline evaluation protocol and the online success metrics — connecting model performance to the business outcomes that justify the investment.

Build ML that scales.

Book a free 30-minute ML architecture scope call with our experts. We review your stack and tell you exactly what to fix before it breaks at scale.

Talk to an Expert