How We Build

Our stack — and why we picked it.

Most AI consultancies wave their hands at “custom AI software.” Here's exactly what that means at OpenGate — model selection, retrieval, evals, guardrails, observability. So you can tell whether we know what we're talking about.

01 — Model selection

Specialist models, not generalists. Right-sized for the job.

For operational workflows — alert classification, ticket routing, document extraction, retrieval — we default to specialist models in the 7B–70B parameter range, fine-tuned on your domain data. Llama 3.x, Mistral, Qwen, and specialist embeddings (BAAI, Nomic) are our defaults. We reach for frontier models (Claude, GPT-class) only when reasoning depth genuinely beats sovereignty — and we'll tell you exactly why we're doing it.

Fine-tuning happens with LoRA adapters (typically rank 16–64), not full-parameter retraining — so we can iterate on your data without burning weeks on a single training run.

Llama 3.1 8B / 70B (quantized via TensorRT-LLM)
Mistral Small / Large
Qwen 2.5 for multilingual or code-heavy domains
Frontier API (Claude / GPT) where appropriate

02 — Retrieval & grounding

Hybrid retrieval over your operational data.

Pure semantic search misses obvious things; pure keyword search misses everything else. We default to hybrid retrieval — BM25 over chunked documents combined with dense embeddings (Nomic, BGE, or your domain-tuned encoder), reranked with a cross-encoder before context lands in the prompt.

Chunking is configured per data source — 256-token windows for ticket bodies, 512 with overlap for technical docs, structured extraction for tables. Vector storage in Qdrant, pgvector, or your existing infrastructure where applicable.

Hybrid BM25 + dense embeddings
Cross-encoder reranking before prompt
Per-source chunking strategy
Citations always returned, never optional

03 — Evals (the unglamorous part that matters most)

Nothing ships without an eval gate it can't fail past.

Every workflow we build has a golden dataset co-developed with your subject-matter experts during week 1 — typically 100–500 examples per workflow, scored against rubrics that match the actual business decision. Every model change, prompt change, and retrieval change reruns the eval suite before it touches production.

For subjective tasks we use LLM-as-judge with calibration checks against human ratings. Regression suites run on schedule. Drift detection runs continuously. This is the part 90% of AI consultancies skip — and it's the reason their POCs die in production.

Golden datasets co-built with your SMEs
Rubric-scored, not just accuracy
LLM-as-judge + human calibration
Regression + drift suites on schedule

04 — Guardrails & governance

Output validation, PII handling, audit trail. By default.

Every production response passes through validation: schema checks for structured outputs, PII redaction at input and output, content policy filters, and confidence thresholds that route low-confidence cases to humans instead of guessing.

Every prompt, retrieval, and output is logged — append-only — to a postgres audit trail with row-level access controls. SOC 2 and HIPAA-aligned by architecture, not by promise. Air-gap mode is a flag, not a refactor.

05 — Integrations & tools

Native integrations with the systems you already run.

Function calling over your existing APIs — ServiceNow REST, Microsoft Graph, Salesforce REST, Stripe, Twilio, ITSM webhooks, custom internal APIs. MCP (Model Context Protocol) servers where we want to expose your data to multiple AI workflows without rewriting integrations every time.

No new platform to administer. No new SSO to configure. The AI workflow shows up inside the tools your team already opens every day.

06 — Observability of the AI itself

Treat the AI like any other production system.

OpenTelemetry traces from every prompt to every tool call. P50/P95/P99 latency, token usage, and cost dashboards in Grafana. Drift alerts on output distribution, retrieval recall, and eval score deltas — so you know before the business hears about it that something changed.

This is what an operator would build for any production system. AI doesn't get a free pass.

OTEL tracesPrometheus + GrafanaToken / cost budgetsDrift alerting

One opinion we hold strongly.

In 2026 there is enormous market pressure to build “agentic” systems that can do anything. We think that's a mistake for most enterprise workflows. Specialist models outperform generalists on constrained problems. Most operational work is a constrained problem. We build narrow agents that do one job well, with verification gates between them, instead of one mega-agent that does ten jobs with hope as the architecture.

When the workflow legitimately needs a frontier reasoning model, we use one — and we'll tell you why and what it costs. When it doesn't, we won't bill you for capability you didn't need.

Want to see this applied to your stack?

Tell us about one workflow that's broken. We'll walk you through the stack we'd use, the evals we'd build, and what the first 30 days look like.