Our stack — and why we picked it.
Most AI consultancies wave their hands at “custom AI software.” Here's exactly what that means at OpenGate — model selection, retrieval, evals, guardrails, observability. So you can tell whether we know what we're talking about.
01 — Model selection
Specialist models, not generalists. Right-sized for the job.
For operational workflows — alert classification, ticket routing, document extraction, retrieval — we default to specialist models in the 7B–70B parameter range, fine-tuned on your domain data. Llama 3.x, Mistral, Qwen, and specialist embeddings (BAAI, Nomic) are our defaults. We reach for frontier models (Claude, GPT-class) only when reasoning depth genuinely beats sovereignty — and we'll tell you exactly why we're doing it.
Fine-tuning happens with LoRA adapters (typically rank 16–64), not full-parameter retraining — so we can iterate on your data without burning weeks on a single training run.
02 — Retrieval & grounding
Hybrid retrieval over your operational data.
Pure semantic search misses obvious things; pure keyword search misses everything else. We default to hybrid retrieval — BM25 over chunked documents combined with dense embeddings (Nomic, BGE, or your domain-tuned encoder), reranked with a cross-encoder before context lands in the prompt.
Chunking is configured per data source — 256-token windows for ticket bodies, 512 with overlap for technical docs, structured extraction for tables. Vector storage in Qdrant, pgvector, or your existing infrastructure where applicable.
03 — Evals (the unglamorous part that matters most)
Nothing ships without an eval gate it can't fail past.
Every workflow we build has a golden dataset co-developed with your subject-matter experts during week 1 — typically 100–500 examples per workflow, scored against rubrics that match the actual business decision. Every model change, prompt change, and retrieval change reruns the eval suite before it touches production.
For subjective tasks we use LLM-as-judge with calibration checks against human ratings. Regression suites run on schedule. Drift detection runs continuously. This is the part 90% of AI consultancies skip — and it's the reason their POCs die in production.
04 — Guardrails & governance
Output validation, PII handling, audit trail. By default.
Every production response passes through validation: schema checks for structured outputs, PII redaction at input and output, content policy filters, and confidence thresholds that route low-confidence cases to humans instead of guessing.
Every prompt, retrieval, and output is logged — append-only — to a postgres audit trail with row-level access controls. SOC 2 and HIPAA-aligned by architecture, not by promise. Air-gap mode is a flag, not a refactor.
05 — Integrations & tools
Native integrations with the systems you already run.
Function calling over your existing APIs — ServiceNow REST, Microsoft Graph, Salesforce REST, Stripe, Twilio, ITSM webhooks, custom internal APIs. MCP (Model Context Protocol) servers where we want to expose your data to multiple AI workflows without rewriting integrations every time.
No new platform to administer. No new SSO to configure. The AI workflow shows up inside the tools your team already opens every day.
06 — Observability of the AI itself
Treat the AI like any other production system.
OpenTelemetry traces from every prompt to every tool call. P50/P95/P99 latency, token usage, and cost dashboards in Grafana. Drift alerts on output distribution, retrieval recall, and eval score deltas — so you know before the business hears about it that something changed.
This is what an operator would build for any production system. AI doesn't get a free pass.
One opinion we hold strongly.
In 2026 there is enormous market pressure to build “agentic” systems that can do anything. We think that's a mistake for most enterprise workflows. Specialist models outperform generalists on constrained problems. Most operational work is a constrained problem. We build narrow agents that do one job well, with verification gates between them, instead of one mega-agent that does ten jobs with hope as the architecture.
When the workflow legitimately needs a frontier reasoning model, we use one — and we'll tell you why and what it costs. When it doesn't, we won't bill you for capability you didn't need.
Want to see this applied to your stack?
Tell us about one workflow that's broken. We'll walk you through the stack we'd use, the evals we'd build, and what the first 30 days look like.