Large Language Model Development

Utility and Bust: Scale in Future AI Provision

Download

Person wearing a virtual reality headset with a hand raised forward, symbolizing innovation and interaction in the digital world.

Accelerate AI Transformation

Boost Productivity

download

contact

Key Takeaways

Large Language Model (LLM) development and hosting technology has developed and diffused making it now widely accessible.

Our investigation and experiments show that it is now possible for virtually any company to host and customise near state-of-the-art models. However, costs are such that it is not realistic to offer full-scale, open-source models at prices that are competitive with the hyperscale providers.

This paper provides a demonstration of what can be done with commodity hardware, and the implications of the change in AI technology that enabled this demonstration are discussed.

Three futures for AI technology are described:

a future where scale dominates,
a future where scale is important,
and a future where scale is irrelevant.

Current evidence points to the ‘scale matters’ future and not the ‘scale dominates’ future, but there are also indications that scale may be even less important. Download our white paper to find out more about how it is possible for virtually any company to host and customise near state-of-the-art models

FAQ: Enterprise LLM Development

What is model distillation in large language model development?

Model distillation is a technique that trains a smaller “student” model to replicate the behavior of a larger “teacher” large language model (LLM). It reduces model size and compute requirements while preserving most of the original performance.

In enterprise settings, model distillation enables cost-efficient deployment of LLMs on mid-range GPU infrastructure instead of hyperscale clusters. When combined with quantization and parameter-efficient tuning, it significantly lowers inference costs and latency. Distillation is particularly valuable for domain-specific applications where extreme model scale is not required.

The full Thought Leadership explains how distillation reshapes LLM infrastructure economics and enterprise AI strategy. Download the report for benchmarks and implementation insights.

What is the difference between model distillation and quantization?

Model distillation and quantization are both LLM optimization techniques, but they address different aspects of efficiency. Distillation reduces model size by training a smaller model to imitate a larger one. Quantization reduces memory usage by lowering numerical precision (e.g., from 16-bit to 8-bit).

Distillation changes the model architecture and parameter count, while quantization modifies how parameters are represented. Combined, these techniques can significantly reduce GPU requirements and inference costs without major performance degradation.

The Thought Leadership explores how combining distillation and quantization enables near state-of-the-art performance on accessible hardware. Download the full paper for technical benchmarks.

Is building a private LLM more cost-effective than using API-based models?

Building a private LLM can be more cost-effective at scale, particularly for sustained workloads with high token volume and strict data governance requirements. However, total cost of ownership depends on infrastructure, engineering expertise, and utilization rates.

API-based models offer rapid deployment and elasticity, but long-term usage fees can exceed the cost of operating a fine-tuned, self-hosted model. Enterprises must evaluate GPU capital expense, MLOps maturity, compliance risk, and strategic differentiation.

The full report compares API pricing scenarios with self-hosted and distilled model strategies under different utilization assumptions. Download the analysis for detailed cost modeling.

gatedDownload.step1

gatedDownload.step2

gatedDownload.step3