What are the tokens per second measured by MLPerf and how are they used in LLM?

Last update: 16th September 2025
Author Isaac
  • LLMs are best evaluated in tokens per second: input and output determine latency.
  • Databricks provisions endpoints by TPS and autoscale; MLPerf standardizes metrics.
  • New benchmarks (DeepSeek-R1, Whisper, Llama 3.1-8B) harden TTFT/TPOT.

tokens per second MLPerf

If you work with language models, you've heard the term "tokens per second" a thousand times, but rarely is it explained in detail what it means in real-world environments and, above all, how MLPerf measures it. In this article, we explain clearly what tokens are, why the tokens per second metric is so important in inference, and how platforms like Databricks and the MLPerf benchmark use it to size, compare, and scale. In addition, we include specific figures from manufacturers and clouds to ground performance expectations..

The issue is not minor: the industry has standardized tokens per second to evaluate LLM performance in data centers and at the edge. MLPerf, the peer-reviewed MLCommons suite, has become the benchmark for comparing hardware and software.In parallel, operators like Databricks already provision their model endpoints directly based on a range of tokens per second. Let's break this all down, with numbers and use cases in hand.

What is a token and why does it matter in LLM?

Language models don't process individual letters or words as is; they work with units called tokens. A token is usually about 4 characters long, or on average, 0,75 words.This ratio varies depending on the language and the model's tokenizer, but it serves as a quick reference: a 10-word text moves around 13–14 tokens.

The exact segmentation depends on the model: Each LLM uses its own tokenizer and divides words into complete tokens or subwordsOnline tools allow you to see, for example, how Llama tokenizes a specific phrase. This variability, which seems like a small detail, influences latency and computing costs.

When talking about generation rate, it is usually expressed in terms of tokens per second, rather than words per second. This homogenizes the metric across languages, context lengths, and output styles., and allows to accurately calculate the inference cost and the required capacity.

Why measure performance in tokens per second and not in RPS?

Traditional API services focus on RPS (requests per second). In LLM, that approach falls short: Two requests can take very different times depending on the input tokens and output tokensThat is, the actual payload comes in tokens, not in "number of calls."

There are two key sources of variability. First, the length of the input context: A short prompt may have just a few tokens, but a summary document can skyrocket to hundreds or thousands.On the other hand, the length of the output: summarizing usually produces fewer tokens; generating a long article or description increases the time, because output decoding is the most expensive part.

Therefore, to realistically scale an inference endpoint, it is helpful to think in terms of tokens. Databricks, for example, provisions its Serving endpoints with a range of tokens per second and bills hourly based on scaling.This way, you can align capacity with actual load without being fooled by an RPS that doesn't tell the whole story.

How Databricks and MLPerf measure tokens per second

What is Nvidia Rubin CPX?

Databricks takes a representative load of RAGs as a reference and summarizes: 2048 input tokens and 256 output tokens. It combines both phases (prefill and decode) and, by default, optimizes the balance between throughput and latency for batch sizes of 1 per request, simulating multiple concurrent requests.

With that rule, the numbers read like this: if you configure an endpoint at 2304 tokens per second (2048 + 256), A request with those sizes takes about a secondIf you set it to 5600 tokens per second, the same request drops to about 0,5 s and you can process two similar requests per second.

When your workload changes, latency will change. Generating more output tokens penalizes more than increasing input tokens.If you're doing batch inference, calculate the average number of input and output tokens for your dataset and compare it to the previous benchmark to estimate times.

Practical examples: with 1000 rows, an average of 3000 input and 500 output tokens, and a provisioned throughput of 3500 tokens per second, it will take you more than 1000 seconds because your averages exceed the reference. If instead you average 1500 input and 100 output with 1600 tokens per second provisioning, you will go below 1000 seconds in total for those 1000 rows.

  IBM runs quantum algorithm on AMD FPGA: what changes

On-demand autoscaling and actual scaling calculation

Databricks Model Serving includes fast autoscaling that Increase or decrease resources based on the demand for tokens per secondThe system scales in capacity blocks, and additional capacity is only billed when used. In tests with more parallel requests, provisioned throughput increases until it stabilizes at around 8000 tokens per second when resources are saturated, increasing queuing latency.

If you notice fewer tokens per second than you marked, check two things: Provisioned concurrency reflecting endpoint metrics and minimum bandwidth size configured. With this data, actual scaling is estimated using the formula: provisioned concurrency × minimum bandwidth size / 4.

A concrete example: with a maximum concurrency of 8 and a minimum stripe size of 850 tokens per second, The effective limit would be 1700 tokens per second (8 × 850 / 4). Understanding this calculation prevents surprises and helps you fine-tune your settings to your latency SLOs.

MLPerf Inference: What it is and what it measures today

MLPerf, developed by MLCommons, is the open and standardized suite for measuring AI performance in the datacenter and edge, from vision to LLM. Its goal is to compare platforms in a fair and reproducible way to drive ecosystem efficiency.In recent years, the focus has clearly shifted towards GenAI and large LLMs.

In the fifth edition, Llama 2 70B was consolidated as the star benchmark, displacing ResNet50, and Tokens per second metrics improved up to 3,3x in the best case in one year, with a median performance 5 times higher thanks to hardware and software optimizations. The presence of CPUs like Intel Xeon 6 in official results also demonstrated that There is room for efficient generalist solutions in certain scenarios.

Version 5.1 of MLPerf Inference has taken another leap forward: it incorporated three new key benchmarks, reasoning with DeepSeek-R1, speech-to-text with Whisper Large v3 and a small LLM based on Llama 3.1 8BOverall, the consortium reported 27 participants, reached the milestone of 90.000 results, and narrowed down latency metrics in interactive scenarios.

Metrics and objectives in the new benchmarks

The reasoning benchmark with DeepSeek‑R1, a MoE of 671B parameters, shows that These models produce long chains of reasoning before the answer. Supports outputs of up to 20.000 tokens, with an average of 3880 tokens per output in the dataset, the largest to date in inference.

The rules measure throughput in offline mode and server mode with strict limits: Time to first token of 2 seconds and latency per token of 80 ms at p99This seeks to balance the "thinking" budget with the responsiveness necessary to deploy it.

The small LLM benchmark with Llama 3.1‑8B replaces GPT‑J 6B as the gateway. Supports contexts of up to 128.000 tokens and evaluates summarization on CNN‑DailyMail with 778 input tokens and 73 output tokens. Accuracy is validated with ROUGE and, in closed division, is required to match 99 percent of a high-accuracy benchmark.

In latency metrics, two indicators are used: TTFT (time to first token) and TPOT (time per token out). On server, 2 s of TTFT and 100 ms of TPOT are noted. (around 480 ppm), and in the new interactive scenario it is squeezed to 0,5 s and 30 ms respectively (around 1600 ppm) for cases such as chat, coding or creative tools.

Performance highlights by manufacturer and operator

  • NVIDIA led again, this time with Blackwell Ultra on the GB300 NVL72 system, scoring A record in reasoning with 45 percent more throughput DeepSeek‑R1 than GB200 NVL72, reaching 5842 tokens per second per GPU offline and 2907 on server, with improvements close to 5x compared to unverified Hopper.
  • In the new interactive Llama 3.1 405B benchmark, NVIDIA applied disaggregated serving with Dynamo, separating context and generation on different GPUs and transferring KV Cache over NVLink, achieving 1,5× more throughput per GPU than traditional serving on Blackwell and more than 5× more than systems with Hopper.
  • For smaller models, NVIDIA reported Over 18.000 tokens per second per GPU on Llama 3.1 8B offline and 5667 tokens per second per GPU in Whisper, maintaining GPU leadership in all scenarios (offline, server and interactive).
  • AMD expanded its presence with the first shipment of the Instinct MI355X GPU, which is now in the 2‑70B range. It showed multi-node scaling and a 2,7x increase in tokens per second over MI325X in FP8. In open division, structured pruning was applied on Llama 3.1‑405B (FP4), increasing throughput by 82 percent with a 21 percent depth pruned model and by 90 percent with a 33 percent more fine-tuned model, maintaining precision.
  • It also debuted shipments in the Llama 2‑70B Interactive, Mixtral‑8×7B and Stable Diffusion XL, and presented mixed MI300X/MI325X results: When scaling to 4 nodes, MI355X achieved 3,4x more throughput than MI300X, extending to 8 nodes with good scalability.
  • HPE, combining ProLiant and Cray, reported 14 number 1 results. The DL380a Gen12 stood out in DLRM and Llama 3.1‑8B (Server) among 8-GPU PCIe systems; the DL385 Gen11 marked better GPU performance in Whisper with H200 NVL; and the Cray XD670 (8× H200) scored six firsts in RetinaNet, Llama 3.1‑8B, Mixtral, and Whisper, plus firsts with RTX Pro 6000 Blackwell SE and GH200 NVL2 results in DLRM.
  • CoreWeave was the first cloud to report results with GB300, delivering 6005 tokens per second per GPU in DeepSeek‑R1 offline and demonstrating orchestration and scaling with Slurm on Kubernetes and topology-aware scheduling to get the most out of NVLink.
  • Dell shipped 12 systems with AMD and NVIDIA accelerators, shining in LLaMA 2 70B Interactive with PowerEdge XE9680L and B200, LLaMA 3.1‑8B Server on XE9685L+B200, SDXL on XE9685L and Whisper on XE9680L, demonstrating versatility from image to voice through LLM.
  • Intel stressed that it remains the only one to send results with server CPUs and showed that Xeon 6 with P-cores improves 1,9× over 5th Gen Xeon across five benchmarks, cementing its role in general-purpose inference. It also introduced workstations with 8 Arc Pro B60 GPUs, with 192GB of VRAM to serve Llama2‑70B to multiple users, and bundled drivers and frameworks to simplify multi-GPU deployment.
  • Among the integrators and partners, ASUSTeK Optimized latency and throughput with quantization, kernels, and stack; Broadcom demonstrated VCF virtualization with minimal overhead versus bare metal on multiple workloads (Whisper, SDXL, Llama 3.1-405B, Llama2-70B, RGAT, RetinaNet); Cisco scaled almost linearly with the UCS C885A M8 (8× H200 SXM) and UCS C845A M8 (8× H200 NVL or L40S), supported by One G200 networks.
  • KRAI, using OpenAI API and realistic overheads, compared SGLang and vLLM with Llama3.1‑70B: 31.391 tokens per second offline with SGLang 0.4.9 and 26.319 with vLLM 0.9.2 on a single server with 8x H200; with dynamic quantization it reached 27.697 with SGLang and 30.893 with vLLM, and on multi-node it scaled up to 87.334 tokens per second on three servers.
  • Lambda, with 8x B200 180 GB SXM, showed throughput improvements up to 7 percent in SDXL and 15 percent in Llama 3.1‑405B compared to the previous round, and offers clusters from 16 to 1536 GPUs with managed Kubernetes or Slurm.
  • MiTAC, with its G8825Z5 series, shone at LLaMA 2 70B Interactive with 18.846,1 tokens per second and good results in Server and Mixtral; Nebius certified its virtualized performance almost on par with bare metal in GB200 NVL72, HGX B200 and HGX H200, with 596,11 tokens per second on server and 855,82 tokens offline on Llama 3.1‑405B with 4 GB200 GPUs.
  • Red Hat demonstrated vLLM as a supported runtime on its AI Inference Server, with CUTLASS kernels for FP8 and FlashAttention‑3 plus an improved vLLM v1 engine, powers Llama‑3.1‑8B in H100 and L40S with great cost‑performance ratio.
  • Supermicro posted leading results with the HGX‑B200 8‑GPU (air and liquid) with both Intel and AMD CPUs, highlighting Llama 3.1‑8B and Llama 2‑70B on server/offline/interactive and Whisper; in collaborations, it showed excellent scaling with 32× H100‑SXM and alternatives with MI325X.
  • Vultr debuted with Supermicro AS‑8126GS‑TNMR and 8x MI325X, certifying competitive performance as a Cloud GPU; GATEOverflow promoted reproducibility with MLCFlow on RTX 4090 and AMD/Intel CPUs; Giga Computing shipped 8U air-cooled EPYC+MI325X and Xeon+HGX B200 systems; QCT covered Xeon 6 configurations with H200 NVL (4 GPUs) and 8× H200 SXM5 platforms with NVLink and GPUDirect Storage, in addition to 8× MI325X systems.
  How the I/O system works

Academia also had its moment. The University of Florida, with its DGX B200 SuperPOD integrated with HiPerGator, was the first institution to submit inference results Meeting server latencies under closed partitioning, using Apptainer without Docker/Sudo, and fitting into multi-user SLURM. At the opposite extreme, a single submission on an M1 MacBook Pro, with ONNX Runtime and CoreML on GPU and Neural Engine, surpassed the target accuracy in the edge category and demonstrated that quality inference can be evaluated on consumer hardware.

Speed ​​perceived by users and practical limits

User experience is not only measured in benchmarks; in everyday life, The feeling of fluidity comes when you exceed a certain threshold of tokens per secondOne user commented that their limit for conversation is 4 tokens per second, and for story writing, it's around 10 tokens per second; below that, interaction feels slow.

If you try to run an LLM locally, there are three realities. On a desktop CPU, It is normal to move at 1–2 tokens per second, unfeasible for long answers. With a high-end gaming GPU, you can get close to 5 tokens per second. With an NVIDIA H100, yes, we're already talking about 60 tokens per second, but it's data center hardware, not desktop hardware.

What's happening in the cloud? The most powerful providers are beating these numbers thanks to specialized hardware and optimized inference stacks. Averages of around 119 tokens per second have been reported on ChatGPT‑4 and 168 on Gemini., while popular open source models like DeepSeek hover around 21 tokens per second. If you convert that to words, 119 tokens per second is around 90 words per second.

  Battlefield 6: Benchmarks and Tests on Over 40 GPUs

Operational conclusion: for most users, Running an AI on the computer is possible, but impractical due to slownessTo work at comfortable speeds and with tight latencies, managed services remain the sensible option.

How to size your endpoint by TPS and what to expect from latency

Practical steps for sizing. First, outline your use case: Average number of input and output tokens, length distribution, and expected concurrency. Second, run a load test with a representative dataset, involving TTFT and tokens per second sustained per request.

Next, align the configuration with your pattern. If your workload resembles the Databricks reference (2048 in, 256 out), Choose a range of tokens per second such that a request falls within the desired latency budgetRemember that duplicating output usually costs more than duplicating input, and that effective concurrency depends on actual autoscaling.

Monitor and adjust. Keep an eye on metrics provisioned concurrency, queues, TTFT and TPOT, and compare it to your SLOs. If you're short on capacity, expand the range; if you have excess resources, lower it and adjust blocks to save. The true scaling formula will help you understand why the endpoint isn't performing as configured if it didn't create enough replicas.

Finally, be aware of the scenario. In interactive chatbot-style mode, aim for TTFT of 0,5s and 30ms per token This will give you a premium user experience. In server mode, 2 s and 100 ms per token are reasonable guidelines, and offline, it seeks maximum throughput while maintaining the accuracy required by the benchmark.

Looking at MLPerf trends, the vector is clear: More context, more tokens, and better efficiency techniques —disaggregated serving, FP4/FP8, structured pruning, custom kernels, KV cache scheduling— are pushing the token ceiling up by the second year over year, both per chip and per system.

The overall picture drawn by Databricks and MLPerf is consistent: Thinking in terms of tokens per second is the correct way to reason about cost, latency, and scalability in LLM.With a good representative benchmark, TTFT/TPOT metrics, and well-calibrated autoscaling, it's possible to deliver fast and stable responses without oversizing the infrastructure.

nvidia blackwell ultra gb300
Related article:
NVIDIA Blackwell Ultra GB300: Architecture, Memory, and NVLink 5