TurboQuant: Online Vector Quantization With Near - Optimal Distortion Rate

Click a question to see the answer

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh

Google Research

zandieh@google.com

Majid Daliri

New York University

daliri.majid@nyu.edu

Majid Hadian

Google DeepMind

majidh@google.com

Vahab Mirrokni

Google Research

mirrokni@google.com

###### Abstract

Vector quantization, a problem rooted in Shannon’s source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

1 Introduction

Vector quantization (VQ) in Euclidean space is crucial for efficiently handling high-dimensional vectors across a spectrum of computational domains, from training and deploying large-scale AI and deep learning models to powering vector databases for search/retrieval systems. The core objective is to compress high dimensional vectors by quantizing them–converting floating-point coordinate values to low-bitwidth integers–while minimizing distortion, quantified by metrics such as

mean-squared error (MSE) or inner product errors. By preserving these properties, inner product queries can be answered rapidly, with minimal latency, and using reduced computational and communication resources.

This problem’s roots trace back to Shannon’s seminal work on Source Coding theory *[48, 49]*, which established that the least distortion achievable by block source codes, now known as vector quantizers, is defined by the Shannon distortion-rate function, determined by the statistical properties of the source and the chosen distortion measure, such as MSE. Today, VQ plays a critical role in fundamental computational domains, including AI, deep learning, and search systems.

A key application of VQ is in the deployment of AI models, including large language models (LLMs) *[5, 18, 7, 52]*. As LLM capabilities depend heavily on their model size and context length *[34]*, serving them requires substantial memory demands and increased inference latency. This latency is primarily attributed to communication bottlenecks between HBM and SRAM on accelerators, or across distributed clusters. By compressing or quantizing model weights and activations, we can effectively mitigate these bottlenecks, resulting in significant reductions in inference costs. Inner product operations between activations and weights is at the core of deep learning models. Thus, model quantization schemes strive to compress weights and/or activation vectors while accurately preserving these inner products.

Decoder based transformer models *[54]* present another compelling use case. These models must store key/value (KV) embeddings from previously generated tokens in the KV cache, the size of which scales with both model size (number of layers and attention heads) and context length. This scaling is a significant bottleneck in terms of memory usage and computational speed, especially for long context models. Therefore, reducing the KV cache size without compromising accuracy is essential. In this context, the preservation of the Euclidean structure of these embedding vectors–their inner products and distances–is crucial for maintaining model performance. VQ emerges as the most suitable framework for addressing this challenge, offering a robust approach to compressing high-dimensional embeddings while preserving their essential geometric properties.

Additionally, nearest neighbor (NN) search in high-dimensional spaces with inner product or cosine similarity *[1, 27]* is a cornerstone of vector databases *[4, 2, 3]*. These databases are fundamental for retrieval-augmented generation *[23, 19]* and information retrieval *[35, 46]*. VQ, a.k.a. product quantization (PQ), plays a critical role in these applications. It enables efficient compression of database vectors, optimizes memory usage, and facilitates low-latency, accurate estimations of inner products with query vectors, thereby enabling fast and precise nearest neighbor searches.

Existing VQ algorithms present a trade-off: either they lack accelerator (vectorization) compatibility and exhibit slow computation, making them unsuitable for real-time AI applications like KV cache quantization, or they suffer from suboptimal distortion bounds relative to bit-width. Our objective is to introduce an algorithm that addresses these limitations. Specifically, we design TurboQuant: a lightweight, capable of online application (crucial for scenarios like KV cache quantization), and highly accelerator-friendly—a critical attribute for modern AI workloads.

The core of TurboQuant is a two-stage process. First, we develop a vector quantizer with optimal distortion rate in terms of mean-squared error (MSE). Subsequently, we apply a 1-bit quantizer to the residual, resulting in an unbiased and low-distortion inner product quantizer. We demonstrate that quantizers optimized for MSE do not produce unbiased estimators for inner products, and

our two-stage solution effectively bridges this gap. Our MSE-optimal quantizer starts by randomly rotating $d$-dimensional input vectors. Observing the key fact that each coordinate in the rotated vectors follows a Beta distribution, we design optimal Lloyd-Max quantizer *[42, 43]* for each coordinate by solving a continuous k-means problem. This method gives optimal MSE distortion bound and minimizes the L2 norm of the residual. To obtain an unbiased and low-distortion quantizer for inner products, we compose our quantizer with the recently developed Quantized Johnson-Lindenstrauss (QJL) transform *[62]*, which quantizes each coordinate of the residual vector to a single bit. Our algorithm offers provably optimal distortion bounds for both MSE and inner products, achieving an exponential improvement over existing methods in terms of bit-width dependence.

1.1 Problem Definition

Formally, our goal is to design a quantization map, denoted as $Q:\mathbb{R}^{d}\to\{0,1\}^{B}$, that transforms $d$-dimensional vectors to a binary string of $B$ bits. If we set $B=b\cdot d$ for some $b\geq 0$, this quantizer will have a bit-width of $b$, representing the average number of bits used to encode each real-valued coordinate of $\mathbb{R}^{d}$. Crucially, we require an inverse map, $Q^{-1}:\{0,1\}^{B}\to\mathbb{R}^{d}$ that performs dequantization, approximately reconstructing original vectors from their quantized representations. Of course, this transformation is inherently lossy, as $Q$ is not a bijection. So, our primary objective is to minimize distortion, with a specific focus on mean-squared error (MSE) and inner product distortion.

We make no assumptions about the input vector dataset, considering the worst-case scenario. We let the quantizer $Q(\cdot)$ to be randomized, leading to stochastic outputs. Considering randomized quantizers, it is more appropriate to define the expected distortion over the randomness of the quantizer’s output. Thus, we aim to design quantizers that for any desired bit-width $b$ minimize the following expected distortion measures for any (worst-case) vectors $\bm{x},\bm{y}\in\mathbb{R}^{d}$:

$\mathbf{(MSE)}$ $D_{\texttt{mse}}:=\mathop{\mathbb{E}}_{Q}\left[\left\|\bm{x}-Q^{-1}\left(Q(\bm{x})\right)\right\|_{2}^{2}\right]$ (1) $\mathbf{(inner\text{-}prod~{}error)}$ $D_{\texttt{prod}}:=\mathop{\mathbb{E}}_{Q}\left[\left|\left\langle\bm{y},\bm{x}\right\rangle-\left\langle\bm{y},Q^{-1}\left(Q(\bm{x})\right)\right\rangle\right|^{2}\right].$ (2)

The expectations above are takes with respect to the randomness of the quantizer $Q(\cdot)$. Furthermore, for inner-product quantizers, we require unbiasedness of the inner product estimator, a desirable property for numerous applications. More precisely, we require:

$\mathbf{(unbiased~{}inner\text{-}prod)}$ $\mathop{\mathbb{E}}_{Q}\left[\left\langle\bm{y},Q^{-1}\left(Q(\bm{x})\right)\right\rangle\right]=\left\langle\bm{y},\bm{x}\right\rangle.$

We aim to design computationally efficient quantizers $Q_{\texttt{mse}}$ and $Q_{\texttt{prod}}$, that achieve optimal bounds for the distortion measures defined above, for any given bit-width $b$. Additionally, we aim for $Q_{\texttt{prod}}$ to provide unbiased inner product estimates. In particular, assume that we are given $n$ real-valued vectors $x_{1},x_{2},\ldots x_{n}\in\mathbb{R}^{d}$. We design the following primitives:

- Quant: efficiently quantizes the dataset and computes $Q(\bm{x}_{1}),Q(\bm{x}_{2}),\ldots Q(\bm{x}_{n})$. - DeQuant: given a quantized dataset, can efficiently reconstruct original vectors by computing $Q^{-1}\left(Q(\bm{x}_{i})\right)$ for any $i\in[n]$.

1.2 Related Work

#### Beginnings of VQ.

The vector quantization theory started by Shannon’s seminal work *[48, 49]* on achievable distortion-rate functions. In 1963, Zador *[61]* made significant advances by employing high-resolution methods to derive the limiting operational distortion-rate function for fixed-rate quantization at high rates that closely matches Shannon’s distortion-rate function. However, Zador did not specifically consider implementable algorithms. Gersho’s influential paper *[25]*, further advanced the vector quantization by popularizing high-resolution theory, simplifying Zador’s results, introducing lattice vector quantization, and proposing a key conjecture that shaped the field. Despite these theoretical advancements, the practical applicability of vector quantization remained unclear in early years. The most straightforward encoding method, brute-force nearest neighbor search, was computationally expensive, hindering the adoption of VQ in practice.

#### Online vs Offline Quantization.

Online (data-oblivious) quantization methods apply instantly without needing data-specific tuning or calibrations *[16, 8, 41, 47, 28]*. In contrast, offline (data-dependent) methods require heavy preprocessing and learning to adapt the quantization map to the data, making them unsuitable for dynamic data scenarios *[37]*. For instance, methods such as those presented in *[20, 39, 57, 13]* use second-order (Hessian) information to tune the quantization map which requires heavy preprocessing and even in some cases post processing as well.

#### Online KV Cache Compression.

Several approaches have been proposed to compress the KV cache. These include architectural modifications *[50, 6, 15]* which restructure the transformer to minimize the number of stored key-value pairs. Additionally, pruning or evicting redundant or less critical tokens has emerged as another approach *[11, 66, 40, 58, 64, 38, 29]*.

A simple yet effective approach to reducing KV cache size is quantizing the KV cache. Several quantization techniques have been developed specifically for this purpose *[60, 59, 17, 33, 65, 41, 30, 36, 28]*. Recently, a new quantization called QJL *[62]* introduced an efficient, data-oblivious 1-bit quantization approach based on sketching techniques, which provides unbiased estimates for inner product queries. This method does not require tuning or adaptation to the input data and we make use of this technology in our quantizer optimized for inner product distortion.

#### Product Quantization (PQ).

In Near Neighbor (NN) search problem with Euclidean datasets, the index size poses a significant memory bottleneck, often mitigated by quantization techniques, commonly referred to as Product Quantization (PQ) in the NN literature. Many of these algorithms rely on constructing a quantization codebook using variations of k-means during the indexing phase *[31, 9, 24, 56, 27]*. Therefore, these methods are ill-suited for online settings due to their requirement for extensive preprocessing.

Recently, a grid-based PQ method was introduced in *[22]*, eliminating the need for preprocessing. This approach operates by projecting a uniform grid onto the unit sphere and conducting a search to identify the nearest projection to the data points. While the paper’s theoretical guarantees are suboptimal, likely due to loose analysis—as practical performance surpasses theoretical bounds—the grid projection and binary search algorithm is also computationally slow and particularly inefficient

on accelerators like GPU because of their algorithm’s inherent lack of vectorization, which prevents parallel processing.

1.3 Overview of Techniques and Contributions

#### MSE Optimzied TurboQuant.

Our first VQ algorithm is designed to minimize MSE distortion deinfed in Eq. (1). To achieve this, we apply a random rotation to the input vectors, thereby inducing a Beta distribution on each coordinate, irrespective of the input vectors themselves. In high dimensions $d$, the distribution of each coordinate converges to a Gaussian distribution $\mathcal{N}(1,1/d)$ due to concentration of measure and the central limit theorem. Furthermore, any two distinct coordinates become nearly uncorrelated and, more importantly, almost independent (a deeper result that goes beyond just correlation). This near-independence is a crucial aspect that simplifies our quantization design. It allows us to quantize each coordinate using optimal scalar quantization, disregarding interactions or correlations between different coordinates, while still achieving near-optimal distortion.

We find optimal scalar quantizers for random variables with Beta distributions by solving a continuous 1-dimensional k-means problem using the Max-Lloyd algorithm. We precompute and store these optimal codebooks for a range of practically useful bit-widths, to enable efficient subsequent invocations of our TurboQuant algorithm.

In Theorem 1 we prove that the $b$-bit MSE optimized TurboQuant $Q_{\texttt{mse}}:\mathbb{R}^{d}\rightarrow\{0,1\}^{b\cdot d}$ achieves the following distortion for any worst-case vector $\bm{x}\in\mathbb{R}^{d}$ with $\|\bm{x}\|=1$:

- $D_{\texttt{mse}}(Q_{\texttt{mse}}):=\mathbb{E}\left[\left\|\bm{x}-Q_{\texttt{mse}}^{-1}\left(Q_{\texttt{mse}}(\bm{x})\right)\right\|_{2}^{2}\right]\leq\frac{\sqrt{3}\pi}{2}\cdot\frac{1}{4^{b}}$ for any $b\geq 0$. - For small bit-widths the above distortion upper bound can be further refined. Specifically, for $b=1,2,3,4$ we have $D_{\texttt{mse}}(Q_{\texttt{mse}})\approx\mathbf{0.36,0.117,0.03,0.009}$, respectively.

Note that the unit norm assumption, $\left\|x\right\|_{2}=1$, is standard and not restrictive. For datasets that do not satisfy this assumption we can compute and store the $L2$ norms in floating-point precision and rescale the dequantized points using these stored norms.

#### Inner Product TurboQuant.

We show that the MSE optimized quantizers are biased for inner product estimation and thus a different VQ scheme is needed to get an unbiased inner product quantizer. Our solution is a two stage algorithm that first applies the abovementioned $Q_{\texttt{mse}}$ with a bit-width one less than our target budget and then apply a QJL *[62]* on the residual error. This is proved to be unbiased and also has nearly optimal inner product error rate.

In Theorem 2 we prove that the $b$-bit inner product optimized TurboQuant $Q_{\texttt{prod}}:\mathbb{R}^{d}\rightarrow\{0,1\}^{b\cdot d}$ achieves the following distortion for any worst-case vectors $\bm{x},\bm{y}\in\mathbb{R}^{d}$ with $\|\bm{x}\|=1$:

- $\mathbb{E}\left[\left\langle\bm{y},Q_{\texttt{prod}}^{-1}\left(Q_{\texttt{prod}}(\bm{x})\right)\right\rangle\right]=\left\langle\bm{y},\bm{x}\right\rangle$ - $D_{\texttt{prod}}(Q_{\texttt{prod}}):=\mathbb{E}\left[\left|\left\langle\bm{y},\bm{x}\right\rangle-\left\langle\bm{y},Q_{\texttt{prod}}^{-1}\left(Q_{\texttt{prod}}(\bm{x})\right)\right\rangle\right|^{2}\right]\leq\frac{\sqrt{3}\pi^{2}\cdot\|\bm{y}\|_{2}^{2}}{d}\cdot\frac{1}{4^{b}}$ for any $b\geq 0$.

Note that

- For small bit-widths the above distortion upper bound can be further refined. Specifically, for $b=1,2,3,4$ we have $D_{\mathsf{prod}}(Q_{\mathsf{prod}})\approx\frac{1.57}{d},\frac{0.56}{d},\frac{0.18}{d},\frac{0.047}{d}$, respectively.

#### Lower Bound.

In Theorem 3, we leverage Shannon’s lower bound and Yao’s minimax principle to prove that for any randomized quantization algorithm $Q:\mathbb{R}^{d}\to\{0,1\}^{b\cdot d}$ with bit-width $b$, there exist hard input instances $\bm{x},\bm{y}\in\mathbb{R}^{d}$ with $\|\bm{x}\|=1$ such that the following lower bounds hold:

- $D_{\mathtt{mse}}(Q):=\mathbb{E}\left[\left\|\bm{x}-Q^{-1}\left(Q(\bm{x})\right)\right\|_{2}^{2}\right]\geq\frac{1}{4^{b}}$ - $D_{\mathsf{prod}}(Q)=\mathbb{E}\left[\left|\langle\bm{y},\bm{x}\rangle-\langle\bm{y},Q^{-1}\left(Q(\bm{x})\right)\rangle\right|^{2}\right]\geq\frac{\|\bm{y}\|_{2}^{2}}{d}\cdot\frac{1}{4^{b}}$

As demonstrated by our lower bounds, TurboQuant’s MSE distortion is provably within a factor of at most $\frac{\sqrt{3}\pi}{2}\approx\mathbf{2.7}$ of the information-theoretical lower bound. Notably, for smaller bit-widths, this factor significantly decreases. For instance, at a bit-width of $b=1$ TurboQuant achieves a distortion that is only a factor of approximately $\mathbf{1.45}$ away from the optimal which is also confirmed by our experimental results, indicating its efficiency in low-bit-width scenarios.

#### Experimental Results.

In Section 4.1, we empirically validate our theoretical distortion bounds, demonstrating that TurboQuant’s observed distortions closely align with our predictions across various real-world datasets, approaching the established lower bounds.

Furthermore, in Section 4.2 and Section 4.3, we showcase TurboQuant’s efficacy in online KV cache quantization. Specifically, we achieve perfect long-context retrieval in needle-in-a-haystack tasks and maintain high performance on other long-context downstream tasks, all while compressing the KV cache by a factor exceeding $5\times$.

Finally in Section 4.4 we apply TurboQuant to various high-dimensional near neighbor search tasks. TurboQuant consistently outperforms data-dependent product quantization (PQ), while reducing the indexing time to essentially zero.

2 Preliminaries

We use boldface lowercase letters, such as $\bm{x}$ and $\bm{y}$, to denote vectors, and boldface uppercase letters, like $\bm{M}$, to denote matrices. To denote a slice of a vector $\bm{x}$ between the coordinate indices $i$ and $j$ inclusive of the endpoints, we use the notation $\bm{x}_{i:j}$. For a matrix $\bm{M}$, we write $\bm{M}_{i,:}$ to denote its $i$-th row vector, which we will simply refer to as $\bm{M}_{i}$.

We use the notation $\mathbb{S}^{d-1}$ to denote the hypersphere in $\mathbb{R}^{d}$ of radius $1$. For a random variable $x$ we denote its differential entropy as $h(x)$. For random variables $x$ and $y$, the mutual information between them is denoted as $I(x;y)=h(x)-h(x|y)$.

Given that TurboQuant employs random rotation to mitigate worst-case input scenarios, understanding the statistical properties of random points on a hypersphere is essential. The following lemma outlines one such property that we will need for analysis and design purposes:

###### Lemma 1 (coordinate distribution of random point on hypersphere).

For any positive integer $d$ if $\bm{x}\in\mathbb{S}^{d-1}$ is a random variable uniformly distributed over the unit hypersphere, then for any $j\in[d]$ the coordinate $\bm{x}_{j}$ follows the following (scaled/shifted) Beta distribution:

$\bm{x}_{j}\sim f_{X}(x):=\frac{\Gamma(d/2)}{\sqrt{\pi}\cdot\Gamma((d-1)/2)}\left(1-x^{2}\right)^{(d-3)/2}.$

In high dimensions this beta distribtion converges to the normal distribution $f_{X}(\cdot)\to\mathcal{N}(0,1/d)$.

###### Proof.

$f_{X}(x)$ equals the ratio of the area of a sphere with radius $\sqrt{1-x^{2}}$ in dimension $d-1$ to the volume of a unit sphere in dimension $d$ scaled down by $1/\sqrt{1-x^{2}}$ (by Pythagorean theorem). Therefore,

$f_{X}(x)=\frac{\frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)}\cdot(1-x^{2})^{(d-2)/2}}{\frac{2\pi^{d/2}}{\Gamma(d/2)}}\cdot 1/\sqrt{1-x^{2}}=\frac{\Gamma(d/2)}{\sqrt{\pi}\cdot\Gamma((d-1)/2)}\left(1-x^{2}\right)^{(d-3)/2}.$

∎

2.1 Shannon Lower Bound on Distortion

The Shannon Lower Bound (SLB) is a powerful tool, derived from Shannon’s lossy source coding theorem *[49]*, that provides a universal lower bound on the optimal achievable distortion rate for any lossy compression scheme. Specifically, we use a version of SLB tailored for the mean-squared error (MSE) distortion measure applied to general $d$-dimensional sources.

###### Lemma 2 (SLB).

Let $\bm{x}\in\mathbb{R}^{d}$ be a random vector with an arbitrary probability distribution $p_{X}$ and finite differential entropy $h(\bm{x})$. Define the MSE distortion-rate function $D(B)$ for total bit complexity $B\geq 0$ as:

$D(p_{X},B):=\inf\left\{\mathbb{E}\left[\|\bm{x}-\bm{y}\|_{2}^{2}\right]:I(\bm{x};\bm{y})\leq B\right\},$

where the infimum is taken over all joint distributions of $\bm{x}$ and a reconstruction random vector $\bm{y}\in\mathbb{R}^{d}$ such that the mutual information $I(\bm{x};\bm{y})$ is at most $B$ and $\mathbb{E}\left[\|\bm{x}-\bm{y}\|_{2}^{2}\right]$ is the expected MSE distortion, calculated with respect to the joint distribution of $\bm{x}$ and $\bm{y}$. Then, for any bit complexity $B\geq 0$, the following Shannon Lower Bound holds:

$D(p_{X},B)\geq\frac{d}{2\pi e}\cdot 2^{(2/d)(h(\bm{x})-B)}.$

This is a classic result proved using backward Gaussian test channel (for a proof see *[14]*). Our lower bound result uses a corollary of SLB that corresponds to the uniformly distributed random points on the unit hyeprsphere. We present this in the following lemma:

###### Lemma 3 (SLB for random point on hypersphere).

Let $\bm{x}\in\mathbb{S}^{d-1}$ be a random variable uniformly distributed over the unit hypersphere and define the MSE distortion-rate function $D(B)$ for total bit complexity $B$ as per Lemma 2. Then, for any bit complexity $B\geq 0$, the following distortion lower bound holds:

$D(B)\geq 2^{-2B/d}.$

######

###### Proof.

If we let $A_{d}$ denote the area of the hypersphere $\mathbb{S}^{d-1}$, the entropy of uniform distribution over hypersphere is $h(\bm{x})=\log_{2}A_{d}$. Plugging this into the SLB from Lemma 2 we get $D(B)\geq\frac{d}{2\pi e}\cdot A_{d}^{2/d}\cdot 2^{-2B/d}$. Using Stirling’s approximation formula for Gamma function we have $A_{d}=\frac{2\pi^{d/2}}{\Gamma(d/2)}\geq\left(\frac{2\pi e}{d}\right)^{d/2}\cdot\sqrt{\frac{2d}{\pi}}\cdot(1-O(1/d))$. By substituting this into the inequality obtained from Lemma 2 we get the desired lower bound. ∎

2.2 QJL: 1-bit inner product quantization

As previously stated, we design two VQ algorithms: one optimized for minimizing MSE and the other for minimizing inner product error. We show that MSE-optimal quantizers do not necessarily provide unbiased inner product estimates, particularly exhibiting significant bias at lower bit-widths. Our solution for inner product quantization is a two-stage algorithm. First, we apply the MSE-optimal quantizer using one less bit than the desired bit-width budget, thus minimizing the L2 norm of the residuals. Next we apply an unbiased and optimal single-bit quantizer to the residual. For the single-bit inner product quantizer, we utilize the recently proposed Quantized Johnson-Lindenstrauss (QJL) algorithm *[62]*, which is an optimal inner product quantizer with a bit-width of one. Here, we present the QJL algorithm and its essential theoretical guarantees.

###### Definition 1 (QJL).

For any positive integer $d$ the QJL map $Q_{\mathtt{qj1}}:\mathbb{R}^{d}\to\{-1,+1\}^{d}$ is defined as:

$Q_{\mathtt{qj1}}(\bm{x}):=\mathtt{sign}\left(\bm{S}\cdot\bm{x}\right)\qquad\text{for any }\bm{x}\in\mathbb{R}^{d},$

where $\bm{S}\in\mathbb{R}^{d\times d}$ is a random matrix with i.i.d. entries sampled from the normal distribution $\mathcal{N}(0,1)$ and the $\mathtt{sign}$ function is applied entry-wise to its vector input. The inverse/dequantization map $Q_{\mathtt{qj1}}^{-1}:\{-1,+1\}^{d}\to\mathbb{R}^{d}$ is defined as:

$Q_{\mathtt{qj1}}^{-1}(\bm{z}):=\frac{\sqrt{\pi/2}}{d}\cdot\bm{S}^{\top}\cdot\bm{z}\qquad\text{for any }\bm{z}\in\{-1,+1\}^{d}.$

In the next lemma we restate the results from *[62]* that show the QJL is unbiased and also has small inner product distortion:

###### Lemma 4 (performance guarantee: QJL).

Let $Q_{\mathtt{qj1}}$ and $Q_{\mathtt{qj1}}^{-1}$ be defined as per Definition 1. For any vector $\bm{x}\in\mathbb{S}^{d-1}$ and any $\bm{y}\in\mathbb{R}^{d}$ we have the following:

- Unbiased: $\mathbb{E}\left[\left\langle\bm{y},Q_{\mathtt{qj1}}^{-1}\left(Q_{\mathtt{qj1}}(\bm{x})\right)\right\rangle\right]=\left\langle\bm{y},\bm{x}\right\rangle$. - Variance Bound: $\mathtt{Var}\left(\left\langle\bm{y},Q_{\mathtt{qj1}}^{-1}\left(Q_{\mathtt{qj1}}(\bm{x})\right)\right\rangle\right)\leq\frac{\pi}{2d}\cdot\|\bm{y}\|_{2}^{2}$

###### Proof.

The unbiasedness immediately follows from Lemma 3.2 of *[62]*. To show the variance bound let $\bm{s}_{1},\bm{s}_{2},\ldots\bm{s}_{m}$ denote the rows of the random matrix $\bm{S}$ in Definition 1. We have:

$\left\langle\bm{y},Q_{\mathtt{qj1}}^{-1}\left(Q_{\mathtt{qj1}}(\bm{x})\right)\right\rangle=\frac{1}{d}\sum_{i\in[d]}\sqrt{\pi/2}\cdot\bm{s}_{i}^{\top}\bm{y}\cdot\mathtt{sign}(\bm{s}_{i}^{\top}\bm{x}).$

Since $\bm{s}_{i}$’s are i.i.d. the above is indeed the average of $d$ i.i.d. random samples defined as $z_{i}:=\sqrt{\pi/2}\cdot\bm{s}_{i}^{\top}\bm{y}\cdot\mathtt{sign}(\bm{s}_{i}^{\top}\bm{x})$ for $i\in[d]$. Let us now upper bound the variance of a single $z_{i}$ using Fact 3.4 from *[62]*:

$\mathtt{Var}\left(z_{i}\right)=\pi/2\cdot\mathtt{Var}\left(\bm{s}_{i}^{\top}\bm{y}\cdot\mathtt{sign}(\bm{s}_{i}^{\top}\bm{x})\right)\leq\pi/2\cdot\mathbb{E}\left[(\bm{s}_{i}^{\top}\bm{y})^{2}\right]=\pi/2\cdot\|y\|_{2}^{2}\,,$ (3)

where the last equality above follows because $\bm{s}_{i}^{\top}\bm{y}$ is a Gaussian random variable with mean zero and variance $\|\bm{y}\|_{2}^{2}$. Now the variance of the average of $d$ i.i.d. random samples $z_{1},z_{2},\ldots z_{d}$ is:

$\mathtt{Var}\left(\left\langle\bm{y},Q_{\mathtt{qj1}}^{-1}\left(Q_{\mathtt{qj1}}(\bm{x})\right)\right\rangle\right)=\frac{1}{d^{2}}\sum_{i\in[d]}\mathtt{Var}(z_{i})\leq\frac{\pi}{2d}\cdot\|\bm{y}\|_{2}^{2}\,.$

∎

3 TurboQuant: High Performance Quantization

We developed two VQ algorithms, each tailored to a specific objective. The first algorithm is designed to minimize the MSE between the original and reconstructed vectors after quantization. The second algorithm is optimized for unbiased inner product estimation, addressing the bias inherent in MSE-optimal quantizers. These algorithms are detailed in the following subsections.

Furthermore, in Section 3.3, we establish information-theoretic lower bounds on the best achievable distortion rates for any vector quantizer. This analysis demonstrates that TurboQuant achieve near-optimality, differing from the lower bound by only a small constant factor across all bit-widths.

3.1 MSE Optimal TurboQuant

Let $\bm{x}\in\mathbb{S}^{d-1}$ be a (worst-case) vector on the unit sphere in dimension $d$. We aim to quantize $\bm{x}$ to $b$ bits per coordinate while minimizing the reconstruction MSE defined in Eq. (1). We start by randomizing this vector by multiplying it with a random rotation matrix $\bm{\Pi}\in\mathbb{R}^{d\times d}$. We can generate $\bm{\Pi}$ by applying QR decomposition on a random matrix with i.i.d Normal entries.

The resulting rotated vector, $\bm{\Pi}\cdot\bm{x}$, is uniformly distributed on the unit sphere $\mathbb{S}^{d-1}$. As shown in Lemma 1, each coordinate of $\bm{\Pi}\cdot\bm{x}$ follows a Beta distribution, which converges to a normal distribution in high dimensions. Furthermore, in high dimensions, distinct coordinates of $\bm{\Pi}\cdot\bm{x}$ become nearly independent *[55]*, allowing us to apply optimal scalar quantizers to each coordinate independently. Therefore, by Lemma 1, our task reduces to designing a scalar quantizer for random variables with the distribution $f_{X}(x)=\frac{\Gamma(d/2)}{\sqrt{\pi}\cdot\Gamma((d-1)/2)}\left(1-x^{2}\right)^{(d-3)/2}$ for $x\in[-1,1]$.

The optimal scalar quantization problem, given a known probability distribution, can be framed as a continuous k-means problem in dimension one. Specifically, we aim to partition the interval $[-1,1]$ into $2^{b}$ clusters/buckets. The optimal solution adheres to a Voronoi tessellation *[42]*, meaning interval boundaries are the midpoints between consecutive centroids, when arranged in sorted order. Therefore, with $c_{i}$’s denoting the centroids in ascending order, we can formulate the scalar

Algorithm 1 TURBOQUANTmse: optimized for MSE

1: input: dimension $d$ and bit-width b // Global Parameters for Setting up TURBOQUANTmse 2: Generate a random rotation matrix $\Pi \in \mathbb{R}^{d\times d}$ 3: Construct codebook by finding centroids $c_{1},c_{2},\ldots c_{2^{b}}\in [-1,1]$ that minimize MSE cost in Eq. (4) 4: Procedure QUANTmse(x) 5: $\pmb {y}\gets \pmb {\Pi}\cdot \pmb{x}$ 6: idxj $\leftarrow$ arg mink∈[2b] |yj - ck| for every j∈[d] {idxj's are b-bit integers} 7: output: idx 8: Procedure DEQUANTmse(idx) 9: $\tilde{\pmb{y}}_j\gets c_{\mathrm{idx}_j}$ for every $j\in [d]$ 10: $\tilde{\pmb{x}}\gets \pmb{\Pi}^{\top}\cdot \tilde{\pmb{y}}$ 11: output: $\tilde{\pmb{x}}$

quantization as the following k-means optimization problem:

$$ \mathcal {C} \left(f _ {X}, b\right) := \min _ {- 1 \leq c _ {1} \leq c _ {2} \leq \dots \leq c _ {2 ^ {b}} \leq 1} \sum_ {i = 1} ^ {2 ^ {b}} \int_ {\frac {c _ {i} + c _ {i + 1}}{2}} ^ {\frac {c _ {i} + c _ {i + 1}}{2}} | x - c _ {i} | ^ {2} \cdot f _ {X} (x) d x. \tag {4} $$

Note that $\mathcal{C}(f_X, b)$ in Eq. (4) denotes the optimal MSE cost function for bit-width $b$ , a quantity we will bound to prove the upper bound on the end-to-end MSE of TURBOQUANT. The problem in Eq. (4) can be solved using iterative numerical methods to achieve any desired precision. We solve Eq. (4) for a range of practically relevant bit-widths $b$ once, and store the results for future uses by the quantizer.

For example, in moderately high dimensions $d$ , where the distribution $f_{X}(x)$ closely approximates a normal distribution, the optimal quantization centroids for bit-widths $b = 1,2$ are $\left\{\pm \frac{\sqrt{2 / \pi}}{\sqrt{d}}\right\}$ and $\left\{\pm \frac{0.453}{\sqrt{d}}, \pm \frac{1.51}{\sqrt{d}}\right\}$ , respectively.

Therefore the quantizer $Q_{\mathrm{mse}}: \mathbb{R}^d \to \{0,1\}^{b \cdot d}$ first computes $\Pi \cdot x$ and then computes and stores the indices of the nearest centroids to each coordinate of this vector. The dequantization map $Q_{\mathrm{mse}}^{-1}: \{0,1\}^{b \cdot d} \to \mathbb{R}^d$ reconstructs the vector by retrieving the centroids corresponding to the stored indices and then rotating the result back to the original basis through multiplication with $\Pi^\top$ . A pseudocode for these procedures is given in Algorithm 1.

We are now ready to prove our main theorem for TURBOQUANTmse.

Theorem 1 (performance guarantee: TURBOQUANTmse). For any bit-width $b \geq 1$ and any vector $\pmb{x} \in \mathbb{S}^{d-1}$ , the procedure QUANTmse( $\pmb{x}$ ) in Algorithm 1 outputs an index vector $\mathbf{idx} \in [2^b]^d$ . When this index vector is passed to the primitive DEQUANTmse( $\mathbf{idx}$ ), it produces a reconstructed vector $\tilde{\pmb{x}} \in \mathbb{R}^d$ that satisfies the following distortion bounds:

- MSE defined as $D_{\mathrm{mse}} \coloneqq \mathbb{E}_{\tilde{\boldsymbol{x}}}[\| \boldsymbol{x} - \tilde{\boldsymbol{x}} \|_2^2]$ is bounded by $D_{\mathrm{mse}} \leq \frac{\sqrt{3}\pi}{2} \cdot \frac{1}{4^b}$ for any $b \geq 0$ .

- For small bit-widths, specifically $b=1,2,3,4$ the MSE exhibits finer-grained distortion values: $D_{\mathtt{mse}}\approx\mathbf{0.36,0.117,0.03,0.009}$, respectively.

###### Proof.

We start the proof by showing that $D_{\mathtt{mse}}=d\cdot\mathcal{C}(f_{X},b)$, where $\mathcal{C}(f_{X},b)$ is the optimal MSE cost for scalar quantizer defined in Eq. (4). Let $\tilde{\bm{y}}$ be defined as per line 9 of Algorithm 1. Since $\bm{\Pi}$ is a rotation matrix we can write: $\left\|\bm{x}-\tilde{\bm{x}}\right\|_{2}=\left\|\bm{\Pi}\cdot\bm{x}-\tilde{\bm{y}}\right\|_{2}$. Using the notation $\bm{y}=\bm{\Pi}\cdot\bm{x}$ as per line 5 of Algorithm 1 and plugging this into the definition of $D_{\mathtt{mse}}$ we can write:

$D_{\mathtt{mse}}$ $=\mathbb{E}[\left\|\bm{y}-\tilde{\bm{y}}\right\|_{2}^{2}]$ $=\sum_{j\in[d]}\mathbb{E}\left[|\bm{y}_{j}-\tilde{\bm{y}}_{j}|^{2}\right]$ $=\sum_{j\in[d]}\mathbb{E}\left[|\bm{y}_{j}-c_{\mathtt{idx}_{j}}|^{2}\right]$ $=d\cdot\mathbb{E}\left[|\bm{y}_{1}-c_{\mathtt{idx}_{1}}|^{2}\right]$ $=d\cdot\min_{-1\leq c_{1}\leq c_{2}\leq\ldots\leq c_{2^{b}}\leq 1}\sum_{i=1}^{2^{b}}\int_{\frac{c_{i-1}+c_{i}}{2}}^{c_{i}+c_{i+1}\over 2}\left|x-c_{i}\right|^{2}\cdot f_{X}(x)\ dx$ $=d\cdot\mathcal{C}(f_{X},b).$

The third equality above follows from the definition of $\tilde{\bm{y}}$ in line 9 of Algorithm 1 and the fourth line above follows because all $\bm{y}_{j}$’s have identical distribution of $\bm{y}_{j}\sim f_{X}(\cdot)$ as shown in Lemma 1. The last two lines above follows because $c_{\mathtt{idx}_{j}}$ is chosen to be the nearest centroid to each coordinate $\bm{y}_{j}$ in line 6.

Now we must bound the optimal k-means cost $\mathcal{C}(f_{X},b)$. For moderate values of $d$, $f_{X}\to\mathcal{N}(0,1/d)$. By numerically solving the optimization problem in Eq. (4) for values $b=1,2,3,4$ we get that $\mathcal{C}(f_{X},b)\approx\frac{0.36}{d},\frac{0.117}{d},\frac{0.03}{d},\frac{0.009}{d}$, respectively. For larger bit-widths $b>4$, we can apply the Panter-Dite *[44]* high-resolution formula for the distortion of a fixed-rate scalar quantizer, yielding the following bound:

$\mathcal{C}(f_{X},b)\leq\frac{1}{12}\cdot\left(\int f_{X}(x)^{1/3}\ dx\right)^{3}\cdot\frac{1}{4^{b}}=\frac{\sqrt{3}\pi}{2d}\cdot\frac{1}{4^{b}}.$

This completes the proof. ∎

#### Entropy Encoding Codebook Pointers.

TurboQuant’s efficiency can be further increased by applying entropy encoding to the indices that point to the closest codebook elements. Specifically, the probability of each codeword index appearing in the quantized vectors can be computed as $p_{\ell}:=\int_{\frac{c_{\ell}+c_{\ell+1}}{2}}^{c_{\ell}\cdot\text{+}\epsilon_{\ell}}\ f_{X}(x)\ dx$. Optimally coding the indices, reduces the average bit-width to nearly the entropy of the distribution $\{p_{i}\}_{i\in[2^{b}]}$. This lossless compression does not affect the distortion and provides a bit-width reduction at no cost. The most significant reduction occurs for $b=4$, where the entropy of $\{p_{i}\}_{i\in[2^{b}]}$ is approximately $3.8$. Detailed calculations for optimal prefix codes reveal that the average bit-width can be reduced by $5\%$. However, given the limited gain, we have chosen not to incorporate this technique into TurboQuant to maintain simplicity and speed.

Algorithm 2 TurBOQuantprod: optimized for inner product

1: input: dimension $d$ and bit-width $b$ // Global Parameters for Setting up TurBOQuantprod 2: Instantiate a TurBOQuantmse with bit-width $b-1$ as per Algorithm 1 3: Generate a random projection matrix $\bm{S}\in\mathbb{R}^{d\times d}$ with i.i.d. entries $\bm{S}_{i,j}\sim\mathcal{N}(0,1)$ 4: Procedure Quantprod($\bm{x}$) 5: idx $\leftarrow$ Quantmse($\bm{x}$) 6: $\bm{r}\leftarrow\bm{x}-\textsc{DeQuantmse}(\text{idx})$ {residual vector} 7: qjl $\leftarrow$ sign ($\bm{S}\cdot\bm{r}$) {QJL on residual vector} 8: output: (idx, qjl, $\|\bm{r}\|_{2}$) 9: Procedure DeQuantprod(idx, qjl, $\gamma$) 10: $\tilde{\bm{x}}_{\text{mse}}\leftarrow$ DeQuantmse(idx) 11: $\tilde{\bm{x}}_{\text{qjl}}\leftarrow\frac{\sqrt{\pi/2}}{d}\cdot\gamma\cdot\bm{S}^{\top}\cdot\text{qjl}$ 12: output: $\tilde{\bm{x}}_{\text{mse}}+\tilde{\bm{x}}_{\text{qjl}}$

3.2 Inner-product Optimal TurboQuant

For important applications like nearest neighbor search, having an unbiased inner product estimator is essential. However, TurBOQuantmse presented in Section 3.1 does not provide unbiased inner product estimates with query vectors. To illustrate this, consider the case with a bit-width of $b=1$. In this scenario, the optimal codebooks that solve the optimization problem in Eq. (4), for sufficiently large $d$, are $\left\{\pm\sqrt{\frac{2}{\pi d}}\right\}$. This implies that the quantization map for TurBOQuantmse is $Q_{\text{mse}}(\bm{x})=\text{{sign}}(\bm{\Pi}\cdot\bm{x})$ for any $\bm{x}\in\mathbb{R}^{d}$, and the dequantization map is $Q_{\text{mse}}^{-1}(\bm{z})=\sqrt{\frac{2}{\pi d}}\cdot\bm{\Pi}^{\top}\cdot\bm{z}$ for any $\bm{z}\in\{-1,+1\}^{d}$. Therefore, for large enough $d$, according to Lemma 4, we have $\mathbb{E}\left[\left\langle\bm{y},Q_{\text{mse}}^{-1}\left(Q_{\text{mse}}(\bm{x})\right)\right\rangle\right]=\frac{2}{\pi}\cdot\left\langle\bm{y},\bm{x}\right\rangle$, which has a multiplicative bias of $2/\pi$. This bias diminishes with increasing bit-widths $b$, as we empirically demonstrate in Section 4.1.

To address this bias, we propose a solution that combines TurBOQuantmse with an instance of QJL *[62]*. Specifically, let $Q_{\text{mse}}$ be the quantization map corresponding to TurBOQuantmse with a bit-width of $b-1$. For any $\bm{x}\in\mathbb{S}^{d-1}$ the residual vector, defined as $\bm{r}:=\bm{x}-Q_{\text{mse}}^{-1}\left(Q_{\text{mse}}(\bm{x})\right)$, has a small L2 norm, i.e., on expectation $\mathbb{E}[\|\bm{r}\|]=\sqrt{\mathcal{C}(f_{X},b-1)}$ (per Eq. (4)). We can then apply the QJL quantization map $Q_{\text{qjl}}$ on this residual vector, resulting in an overall bit-width of $b$ and providing the following unbiased inner product estimator:

$\left\langle\bm{y},Q_{\text{mse}}^{-1}\left(Q_{\text{mse}}(\bm{x})\right)\right\rangle+\left\|\bm{r}\right\|_{2}\cdot\left\langle\bm{y},Q_{\text{qjl}}^{-1}\left(Q_{\text{qjl}}(\bm{r})\right)\right\rangle.$

More formally, the quantization map $Q_{\text{prod}}:\mathbb{S}^{d-1}\rightarrow[2^{b-1}]^{d}\times\{-1,1\}^{d}\times\mathbb{R}$ is defined as:

$Q_{\text{prod}}(\bm{x})=\left[Q_{\text{mse}}(\bm{x}),Q_{\text{qjl}}\left(\bm{x}-Q_{\text{mse}}^{-1}\left(Q_{\text{mse}}(\bm{x})\right)\right),\left\|\bm{x}-Q_{\text{mse}}^{-1}\left(Q_{\text{mse}}(\bm{x})\right)\right\|_{2}\right].$

A pseudocode for this procedure is given in Algorithm 2.

We prove the main result for TurBOQuantprod in the following theorem.

Theorem 2 (performance guarantee: $\mathrm{TURBOQUANT}_{\mathrm{prod}}$). For any bit-width $b \geq 1$ and any vector $\mathbf{x} \in \mathbb{S}^{d-1}$, the procedure $\mathrm{QUANT}_{\mathrm{prod}}(\mathbf{x})$ in Algorithm 2 outputs an index vector $\mathbf{idx} \in [2^{b-1}]^d$ along with a sign vector $\mathbf{qjl} \in \{-1,1\}^d$ and a positive number $\gamma \geq 0$. When these vectors and the scalar value are passed to the primitive $\mathrm{DEQUANT}_{\mathrm{prod}}(\mathbf{idx}, \mathbf{qjl}, \gamma)$, it produces a reconstructed vector $\tilde{\mathbf{x}} \in \mathbb{R}^d$ that for any vector $\mathbf{y} \in \mathbb{R}^d$ satisfies the following properties:

- Expected inner-product $\mathbb{E}_{\tilde{\boldsymbol{x}}}[\langle \boldsymbol{y},\tilde{\boldsymbol{x}}\rangle ] = \langle \boldsymbol {y},\boldsymbol{x}\rangle$ - Inner-product distortion defined as $D_{\mathrm{prod}} := \mathbb{E}_{\tilde{\boldsymbol{x}}}\left[|\langle \boldsymbol {y},\boldsymbol {x}\rangle -\langle \boldsymbol {y},\tilde{\boldsymbol{x}}\rangle |^2\right]$ is bounded by $D_{\mathrm{prod}} \leq \frac{\sqrt{3}\pi^2 \cdot \| \boldsymbol{y} \|_2^2}{d} \cdot \frac{1}{4^b}$ for any $b \geq 0$. - For small bit-widths, specifically $b = 1,2,3,4$, $D_{\mathrm{prod}}$ exhibits finer-grained distortion values: $D_{\mathrm{prod}} \approx \frac{1.57}{d}, \frac{0.56}{d}, \frac{0.18}{d}, \frac{0.047}{d}$, respectively.

Proof. First we compute the conditional expectation of the inner product estimate $\langle \pmb {y},\tilde{\pmb{x}}\rangle$ conditioned on $\tilde{\pmb{x}}_{\mathrm{mse}}$ as follows:

$$ \begin{array}{l} \mathbb {E} \left[ \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} \rangle | \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \right] = \underset {\tilde {\boldsymbol {x}} _ {\mathrm {q j l}}} {\mathbb {E}} \left[ \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} + \tilde {\boldsymbol {x}} _ {\mathrm {q j l}} \rangle | \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \right] \\ = \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \rangle + \underset {\tilde {\boldsymbol {x}} _ {\mathrm {q j l}}} {\mathbb {E}} \left[ \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {q j l}} \rangle | \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \right] \\ = \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \rangle + \langle \boldsymbol {y}, \boldsymbol {r} \rangle \\ = \langle \boldsymbol {y}, \boldsymbol {x} \rangle , \\ \end{array} $$

where the first equality follows from the definition of $\tilde{\pmb{x}}$ in line 12 of the algorithm. The third equality above follows from Lemma 4 and last line follows from definition of the residual vector $\pmb {r} = \pmb {x} - \tilde{\pmb{x}}_{\mathrm{mse}}$ in line 6. Now we can computed the unconditional expectation using the law of total expectation: $\mathbb{E}_{\tilde{\pmb{x}}}[\langle \pmb {y},\tilde{\pmb{x}}\rangle ] = \mathbb{E}_{\tilde{\pmb{x}}_{\mathrm{mse}}}[\mathbb{E}[\langle \pmb {y},\tilde{\pmb{x}}\rangle |\tilde{\pmb{x}}_{\mathrm{mse}}]] = \mathbb{E}[\langle \pmb {y},\pmb {x}\rangle ] = \langle \pmb {y},\pmb {x}\rangle$, which proves the first claim of the theorem.

We apply the same conditioning on $\tilde{\pmb{x}}_{\mathrm{mse}}$, when computing the distortion, and then compute the resulting conditional distortion:

$$ \begin{array}{l} \mathbb {E} \left[ | \langle \boldsymbol {y}, \boldsymbol {x} \rangle - \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} \rangle | ^ {2} \middle | \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \right] = \underset {\tilde {\boldsymbol {x}} _ {\mathrm {q j l}}} {\mathbb {E}} \left[ | \langle \boldsymbol {y}, \boldsymbol {x} \rangle - \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} + \tilde {\boldsymbol {x}} _ {\mathrm {q j l}} \rangle | ^ {2} \middle | \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \right] \\ = \underset {\tilde {x} _ {\mathrm {q j l}}} {\mathbb {E}} \left[ | \langle \boldsymbol {y}, \boldsymbol {r} \rangle - \langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {q j l}} \rangle | ^ {2} \middle | \tilde {\boldsymbol {x}} _ {\mathrm {m s e}} \right] \\ = \operatorname {V a r} \left(\langle \boldsymbol {y}, \tilde {\boldsymbol {x}} _ {\mathrm {q j l}} \rangle \mid \tilde {\boldsymbol {x}} _ {\mathrm {m s e}}\right) \\ \leq \frac {\pi}{2 d} \cdot \| \boldsymbol {r} \| _ {2} ^ {2} \| \boldsymbol {y} \| _ {2} ^ {2}, \\ \end{array} $$

where the second equality above follows from the definitions of $\pmb{r}$ and $\tilde{\pmb{x}}_{\mathrm{mse}}$ in lines 6 and 10 of Algorithm 2. The third line above follows because $\mathbb{E}[\langle \pmb {y},\tilde{\pmb{x}}_{\mathrm{qj1}}\rangle ] = \langle \pmb {y},\pmb {r}\rangle$, by Lemma 4. The last line follows from the variance bound of QJL estimator shown in Lemma 4 and using the fact that $\tilde{\pmb{x}}_{\mathrm{qj1}}$ in line 11 is re-scaled by $\gamma = \| \pmb {r}\|$.

Now by law of total expectation along with the fact that $\bm{r}=\bm{x}-\tilde{\bm{x}}_{\texttt{mse}}$ we can bound the inner product distortion as follows:

$D_{\texttt{prod}}$ $=\operatorname*{\mathbb{E}}_{\tilde{\bm{x}}_{\texttt{mse}}}\left[\operatorname*{\mathbb{E}}\left[\left|\left\langle\bm{y},\bm{x}\right\rangle-\left\langle\bm{y},\tilde{\bm{x}}\right\rangle\right|^{2}\right|\tilde{\bm{x}}_{\texttt{mse}}\right]\right]$ $\leq\frac{\pi}{2d}\cdot\left\|\bm{y}\right\|_{2}^{2}\cdot\operatorname*{\mathbb{E}}[\left\|\bm{x}-\tilde{\bm{x}}_{\texttt{mse}}\right\|_{2}^{2}]$ $=\frac{\pi}{2d}\cdot\left\|\bm{y}\right\|_{2}^{2}\cdot D_{\texttt{mse}}.$

The theorem follows by invoking the MSE bounds from Theorem 1 with bit-width $b-1$. ∎

3.3 Lower Bounds

We show that TurboQuant achieves an optimal distortion rate, up to a small constant factor, for any bit-width by proving lower bounds on the best achievable distortion for any compression algorithm. Our lower bound proof leverages Yao’s minimax principle. This principle allows us to relate the lower bound for randomized algorithms with worst-case deterministic input vectors to the lower bound for deterministic algorithms with randomized input vectors. Subsequently, we derive a lower bound on the achievable distortion rate for the latter using Shannon’s lower bound (SLB) presented in Section 2.1. Formally, we prove the following theorem.

###### Theorem 3 (lower bound on best achievable compression distortion).

For any randomized quantization algorithm $Q:\mathbb{S}^{d-1}\to\{0,1\}^{b\cdot d}$ with bit-width $b$ and any reconstruction map $Q^{-1}:\{0,1\}^{b\cdot d}\to\mathbb{R}^{d}$, there exist a hard input instance $\bm{x}\in\mathbb{S}^{d-1}$ such that:

$D_{\texttt{mse}}(Q):=\operatorname*{\mathbb{E}}\left[\left\|\bm{x}-Q^{-1}\left(Q(\bm{x})\right)\right\|_{2}^{2}\right]\geq\frac{1}{4^{b}}.$

Furthermore, there exists a $\bm{y}\in\mathbb{S}^{d-1}$ such that:

$D_{\texttt{prod}}(Q)=\operatorname*{\mathbb{E}}\left[\left|\left\langle\bm{y},\bm{x}\right\rangle-\left\langle\bm{y},Q^{-1}\left(Q(\bm{x})\right)\right\rangle\right|^{2}\right]\geq\frac{1}{d}\cdot\frac{1}{4^{b}}$

###### Proof.

By Yao’s minimax principle the expected MSE of the optimal randomized compression algorithm for worst-case inputs ($D_{\texttt{mse}}$) is equal to the expected MSE of the optimal deterministic compression algorithm when applied to inputs drawn from a maximally difficult randomized distribution. By definition, the MSE of the latter scenario is lower-bounded by the best achievable MSE for inputs uniformly distributed on the unit hypersphere.

The best achievable MSE for a compression algorithm with bit-width $b$, operating on uniformly distributed inputs from the sphere $\mathbb{S}^{d-1}$, is lower bounded in Lemma 3. Therefore, by invoking Lemma 3 we conclude that $D_{\texttt{mse}}\geq\frac{1}{4^{b}}$

Furthermore, from $D_{\mathtt{mse}}\geq\frac{1}{4^{5}}$ and using the definition of $D_{\mathtt{mse}}$ we conclude that:

$D_{\mathtt{mse}}$ $=\sum_{j=1}^{d}\mathbb{E}\left[\left|\bm{x}_{j}-\left[Q^{-1}\left(Q(\bm{x})\right)\right]_{j}\right|^{2}\right]$ $=\sum_{j=1}^{d}\mathbb{E}\left[\left|\left\langle\bm{e}_{j},\bm{x}\right\rangle-\left\langle\bm{e}_{j},Q^{-1}\left(Q(\bm{x})\right)\right\rangle\right|^{2}\right]$ $\geq\frac{1}{4^{b}}.$

By pigeonhole principle there exist an index $j\in[d]$ such that $\mathbb{E}\left[\left|\left\langle\bm{e}_{j},\bm{x}\right\rangle-\left\langle\bm{e}_{j},Q^{-1}\left(Q(\bm{x})\right)\right\rangle\right|^{2}\right]\geq\frac{1}{d}\cdot\frac{1}{4^{5}}$, which completes the proof. ∎

We note that a comparable lower bound for the *worst-case* distortion in vector quantization can be derived using “sphere packing” arguments (indeed, with larger constants as this is a harder problem) *[26]*. However, Theorem 3 offers a more robust and relevant lower bound for our analysis. This is because it establishes a lower bound on the *expected distortion*, rather than the worst-case error, and aligns seamlessly with our upper bounds presented in Theorem 1 and Theorem 2.

4 Experiments

All experiments are performed using a single NVIDIA A100 GPU. The experimental section is divided into two parts: one to empirically validate the theoretical results, and another to evaluate the performance of our methods on downstream tasks, specifically KV cache quantization and nearest neighbor vector search.

4.1 Empirical Validation

In this section, we verify the theoretical results established in previous sections. We conduct our experiments using the DBpedia Entities dataset, which has been encoded into a 1536-dimensional space using OpenAI3 embeddings. To perform our experiments, we randomly sample 100,000 data points from the dataset, denoted as training set, which serves as our primary dataset. Additionally, we extract 1,000 distinct entries, denoted as query set, to be used as query points.

We evaluate two quantization methods: TurboQuant_{prod} and TurboQuant_{mse}. The method TurboQuant_{mse} is designed to be optimzed for estimating the mean squared error (MSE) between the quantized and original vectors. In contrast, TurboQuant_{prod} is unbiased for estimating the inner product between the quantized and original vectors.

Both methods are applied to the task of inner product estimation by quantizing training set and analyzing the distortion in inner product calculations across different bit widths. As shown in Fig. 1, increasing the bit width reduces variance in both methods. However, when used for inner product estimation, TurboQuant_{mse} introduces bias. This bias diminishes as the bit width increases and eventually converges to zero.

![img-0.jpeg](img-0.jpeg)

![img-1.jpeg](img-1.jpeg) (a) TURBOQUANTprod

![img-2.jpeg](img-2.jpeg)

![img-3.jpeg](img-3.jpeg)

![img-4.jpeg](img-4.jpeg) (b) TURBOQUANTmse

![img-5.jpeg](img-5.jpeg) Figure 1: Error distribution of TURBOQUANTprod and TURBOQUANTmse for Inner Product Estimation.

![img-6.jpeg](img-6.jpeg)

![img-7.jpeg](img-7.jpeg)

The experimental results, illustrated in Fig. 1, confirm that TURBOQUANTprod remains unbiased for inner product estimation across all bit widths, while TURBOQUANTmse gradually improves with increasing bit width.

As observed in Fig. 2, when quantizing to 2 bits, the variance remains constant regardless of the inner product of the original vector in the TURBOQUANTprod approach. However, the same plot indicates that the bias in the TURBOQUANTmse approach is dependent on the average inner product. As the average inner product increases, the bias also increases.

Along with the histograms, we also plot Section 4.1 the average inner product error and MSE between the original and quantized vectors across different bit ratios. These plots are drawn alongside the upper and lower bounds established in our theoretical analysis. Our observations confirm that the results align with the theoretical predictions. Specifically, for inner product estimation, the TURBOQUANTprod approach performs better at lower bit ratios. However, as the bit count increases, TURBOQUANTmse reduces bias and ultimately achieves superior performance in inner product estimation.

9030club / TurboQuant: Online Vector Quantization With Near - Optimal Distortion Rate (arXiv)