LeJEPA: Provable and Scalable Self - Supervised Learning Without the Heuristics

Click a question to see the answer

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero $^{1,2,}$ Yann LeCun $^{3,2,}$

$^{1}$ Brown University $^{3}$ New York University (NYU) $^{2}$ Meta-FAIR

* Equal contribution

Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in LeJEPA, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs' embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective-Sketched Isotropic Gaussian Regularization (SIGReg)-to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx 50$ lines of code. Our empirical validation covers $10+$ datasets, $60+$ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches $79\%$ with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (GitHub repo).

![img-0.jpeg](img-0.jpeg)

![img-1.jpeg](img-1.jpeg)

![img-2.jpeg](img-2.jpeg) Figure 1. LeJEPA overview. Top-left: Training loss exhibits strong correlation with downstream linear probe performance on ImageNet-1k (ViT-base), providing the first practical loss for model selection without supervised probing. Top-right: Training stability without heuristics even on 1.8B ViT-g models, stable training loss. Bottom-left: PCA features from ImageNet-1k pretrained LeJEPA ViT-Large demonstrate clear semantic relationships. Bottom-right: Galaxy10 in-domain results showcasing LeJEPA's in-domain pretraining consistently outperforms state-of-the-art frontier foundation models transfer learning (DINOv2/v3 trained on natural images) across data regimes from 1-shot to full supervision. This demonstrates that domain-specific SSL beats generic transfer learning, even against massive-scale frontier models, when the framework scales effortlessly to any domain, model, and data scale.

![img-3.jpeg](img-3.jpeg)

| Method | Full FT | | Frozen | | | --- | --- | --- | --- | --- | | | 1-sh | Full | 1-sh | Full | | LeJEPA (in-domain) | | | | | | ConvNeXt-V2 Nano | 29.42 | 82.72 | 28.74 | 76.52 | | ResNet-34 | 24.27 | 83.28 | 31.08 | 78.17 | | Frontier (transfer) | | | | | | DINOv2 ViT-S/16 | 21.05 | 78.34 | 27.68 | 67.62 | | DINOv3 ViT-S/16 | 24.71 | 81.60 | 30.17 | 71.38 |

1 Introduction

Learning manipulable representations of the world and its dynamics is a long-standing question in AI, with roots dating back centuries ago *(Von Helmholtz, 1867; Tolman, 1948; Gregory, 1980; Sutton, 1991; Friston, 2010)*. Across domains, e.g., image recognition, robotics, physics, space exploration, the unifying question is how to learn an organized and actionable high-dimensional embedding space from observations? Using Deep Networks–parameterized non-linear operators $f_{\theta}$–to map observations to embeddings is a standard first piece of that puzzle *(LeCun et al., 2015; Goodfellow et al., 2016)*. The second, less standardized, piece of that puzzle is how to train $f_{\theta}$. Joint-Embedding Predictive Architectures (JEPAs) suggest training $f_{\theta}$ by maximizing predictive agreement between the embeddings of semantically related views *(Bromley et al., 1993; LeCun, 2022; Balestriero et al., 2023)*. Views can come in two forms: transformations or corruptions. They can involve masking, cropping, blurring, temporal or spatial translations, geometric or photometric transformations, viewpoint changes, views from different sensor modalities, etc. The supervised forms involve human-produced components such as image-caption pairs, text-code pairs, etc *(Tian et al., 2020)*. In any case, views are expected to share some degree of semantic relationship to allow the prediction task to align $f_{\theta}$’s embeddings towards the underlying knowledge present in the data.

Alas, JEPA’s prediction task admits failure modes, such as representation collapse, where $f_{\theta}$ maps all inputs to nearly identical embeddings (complete collapse) or to a low-dimensional subspace (dimensional collapse) *(Jing et al., 2021)**(Jing et al., 2021; Cosentino et al., 2022; Balestriero and LeCun, 2022)*. To mitigate such shortcut solutions, state-of-the-art recipes rely on heuristics–stop-gradient *(Chen et al., 2020a)*, asymmetric view generation *(Wang et al., 2022)*, teacher–student networks with carefully tuned EMA schedules *(Caron et al., 2021; Tian et al., 2021)*, explicit normalization and whitening layers *(Ermolov et al., 2021; Chen et al., 2021)*–and a delicate balance of hyperparameters. As a result, today’s JEPA training is brittle and most research has shifted toward scaling data *(Vo et al., 2024)*, models *(Fan et al., 2025)* and even post-training *Rodas et al. (2025)* while leaving the theoretical foundations of JEPAs largely unexplored.

Our study proposes to break that cycle by questioning some of the fundamental design principles underpinning JEPAs. That introspection will start by asking what are the necessary conditions that JEPAs should abide by? Those minimal conditions will then act as axioms for us to design a novel and lean JEPA. We identify two axioms: (i) solving the prediction task while (ii) enforcing an isotropic Gaussian distribution of the embeddings (Section 3). While (i) follows standard practice *(Balestriero and LeCun, 2022)*, we introduce in Section 4 a novel distribution matching objective–Sketched Isotropic Gaussian Regularization (SIGReg)–to enforce (ii). The use of SIGReg not only removes the need for the numerous heuristics previously employed to prevent representation collapse, but SIGReg also exhibits favorable scaling properties as its memory and computational complexity is linear in dimension and sample size. Crucially, SIGReg’s isotropic Gaussian enforcement solves the collapsed shortcut solution and provably minimizes the model’s expected risk over the space of downstream tasks to be encountered post-training. The resulting JEPA solution–coined Latent-Euclidean JEPA (LeJEPA)–is introduced in Section 5. Beyond theoretical optimality, LeJEPA offers numerous benefits such as (i) provable statistical guarantees, (ii) removal of heuristics such as teacher-student networks, (iii) linear memory and computational complexity, and most importantly (iv) a unified design with a single trade-off parameter that works out of the box across datasets, architectures and scales (see Section 6). We summarize our contributions below.

Contribution 1: We prove the optimal embedding distribution for foundation models. We establish that the isotropic Gaussian uniquely minimizes downstream prediction risk across broad task families. In Section 3, we derive this result rigorously for both linear (Section 3.1) and nonlinear probes (Section 3.2), providing the first principled answer to what distribution $f_{\theta}$’s embeddings should follow. This theoretical result transforms JEPA design from heuristic exploration to targeted optimization. Contribution 2: We introduce SIGReg, a distribution matching objective that uniquely combines provable correctness with computational efficiency at scale. We present Sketched Isotropic Gaussian Regularization (SIGReg), a novel objective that enforces distributional alignment via random projections and characteristic-function matching (Section 4 and Figure 2). SIGReg provides statistical guarantees (Sections 4.1 and 4.2) while achieving linear complexity and bounded gradients—a combination that existing distribution matching methods do not offer. Critically, its projection-based construction defeats the curse of dimensionality (Section 4.3), making it both theoretically sound and practically efficient for high-dimensional embeddings.

Contribution 3: We design LeJEPA, a statistically optimal JEPA that eliminates collapse by construction. By combining JEPA’s predictive objective with SIGReg targeting the isotropic Gaussian, we introduce LeJEPA—Latent-Euclidean JEPA (Section 5). LeJEPA requires only a single hyperparameter, eliminates representational collapse without stop-gradients or teacher-student architectures, and transfers across architectures and datasets without hyperparameter tuning. This demonstrates that principled

LeJEPA:

![img-4.jpeg](img-4.jpeg) Figure 2. Sketched Isotropic Gaussian Regularization (SIGReg): Given some arbitrary input data with density $p_x$ with support that may or may not lie on a manifold (left), a Deep network (DN) encoder $(f_{\theta})$ produces embeddings $z = f_{\theta}(x)$ with some distribution $z \sim p_z$ (middle). Our proposed Backward Cramér-Wold Statistics (Section 4) objective pushes $p_z$ to match a target distribution $p_t$ by projecting the embeddings along $1d$ directions (middle, arrows) and enforcing that the univariate densities (right, colored lines) match the distribution of $p_t$ , projected along the same directions. Any popular statistical test (provided in Section 4.2) can assess the goodness-of-fit-in practice we argue for characteristic function tests (Section 4.2). By using SIGReg with $p_t$ isotropic Gaussian (right, black lines), we introduce a lean and provably optimal (Section 3) JEPA, coined LeJEPA, free of numerous heuristics and able to produce competitive performances (Sections 5 and 6).

theory directly yields practical simplicity.

Contribution 4: We validate LeJEPA at scale across diverse architectures and establish in-domain pretraining as viable. Our experiments (Section 6) span ViTs, ConvNeXts, ResNets, MaxViTs, and Swin Transformers at scales approaching 1 billion parameters, where LeJEPA matches or exceeds state-of-the-art methods while maintaining training simplicity and robustness. Critically, on domain-specific datasets (Galaxy10, Food101), LeJEPA outperforms DINOv2-based transfer learning when pretrained directly on target data. This challenges the transfer learning paradigm and demonstrates that principled SSL can unlock effective in-domain pretraining—previously considered impractical for small datasets.

9030club / LeJEPA: Provable and Scalable Self - Supervised Learning Without the Heuristics (arXiv)