NeurIPS 2024

Deep Support Vectors

DeepKKT uncovers support-vector-like samples inside deep classifiers, enables few-shot distillation with almost no data, and turns a classifier into a latent generator.

Junhoo Lee1,Hyunho Lee1,Kyomin Hwang1,Nojun Kwak1
Advances in Neural Information Processing Systems 371Seoul National University

The opening figure is rendered directly from `figure/overall_photo.pdf` in the paper source. It shows DSV generations on ImageNet, CIFAR10, and CIFAR100 produced without revisiting the original training data, which frames the paper's core claim: modern classifiers really do contain recoverable support-vector-like structure.

Generated images from the original overall_photo figure in the Deep Support Vectors source.

Source figure: `figure/overall_photo.pdf` from the official TeX release.

Abstract

Deep learning achieves remarkable accuracy, but it still hides the samples and cues that actually shape its decision boundary. Deep Support Vectors asks whether a modern classifier can recover something as structurally important and interpretable as classic support vectors.

The paper introduces DeepKKT, a deep-learning adaptation of the Karush-Kuhn-Tucker condition, to identify or synthesize Deep Support Vectors (DSVs). These DSVs behave like boundary-defining samples: they exhibit high uncertainty, summarize model behavior, and reveal what features a classifier relies on to separate classes.

That same optimization recipe also leads to practical few-shot distillation and classifier-driven generation. One framework ties together interpretability, compression, and synthesis without requiring the original training set at reconstruction time.

Key Ideas

Recover boundary-critical samples

DSVs act like support vectors for deep networks: they sit near the boundary, surface uncertainty, and summarize what the classifier treats as essential.

Distill from a trained classifier

DeepKKT turns a frozen classifier into a practical distillation signal, avoiding the usual full-dataset, snapshot-heavy matching pipelines.

Reuse classification as generation

Once labels become latent variables, the same optimization supports synthesis, interpolation, and editing without a separate generator.

DeepKKT

DeepKKT extends the KKT recipe from maximum-margin classifiers to overparameterized deep networks. Instead of only enforcing correct classification, it also tracks dual feasibility, a stationarity-style relation between the trained model and candidate samples, and a manifold prior that keeps synthesized DSVs on plausible data support.

This produces a practical recipe for recovering support-vector-like samples from a trained classifier without replaying full training. In the paper, the conditions are evaluated on ConvNet and ResNet backbones across CIFAR10, CIFAR100, SVHN, and ImageNet.

Primal
iI, arg maxcΦc(xi;θ)=yi\forall i \in \mathcal{I},\ \operatorname*{arg\,max}_c \Phi_c(x_i; \theta^*) = y_i

Candidate DSVs must still land on the target class.

Dual
iI, λi0\forall i \in \mathcal{I},\ \lambda_i \ge 0

The multiplier becomes a natural measure of sample importance.

Stationarity
θ=i=1nλiθL(Φ(xi;θ),yi)\theta^* = -\sum_{i=1}^{n} \lambda_i \nabla_{\theta}\mathcal{L}(\Phi(x_i; \theta^*), y_i)

Support-vector structure is tied back to the trained model.

Manifold
iI, xiM\forall i \in \mathcal{I},\ x_i \in \mathcal{M}

Synthesized DSVs stay close to plausible data support.

From the paper

Relaxed DeepKKT Conditions

This is the actual TeX-level system from `sec/4Deep_Support_Vector.tex`, rendered on-page with KaTeX instead of being approximated as plaintext.

Primal feasibility:iI, arg maxcΦc(xi;θ)=yiDual feasibility:iI, λi0Stationarity:θ=i=1nλiθL(Φ(xi;θ),yi)Manifold:iI, xiM\begin{aligned} &\text{Primal feasibility:} && \forall i \in \mathcal{I},\ \operatorname*{arg\,max}_c \Phi_c(x_i; \theta^*) = y_i \\ &\text{Dual feasibility:} && \forall i \in \mathcal{I},\ \lambda_i \ge 0 \\ &\text{Stationarity:} && \theta^* = -\sum_{i=1}^{n} \lambda_i \nabla_{\theta}\mathcal{L}(\Phi(x_i; \theta^*), y_i) \\ &\text{Manifold:} && \forall i \in \mathcal{I},\ x_i \in \mathcal{M} \end{aligned}
Optimization

Primal Surrogate Loss

The paper replaces hard feasibility with a hinge-like surrogate that only penalizes a DSV when the classifier fails to predict the target class.

Lprimal=1ni=1nLi,Li={0if arg maxcΦc(xi;θ)=yi,L(Φ(xi;θ),yi)otherwise.L_{\text{primal}} = \frac{1}{n}\sum_{i=1}^{n} L_i,\qquad L_i = \begin{cases} 0 & \text{if } \operatorname*{arg\,max}_c \Phi_c(x_i; \theta^*) = y_i, \\ \mathcal{L}(\Phi(x_i; \theta^*), y_i) & \text{otherwise.} \end{cases}
Optimization

Stationarity and Final DSV Objective

The stationarity loss links synthesized samples back to the trained model, and the final objective augments it with primal, total-variation, and norm priors.

Lstat=D ⁣(θ,i=1nλiθL(Φ(xi;θ),yi))L_{\text{stat}} = D\!\left( \theta^*, -\sum_{i=1}^{n} \lambda_i \nabla_{\theta}\mathcal{L}(\Phi(x_i; \theta^*), y_i) \right)
DSV=arg minxEA[Lstationarity(A(x))+β1Lprimal(A(x))+β2Ltot(x)+β3Lnorm(x)]\operatorname{DSV} = \operatorname*{arg\,min}_x \mathbb{E}_{\mathcal{A}} \left[ L_{\text{stationarity}}(\mathcal{A}(x)) + \beta_1 L_{\text{primal}}(\mathcal{A}(x)) + \beta_2 L_{\text{tot}}(x) + \beta_3 L_{\text{norm}}(x) \right]

Why the full objective matters

This figure is rendered directly from `figure/primalstat.pdf`. The paper uses it to show that primal-only optimization captures weak texture cues, while the full DeepKKT objective moves samples toward meaningful class structure.

Original primalstat figure from the Deep Support Vectors source.

Few-Shot Distillation

The few-shot result is especially striking because DeepKKT does not depend on full-dataset access, trajectory matching, or Hessian-heavy objectives. A pretrained classifier itself becomes the supervisory signal for distillation.

On CIFAR10, the source table shows that DSV-based distillation remains competitive even in extremely data-starved settings, including zero-shot cases where no original class image is used to initialize the distilled set.

img/clsshot/classratio (%)DCDSADMDSV
100---21.68 +/- 0.80
10.0216.48 +/- 0.8115.41 +/- 1.9113.03 +/- 0.1522.69 +/- 0.38
100.219.66 +/- 0.7821.15 +/- 0.5822.42 +/- 0.43-
50125.90 +/- 0.6226.01 +/- 0.7024.42 +/- 0.29-
5001028.06 +/- 0.6128.20 +/- 0.6325.06 +/- 1.20-
1000---30.35 +/- 0.99
100.225.06 +/- 1.2026.67 +/- 1.0429.77 +/- 0.6637.90 +/- 1.69
50136.44 +/- 0.5236.63 +/- 0.5236.63 +/- 0.52-
5001043.55 +/- 0.5044.66 +/- 0.5947.96 +/- 0.95-
5000---39.35 +/- 0.54
50141.22 +/- 0.9041.29 +/- 0.4548.93 +/- 0.9253.56 +/- 0.73
5001052.00 +/- 0.5952.19 +/- 0.5360.59 +/- 0.41-

Reproduced as HTML from `sec/main_table.tex` in the source, using the CIFAR10 few-shot dataset distillation numbers.

Visual Evidence

The paper makes the support-vector analogy concrete in several ways. Candidate DSVs become higher-entropy over optimization, consistent with moving toward a decision boundary, and the learned Lagrange multipliers correlate with downstream classwise test accuracy.

The same decision signals are also visible. DSV-guided edits expose the precise traits a classifier leans on, showing how local semantic changes can flip the model's prediction.

Entropy rises near the boundary

Rendered from `figure/Entropy_DSV.pdf`. As optimization proceeds, DSV candidates become higher-entropy, which matches the intuition that support vectors should lie close to the decision boundary.

Entropy change figure from the Deep Support Vectors source.

Lambda tracks classwise generalization

Rendered from `figure/pearson.pdf`. The source figure links the sum of learned lambda values to classwise test accuracy, turning DeepKKT multipliers into a readable importance signal.

Pearson correlation figure from the Deep Support Vectors source.

Decision criteria become editable

Rendered from `figure/editimages.pdf`. The examples show original images, DSV-informed manual edits, and DeepKKT-based edits that change the classifier's prediction by altering the precise cue it relies on.

Edited image examples from the Deep Support Vectors source.

Classifier as Generator

DeepKKT also reframes a classifier as a latent generative model. By treating labels or soft labels as latent variables, the method can synthesize unseen images, interpolate between classes, and perform semantically meaningful edits.

The source figures further show why the full objective matters. Primal-only optimization does not yield meaningful samples, while the stationarity-aware objective produces sharper, more class-consistent generations that stay close to the data manifold.

Class interpolation from soft labels

Rendered from `figure/interpolation_fig_compressed.pdf`. The interpolation figure shows that DeepKKT can move smoothly between class semantics while staying in-distribution.

Interpolation figure from the Deep Support Vectors source.

Mixup in latent label space

Rendered from `figure/mixup_overall.pdf_compressed.pdf`. The supplementary figure illustrates label-space mixing as a direct extension of the same DeepKKT reconstruction objective.

Mixup figure from the Deep Support Vectors supplementary source.

ImageNet mixing examples

Rendered from `figure/mix_imagenet_compressed.pdf`. The ImageNet examples make the same point at larger scale: semantic mixing is coming from the classifier-guided objective itself, not from a separately trained generator.

ImageNet mixing figure from the Deep Support Vectors source.

Supplementary Code Layout

The supplementary release ships a compact reconstruction pipeline under `submit_code/`. The page below mirrors that layout rather than inventing a new pseudo-repo structure: `pl_main.py` is the Lightning entrypoint, `base_reconstruct_imagenet.yaml` holds the default reconstruction config, and the optimization logic is split across `models.py`, `svm_modules.py`, and the dataset helpers.

For the project page, the code block is presented exactly around that release entrypoint so the GitHub link, README, and reproduction snippet all line up with the supplementary package users actually receive.

Main command
python pl_main.py --yaml base_reconstruct_imagenet.yaml \
  --loss_weight 1e-4 \
  --UPSCALE_cycle 2000 \
  --tv_scale 200000 \
  --alpha_scale 1e-4
Base yaml
ARCHITECTURE: resnet
BATCH_SIZE: 256
CHECKPATH: "./checkpoints/"
SAVE_NAME: RECONSTRUCT
DATASET: CIFAR10
NUM_SAMPLES: 1
PATH_DATASETS: "./data"
RECONSTRUCT_DATA: true
REF_MODEL_PATH: ./models/imagenet.ckpt
DATAPATH: "./data"
MODE: "RECONSTRUCT"
NUM_WORKERS: 12

pl_main.py

PyTorch Lightning entrypoint that parses optimization flags, loads the yaml config, and dispatches reconstruction, base training, or SVM retraining modes.

base_reconstruct_imagenet.yaml

Default reconstruction config used by the supplementary release for DSV synthesis runs.

models.py / svm_modules.py

Core model and DeepKKT optimization logic used for reconstruction and downstream evaluation.

BibTeX

@article{lee2024deepsupportvectors,
  title         = {Deep Support Vectors},
  author        = {Lee, Junhoo and Lee, Hyunho and Hwang, Kyomin and Kwak, Nojun},
  journal       = {Advances in Neural Information Processing Systems},
  volume        = {37},
  year          = {2024},
  eprint        = {2403.17329},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2403.17329}
}