Recover boundary-critical samples
DSVs act like support vectors for deep networks: they sit near the boundary, surface uncertainty, and summarize what the classifier treats as essential.
NeurIPS 2024
DeepKKT uncovers support-vector-like samples inside deep classifiers, enables few-shot distillation with almost no data, and turns a classifier into a latent generator.
Deep learning achieves remarkable accuracy, but it still hides the samples and cues that actually shape its decision boundary. Deep Support Vectors asks whether a modern classifier can recover something as structurally important and interpretable as classic support vectors.
The paper introduces DeepKKT, a deep-learning adaptation of the Karush-Kuhn-Tucker condition, to identify or synthesize Deep Support Vectors (DSVs). These DSVs behave like boundary-defining samples: they exhibit high uncertainty, summarize model behavior, and reveal what features a classifier relies on to separate classes.
That same optimization recipe also leads to practical few-shot distillation and classifier-driven generation. One framework ties together interpretability, compression, and synthesis without requiring the original training set at reconstruction time.
DSVs act like support vectors for deep networks: they sit near the boundary, surface uncertainty, and summarize what the classifier treats as essential.
DeepKKT turns a frozen classifier into a practical distillation signal, avoiding the usual full-dataset, snapshot-heavy matching pipelines.
Once labels become latent variables, the same optimization supports synthesis, interpolation, and editing without a separate generator.
DeepKKT extends the KKT recipe from maximum-margin classifiers to overparameterized deep networks. Instead of only enforcing correct classification, it also tracks dual feasibility, a stationarity-style relation between the trained model and candidate samples, and a manifold prior that keeps synthesized DSVs on plausible data support.
This produces a practical recipe for recovering support-vector-like samples from a trained classifier without replaying full training. In the paper, the conditions are evaluated on ConvNet and ResNet backbones across CIFAR10, CIFAR100, SVHN, and ImageNet.
Candidate DSVs must still land on the target class.
The multiplier becomes a natural measure of sample importance.
Support-vector structure is tied back to the trained model.
Synthesized DSVs stay close to plausible data support.
This is the actual TeX-level system from `sec/4Deep_Support_Vector.tex`, rendered on-page with KaTeX instead of being approximated as plaintext.
The paper replaces hard feasibility with a hinge-like surrogate that only penalizes a DSV when the classifier fails to predict the target class.
The stationarity loss links synthesized samples back to the trained model, and the final objective augments it with primal, total-variation, and norm priors.
This figure is rendered directly from `figure/primalstat.pdf`. The paper uses it to show that primal-only optimization captures weak texture cues, while the full DeepKKT objective moves samples toward meaningful class structure.

The few-shot result is especially striking because DeepKKT does not depend on full-dataset access, trajectory matching, or Hessian-heavy objectives. A pretrained classifier itself becomes the supervisory signal for distillation.
On CIFAR10, the source table shows that DSV-based distillation remains competitive even in extremely data-starved settings, including zero-shot cases where no original class image is used to initialize the distilled set.
| img/cls | shot/class | ratio (%) | DC | DSA | DM | DSV |
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | - | - | - | 21.68 +/- 0.80 |
| 1 | 0.02 | 16.48 +/- 0.81 | 15.41 +/- 1.91 | 13.03 +/- 0.15 | 22.69 +/- 0.38 | |
| 10 | 0.2 | 19.66 +/- 0.78 | 21.15 +/- 0.58 | 22.42 +/- 0.43 | - | |
| 50 | 1 | 25.90 +/- 0.62 | 26.01 +/- 0.70 | 24.42 +/- 0.29 | - | |
| 500 | 10 | 28.06 +/- 0.61 | 28.20 +/- 0.63 | 25.06 +/- 1.20 | - | |
| 10 | 0 | 0 | - | - | - | 30.35 +/- 0.99 |
| 10 | 0.2 | 25.06 +/- 1.20 | 26.67 +/- 1.04 | 29.77 +/- 0.66 | 37.90 +/- 1.69 | |
| 50 | 1 | 36.44 +/- 0.52 | 36.63 +/- 0.52 | 36.63 +/- 0.52 | - | |
| 500 | 10 | 43.55 +/- 0.50 | 44.66 +/- 0.59 | 47.96 +/- 0.95 | - | |
| 50 | 0 | 0 | - | - | - | 39.35 +/- 0.54 |
| 50 | 1 | 41.22 +/- 0.90 | 41.29 +/- 0.45 | 48.93 +/- 0.92 | 53.56 +/- 0.73 | |
| 500 | 10 | 52.00 +/- 0.59 | 52.19 +/- 0.53 | 60.59 +/- 0.41 | - |
Reproduced as HTML from `sec/main_table.tex` in the source, using the CIFAR10 few-shot dataset distillation numbers.
The paper makes the support-vector analogy concrete in several ways. Candidate DSVs become higher-entropy over optimization, consistent with moving toward a decision boundary, and the learned Lagrange multipliers correlate with downstream classwise test accuracy.
The same decision signals are also visible. DSV-guided edits expose the precise traits a classifier leans on, showing how local semantic changes can flip the model's prediction.
Rendered from `figure/Entropy_DSV.pdf`. As optimization proceeds, DSV candidates become higher-entropy, which matches the intuition that support vectors should lie close to the decision boundary.

Rendered from `figure/pearson.pdf`. The source figure links the sum of learned lambda values to classwise test accuracy, turning DeepKKT multipliers into a readable importance signal.

Rendered from `figure/editimages.pdf`. The examples show original images, DSV-informed manual edits, and DeepKKT-based edits that change the classifier's prediction by altering the precise cue it relies on.

DeepKKT also reframes a classifier as a latent generative model. By treating labels or soft labels as latent variables, the method can synthesize unseen images, interpolate between classes, and perform semantically meaningful edits.
The source figures further show why the full objective matters. Primal-only optimization does not yield meaningful samples, while the stationarity-aware objective produces sharper, more class-consistent generations that stay close to the data manifold.
Rendered from `figure/interpolation_fig_compressed.pdf`. The interpolation figure shows that DeepKKT can move smoothly between class semantics while staying in-distribution.

Rendered from `figure/mixup_overall.pdf_compressed.pdf`. The supplementary figure illustrates label-space mixing as a direct extension of the same DeepKKT reconstruction objective.

Rendered from `figure/mix_imagenet_compressed.pdf`. The ImageNet examples make the same point at larger scale: semantic mixing is coming from the classifier-guided objective itself, not from a separately trained generator.

The supplementary release ships a compact reconstruction pipeline under `submit_code/`. The page below mirrors that layout rather than inventing a new pseudo-repo structure: `pl_main.py` is the Lightning entrypoint, `base_reconstruct_imagenet.yaml` holds the default reconstruction config, and the optimization logic is split across `models.py`, `svm_modules.py`, and the dataset helpers.
For the project page, the code block is presented exactly around that release entrypoint so the GitHub link, README, and reproduction snippet all line up with the supplementary package users actually receive.
python pl_main.py --yaml base_reconstruct_imagenet.yaml \
--loss_weight 1e-4 \
--UPSCALE_cycle 2000 \
--tv_scale 200000 \
--alpha_scale 1e-4ARCHITECTURE: resnet
BATCH_SIZE: 256
CHECKPATH: "./checkpoints/"
SAVE_NAME: RECONSTRUCT
DATASET: CIFAR10
NUM_SAMPLES: 1
PATH_DATASETS: "./data"
RECONSTRUCT_DATA: true
REF_MODEL_PATH: ./models/imagenet.ckpt
DATAPATH: "./data"
MODE: "RECONSTRUCT"
NUM_WORKERS: 12PyTorch Lightning entrypoint that parses optimization flags, loads the yaml config, and dispatches reconstruction, base training, or SVM retraining modes.
Default reconstruction config used by the supplementary release for DSV synthesis runs.
Core model and DeepKKT optimization logic used for reconstruction and downstream evaluation.
@article{lee2024deepsupportvectors,
title = {Deep Support Vectors},
author = {Lee, Junhoo and Lee, Hyunho and Hwang, Kyomin and Kwak, Nojun},
journal = {Advances in Neural Information Processing Systems},
volume = {37},
year = {2024},
eprint = {2403.17329},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2403.17329}
}