ICML 2026

BASIL: Scalable Bayesian Semi-supervised Clustering
with Feature Selection and Adaptive Constraint Weighting

Luwei Wang1, Dagmara Panas1, Ke Wang1, Bruce Guthrie2, Sohan Seth1

1School of Informatics, University of Edinburgh    2Usher Institute, University of Edinburgh

BASIL on MNIST digits 0 to 9. Each panel is the learned per-cluster feature importance map after training. Pixel colour intensity reflects the posterior mean of the per-cluster relevance score γkd, so the panels recover the digit silhouette purely from data and pairwise constraints, without any pixel-level label.

BASIL learns clusters from data that comes with a few user-supplied pairwise constraints, scales to half a million patients, and stays accurate when up to 30% of those constraints are wrong.

Abstract

Constrained clustering incorporates prior knowledge in the form of pairwise constraints to guide data partitioning. Existing Bayesian approaches struggle to scale and offer weak interpretability. We propose BASIL, a scalable Bayesian semi-supervised clustering framework that uses stochastic variational inference to jointly infer cluster assignments and feature importance weights. To handle noisy supervision, BASIL introduces an adaptive constraint-weighting mechanism that downweights unreliable constraints. Experiments on synthetic and real-world benchmarks show BASIL attains competitive accuracy while reducing training time by over 96% on large datasets, learns interpretable per-cluster feature importance maps, and remains robust to up to 30% noisy constraints under sufficient supervision. We further demonstrate applicability to large-scale health data, including medical imaging and electronic health records.

Why BASIL?

Scalable

Over 96% training-time reduction on large datasets thanks to stochastic variational inference. Tested up to 500k patients.

Robust

An adaptive Gamma prior on constraint weights downweights inconsistent supervision. Holds up under 30% noisy constraints.

Interpretable

Per-cluster feature selection produces a relevance map per cluster, so clinicians can read off what each cluster is about.

Method

Plate diagram of BASIL. Latent variables include cluster assignments z_n, cluster parameters theta_k, per-cluster feature relevance gamma_k, mixture weights pi_k, and must-link and cannot-link constraint weights W and W-bar. Observed nodes are the data x_n and the constraint matrix Y. Plate K encloses theta_k, gamma_k, pi_k. Plate N encloses z_n and x_n.
Plate diagram of BASIL with feature selection and adaptive constraint weight learning. Shaded nodes ($\mathbf{x}_n$, $Y$) are observed; unshaded nodes ($z_n$, $\boldsymbol{\theta}_k$, $\boldsymbol{\gamma}_k$, $\pi_k$, $W$, $\bar{W}$) are latent; labels without circles are fixed hyperparameters.

BASIL formulates HMRF-structured cluster assignments, mixture likelihoods, per-cluster feature relevance, and latent constraint reliability inside a single hierarchical generative model. Each cluster assignment $z_n$ depends on mixture weights $\pi_k$ and on the pairwise-constraint matrix $Y$ through adaptive must-link and cannot-link weights $W$ and $\bar{W}$. Per-feature relevance $\gamma_{kd} \sim \mathrm{Beta}(\lambda_{\gamma 1}, \lambda_{\gamma 2})$ shrinks each cluster mean $\boldsymbol{\theta}_k$ toward the population mean $\boldsymbol{\theta}_0$. SVI delivers joint posterior inference by maximising the ELBO,

$$\mathcal{L}(q) = \mathbb{E}_q\!\left[\log p(X, Z, \boldsymbol{\theta}, \boldsymbol{\gamma}, W, \bar{W} \mid Y)\right] - \mathbb{E}_q\!\left[\log q(Z, \boldsymbol{\theta}, \boldsymbol{\gamma}, W, \bar{W})\right],$$

under the mean-field factorisation $q(Z, \boldsymbol{\theta}, \boldsymbol{\gamma}, W, \bar{W}) = \prod_n q(z_n) \prod_k q(\boldsymbol{\theta}_k) \prod_{k,d} q(\gamma_{kd}) \prod_{(n,m)} q(w_{nm}) q(\bar{w}_{nm})$, which yields closed-form local updates compatible with mini-batching.

Hyperparameter explorer

The three knobs the paper identifies as the ones most worth tuning. Drag the sliders to feel how each one shapes the model.

warmup_prop

Phased schedule from 1+D+FS to w+FS

0.30

warmup_prop only takes effect when w_learn, KL, and feat_select are all True.

wprior

Gamma rate β on constraint weight w

Prior: w ∼ Gamma(1, wprior)  ·  default wprior = 1.0

1.00
Prior mean of w 1.000
Fixed W (w_learn=False) 1.000

Increasing wprior shifts the Gamma prior toward zero, lowering both the adaptive prior mean and the fixed weight to 1/wprior.

fsprior

Beta rate on per-cluster feature importance γkd

Prior: γkd ∼ Beta(0.5, fsprior)  ·  default fsprior = 0.1

0.100
Prior mean of γkd 0.833
Selection strength weak

Increasing fsprior shifts the Beta prior toward zero importance, so only features the data strongly supports remain salient.

Results

NMI 0.77 in 19.4 min on N=50k synthetic data at 50% supervision (vs 18.9 hr for the closest full-batch baseline, ~58x wall-clock reduction).

Synthetic Gaussian-mixture clustering result showing recovered cluster boundaries against ground truth.

Citation

@inproceedings{wang2026basil,
  title     = {BASIL: Scalable Bayesian Semi-supervised Clustering with Feature Selection and Adaptive Constraint Weighting},
  author    = {Wang, Luwei and Panas, Dagmara and Wang, Ke and Guthrie, Bruce and Seth, Sohan},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR}
}