BASIL: Scalable Bayesian Semi-supervised Clustering

BASIL: Scalable Bayesian Semi-supervised Clustering
with Feature Selection and Adaptive Constraint Weighting

Luwei Wang¹, Dagmara Panas¹, Ke Wang¹, Bruce Guthrie², Sohan Seth¹

¹School of Informatics, University of Edinburgh ²Usher Institute, University of Edinburgh

Abstract

Constrained clustering incorporates prior knowledge in the form of pairwise constraints to guide data partitioning. Existing Bayesian approaches struggle to scale and offer weak interpretability. We propose BASIL, a scalable Bayesian semi-supervised clustering framework that uses stochastic variational inference to jointly infer cluster assignments and feature importance weights. To handle noisy supervision, BASIL introduces an adaptive constraint-weighting mechanism that downweights unreliable constraints. Experiments on synthetic and real-world benchmarks show BASIL attains competitive accuracy while reducing training time by over 96% on large datasets, learns interpretable per-cluster feature importance maps, and remains robust to up to 30% noisy constraints under sufficient supervision. We further demonstrate applicability to large-scale health data, including medical imaging and electronic health records.

Why BASIL?

Scalable

Over 96% training-time reduction on large datasets thanks to stochastic variational inference. Tested up to 500k patients.

Robust

An adaptive Gamma prior on constraint weights downweights inconsistent supervision. Holds up under 30% noisy constraints.

Interpretable

Per-cluster feature selection produces a relevance map per cluster, so clinicians can read off what each cluster is about.

Method

Plate diagram of BASIL. Latent variables include cluster assignments z_n, cluster parameters theta_k, per-cluster feature relevance gamma_k, mixture weights pi_k, and must-link and cannot-link constraint weights W and W-bar. Observed nodes are the data x_n and the constraint matrix Y. Plate K encloses theta_k, gamma_k, pi_k. Plate N encloses z_n and x_n.

Plate diagram of BASIL with feature selection and adaptive constraint weight learning. Shaded nodes ($\mathbf{x}_n$, $Y$) are observed; unshaded nodes ($z_n$, $\boldsymbol{\theta}_k$, $\boldsymbol{\gamma}_k$, $\pi_k$, $W$, $\bar{W}$) are latent; labels without circles are fixed hyperparameters.

BASIL formulates HMRF-structured cluster assignments, mixture likelihoods, per-cluster feature relevance, and latent constraint reliability inside a single hierarchical generative model. Each cluster assignment $z_n$ depends on mixture weights $\pi_k$ and on the pairwise-constraint matrix $Y$ through adaptive must-link and cannot-link weights $W$ and $\bar{W}$. Per-feature relevance $\gamma_{kd} \sim \mathrm{Beta}(\lambda_{\gamma 1}, \lambda_{\gamma 2})$ shrinks each cluster mean $\boldsymbol{\theta}_k$ toward the population mean $\boldsymbol{\theta}_0$. SVI delivers joint posterior inference by maximising the ELBO,

$$\mathcal{L}(q) = \mathbb{E}_q\!\left[\log p(X, Z, \boldsymbol{\theta}, \boldsymbol{\gamma}, W, \bar{W} \mid Y)\right] - \mathbb{E}_q\!\left[\log q(Z, \boldsymbol{\theta}, \boldsymbol{\gamma}, W, \bar{W})\right],$$

under the mean-field factorisation $q(Z, \boldsymbol{\theta}, \boldsymbol{\gamma}, W, \bar{W}) = \prod_n q(z_n) \prod_k q(\boldsymbol{\theta}_k) \prod_{k,d} q(\gamma_{kd}) \prod_{(n,m)} q(w_{nm}) q(\bar{w}_{nm})$, which yields closed-form local updates compatible with mini-batching.

Hyperparameter explorer

The three knobs the paper identifies as the ones most worth tuning. Drag the sliders to feel how each one shapes the model.

warmup_prop

Phased schedule from 1+D+FS to w+FS

warmup_prop 0.30

warmup_prop only takes effect when w_learn, KL, and feat_select are all True.

wprior

Gamma rate β on constraint weight w

Prior: w ∼ Gamma(1, wprior) · default wprior = 1.0

wprior 1.00

Prior mean of w 1.000

Fixed W (w_learn=False) 1.000

Increasing wprior shifts the Gamma prior toward zero, lowering both the adaptive prior mean and the fixed weight to 1/wprior.

fsprior

Beta rate on per-cluster feature importance γ_kd

Prior: γ_kd ∼ Beta(0.5, fsprior) · default fsprior = 0.1

fsprior 0.100

Prior mean of γ_kd 0.833

Selection strength weak

Increasing fsprior shifts the Beta prior toward zero importance, so only features the data strongly supports remain salient.

Results

NMI 0.77 in 19.4 min on N=50k synthetic data at 50% supervision (vs 18.9 hr for the closest full-batch baseline, ~58x wall-clock reduction).

Citation

@inproceedings{wang2026basil,
  title     = {BASIL: Scalable Bayesian Semi-supervised Clustering with Feature Selection and Adaptive Constraint Weighting},
  author    = {Wang, Luwei and Panas, Dagmara and Wang, Ke and Guthrie, Bruce and Seth, Sohan},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR}
}