Prior Basics

This page explains the prior distributions used in bgms and how they connect to the Bayesian graphical modeling framework introduced on the Bayesian Approach page. Analysis-specific guidance appears on the Edge Selection, Group Comparison, and Edge Clustering pages.

What is a prior?

A prior distribution encodes assumptions about a parameter before observing data. In Bayesian inference, the prior is combined with the likelihood to produce the posterior distribution. The prior matters most when data are sparse; as data accumulate, the posterior is increasingly driven by the likelihood.

The spike-and-slab prior

The Bayesian Approach page introduced two prior distributions: one for the network structure, \(p(\mathcal{S})\), and one for the partial associations conditional on that structure, \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S})\). In bgms, these two priors are linked through a spike-and-slab prior on each partial association \(\omega_{ij}\).

The idea is that each edge in the network is either absent or present. When the edge is absent, the partial association is exactly zero (\(\omega_{ij} = 0\)). When the edge is present, the partial association is free to take any nonzero value (\(\omega_{ij} \in \mathbb{R} \setminus \{0\}\)). A binary indicator \(\gamma_{ij}\) tracks the state of each edge: \(\gamma_{ij} = 0\) means the edge is absent and \(\gamma_{ij} = 1\) means it is present. Together, the collection of indicators forms the network structure \(\mathcal{S}\).

We use the mixtures of mutually singular distributions formulation of the spike-and-slab prior (Gottardo & Raftery, 2008; van den Bergh et al., 2026), which designates a mixture of two mutually exclusive components:

\[ p(\omega_{ij} \mid \gamma_{ij}) = (1 - \gamma_{ij})\;\mathbf{1}_{\{0\}}(\omega_{ij}) \;+\; \gamma_{ij}\;p_{\text{slab}}(\omega_{ij})\;\mathbf{1}_{\mathbb{R}\setminus\{0\}}(\omega_{ij}) \]

Here \(\mathbf{1}_{A}(\omega_{ij})\) is the indicator function that equals one when \(\omega_{ij} \in A\) and zero otherwise. The two indicator functions make the disjoint support of the components explicit: the first component assigns all probability to \(\omega_{ij} \in \{0\}\), the second to \(\omega_{ij} \in \mathbb{R}\setminus\{0\}\).

The first component is the spike: when \(\gamma_{ij} = 0\), all prior probability is concentrated on the value \(\omega_{ij} = 0\), enforcing the absence of the edge. The second component is the slab: when \(\gamma_{ij} = 1\), the partial association receives a continuous distribution over nonzero values, which determines what effect sizes are considered plausible for present edges.

The spike-and-slab above is specified for a single edge. The joint prior on all partial associations is the product of independent spike-and-slab priors across edges:

\[ p(\boldsymbol{\Omega} \mid \mathcal{S}) = \prod_{i < j} p(\omega_{ij} \mid \gamma_{ij}) \]

This is the prior \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S})\) from the Bayesian Approach page. For a given structure \(\mathcal{S}\), the spike components fix all excluded edges to zero, and the slab components provide independent priors on the partial associations of the included edges. The joint prior therefore reduces to a product of slab distributions over the edges in \(\mathcal{S}\).

The spike-and-slab prior decomposes inference into two questions.¹ The first is a structural question: is the edge present? This is governed by \(\gamma_{ij}\) and its prior, which together define \(p(\mathcal{S})\). The second is a question about effect size: given that the edge is present, how strong is it? This is governed by the slab distribution, which defines \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S})\) for the nonzero entries.

The sections below describe each component: the slab distribution that regularizes effect sizes, and the edge priors that govern which edges are included.

The slab: prior on effect sizes

The slab distribution determines the prior plausibility of different effect sizes for edges that are present. In bgms, the slab is set via the interaction_prior argument and defaults to a Cauchy distribution centered at zero with scale 1 (cauchy_prior(scale = 1)). Alternative slabs are available through the prior constructors: normal_prior() for a lighter-tailed prior and beta_prime_prior() for a beta-prime parameterization.

The Cauchy distribution has heavy tails, which means it allows large effects when the data support them, while the peak at zero provides moderate shrinkage toward small effects. In models with many variables and limited data, this regularization prevents extreme estimates.

The scale parameter controls the width of the slab. A wider slab spreads prior probability over a larger range of effect sizes. This has a direct consequence for edge selection: because the slab is also the alternative hypothesis for the inclusion Bayes factor, a wider slab makes it harder for the Bayes factor to favor inclusion when the true effect is small.

When standardize = TRUE, the prior scale is adjusted based on the range of response scores. Variables with more categories have larger score products, which typically correspond to smaller interaction effects. Without standardization, a fixed scale is relatively wide for high-category pairs and narrow for low-category pairs. The adjustment ensures equivalent relative shrinkage across all variable pairs.

The spike: priors on the edge indicators

Edge indicators \(\gamma_{ij}\) govern edge inclusion in the spike-and-slab prior. Their prior determines, before seeing the data, how likely each edge is to be excluded (\(\gamma_{ij}=0\)) or included (\(\gamma_{ij}=1\)), and therefore governs overall sparsity and structure.

bgms provides three edge priors, each encoding a different type of structural assumption. They are selected by passing one of the indicator-prior constructors — bernoulli_prior(), beta_bernoulli_prior(), or sbm_prior() — to the edge_prior argument.

Bernoulli prior

The simplest edge prior treats each edge independently: the indicator \(\gamma_{ij}\) follows a Bernoulli distribution with inclusion probability \(\pi_{ij}\). Across all edges, this gives

\[ p(\mathcal{S}) = \prod_{i < j} \pi_{ij}^{\gamma_{ij}}\,(1 - \pi_{ij})^{1 - \gamma_{ij}} \]

The inclusion probability can be set to the same value for all edges (a scalar) or specified per edge (a matrix). The default is \(\pi_{ij} = 0.5\) for all edges, which assigns equal prior weight to edge presence and absence. A classic choice due to Erdős & Rényi (1959), this treats every possible network structure as equally likely.

Lowering \(\pi_{ij}\) encodes a prior expectation that the network is sparse. Raising it encodes a prior expectation of a dense network. Setting different values for different edges allows prior information about specific relationships to be incorporated.

Beta-Bernoulli prior

The Beta-Bernoulli prior extends the Bernoulli prior by treating the inclusion probability as unknown and learning it from the data. All edges share a single inclusion probability \(\pi\), which receives a Beta prior with shape parameters \(\alpha\) and \(\beta\):

\[ p(\mathcal{S}) = \frac{B(\alpha + \gamma_{++},\; \beta + k - \gamma_{++})}{B(\alpha,\; \beta)} \]

where \(\gamma_{++}\) is the total number of included edges, \(k = \binom{p}{2}\) is the number of possible edges, and \(B(\cdot, \cdot)\) is the beta function.

Because the inclusion probability is integrated out, the edges are no longer independent a priori: the inclusion of one edge makes others slightly more likely. In practice, this means the prior adapts to the overall complexity of the network. If the data suggest a sparse network, the learned inclusion probability decreases; if they suggest a dense network, it increases.

An important special case arises when \(\alpha = \beta = 1\) (the default). The prior then places a uniform distribution over different levels of network complexity — one edge, two edges, and so on are all equally likely a priori — and, within each complexity level, all structures of the same size are equally likely.

Stochastic block model prior

The Bernoulli and Beta-Bernoulli priors treat all edges the same way: either each edge has a fixed inclusion probability, or all edges share a single learned inclusion probability. Neither prior can express the expectation that some groups of variables are more densely connected than others.

The stochastic block model (SBM) prior addresses this by assigning variables to latent blocks. Edges between two variables in the same block share a within-block inclusion probability, while edges between variables in different blocks share a between-block inclusion probability. Each block pair effectively has its own Beta-Bernoulli prior, so the model can learn different levels of density within and between clusters.

When there is only one block, the SBM reduces to the standard Beta-Bernoulli prior. With multiple blocks, it can capture a common pattern in applied networks: clusters of densely connected variables with sparser connections between clusters.

The SBM prior introduces additional parameters: the number of blocks, the assignment of variables to blocks, and the within- and between-block Beta hyperparameters. These are estimated from the data by default. See Edge Clustering for the full specification, hyperparameter guidance, and interpretation of results.

Nuisance parameter priors

Beyond the spike-and-slab, bgms places priors on several nuisance parameters that are estimated but typically not of direct inferential interest:

Parameter	Model	Default prior	Argument
Category thresholds	Ordinal MRF, Mixed MRF	Beta-prime\((0.5, 0.5)\) via \(\exp(\mu)\)	`threshold_prior`
Blume-Capel parameters	Ordinal MRF, Mixed MRF	Beta-prime\((0.5, 0.5)\) via \(\exp(\mu)\)	`threshold_prior`
Diagonal of precision matrix	GGM, Mixed MRF	Gamma\((1, 1)\)	`precision_scale_prior`
Continuous variable means	Mixed MRF	Normal\((0, 1)\)	`means_prior`

Each row’s prior is set by passing the corresponding constructor — beta_prime_prior(), gamma_prior() or exponential_prior(), and normal_prior() — to the listed argument. For the threshold and Blume-Capel priors, when the two beta-prime shape parameters are equal the induced prior on \(\mu\) is symmetric around zero.

References

Erdős, P., & Rényi, A. (1959). On random graphs I. Publicationes Mathematicae Debrecen, 6, 290–297.

Gottardo, R., & Raftery, A. E. (2008). Markov chain Monte Carlo with mixtures of mutually singular distributions. Journal of Computational and Graphical Statistics, 17(4), 949–975. https://doi.org/10.1198/106186008X386102

van den Bergh, D., Clyde, M. A., Raftery, A. E., & Marsman, M. (2026). Reversible jump MCMC with no regrets: Bayesian variable selection using mixtures of mutually singular distributions. Manuscript in Preparation.

Footnotes

When edge_selection = TRUE (the default), both questions are answered simultaneously during estimation. When edge_selection = FALSE, the structural question is bypassed: all edges are present, and only the slab prior is active.↩︎