The Bayesian Approach

bgms uses Bayesian inference to learn about the network structure and its parameters from data. This page describes the general framework; details about prior specifications are covered in Prior Basics and the mechanics of edge selection in Edge Selection. The posterior is estimated using Markov chain Monte Carlo; see MCMC Diagnostics for convergence checks and MCMC Output for the output structure.

What we want to learn

Given a dataset of \(p\) variables, we want to learn two things: the network structure \(\mathcal{S}\) — the collection of edges in the network — and the network parameters \(\boldsymbol{\Omega}_\mathcal{S}\) — the partial associations for the edges in \(\mathcal{S}\).1

There are many possible structures that could underlie the network. For each structure, there are also many possible values for the corresponding partial associations. Because we seek to learn about these quantities from the data, the specific configuration of edges and the exact parameter values are typically unknown.

Prior distributions

To account for this uncertainty, we assign prior distributions to the network structure \(p(\mathcal{S})\) and to the parameters conditional on that structure \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S})\). A prior distribution assigns plausibility to different values of the parameters and structures before the data are observed.

The prior distribution on the structure \(p(\mathcal{S})\) expresses which edge configurations we consider plausible a priori. The prior distribution on the parameters \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S})\) expresses, for a given structure, which values of the partial associations we consider plausible.

These prior distributions can formalize substantive knowledge — for example, results from previous research — or they can represent relative ignorance using default specifications. See Prior Basics for the prior distributions available in bgms.

From prior to posterior

Bayes’ rule combines the prior distributions with the information in the observed data to produce a joint posterior distribution over the structure and parameters:

\[ p(\boldsymbol{\Omega}_\mathcal{S},\, \mathcal{S} \mid \text{data}) = \frac{p(\text{data} \mid \boldsymbol{\Omega}_\mathcal{S}) \; p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S}) \; p(\mathcal{S})} {p(\text{data})} \]

This joint posterior distribution expresses everything we know about the network structure and parameter values after observing the data. It is central to Bayesian graphical modeling because different inferential tasks — parameter estimation and structure learning — rely on different aspects of this distribution.

To make this explicit, we factor the joint posterior into two components:

\[ p(\boldsymbol{\Omega}_\mathcal{S},\, \mathcal{S} \mid \text{data}) = p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S},\, \text{data}) \times p(\mathcal{S} \mid \text{data}) \]

The first factor, \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S}, \text{data})\), is the conditional posterior distribution for the parameters under a specific structure — the posterior distribution of the partial associations given that the network has structure \(\mathcal{S}\). This distribution is used for Bayesian parameter estimation.

The second factor, \(p(\mathcal{S} \mid \text{data})\), is the marginal posterior distribution of the network structure, with the parameters integrated out. This distribution is used for Bayesian structure learning.

Model uncertainty

For a network with \(p\) variables there are \(\binom{p}{2} = p(p-1)/2\) possible edges, and each can independently be present or absent. The total number of possible network structures is therefore \(2^{p(p-1)/2}\). Even for a modest network of 10 variables this gives \(2^{45} \approx 3.5 \times 10^{13}\) possible structures; for 20 variables the number exceeds \(10^{57}\).

In practice, it is very unlikely that any single structure completely dominates the posterior distribution. Many structures that differ by only a few edges will have comparable posterior probabilities. Ignoring this uncertainty and conditioning on a single structure — for instance the most probable one — amounts to acting as though that structure generated the data with certainty. This leads to overconfident inferences: parameter estimates that are too precise because they do not reflect uncertainty about which edges belong in the model, and conclusions about individual edges that may change depending on what is assumed about the remaining edges (Hinne et al., 2020; Hoeting et al., 1999).

Bayesian model averaging

Bayesian model averaging [BMA; Hoeting et al. (1999); Hinne et al. (2020)] provides the principled way to account for model uncertainty. Instead of conditioning on a single structure, BMA averages inferences across all possible structures, weighting each by its posterior probability \(p(\mathcal{S} \mid \text{data})\). Structures that predict the data well gain plausibility; those that predict it poorly lose it. No structure is discarded entirely, and no structure is assumed to be true with certainty.

In bgms, BMA is the default. The Markov chain Monte Carlo sampler explores the joint posterior over structures and parameters simultaneously using the mixtures of mutually singular distributions (MoMS) framework (Gottardo & Raftery, 2008; van den Bergh et al., 2026). All reported quantities — parameter estimates, inclusion probabilities, and Bayes factors — automatically incorporate model averaging. Both parameter estimation and structure learning therefore account for structural uncertainty.

Parameter estimation

For a given structure \(\mathcal{S}\), the conditional posterior \(p(\boldsymbol{\Omega}_\mathcal{S} \mid \mathcal{S}, \text{data})\) tells us what we know about each partial association after observing the data. The prior distribution assigns plausibility to each value of the parameter. The distribution is updated by the observed data: parameter values that predict the data well gain plausibility, while those that predict it poorly lose it.

Because bgms averages over structures, the reported parameter estimates reflect both parameter uncertainty and structural uncertainty. In structures where an edge is absent, the corresponding partial association is zero; in structures where it is present, the partial association follows its conditional posterior. The model-averaged posterior mean is therefore automatically shrunk toward zero for edges that receive little structural support.

In practice, the posterior distribution for each partial association is usually summarized:

  • The posterior mean as a point estimate of the partial association.
  • The posterior standard deviation as a measure of estimation precision.
  • A credible interval — a range that contains a given proportion of the posterior probability mass (e.g., 95%). Unlike frequentist confidence intervals, a 95% credible interval has a direct probability interpretation: given the prior and the observed data, there is a 95% posterior probability that the parameter lies within the interval.

In bgms, coef(fit)$pairwise returns the posterior means of the partial associations, and summary(fit)$pairwise adds the posterior standard deviation, Monte Carlo standard error, effective sample size, and R-hat.

Structure learning

The posterior inclusion probability of an edge is obtained by summing the posterior probabilities of all structures that contain that edge. Let \(\mathcal{S}^{(ij)}\) denote the set of structures that include an edge between variables \(i\) and \(j\), then:

\[ p(\gamma_{ij} = 1 \mid \text{data}) = \sum_{\mathcal{S} \in \mathcal{S}^{(ij)}} p(\mathcal{S} \mid \text{data}) \]

This posterior inclusion probability aggregates evidence across all possible network structures and therefore accounts for uncertainty about the remaining edges in the network.

In bgms, coef(fit)$indicator returns the posterior means of the inclusion probabilities, and summary(fit)$indicator adds the posterior standard deviation, Monte Carlo standard error, effective sample size, and R-hat.

The inclusion Bayes factor quantifies the change in evidence for including an edge from prior to posterior (Hinne et al., 2020; Marsman et al., 2022):

\[ \underbrace{\frac{p(\text{data} \mid \gamma_{ij} = 1)} {p(\text{data} \mid \gamma_{ij} = 0)}}_{\text{Inclusion Bayes factor}} = \underbrace{\frac{p(\gamma_{ij} = 1 \mid \text{data})} {p(\gamma_{ij} = 0 \mid \text{data})}}_{\text{Posterior inclusion odds}} \bigg/ \underbrace{\frac{p(\gamma_{ij} = 1)} {p(\gamma_{ij} = 0)}}_{\text{Prior inclusion odds}} \]

The inclusion Bayes factor is the ratio of the posterior inclusion odds to the prior inclusion odds. Values greater than one indicate evidence for the presence of the edge (\(\Omega_{ij} \neq 0\)), values less than one indicate evidence for its absence (\(\Omega_{ij} = 0\)), and values close to one indicate that the data are uninformative — the evidence does not clearly favor either hypothesis. This three-way distinction is a key advantage of the Bayes factor: it separates evidence for presence, evidence for absence, and absence of evidence. When the prior inclusion probability is \(\frac{1}{2}\), the prior inclusion odds equal one, and the inclusion Bayes factor simplifies to the posterior odds.

Because the inclusion Bayes factor averages over all possible network structures, it does not depend on assumptions about the remaining edges in the network. This property makes it a robust measure of the evidence for or against each individual edge (Sekulovski et al., 2024). See Edge Selection for interpretation guidelines and the prior specifications used in bgms.

References

Gottardo, R., & Raftery, A. E. (2008). Markov chain Monte Carlo with mixtures of mutually singular distributions. Journal of Computational and Graphical Statistics, 17(4), 949–975. https://doi.org/10.1198/106186008X386102
Hinne, M., Gronau, Q. F., van den Bergh, D., & Wagenmakers, E.-J. (2020). A conceptual introduction to Bayesian model averaging. Advances in Methods and Practices in Psychological Science, 3(2), 200–215. https://doi.org/10.1177/2515245919898657
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–417. https://doi.org/10.1214/ss/1009212519
Marsman, M., Huth, K. B. S., Waldorp, L. J., & Ntzoufras, I. (2022). Objective Bayesian edge screening and structure selection for Ising networks. Psychometrika, 87(1), 47–82. https://doi.org/10.1007/s11336-022-09848-8
Sekulovski, N., Keetelaar, S., Huth, K. B. S., Wagenmakers, E.-J., van Bork, R., van den Bergh, D., & Marsman, M. (2024). Testing conditional independence in psychometric networks: An analysis of three Bayesian methods. Multivariate Behavioral Research, 59, 913–933. https://doi.org/10.1080/00273171.2024.2345915
van den Bergh, D., Clyde, M. A., Raftery, A. E., & Marsman, M. (2026). Reversible jump MCMC with no regrets: Bayesian variable selection using mixtures of mutually singular distributions. Manuscript in Preparation.

Footnotes

  1. In graphical modeling, the primary inferential targets are the partial associations \(\boldsymbol{\Omega}\) and the network structure \(\mathcal{S}\). The node potentials \(\mu_i(x_i)\) are nuisance parameters: they are estimated but typically not of direct inferential interest.↩︎