Skip to article content

Self-orthogonalizing attractor neural networks emerging from the free energy principle

Appendix

For the manuscript titled “Self-orthogonalizing attractor neural networks emerging from the free energy principle” by Tamas Spisak and Karl Friston

Library imports for inline code.

import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import scipy.integrate as integrate

sns.set(style="white")

Appendix 1

Continuous Bernoulli distribution

xCB(L)    P(x)eLx,x[1,1]Rx \sim \mathcal{CB}(L) \iff P(x) \propto e^{Lx}, \quad x \in [-1, 1]\subset \mathbb{R}

Below we provide a Python implementation of the continuous Bernoulli distributions: CB(L)\mathcal{CB}(L), parametrized by the log odds L and adjusted to the [-1,1] interval.

def CB(x, b, logodds=True):
    if not logodds:
        b=np.log(b/(1-b))
    if np.isclose(b, 0):
        return np.ones_like(x)/2
    else:
        return b * np.exp(b*x) / (2*np.sinh(b)) 
    
# Plot some examples
delta = 100
sigma = np.linspace(-1, 1, delta)
for b in np.linspace(-2, 2, 5):
    p_sigma = CB(sigma, b)
    sns.lineplot(x=sigma, y=p_sigma, label=b)
<Figure size 640x480 with 1 Axes>

Appendix 2

Continuous Bernoulli distribution full derivation

Derivation of the exponential form of the continuous Bernoulli distribution, parametrized by the log odds L and adjusted to the [-1,1] interval.

P(x;L)=eLx11eLx=eLxeLeLL=LeLx2sinh(L)P(x; L) = \frac{e^{Lx}}{\int_{-1}^1 e^{Lx}} = \frac{e^{Lx}}{ \frac{e^L - e^{-L}}{L}} = L \frac{e^{Lx}}{2sinh(L)}

Appendix 3

Visualization of the likelihood function with various parameters, in python.

delta = 100
s_i = np.linspace(-1, 1, delta)

fig, axes = plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for idx, mu in enumerate([0.1, 0.5, 1]):
    for w_i in np.linspace(-2, 2, 5):
            p_mu = CB(s_i, w_i*mu)
            sns.lineplot(x=s_i, y=p_mu, ax=axes[idx], label=w_i).set(title=f"$\\mu={mu}$")
<Figure size 1200x300 with 3 Axes>

Appendix 4

Expected value of the CB\mathcal{CB}

ECB(b)[σ]=σebσ2sinh(b)dσ=b((b1)ebb2+(b+1)ebb2)2sinh(b)=coth(b)1b\begin{align*} \mathbb{E}_{\mathcal{CB}(b)}[\sigma] &= \int \sigma \frac{ e^{b \sigma} }{ 2sinh(b) } d \sigma &= \frac{ b \left( \frac{(b-1)e^b}{b^2} + \frac{(b+1)e^-b}{b^2} \right) }{ 2sinh(b) } \\ &= \coth(b) - \frac{1}{b} \end{align*}
bs = np.linspace(-10, 10, 100)
plt.figure(figsize=(4, 1))
sns.lineplot(x=bs, y=1/np.tanh(bs) - 1/bs) # coth(b) - 1/b == 1/2 sech^2(b)
plt.show()
<Figure size 400x100 with 1 Axes>

Appendix 5

Conservative dynamics

Let xRnx ∈ ℝ^n denote the system’s states (internal, blanket, and external states). The drift can be decomposed into a gradient part (from a potential U) and a solenoidal part R:

x˙=U(x)+R(x)\dot{x} = −∇U(x) + R(x)

where R=RTR = −Rᵀ is antisymmetric in state‐space.

The probability density p(x,t)p(x, t) over states evolves according to the Fokker–Planck equation:

tp(x,t)=[(U+R)p(x,t)]+diffusionterms.∂ₜp(x,t) = −∇ · [ (−∇U + R) p(x,t) ] + diffusion terms.

To determine the stationary distribution ps(x)p_s(x), we set tp(x,t)=0∂ₜp(x,t) = 0. The Fokker-Planck equation then implies that the net probability current Js(x)=(U(x)+R(x))ps(x)Dps(x)J_s(x) = (−∇U(x) + R(x)) p_s(x) - D∇p_s(x) (assuming isotropic diffusion DD) must be divergence-free: Js(x)=0∇ \cdot J_s(x) = 0. If we propose ps(x)=Z1exp(U(x)/D)p_s(x) = Z^{-1} \exp(−U(x)/D) (where ZZ is a normalization constant), then the diffusive current Dps(x)-D∇p_s(x) becomes ps(x)U(x)p_s(x)∇U(x).

Substituting this into Js(x)J_s(x), we get:

Js(x)=(U(x)+R(x))ps(x)+ps(x)U(x)=R(x)ps(x)J_s(x) = (−∇U(x) + R(x)) p_s(x) + p_s(x)∇U(x) = R(x) p_s(x).

Thus, ps(x)exp(U(x)/D)p_s(x) \propto \exp(−U(x)/D) is the stationary distribution if and only if the solenoidal component of the probability current, JR(x)=R(x)ps(x)J_R(x) = R(x) p_s(x), is itself divergence-free: (R(x)ps(x))=0∇ \cdot (R(x) p_s(x)) = 0.

It is a key result from the study of non-equilibrium systems under the Free Energy Principle, particularly those possessing a Markov blanket and involving conservative particles, that this condition (R(x)ps(x))=0∇ \cdot (R(x) p_s(x)) = 0 holds (see Friston et al., 2023; but also related: Ao (2004); Xing (2010)). The conditional independence structure imposed by the Markov blanket, along with other FEP-specific assumptions (like conservative particles), constrains the system’s dynamics such that solenoidal forces R(x)R(x) do not alter the Boltzmann form of the stationary distribution. Instead, they drive persistent, divergence-free probability currents JR(x)J_R(x) along iso-potential surfaces.

In steady state (tp=0)(∂ₜp = 0), only the gradient term U−∇U influences p(x)p(x). The stationary (nonequilibrium) distribution is

p(x)expU(x) p(x) ∝ exp{ −U(x) },

determined solely by the symmetric (potential) part −∇U. The antisymmetric RR does not reweight p(x)p(x); it just circulates probability around isocontours of UU.

In most NESS systems, antisymmetric flows do alter the stationary measure. Under particular‐partition (Markov‐blanket) constraints, however, the internal–external factorization ensures that solenoidal (antisymmetric) flows remain divergence‐free, leaving the Boltzmann‐like steady distribution intact. This is why, in these Markov‐blanketed systems, one can have persistent solenoidal currents (nonequilibrium flows) yet preserve a stationary distribution that depends only on the symmetric part of the couplings.

Appendix 6

Detailed inference derivation: accuracy–complexity decomposition and Fbq\dfrac{\partial F}{\partial b_q}

The main text derives the inference rule (eq. (15)) via a compact local VFE argument. Here we provide the equivalent — but more detailed — derivation via the accuracy–complexity decomposition of VFE, confirming the same result.

1. Subsitute our parametrization into F

Let’s start with substituting our parametrization into eq. .

1a. Accuracy term

From the RBM marginalization (eq. ):

E(σ)=biσijiJijσiσjTerms with σijibjσj12j,kiJjkσjσkTerms without σi E(\bm{\sigma}) = \underbrace{-b_i\sigma_i - \sum_{j\neq i}J_{ij}\sigma_i\sigma_j}_{\text{Terms with } \sigma_i} \underbrace{-\sum_{j\neq i}b_j\sigma_j - \frac{1}{2}\sum_{j,k\neq i}J_{jk}\sigma_j\sigma_k}_{\text{Terms without } \sigma_i}

Here, biσi-b_i\sigma_i becomes constant, since σi\sigma_i is fixed. So we get:

P(σ\iσi)exp(ji(bj+Jijσi)σj+12j,kiJjkσjσk) P(\sigma_{\backslash i}|\sigma_i) \propto \exp\left(\sum_{j\neq i}(b_j + J_{ij}\sigma_i)\sigma_j + \frac{1}{2}\sum_{j,k\neq i}J_{jk}\sigma_j\sigma_k\right)

Taking expectation of lnP(σ\iσi)\ln P(\sigma_{\backslash i}|\sigma_i) under q(σi)q(σ_i):

Eq[lnP(σ\iσi)]=const+jibjσj+S(bq)jiJijσj+12j,kiJjkσjσk \mathbb{E}q[\ln P(σ_{\backslash i}|σ_i)] = \text{const} + \sum_{j\neq i}b_jσ_j + S(b_q)\sum_{j\neq i}J_{ij}σ_j + \frac{1}{2}\sum_{j,k\neq i}J_{jk}σ_jσ_k

Where S(bq)=Eq[σi]=cothbq1/bqS(b_q) = \mathbb{E}_q[σ_i] = \coth b_q - 1/b_q is the expected value of the CB\mathcal{CB}, a sigmoid function of the bias (#supplementary-information-4)).

1b. Complexity term

The complexity term in eq. is simply the KL-divergence term between two CB\mathcal{CB} distributions. For CB\mathcal{CB} distributions:

q(x)=bq2sinhbqebqx,p(x)=b2sinhbebxq(x) = \frac{b_q}{2\sinh b_q}e^{b_q x},\quad p(x) = \frac{b}{2\sinh b}e^{b x}

• KL divergence definition:

DKL=11q(x)lnq(x)p(x)dx=Eq[lnq(x)lnp(x)] D_{KL} = \int_{-1}^1 q(x) \ln\frac{q(x)}{p(x)} dx = \mathbb{E}_q[\ln q(x) - \ln p(x)]

• Expand log terms:

lnq(x)=lnbqln(2sinhbq)+bqx \ln q(x) = \ln b_q - \ln(2\sinh b_q) + b_q x

lnp(x)=lnbln(2sinhb)+bx \ln p(x) = \ln b - \ln(2\sinh b) + b x

• Subtract log terms:

lnq(x)p(x)=lnbqb+lnsinhbsinhbq+(bqb)x \ln\frac{q(x)}{p(x)} = \ln\frac{b_q}{b} + \ln\frac{\sinh b}{\sinh b_q} + (b_q - b)x

• Take expectation under q(x):

DKL=lnbqb+lnsinhbsinhbq+(bqb)Eq[x] D_{KL} = \ln\frac{b_q}{b} + \ln\frac{\sinh b}{\sinh b_q} + (b_q - b)\mathbb{E}_q[x]

• Compute expectation Eq[x]\mathbb{E}_q[x]:

Eq[x]=11xbqebqx2sinhbqdx=12sinhbq[ebqxbq2(bqx1)]11 \mathbb{E}_q[x] = \int_{-1}^1 x \frac{b_q e^{b_q x}}{2\sinh b_q} dx = \frac{1}{2\sinh b_q}\left[\frac{e^{b_q x}}{b_q^2}(b_q x - 1)\right]_{-1}^1

• Evaluate at bounds:

=12sinhbq(ebq(bq1)ebq(bq1)bq2) = \frac{1}{2\sinh b_q}\left(\frac{e^{b_q}(b_q - 1) - e^{-b_q}(-b_q - 1)}{b_q^2}\right)

• Simplify using hyperbolic identities:

=(bqcoshbqsinhbq)bq2sinhbq=cothbq1bq = \frac{(b_q \cosh b_q - \sinh b_q)}{b_q^2 \sinh b_q} = \coth b_q - \frac{1}{b_q}

• Final substitution for the complexity term:

DKL=lnbqsinhbbsinhbq+(bqb)(cothbq1bq) D_{KL} = \ln\frac{b_q \sinh b}{b \sinh b_q} + (b_q - b)\left(\coth b_q - \frac{1}{b_q}\right)

1c. Combining the two terms

Combining the two terms, we get the following expression for the free energy:

F=ln(bqb)+ln(sinh(b)sinh(bq))+(bqb)S(bq)ji(bj+S(bq)Jij)σj12jikiJjkσjσk+C F = \ln\left(\frac{b_q}{b}\right) + \ln\left(\frac{\sinh(b)}{\sinh(b_q)}\right) + (b_q - b) S(b_q) - \sum_{j \ne i} \left( b_j + S(b_q) J_{ij} \right) \sigma_j - \dfrac{1}{2} \sum_{j \ne i} \sum_{k \ne i} J_{jk} \sigma_j \sigma_k + C

where C denotes all constants in the equation that are independent of σorbq\sigma or b_q.

2. Free Energy partial derivative calculation

• First, we differentiate the log terms:

bq[lnbqb+lnsinhbsinhbq]=1bqcothbq \frac{\partial}{\partial b_q}\left[\ln\frac{b_q}{b} + \ln\frac{\sinh b}{\sinh b_q}\right] = \frac{1}{b_q} - \coth b_q

• Then, the KL core term:

bq[(bqb)S(bq)]=S(bq)+(bqb)dSdbq \frac{\partial}{\partial b_q}\left[(b_q - b)S(b_q)\right] = S(b_q) + (b_q - b)\frac{dS}{db_q}

• The linear terms:

bq[ji(bj+S(bq)Jij)σj]=jiJijσjdSdbq \frac{\partial}{\partial b_q}\left[-\sum_{j≠i}(b_j + S(b_q)J_{ij})\sigma_j\right] = -\sum_{j≠i}J_{ij}\sigma_j\frac{dS}{db_q}

• The constants vanish.

• Now, combining all terms:

Fbq=(1bqcothbq)+(S(bq)+(bqb)dSdbq)jiJijσjdSdbq \frac{\partial F}{\partial b_q} = \left(\frac{1}{b_q} - \coth b_q\right) + \left(S(b_q) + (b_q - b)\frac{dS}{db_q}\right) - \sum_{j≠i}J_{ij}\sigma_j\frac{dS}{db_q}

• Substituting S(bq)=cothbq1/bqS(b_q) = \coth b_q - 1/b_q:

=(1bqcothbq)+(cothbq1bq+(bqb)dSdbq)jiJijσjdSdbq = \left(\frac{1}{b_q} - \coth b_q\right) + \left(\coth b_q - \frac{1}{b_q} + (b_q - b)\frac{dS}{db_q}\right) - \sum_{j≠i}J_{ij}\sigma_j\frac{dS}{db_q}

• Cancel terms:

1bqcothbq+cothbq1bq+(bqb)dSdbqjiJijσjdSdbq \cancel{\frac{1}{b_q}} \cancel{- \coth b_q} + \cancel{\coth b_q} \cancel{- \frac{1}{b_q}} + (b_q - b)\frac{dS}{db_q} - \sum_{j≠i}J_{ij}\sigma_j\frac{dS}{db_q}

Gives us the final derivative:

Fbq=(bqbjiJijσj)dSdbq \frac{∂F}{∂b_q} = \left(b_q - b - \sum_{j\neq i}J_{ij}σ_j\right)\frac{dS}{db_q}

Where dSdbq=csch2bq+1bq2\frac{dS}{db_q} = -csch^2 b_q + \frac{1}{b_q^2}.

Setting the derivative to zero and solving for bqb_q, we get: bq=b+jiJijσj b_q = b + \sum_{j \ne i} J_{ij} \sigma_j

Now we remember that the expected value of the CB\mathcal{CB} is the Langevin function of its bias -, E(x)=coth(b)1/b\mathbb{E}(x) = coth(b) - 1/b (). For simplicity, we will denote it as L(x)L(x). Now we can write:
Eq[σi]=L(bq)=L(b+jiJijσj) \mathbb{E}_{q}[\sigma_i] = L(b_q) = L \left( b + \sum_{j \ne i} J_{ij} \sigma_j \right)

Q.E.D.

Appendix 7

Basin geometry and adversarial robustness

We provide a geometric analysis of how attractor orthogonality and the precision parameter jointly affect adversarial robustness in the proposed framework.

Setup. Consider a network with NN nodes, inverse temperature iTiT, and KK stored attractors {σ(μ)}μ=1K\{\boldsymbol{\sigma}^{(\mu)}\}_{\mu=1}^K with σ(μ)2=N\|\boldsymbol{\sigma}^{(\mu)}\|^2 = N. Under the projector weight matrix J1Nμσ(μ)σ(μ)\mathbf{J}^\dagger \approx \frac{1}{N}\sum_\mu \boldsymbol{\sigma}^{(\mu)}{\boldsymbol{\sigma}^{(\mu)}}^\top (zero diagonal), which the learning rule approximates, the energy at a stored attractor satisfies E(σ(μ))NE(\boldsymbol{\sigma}^{(\mu)}) \approx -N.

Adversarial budget and representational redundancy. An adversarial perturbation δs\delta\mathbf{s} applied to the bias vector shifts the energy difference between two attractors σ(μ)\boldsymbol{\sigma}^{(\mu)} and σ(ν)\boldsymbol{\sigma}^{(\nu)} by ΔEδs(σ(μ)σ(ν))\Delta E \approx -\delta\mathbf{s}^\top (\boldsymbol{\sigma}^{(\mu)} - \boldsymbol{\sigma}^{(\nu)}). The most efficient attack aligns δs\delta\mathbf{s} with the difference vector, yielding a minimum perturbation norm δsmin1/σ(μ)σ(ν)\|\delta\mathbf{s}\|_{\min} \propto 1/\|\boldsymbol{\sigma}^{(\mu)} - \boldsymbol{\sigma}^{(\nu)}\|. For orthogonal attractors (σ(μ)σ(ν)=0\boldsymbol{\sigma}^{(\mu)\top}\boldsymbol{\sigma}^{(\nu)} = 0): σ(μ)σ(ν)2=2N\|\boldsymbol{\sigma}^{(\mu)} - \boldsymbol{\sigma}^{(\nu)}\|^2 = 2N. For attractors with normalized overlap m=1Nσ(μ)σ(ν)m = \frac{1}{N}\boldsymbol{\sigma}^{(\mu)\top}\boldsymbol{\sigma}^{(\nu)}: σ(μ)σ(ν)2=2N(1m)\|\boldsymbol{\sigma}^{(\mu)} - \boldsymbol{\sigma}^{(\nu)}\|^2 = 2N(1-m). Since 0m<10 \leq m < 1, orthogonal attractors (m=0m = 0) maximize the adversarial budget. Self-orthogonalization thus reduces representational redundancy (overlap between attractor states) while increasing robustness to targeted perturbations.

Structural redundancy. Although representational redundancy is minimized, structural redundancy is fully preserved: each attractor is encoded across all O(N2)O(N^2) synaptic weights, with each weight contributing O(1/N)O(1/N) to the basin depth. Corrupting or removing a bounded fraction of weights therefore has a limited effect on the energy landscape, regardless of attractor orthogonality.

Prior precision as a defense. Under finite temperature, the posterior is p(σs)exp{iT[i(bi+δsi)σi+12ijJijσiσj]}p(\boldsymbol{\sigma}\mid\mathbf{s}) \propto \exp\bigl\{iT\bigl[\sum_i (b_i + \delta s_i)\sigma_i + \frac{1}{2}\sum_{ij}J_{ij}\sigma_i\sigma_j\bigr]\bigr\}. The effective basin depth scales as iTNiT \cdot N, while the perturbation contributes iTδsσiT \cdot \delta\mathbf{s}^\top\boldsymbol{\sigma}. The ratio of basin depth to perturbation magnitude grows with iTiT: high precision makes prior-encoded attractors robust to sensory-level attacks. The converse also holds: low iTiT down-weights the prior, limiting the influence of poisoned attractors on the posterior.

Stochastic averaging. Beyond the geometric analysis, stochastic MCMC inference provides an additional defense layer. The time-averaged network response converges to the posterior mean Ep(σs)[σ]\mathbb{E}_{p(\boldsymbol{\sigma}\mid\mathbf{s})}[\boldsymbol{\sigma}], with variance decreasing as O(1/Teff)O(1/T_{\text{eff}}) where TeffT_{\text{eff}} is the effective number of independent samples. This averaging smooths out the effect of perturbations that shift individual samples but do not move the posterior mode across a basin boundary.

Implicit regularization against data poisoning. During learning, the anti-Hebbian term in the learning rule subtracts the network’s current prediction from the observed correlation. For a corrupted training pattern, the prediction — dominated by the clean attractor subspace — absorbs most of the pattern’s variance, leaving only a small residual update. For a poisoning fraction ϵ1\epsilon \ll 1, the cumulative effect of poisoned updates remains bounded relative to the clean learning signal, analogous to the outlier-robustness of Bayesian posteriors under informative priors.

Appendix 8

Detailed comparison with classical models

We compare the proposed FEP-based attractor network (FEP-ANN) with the canonical formulations of classical Hopfield networks, Boltzmann machines, projector (pseudo-inverse) networks, and hierarchical predictive coding. Each of these model families encompasses many variants; the comparison targets the canonical baseline in each case.

Learning rule and computational complexity

  • Boltzmann machines (contrastive divergence). The exact gradient of the Boltzmann log-likelihood with respect to weights JijJ_{ij} is

    lnp(v)Jij=σiσjdataσiσjmodel\frac{\partial \ln p(\mathbf{v})}{\partial J_{ij}} = \langle \sigma_i \sigma_j \rangle_{\text{data}} - \langle \sigma_i \sigma_j \rangle_{\text{model}}

    where model\langle \cdot \rangle_{\text{model}} requires sampling from the model’s equilibrium distribution — computationally intractable for large networks. Contrastive divergence (CD-kk) approximates model\langle \cdot \rangle_{\text{model}} by running kk steps of Gibbs sampling from the data-initialized state. Each Gibbs sweep visits all NN nodes, and computing the input to each node requires O(N)O(N) multiplications, yielding a per-step cost of O(N2)O(N^2) per Gibbs sweep and O(N2k)O(N^2 k) per weight update.

  • FEP-ANN. The learning rule for node ii uses the network’s instantaneous predicted state σ^i=L(bi+kJikσk)\hat{\sigma}_i = L(b_i + \sum_k J_{ik} \sigma_k) — a single forward pass through the current weights — rather than a sample average over model-generated states. Computing σ^i\hat{\sigma}_i for all nodes requires one matrix-vector product Jσ\mathbf{J}\boldsymbol{\sigma} at O(N2)O(N^2); the outer-product updates for all weights then cost O(N2)O(N^2). The total per-step learning cost is therefore O(N2)O(N^2) — a factor of kk cheaper than CD-kk. The free-running phase is eliminated entirely.

  • Solenoidal flows and mixing times In a symmetric Boltzmann machine, inference proceeds via reversible Gibbs sampling; reversibility implies that probability mass can only move between attractor basins by climbing over energy barriers, leading to mixing times that scale exponentially with barrier height.

    In the FEP-ANN with asymmetric couplings JJ\mathbf{J} \neq \mathbf{J}^\top, the dynamics decompose into a gradient (dissipative) component driven by J=12(J+J)\mathbf{J}^\dagger = \frac{1}{2}(\mathbf{J}+\mathbf{J}^\top) and a solenoidal (conservative) component driven by J=12(JJ)\mathbf{J}^- = \frac{1}{2}(\mathbf{J}-\mathbf{J}^\top). The solenoidal term does not alter the stationary distribution but induces persistent probability currents along iso-energy contours. These non-reversible currents transport probability mass between attractor basins without climbing energy barriers — they traverse along the iso-energy surface instead, accelerating mixing. The resulting dynamics are formally analogous to continuous normalizing flows and Markovian flow matching.

  • Orthogonalization and memory capacity Classical Hopfield networks with Hebbian (outer-product) weights store KK patterns; cross-talk limits capacity to Kmax0.14NK_{\max} \approx 0.14N. Projector (pseudo-inverse) networks achieve optimal capacity Kmax=NK_{\max} = N with error-free retrieval but require batch access and O(N3)O(N^3) computation. In the FEP-ANN, the Hebbian/anti-Hebbian learning rule incrementally approximates the projector matrix through online learning: the Hebbian term accumulates pattern correlations while the anti-Hebbian term subtracts the model’s current prediction, progressively decorrelating the stored representations. Memory capacity thus approaches the projector limit without batch computation or matrix inversion.

  • Relationship to hierarchical predictive coding Both the FEP-ANN and hierarchical predictive coding (PC) networks minimize variational free energy via local, prediction-error-driven updates. The structural difference is topology: PC operates on a bidirectional hierarchy with directed inter-layer connections (feedforward predictions, feedback prediction errors), while the FEP-ANN uses a fully recurrent, single-layer topology with arbitrary lateral couplings. Within the deep particular partition formalism, both are instances of the same FEP-based inference architecture applied to different coupling topologies. Exploring the formal equivalences between these topologies is a promising direction for future work.

Table 1:Comparison of the FEP-ANN with canonical formulations of classical models.

FeatureHopfieldBoltzmann machineProjector networkPredictive codingFEP-ANN (ours)
DerivationEnergy-basedStatistical mechanicsOptimal storageHierarchical BayesianFirst principles (FEP)
State spaceBinary {1,+1}\{-1,+1\}Binary {0,1}\{0,1\}BinaryContinuous (Gaussian)Continuous [1,+1][-1,+1] (CB)
ActivationSignLogistic sigmoidSignLinear / nonlinearLangevin (emergent)
LearningHebbian (batch)Contrastive divergencePseudo-inverse (batch)Prediction-error min.Hebbian/anti-Hebbian (online)
Learning phasesOne-shotClamped + freeOne-shotLayerwiseSingle phase
CouplingSymmetricSymmetricSymmetricDirected (hierarchy)Symmetric or asymmetric
Sequence dynamicsNoNoNoTemporal hierarchyNESS solenoidal flow
OrthogonalizationNoNoBy constructionNoEmergent (VFE)
Memory capacity0.14N\sim 0.14N0.14N\sim 0.14NNN (optimal)N/A (generative)Approaches NN
InferenceDeterministicGibbs MCMCDeterministicMessage passingMCMC + solenoidal flow
Precision controlNoneTemperature (fixed)NonePrecision-weightingiTiT (Bayesian)
Continual learningCatastrophic forgettingCatastrophic forgettingCatastrophic forgettingPossible (with replay)Built-in (spontaneous replay)
Learning complexityO(KN2)O(KN^2) one-shotO(N2k)O(N^2 k) per stepO(KN2+N3)O(KN^2 + N^3) one-shotO(Nl)O(N_l) per layer per stepO(N2)O(N^2) per step
FEP / active inferenceNo formal linkVariational (post hoc)No formal linkCompatibleDerived from FEP

Appendix 9

Scaling benchmarks

We report runtime and memory capacity benchmarks for the JAX implementation across network sizes from N=64N = 64 to N=4096N = 4096 and pattern counts from K=10K = 10 to K=NK = N. Training and inference wall-clock time as a function of network size NN, confirming the expected O(N2)O(N^2) per-step scaling. Panel to be finalized with benchmark data from 07-simulation-scaling-jax.ipynb.

Table: Runtime benchmarks (JAX implementation, 10 updates)

NN (nodes)Training time (s)Throughput (steps/s)
1000.0042455
1 0000.0071390
5 0000.2639.1
10 0000.3925.5
20 0001.496.69
50 00059.80.167
References
  1. Friston, K., Da Costa, L., Sakthivadivel, D. A. R., Heins, C., Pavliotis, G. A., Ramstead, M., & Parr, T. (2023). Path integrals, particular kinds, and strange things. Physics of Life Reviews, 47, 35–62. 10.1016/j.plrev.2023.08.016
  2. Ao, P. (2004). Potential in stochastic differential equations: novel construction. Journal of Physics A: Mathematical and General, 37(3), L25–L30. 10.1088/0305-4470/37/3/l01
  3. Xing, J. (2010). Mapping between dissipative and Hamiltonian systems. Journal of Physics A: Mathematical and Theoretical, 43(37), 375003. 10.1088/1751-8113/43/37/375003