LDLT L-Lipschitz Network Weight Parameterization Initialization
Abstract
We analyze initialization dynamics for LDLT-based L-Lipschitz layers by deriving the exact marginal output variance when the underlying parameter matrix W0∈ Rm× n is initialized with IID Gaussian entries N(0,σ2). The Wishart distribution, S=W0W0m(n,σ2 Im), used for computing the output marginal variance is derived in closed form using expectations of zonal polynomials via James' theorem and a Laplace-integral expansion of (α Im+S)-1. We develop an Isserlis/Wick-based combinatorial expansion for E[tr(Sk)] and provide explicit truncated moments up to k=10, which yield accurate series approximations for small-to-moderate σ2. Monte Carlo experiments confirm the theoretical estimates. Furthermore, empirical analysis was performed to quantify that, using current He or Kaiming initialization with scaling 1/n, the output variance is 0.41, whereas the new parameterization with 10/ n for α=1 results in an output variance of 0.9. The findings clarify why deep L-Lipschitz networks suffer rapid information loss at initialization and offer practical prescriptions for choosing initialization hyperparameters to mitigate this effect. However, using the Higgs boson classification dataset, a hyperparameter sweep over optimizers, initialization scale, and depth was conducted to validate the results on real-world data, showing that although the derivation ensures variance preservation, empirical results indicate He initialization still performs better.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.