Metrics¶

List of metrics currently considered by the package for a hook point.

Name	Description	Category
L1	The L1-norm, averaged over matrices. \(\frac{1}{K}\lvert\lvert\omega\rvert\rvert_1 = \frac{1}{K} \Sigma^n_{i=1}\lvert\omega_i\rvert\), where \(K\) is the number of weight matrices in the neural network. We average over matrices so that models with different depths are comparable.	1
L2	The L2-norm, averaged over matrices. \(\frac{1}{K}\lvert\lvert\omega\rvert\rvert_2 = \frac{1}{K}\Sigma^n_{i=1}\sqrt{\omega^2_i}\)	1
\(\frac{L1}{L2}\)	Measures the sparsity of the weights (Repetti et al., 2014). \(\frac{1}{K}\Sigma^K_{i=1}\frac{L_1^{(i)}}{L_2^{(i)}}\), which is the metric \(\frac{L1}{L2}\) averaged over the \(K\) weight matrices. Lower is more sparse. For example, a one-hot vector is fully sparse and has code sparsity of 1. See Hurley & Rickard (2008) for a discussion on measures of sparsity.	1
\(\mu(w)\)	Sample mean of weight. \(\frac{1}{N}\Sigma^N_{i=1}w_i\), where \(N\) is the number of parameters in the network.	1
\(median(w)\)	Median of the weights, treated as a set concatenated together .	1
\(\sigma(w)\)	Sample variance of the weights without Bessel’s correction. \(\frac{\Sigma^N\_{i=1}(w_i-\overline w)^2}{N}\)	1
\(\mu(b)\)	Sample mean of the biases. We treat the biases separately because they have a distinct interpretation from the weights.	1
\(median(b)\)	Median of the biases, treated as a set concatenated together.	1
\(\sigma(b)\)	Sample variance of biases without Bessel’s correction.	1
trace	The average trace over $K$ weight matrices. \(\frac{1}{K}\Sigma^K_{i=1}tr(W_k)\), where \(W_k\) is the \(kth\) weight matrix.	2
\(\lambda_{max}\)	The average spectral norm. \(\frac{1}{K}\Sigma^K_{i=1}\lvert\lvert W_k\rvert\rvert _2\).	2
\(\frac{trace}{\lambda_{max}}\)	Average trace over spectral norm. \(\frac{1}{K}\Sigma^K_{i=1}\frac{tr(W_k)}{\lvert\lvert W_k\rvert\rvert _2}\).	2
\(\mu(\lambda)\)	Average singular value over all matrices.	2
\(\sigma(\lambda)\)	Sample variance of singular values over all matrices.	2
Gradient Symmetricity	A Modular addition specific metric. Compute the cosine similarity between gradient vectors for the output logits w.r.t. input embeddings. Taking the average over many pairs yields the gradient symmetricity See Zhong & Liu(2008).	2
Distant Irrelevance	A Modular addition specific metric. Measures the dependence of correct logits on differences between two inputs.	2

Category Glossary:¶

Category	Description
1	The statistic intends to capture how the neural network weights are dispersed in space
2	The statistic intends to capture properties of the function computed by a layer.