# Metrics List of metrics currently considered by the package for a hook point. | Name | Description | Category | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | | L1 | The L1-norm, averaged over matrices. {math}`\frac{1}{K}\lvert\lvert\omega\rvert\rvert_1 = \frac{1}{K} \Sigma^n_{i=1}\lvert\omega_i\rvert`, where {math}`K` is the number of weight matrices in the neural network. We average over matrices so that models with different depths are comparable. | 1 | | L2 | The L2-norm, averaged over matrices. {math}`\frac{1}{K}\lvert\lvert\omega\rvert\rvert_2 = \frac{1}{K}\Sigma^n_{i=1}\sqrt{\omega^2_i}` | 1 | | {math}`\frac{L1}{L2}` | Measures the sparsity of the weights ([Repetti et al., 2014](https://arxiv.org/abs/1407.5465)). {math}`\frac{1}{K}\Sigma^K_{i=1}\frac{L_1^{(i)}}{L_2^{(i)}}`, which is the metric {math}`\frac{L1}{L2}` averaged over the {math}`K` weight matrices. Lower is more sparse. For example, a one-hot vector is fully sparse and has code sparsity of 1. See [Hurley & Rickard (2008)](https://arxiv.org/abs/0811.4706) for a discussion on measures of sparsity. | 1 | | {math}`\mu(w)` | Sample mean of weight. {math}`\frac{1}{N}\Sigma^N_{i=1}w_i`, where {math}`N` is the number of parameters in the network. | 1 | | {math}`median(w)` | Median of the weights, treated as a set concatenated together . | 1 | | {math}`\sigma(w)` | Sample variance of the weights without Bessel's correction. {math}`\frac{\Sigma^N\_{i=1}(w_i-\overline w)^2}{N}` | 1 | | {math}`\mu(b)` | Sample mean of the biases. We treat the biases separately because they have a distinct interpretation from the weights. | 1 | | {math}`median(b)` | Median of the biases, treated as a set concatenated together. | 1 | | {math}`\sigma(b)` | Sample variance of biases without Bessel’s correction. | 1 | | trace | The average trace over $K$ weight matrices. {math}`\frac{1}{K}\Sigma^K_{i=1}tr(W_k)`, where {math}`W_k` is the {math}`kth` weight matrix. | 2 | | {math}`\lambda_{max}` | The average spectral norm. {math}`\frac{1}{K}\Sigma^K_{i=1}\lvert\lvert W_k\rvert\rvert _2`. | 2 | | {math}`\frac{trace}{\lambda_{max}}` | Average trace over spectral norm. {math}`\frac{1}{K}\Sigma^K_{i=1}\frac{tr(W_k)}{\lvert\lvert W_k\rvert\rvert _2}`. | 2 | | {math}`\mu(\lambda)` | Average singular value over all matrices. | 2 | | {math}`\sigma(\lambda)` | Sample variance of singular values over all matrices. | 2 | | Gradient Symmetricity | A Modular addition specific metric. Compute the cosine similarity between gradient vectors for the output logits w.r.t. input embeddings. Taking the average over many pairs yields the gradient symmetricity See [Zhong & Liu(2008)](https://arxiv.org/abs/2306.17844). | 2 | | Distant Irrelevance | A Modular addition specific metric. Measures the dependence of correct logits on differences between two inputs. | 2 | #### Category Glossary: | Category | Description | | -------- | -------------------------------------------------------------------------------------- | | 1 | The statistic intends to capture how the neural network weights are dispersed in space | | 2 | The statistic intends to capture properties of the function computed by a layer. |