# Metrics

List of metrics currently considered by the package for a hook point.

| Name                                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Category |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| L1                                  | The L1-norm, averaged over matrices. {math}`\frac{1}{K}\lvert\lvert\omega\rvert\rvert_1 = \frac{1}{K} \Sigma^n_{i=1}\lvert\omega_i\rvert`, where {math}`K` is the number of weight matrices in the neural network. We average over matrices so that models with different depths are comparable.                                                                                                                                                              | 1        |
| L2                                  | The L2-norm, averaged over matrices. {math}`\frac{1}{K}\lvert\lvert\omega\rvert\rvert_2 = \frac{1}{K}\Sigma^n_{i=1}\sqrt{\omega^2_i}`                                                                                                                                                                                                                                                                                                                         | 1        |
| {math}`\frac{L1}{L2}`               | Measures the sparsity of the weights ([Repetti et al., 2014](https://arxiv.org/abs/1407.5465)). {math}`\frac{1}{K}\Sigma^K_{i=1}\frac{L_1^{(i)}}{L_2^{(i)}}`, which is the metric {math}`\frac{L1}{L2}` averaged over the {math}`K` weight matrices. Lower is more sparse. For example, a one-hot vector is fully sparse and has code sparsity of 1. See [Hurley & Rickard (2008)](https://arxiv.org/abs/0811.4706) for a discussion on measures of sparsity. | 1        |
| {math}`\mu(w)`                      | Sample mean of weight. {math}`\frac{1}{N}\Sigma^N_{i=1}w_i`, where {math}`N` is the number of parameters in the network.                                                                                                                                                                                                                                                                                                                                      | 1        |
| {math}`median(w)`                   | Median of the weights, treated as a set concatenated together .                                                                                                                                                                                                                                                                                                                                                                                               | 1        |
| {math}`\sigma(w)`                   | Sample variance of the weights without Bessel's correction. {math}`\frac{\Sigma^N\_{i=1}(w_i-\overline w)^2}{N}`                                                                                                                                                                                                                                                                                                                                              | 1        |
| {math}`\mu(b)`                      | Sample mean of the biases. We treat the biases separately because they have a distinct interpretation from the weights.                                                                                                                                                                                                                                                                                                                                       | 1        |
| {math}`median(b)`                   | Median of the biases, treated as a set concatenated together.                                                                                                                                                                                                                                                                                                                                                                                                 | 1        |
| {math}`\sigma(b)`                   | Sample variance of biases without Bessel’s correction.                                                                                                                                                                                                                                                                                                                                                                                                        | 1        |
| trace                               | The average trace over $K$ weight matrices. {math}`\frac{1}{K}\Sigma^K_{i=1}tr(W_k)`, where {math}`W_k` is the {math}`kth` weight matrix.                                                                                                                                                                                                                                                                                                                     | 2        |
| {math}`\lambda_{max}`               | The average spectral norm. {math}`\frac{1}{K}\Sigma^K_{i=1}\lvert\lvert W_k\rvert\rvert _2`.                                                                                                                                                                                                                                                                                                                                                                  | 2        |
| {math}`\frac{trace}{\lambda_{max}}` | Average trace over spectral norm. {math}`\frac{1}{K}\Sigma^K_{i=1}\frac{tr(W_k)}{\lvert\lvert W_k\rvert\rvert _2}`.                                                                                                                                                                                                                                                                                                                                           | 2        |
| {math}`\mu(\lambda)`                | Average singular value over all matrices.                                                                                                                                                                                                                                                                                                                                                                                                                     | 2        |
| {math}`\sigma(\lambda)`             | Sample variance of singular values over all matrices.                                                                                                                                                                                                                                                                                                                                                                                                         | 2        |
| Gradient Symmetricity               | A Modular addition specific metric. Compute the cosine similarity between gradient vectors for the output logits w.r.t. input embeddings. Taking the average over many pairs yields the gradient symmetricity See [Zhong & Liu(2008)](https://arxiv.org/abs/2306.17844).                                                                                                                                                                                      | 2        |
| Distant Irrelevance                 | A Modular addition specific metric. Measures the dependence of correct logits on differences between two inputs.                                                                                                                                                                                                                                                                                                                                              | 2        |

#### Category Glossary:

| Category | Description                                                                            |
| -------- | -------------------------------------------------------------------------------------- |
| 1        | The statistic intends to capture how the neural network weights are dispersed in space |
| 2        | The statistic intends to capture properties of the function computed by a layer.       |