Metrics¶
List of metrics currently considered by the package for a hook point.
Name |
Description |
Category |
---|---|---|
L1 |
The L1-norm, averaged over matrices. \(\frac{1}{K}\lvert\lvert\omega\rvert\rvert_1 = \frac{1}{K} \Sigma^n_{i=1}\lvert\omega_i\rvert\), where \(K\) is the number of weight matrices in the neural network. We average over matrices so that models with different depths are comparable. |
1 |
L2 |
The L2-norm, averaged over matrices. \(\frac{1}{K}\lvert\lvert\omega\rvert\rvert_2 = \frac{1}{K}\Sigma^n_{i=1}\sqrt{\omega^2_i}\) |
1 |
\(\frac{L1}{L2}\) |
Measures the sparsity of the weights (Repetti et al., 2014). \(\frac{1}{K}\Sigma^K_{i=1}\frac{L_1^{(i)}}{L_2^{(i)}}\), which is the metric \(\frac{L1}{L2}\) averaged over the \(K\) weight matrices. Lower is more sparse. For example, a one-hot vector is fully sparse and has code sparsity of 1. See Hurley & Rickard (2008) for a discussion on measures of sparsity. |
1 |
\(\mu(w)\) |
Sample mean of weight. \(\frac{1}{N}\Sigma^N_{i=1}w_i\), where \(N\) is the number of parameters in the network. |
1 |
\(median(w)\) |
Median of the weights, treated as a set concatenated together . |
1 |
\(\sigma(w)\) |
Sample variance of the weights without Bessel’s correction. \(\frac{\Sigma^N\_{i=1}(w_i-\overline w)^2}{N}\) |
1 |
\(\mu(b)\) |
Sample mean of the biases. We treat the biases separately because they have a distinct interpretation from the weights. |
1 |
\(median(b)\) |
Median of the biases, treated as a set concatenated together. |
1 |
\(\sigma(b)\) |
Sample variance of biases without Bessel’s correction. |
1 |
trace |
The average trace over $K$ weight matrices. \(\frac{1}{K}\Sigma^K_{i=1}tr(W_k)\), where \(W_k\) is the \(kth\) weight matrix. |
2 |
\(\lambda_{max}\) |
The average spectral norm. \(\frac{1}{K}\Sigma^K_{i=1}\lvert\lvert W_k\rvert\rvert _2\). |
2 |
\(\frac{trace}{\lambda_{max}}\) |
Average trace over spectral norm. \(\frac{1}{K}\Sigma^K_{i=1}\frac{tr(W_k)}{\lvert\lvert W_k\rvert\rvert _2}\). |
2 |
\(\mu(\lambda)\) |
Average singular value over all matrices. |
2 |
\(\sigma(\lambda)\) |
Sample variance of singular values over all matrices. |
2 |
Gradient Symmetricity |
A Modular addition specific metric. Compute the cosine similarity between gradient vectors for the output logits w.r.t. input embeddings. Taking the average over many pairs yields the gradient symmetricity See Zhong & Liu(2008). |
2 |
Distant Irrelevance |
A Modular addition specific metric. Measures the dependence of correct logits on differences between two inputs. |
2 |
Category Glossary:¶
Category |
Description |
---|---|
1 |
The statistic intends to capture how the neural network weights are dispersed in space |
2 |
The statistic intends to capture properties of the function computed by a layer. |