Metrics

List of metrics currently considered by the package for a hook point.

Name

Description

Category

L1

The L1-norm, averaged over matrices. \(\frac{1}{K}\lvert\lvert\omega\rvert\rvert_1 = \frac{1}{K} \Sigma^n_{i=1}\lvert\omega_i\rvert\), where \(K\) is the number of weight matrices in the neural network. We average over matrices so that models with different depths are comparable.

1

L2

The L2-norm, averaged over matrices. \(\frac{1}{K}\lvert\lvert\omega\rvert\rvert_2 = \frac{1}{K}\Sigma^n_{i=1}\sqrt{\omega^2_i}\)

1

\(\frac{L1}{L2}\)

Measures the sparsity of the weights (Repetti et al., 2014). \(\frac{1}{K}\Sigma^K_{i=1}\frac{L_1^{(i)}}{L_2^{(i)}}\), which is the metric \(\frac{L1}{L2}\) averaged over the \(K\) weight matrices. Lower is more sparse. For example, a one-hot vector is fully sparse and has code sparsity of 1. See Hurley & Rickard (2008) for a discussion on measures of sparsity.

1

\(\mu(w)\)

Sample mean of weight. \(\frac{1}{N}\Sigma^N_{i=1}w_i\), where \(N\) is the number of parameters in the network.

1

\(median(w)\)

Median of the weights, treated as a set concatenated together .

1

\(\sigma(w)\)

Sample variance of the weights without Bessel’s correction. \(\frac{\Sigma^N\_{i=1}(w_i-\overline w)^2}{N}\)

1

\(\mu(b)\)

Sample mean of the biases. We treat the biases separately because they have a distinct interpretation from the weights.

1

\(median(b)\)

Median of the biases, treated as a set concatenated together.

1

\(\sigma(b)\)

Sample variance of biases without Bessel’s correction.

1

trace

The average trace over $K$ weight matrices. \(\frac{1}{K}\Sigma^K_{i=1}tr(W_k)\), where \(W_k\) is the \(kth\) weight matrix.

2

\(\lambda_{max}\)

The average spectral norm. \(\frac{1}{K}\Sigma^K_{i=1}\lvert\lvert W_k\rvert\rvert _2\).

2

\(\frac{trace}{\lambda_{max}}\)

Average trace over spectral norm. \(\frac{1}{K}\Sigma^K_{i=1}\frac{tr(W_k)}{\lvert\lvert W_k\rvert\rvert _2}\).

2

\(\mu(\lambda)\)

Average singular value over all matrices.

2

\(\sigma(\lambda)\)

Sample variance of singular values over all matrices.

2

Gradient Symmetricity

A Modular addition specific metric. Compute the cosine similarity between gradient vectors for the output logits w.r.t. input embeddings. Taking the average over many pairs yields the gradient symmetricity See Zhong & Liu(2008).

2

Distant Irrelevance

A Modular addition specific metric. Measures the dependence of correct logits on differences between two inputs.

2

Category Glossary:

Category

Description

1

The statistic intends to capture how the neural network weights are dispersed in space

2

The statistic intends to capture properties of the function computed by a layer.