This webpage showcases the master project developed by João Carvalho at the Chair of Algorithms and Data Structures of the University of Freiburg, as part of the MSc degree in Computer Science.

In this project we explored the capabilities of neural language models. More precisely, we questioned if a neural network would be able to encode Part-of-Speech (POS) tags in its neurons, just by training a simple language model.

We first trained a byte-level language model with a Long Short-Term Memory (LSTM) network using a large collection of text. Then, taking a sentence, which can be viewed as a byte sequence, we used its inner representations (the cell states of the LSTM), along with its corresponding POS tags, as the inputs and targets to train a logistic regression classifier. Looking at the classifier weights, we observed that some concepts (POS tags) are encoded in one neuron, i.e., the POS tag of a byte can be derived from one neuron's activation value, while others are derived with more than one neuron together with the logistic regression classifier. For some tags, using three neurons yielded satisfactory results.

The idea for this project started from the openAI paper (Radford et al, 2017). In this article, the authors found a dimension in the cell states (a neuron) that strongly correlates to the semantic concept of sentiment, which they called the Sentiment Neuron. In this project we also replicated their results.

*Natural Language* is the system humans use to communicate with each other. *Natural Language Processing* (NLP) is a field of study concerned with the automatic processing of natural language (Manning et al, 1999). Its main focus is to enable machines to independently process the base concepts in linguistics: morphology (word formation); syntax (sentence structure / grammar); and semantics (word / sentence meaning). This means that machines should not only recognize the structure of language, but also *understand it*.

When analysing / processing language, there are two fundamentally different approaches (Manning et al, 1999). On the one hand, *rationalists* (usually connected to Noam Chomksy's ideas) assume language is an innate characteristic of humans, not derived by the senses, but rather somehow encoded in the genome. On the other hand, *empiricists* assume the brain has some initial cognitive ability to learn and organize language, but ultimately the structure is not innate, but rather something that needs to be learned. The difference between these views can sometimes be subtle, but a clear distinction can be found, because the two parties describe different things. *Rationalists* seek to describe how the human mind modules language, for instance in the form of grammars, by viewing written or spoken language as an indirect method. *Empiricists* describe language as it occurs.

In this work we will analyse language from the *empiricists'* perspective. Due to large amount of written and spoken language in a digital form, which can be easily used by computers, we can employ statistical methods to analyse it. This approach is commonly referred as Statistical NLP, since it performs statistical inference and prediction in natural language. In this setting, we model language using probability distributions instead of formal grammars (Manning et al, 1999).

In the context of NLP, a *language model* is a function that assigns a probability measure to a sequence of tokens drawn from a vocabulary (Bengio et al, 2016). These tokens can be of multiple forms: in a word-language model, they are words; in a character-language model, they are characters; and in a byte-language model, they are bytes. In this work we deal with models at the byte-level. However, to keep the explanation simple and more intuitive, we will describe language models at the word-level.

Consider a string \(s\) composed by \(n\) words \(s = w_1 w_2 \ldots w_n \), where \(w_i\) is a word from an alphabet \(\Sigma\). We can define the probability of string \(s\) using the chain rule of probability:

\begin{align} P(s) & = P( w_1 w_2 \ldots w_n) = P(\bigcap_{i=1}^{n} w_i) = \prod_{k=1}^{n} P(w_k | \bigcap_{j=1}^{k-1} w_j) \\ & = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_1 w_2) \cdot \ldots \cdot P(w_n | w_1 \ldots w_{n-1}) \end{align}

If the ordering of words is not random, then we can assume that there exists an underlying probability distribution that models the language, and that can be computed with the expression above. As result, it is possible to compare how likely two sentences are. For instance, in the English language we should be able to observe that:

$$ P(\text{I am going to the gym today}) > P(\text{To the gym today going I am}) $$By modelling this probability distribution, we can use it to infer the most probable word after a sequence of words. For instance, in the sentence \(s = \text{I am going to the gym to [x]} \), \(\text{[x]}\) is to be replaced by the word (from the seen vocabulary) that maximizes \(P(s)\) (for example, exercise).

The goal of Statistical NLP is to model language by finding the underlying conditional probability distribution inherent to it, such that the inference of a new word giving a previous set of words is maximized. For that we present two different approaches: n-grams and neural language models.

*N-gram* models are based on the Markov assumption that given a sequence of words \(s = w_1 w_2 \ldots w_{\tau} \), the probability of the word \(w_{\tau}\) appearing after \(w_1 w_2 \ldots w_{\tau-1} \) is only dependent on the previous \(n-1\) words: \( P(w_{\tau} | w_1 \ldots w_{\tau - 1}) \approx P(w_{\tau} | w_{\tau-(n-1)} \ldots w_{\tau - 1}) \) (Bengio et al, 2016). Thus, the probability of the sequence \(s\) is given by:

To train n-gram models we use the principle of Maximum Likelihood Estimation (MLE) to derive the underlying conditional probabilities that maximize the likelihood function. For n-gram models this is straight forward, since the conditional probabilities can be estimated by counting the number of times each n-gram occurs in the training set, and store the computed probabilities to use at prediction time (Jurafsky et al, 2017). For instance, with bi-gram models (\(n=2\)):

$$ P(w_n | w_{n-1}) = C(w_{n-1}w_n) / \sum_{w} C(w_{n-1}w) $$Where \(C(xy)\) is the number of times the bi-gram \(xy\) is found in the training set.

Naturally, since not all n-grams will be in the training set, this raises a problem when making predictions, and thus techniques like smoothing are employed (Jurafsky et al, 2017). To infer what is the most probable word after a previous sequence of words, we just need to look up the stored probabilities computed during training and pick the most probable word, given the previous \(n-1\) words.

N-grams have been used extensively to model language not only because of their simple training and sampling methods, but also because they are easy to understand.

N-gram models suffer from the *curse of dimensionality*, a problem encountered when solving machine learning problems with a large number of features (Domingos, 2015). Particularly, for a vocabulary \(V\) of size \(|V|\), there are \(|V|^n\) possible n-grams, i.e., the number of possible n-grams grows exponentially for a fixed-size vocabulary. Since not all n-grams occur in the training set, at prediction time the model is just doing a nearest-neighbor look up on the input space (Bengio et al, 2016).

To overcome this problem, a solution proposed by (Bengio et al, 2013) does not represent words as traditional n-gram models in a one-hot representation, but uses a *distributed representation* (in the literature, these are also called *word embeddings*). In this approach, a word is represented as a vector of real numbers of dimension \(D\). In a one-hot vector representation, words like "father", "mother" and "sun" would be at an euclidean distance of \(\sqrt{2} \) of each other, whilst in a distributed representation we expect the distance between "father" and "mother" to be closer than "father" and "sun". This desire has some roots in neuroscience, but more importantly, because empirically, learning algorithms achieve better performances in language tasks if similar words are closer in the feature space (Mikolov et al, 2013).

In neural language models there are two (not separate) tasks: learning the mapping from the input-space (word) to the feature-space (word-embedding); and learning the underlying joint probability distribution of a sequence of words, in terms of the word-embeddings. The model can be seen as a composition of functions \(h(w) = g(f(w))\), in which \(f\) maps a word to its embedding, and \(g\) transforms an embedding into a probability distribution.

$$ f : \mathbb{N}^{|V|} \rightarrow \mathbb{R}^D \text{, } g : \mathbb{R}^D \rightarrow \mathbb{R}^{|V|} \text{, } h : \mathbb{N}^{|V|} \rightarrow \mathbb{R}^{|V|} $$\(\mathbb{N}^{|V|}\) indicates that in the input-space a word is represented in a one-hot vector of dimension \(|V|\), and \(\mathbb{R}^D\) specifies a real-valued embedding of size \(D\).

More concretely, taking a sequence of tokens \(X = {x_1, \ldots, x_n}\), a neural language model finds the set of parameters \(\theta\) of a function \(h\) that maximizes the likelihood of a token \(x_i\), given the previous \(k\) tokens and the model parameters:

\begin{align} & \theta = \underset{\theta}{\arg\max} \text{ } L(X) \\ & L(X) = \prod_i P(x_i | x_{i-k}, \ldots, x_{i-1}; \theta) \end{align}In a neural language model, \(h\) is represented as a neural network (NN) and \(\theta\) are its internal weights and biases.

In this project we used byte-level language models instead of word-level models (all the details above are nonetheless aplicable). There are some reasons to prefer these models over character or word-level ones (Baisa, 2016):

- In English, character-level models are already byte-level models, since each English character is encoded in one byte (ASCII encoding);
- The input-space of a byte is \(2^8 = 256\), and thus smaller than a character space, since its vocabulary does not change with the number of unique characters;
- Since any character of any language can be encoded in its byte representation (for instance, via UTF-8 encoding), the model can be applied to any language;
- For some languages, the input-space (equal to the number of unique characters) is very large (Chinese has around 100,000 different characters).

We describe here the basic neural language models used to model language - Recurrent Neural Networks. For this part and the remaining work, we represent scalars with normal font \( x\), vectors in bold \(\boldsymbol{x}\), and matrices in capitalized bold \( \boldsymbol{X} \).

To learn the parameters of neural language models we used Recurrent Neural Networks (RNNs), which are a type of neural networks designed to process sequential data, where the input's size does not need to be fixed (Bengio et al, 2016). Since language shows an inherent sequential structure, and by empiricists' believe, it has an underlying joint probability distribution, RNNs are well suited for the language modelling task.

A RNN is a feed forward neural network extended over time, governed essentially by the equation that relates the hidden state \(\boldsymbol{h}\) of the network at time \(t\), with the previous hidden state at time \(t-1\), the current input \(\boldsymbol{x}\) at time \(t\) and the set of network parameters \(\theta\):

$$ \boldsymbol{h}^{(t)} = f(\boldsymbol{h}^{(t-1)}, \boldsymbol{x}^{(t)}; \theta) $$From this relation we notice that a RNN is by construction a recursive function, where the hidden state changes with time, but the parameters are fixed and shared. Morevover, since \(\boldsymbol{h}^{(t)}\) depends on \(\boldsymbol{h}^{(t-1)}\), the hidden state is sometimes regarded as the model's memory, because (in theory) it stores the information from all the previously processed inputs.

In the byte-level language modelling problem we are interested in describing how a byte is deduced from previous ones. For this task, we used an RNN architecture that produces an output at each time step and has recurrent connections between hidden units. The following picture and equations depict this idea:

\begin{align} \boldsymbol{h}^{(t)} & = \tanh(\boldsymbol{Ux}^{(t)} + \boldsymbol{Wh}^{(t-1)}) \\ \boldsymbol{o}^{(t)} & = \boldsymbol{Vh}^{(t)} \\ \boldsymbol{\hat{y}}^{(t)} & = \text{softmax}(\boldsymbol{o}^{(t)}) \end{align}Here the set of parameters \(\theta\) to be learned are the matrices \(\boldsymbol{U}\), \(\boldsymbol{W}\) and \(\boldsymbol{V}\) (we do not account biases separetely), which represent the input-to-hidden, hidden-to-hidden and hidden-to-output linear connections. As with feed forward neural networks, the hidden state is the result of linear transformations (given by matrices \(\boldsymbol{U}\) and \(\boldsymbol{W}\)) followed by a non-linear activation function (hyperbolic tangent). To clarify the dimensions of these matrices, consider that \(V\) is the input / ouput vocabulary size and \( D \) is the size of the hidden-state. Then, the input-space is in \(\mathbb{N}^{V}\), the feature-space in \(\mathbb{R}^{D}\) and the output-space in \(\mathbb{R}^{V}\). Also, for simplicity, we assume that only one input is fed at a time (in constrast to a batch of inputs). We obtain \(\boldsymbol{x} \in \mathbb{N}^{V}\), \(\boldsymbol{h} \in \mathbb{R}^{D}\), \(\boldsymbol{o} \in \mathbb{R}^{D}\), \(\boldsymbol{y} \in \mathbb{R}^{D}\), \(\boldsymbol{U} \in \mathbb{R}^{D \times V}\), \(\boldsymbol{W} \in \mathbb{R}^{D \times D}\) and \(\boldsymbol{V} \in \mathbb{R}^{V \times D}\).

Considering \(\boldsymbol{X} = \boldsymbol{x}^{(1)} \ldots \boldsymbol{x}^{(n)} \) as the network input and \(\boldsymbol{Y} = \boldsymbol{y}^{(1)} \ldots \boldsymbol{y}^{(n)} \) the network output, the loss we want to minimize is the sum of the losses at each time \( t \), \(L^{(t)}\), which is the cross-entropy between \(\boldsymbol{y}\) and \(\hat{\boldsymbol{y}}\):

\begin{align} L (\boldsymbol{y}^{(1)} \ldots \boldsymbol{y}^{(n)}; \boldsymbol{x}^{(1)} \ldots \boldsymbol{x}^{(n)}) & = \sum_{t=1}^n L^{(t)} \\ & = -\sum_{t=1}^n \log p (\boldsymbol{y}^{(t)} | \boldsymbol{x}^{(1)} \ldots \boldsymbol{x}^{(t)}) \\ & = -\sum_{t=1}^n \log (\hat{\boldsymbol{y}}^{(t)}[\boldsymbol{y}^{(t)} == 1]) \end{align}In the equation above \(\hat{\boldsymbol{y}}^{(t)}[\boldsymbol{y}^{(t)} == 1]\) denotes the entry of \(\hat{\boldsymbol{y}}^{(t)}\) where \(\boldsymbol{y}^{(t)}\) is equal to 1. Minimizing the cross-entropy loss is the same as maximizing the log-likelihood (the logarithm is used for numerical stability). Intuitively, the cross-entropy between random variables \(p\) and \(q\), denoted by \(H(p, q)\), measures the average number of bits needed to represent a symbol from \(p\), given that its coding scheme was derived with the probability distribution governing the random variable \(q\). If \(q\) follows the exact same distribution as \(p\), then the cross entropy is the entropy itself. Moreover, this also yields the minimization of the Kullback–Leibler divergence - a measure of how distant two probability distribution are. In the end, we want the estimated probability distribution \(\hat{\boldsymbol{y}}^{(t)}\) to be as close as possible to the true one \(\boldsymbol{y}^{(t)}\).

Similar to feed forward neural networks, RNNs parameters are updated by a gradient descent method. For example, parameter \(\boldsymbol{W}\) is updated as:

$$ \boldsymbol{W} \leftarrow \boldsymbol{W} - \eta \nabla_{\boldsymbol{W}} L $$Where \(\eta\) is the learning rate and \(\nabla_{\boldsymbol{W}} L\) the derivative of the loss with respect to the weigths \(\boldsymbol{W}\). In our experiments we used a variation of the traditional gradient descent, the ADAM method (Kingma et al, 2014). Also, from the multiple ways to update the learning rate during training, we chose to linearly decay the initial learning rate to zero throughout the iterations. In feed forward NNs, the gradients \( \nabla_{\boldsymbol{W}}\) are computed using a dynamic programming approach called back-propagation (Bengio et al, 2016). The difference of computing gradients of RNNs and feed forward NNs is that the back-propagation is done through time - Back-Propagation Through Time (BPTT). This means that a parameter is updated by adding all the contributions to the loss over time. For instance, the derivative of the loss with respect to the output matrix \(\boldsymbol{V}\) is given by (Bengio et al, 2016):

$$ \nabla_{\boldsymbol{V}}L = \sum_{t}(\nabla_{\boldsymbol{o}^{(t)}}L)\boldsymbol{h}^{(t)} $$Two problems related to the gradient computation arise when training RNNs (Pascanu et al, 2013): *vanishing* and *exploding* gradients. They appear because when computing the gradient of the loss at time \(t\) with respect to the parameters, a sum over \(k\) of the terms \( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(k)}} = \prod_{t \geq i > k} \frac{\partial \boldsymbol{h}^{(i)}}{\partial \boldsymbol{h}^{(i-1)}} \) arises (see (Pascanu et al, 2013) for a detailed derivation). Since \( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t-1)}} = \boldsymbol{W} \tanh'(\boldsymbol{W} \boldsymbol{h}^{(t-1)}) \), the product from the previous equation results in a term equal to \( \boldsymbol{W}^{t-k} \). The *vanishing gradient* problem comes from the fact that, assuming \(\boldsymbol{W}\) to have an eigendecomposition, the *eigenvalues* of \(\boldsymbol{W}\) that are less than 1 decrease exponentially to zero with \( t \), while the *exploding gradient* problem exists if the *eingenvalues* are greater than 1. Both situations are not desirable when training the network. The first one leads to a slower convergence, while the latter can move the gradient descent trajectory out of the descent path.

A practical solution for the *exploding gradients* proposed in (Pascanu et al, 2013) is to clip the gradients if they are higher than a *threshold* (empirical value), by scaling the gradient, but keeping its direction. A solution to the *vanishing gradients* is to not propagate the loss through all time steps, but to perform a Truncated Back-Propagation Through Time (TBPTT). In this setting, instead of training on a whole sequence \(\boldsymbol{X} = {\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n}\) from \(1\) to \(n\), we choose a value \( 1 \leq \tau \leq n \), for which when computing the loss we only consider the sequence from \(\tau\) to \(n\). Naturally, TBPTT also accelerates the training procedure. A second approach is to use another type of RNN architecture, the Long Short-Term Memory (LSTM) architecture, which we mention in a section below.

Using Truncated Back-Propagation Through Time there are two possibilities to deal with hidden state of the RNN when moving between training batches. In *stateful* RNNs we have reason to believe that the dependencies in the data span over the truncated sequence, so we keep the hidden state across batches in order to simulate a full back-propagation. While in *stateless* RNNs, the hidden state is reseted after processing a training sequence, if the there are no long term dependencies in the data.

In our experiments, we used the TBPTT to solve the vanishing gradient and memory related problems, and we used stateful RNNs, because in a language modelling task, we wish to keep the long range information from previously processed text.

In the context of language modelling, *sampling* means generating the next word (or byte) based on the history processed so far. The inputs are a text to start sampling from, denoted by *prime* \(\boldsymbol{P} = \boldsymbol{p}_1 \ldots \boldsymbol{p}_m \), where \( \boldsymbol{p}_i \) is a one-hot encoded vector of the vocabulary's size, and the number of words to sample \(n > m\). With the pre-trained parameters, we begin with an initial hidden state \( \boldsymbol{h}^{(0)} = \boldsymbol{0} \) and input \( \boldsymbol{x}^{(1)} = \boldsymbol{p}_1 \), and compute the hidden states until \(m\) by using the previous hidden state and by taking each input from \( \boldsymbol{P} \) sequentially. To process the words after the \(m^{\text{th}}\) word, we predict the words \( m+1 \) to \( n \) as an index \(k\) of the prediction vector \( \hat{\boldsymbol{y}}^{(j)} \), which we use as input to the prediction of the next word, i.e., \( \forall j > m \text{, } \boldsymbol{x}^{(j+1)} = \boldsymbol{\hat{y}}^{(j)} \).

For every time step \(t\) we pick a value from \(\hat{\boldsymbol{y}}^{(t)}\), the vector resulting from the softmax operation of the output vector \(\boldsymbol{o}^{(t)}\). The softmax function is used as an extension of the sigmoid function to multiple variables, transforming its input into a probability distribution over the possible discrete outputs:

$$ \hat{y}_i^{(t)} = \text{softmax}(o_i^{(t)}) = \frac{\exp(o_i^{(t)})}{\sum_{i=1}^V \exp(o_i^{(t)})} $$Thus, \(\hat{\boldsymbol{y}}^{(t)}\) represents the probability distribution over the words indexed with \(0 \leq i \leq V-1\). Our language model will then pick a word from this probability distribution. For that we found two different approaches:

- Take the word with
**maximum probability**. Simply take the index of \(\hat{\boldsymbol{y}}^{(t)}\) as \( i = \underset{i}{\arg\max} \text{ } \hat{\boldsymbol{y}}^{(t)} \). With this approach we observed that the sampling process enters a looping state and outputs repetitive text. - Take the word as a sample from the
**multinomial distribution**represented by \( \hat{\boldsymbol{y}}^{(t)} \). In this case the model picks the index \( i \) with probability \( \hat{\boldsymbol{y}}_i^{(t)} \). We observed that, contrary to taking the maximum probability, this approach prevents the sample to loop. Additionally, for sampling, one can introduce another parameter called*temperature*(Hinton et al, 2015).*Temperature*(\( T \)) is a parameter used to control the randomness of the predictions. With this parameter each entry of the vector \( \hat{\boldsymbol{y}}^{(t)} \) is modified to: $$ \hat{y}_i^{(t)} = \text{softmax}(o_i^{(t/T)}) = \frac{\exp(o_i^{(t/T)})}{\sum_{i=1}^V \exp(o_i^{(t/T)})} $$If \( T \) grows to infinity, \( \boldsymbol{\hat{y}} \) results in an uniform distribution and all predictions are equally likely. If \( T\) is equal to 1, we obtain the traditional model. If \( T\) tends to 0, we obtain the sampling from the

**maximum probability**variant. Thus, with lower temperatures the model is more conservative, and is not likely to generate random sentences. On the other hand, higher temperatures result in more diversity in the output, at the cost of more syntax mistakes.

As previously explained, simple RNNs suffer from the *vanishing* and *exploding* gradient problems. LSTM networks (Schmidhuber et al, 1997) share the same principles of RNNs, but their architecture prevents these two identified problems. So, they are capable of learning long-term dependencies more effectively than traditional RNNs. For this reason, we used in our language model this type of architecture.

While RNNs' hidden states are computed using a linear transformation followed by a non-linearity, LSTMs replace this computation with LSTM cells, and additionally to the hidden state they introduce a *cell state*. Intuitively, the hidden state can be perceived as the working (short term) memory and the cell state as the long term memory. The inputs to LSTMs cell are: the previous hidden state \(\boldsymbol{h}^{(t-1)}\); the previous cell state \(\boldsymbol{c}^{(t-1)}\); and the current input \(\boldsymbol{x}^{(t)}\). The outputs are: the current hidden state \(\boldsymbol{h}^{(t)}\); and the current cell state \(\boldsymbol{c}^{(t)}\). The output prediction is computed as with RNNs by performing a linear operation with the hidden states followed by a softmax function. The following equations define LSTMs in general:

Where \( [\boldsymbol{h}^{(t-1)}, \boldsymbol{x}^{(t)}] \in \mathbb{R}^{D+V} \) represents the concatenation of hidden states and input. \( \sigma (.) \) is the sigmoid function. \( \boldsymbol{W}_f, \boldsymbol{W}_i, \boldsymbol{W}_c, \boldsymbol{W}_o \in \mathbb{R}^{D \times (D+V)} \) are the parameters to be learned. \( \boldsymbol{f}^{(t)}, \boldsymbol{i}^{(t)}, \boldsymbol{c}^{(t)}, \boldsymbol{o}^{(t)}, \boldsymbol{h}^{(t)} \in \mathbb{R}^{D} \).

The training and sampling processes of LSTMs are very similar to RNNs, and so we won't detail them here. For further information we redirect the reader to (Bengio et al, 2016).

Logistic regression is a model used for supervised classification tasks, based on a linear model followed by a non-linearity. In a binary classification setting with classes 0 and 1, the output of the logistic regression can be interpreted as a probability of belonging to one of the classes. Specifically, if \( \boldsymbol{x} \in \mathbb{R}^{1 \times D} \) is our input vector, the estimated output \( \hat{y} \in \mathbb{R} \) is given by:

$$ \hat{y} = \sigma(\boldsymbol{x}\boldsymbol{W} + b) $$Where \( \boldsymbol{W} \in \mathbb{R}^{D \times 1} \) and \( b \in \mathbb{R} \) are model parameters, and \( \sigma(a) = 1 / (1 + e^{-a}) \) is the sigmoid function, which forces the output to be in the interval \( [0, 1] \).

For \( N \) data points, the loss function to minimize is given by the negative log-likelihood, or as explained before, the cross-entropy:

$$ L(\boldsymbol{y}, \boldsymbol{\hat{y}}) = \sum_{i=1}^{N} -y_i\log(\hat{y}_i) - (1-y_i) \log(1 - \hat{y}_i) $$To prevent overfitting it is common practice to use a regularizer to penalize the loss function. We used a \( L_1 \) regularizer to enforce sparse models, which results in the loss:

$$ L_R = L(\boldsymbol{y}, \boldsymbol{\hat{y}}) + \lambda \sum_{i=1}^D |W_i| $$Where \( \lambda \) is the regularization coefficient. With this loss function there is no analytical solution for deriving the parameters. However, they can be efficiently computed using a gradient descent approach.

In the multiclass setting there are labels with \(C\) distinct classes. One way to train such a model is to separately train \( C \) classifiers in a One-vs-All approach, which can be trained in parallel. To infer the class \(j \in \{1, \ldots, C\}\) of a new unseen datapoint \( \boldsymbol{x} \) we simply compute:

$$ j = \underset{j = 1, \ldots, C}{\arg\max} \text{ } \boldsymbol{x} \boldsymbol{W}_n^{(j)} + b_n^{(j)} $$Where \( \boldsymbol{W}_n^{(j)} = \boldsymbol{W}^{(j)} / \left\lVert \boldsymbol{W}^{(j)} \right\rVert _2 \) and \( b_n^{(j)} = b^{(j)} / \left\lVert \boldsymbol{W}^{(j)} \right\rVert _2 \) are the normalized pre-trained parameters.

To evaluate the results of a classifier we used three common metrics: accuracy; precision; and recall. In the next table we show the confusion matrix, and how each metric is computed.

True class | |||
---|---|---|---|

Correct | Not Correct | ||

Predicted class | |||

Selected | True Positives (TP) | False Positives (FP) | |

Not Selected | False Negatives (FN) | True Negatives (TN) |

\( \text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + FP + TN + FN}} \), is the fraction of predictions the model got right.

\( \text{Precision} = \frac{\text{TP}}{\text{TP + FP}} \), is the fraction of selected items that are correct.

\( \text{Recall} = \frac{\text{TP}}{\text{TP + FN}} \), is the fraction of correct items that are selected.

The language models based on LSTMs were implemented in Tensorflow using simple matrix multiplications (we only left the gradient computation to the Tensorflow optimizer routine). For classification tasks, we used the logistic regression classifier module from the PYTHON package sickit-learn.

In this section we train and sample a byte-level language model using a LSTM network.

The data was obtained from the Wikitext-103 dataset (Merity et al, 2016). It contains a collection of *good* and *featured* Wikipedia articles, well suited for language modeling.

The training data contains 28,475 articles, with 103,227,021 tokens (a token is a word or punctuation); the validation data contains 60 articles, with 217,646 tokens; and the testing data contains 60 articles, with 245,569 tokens. The approximate size of the whole dataset is close to 518 MB.

The following is a sample from this dataset:

= Gold dollar =

The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from 1849 to 1889 . The coin had three types over its lifetime , all designed by Mint Chief Engraver James B. Longacre . The Type 1 issue had the smallest diameter of any United States coin ever minted .

A gold dollar had been proposed several times in the 1830s and 1840s , but was not initially adopted . Congress was finally galvanized into action by the increased supply of bullion caused by the California gold rush , and in 1849 authorized a gold dollar . In its early years , silver coins were being hoarded or exported , and the gold dollar found a ready place in commerce . Silver again circulated after Congress in 1853 required that new coins of that metal be made lighter , and the gold dollar became a rarity in commerce even before federal coins vanished from circulation because of the economic disruption caused by the American Civil War .

Gold did not again circulate in most of the nation until 1879 ; once it did , the gold dollar did not regain its place . In its final years , it was struck in small numbers , causing speculation by hoarders . It was also in demand to be mounted in jewelry . The regular issue gold dollar was last struck in 1889 ; the following year , Congress ended the series .

To facilitate training, the dataset was split into 98 parts (shards), with 96 used for training, 1 for validation and 1 for testing. We removed the lines containing headers and sub-headers (the ones containing = X =, = = XX ==, ...). Other than that, as previously explained, byte-level language models do not require any special pre-processing task.

- Hidden / Cell state size: 2048
- Batch size: 128
- Sequence length: 128
- Epochs per shard: 10
- Initial learning rate: 5e-4, decayed linearly to zero over the training set
- Gradient clipping threshold: 5.0
- States were kept during batches (stateful LSTM) and reset between epochs and shards

At the end of training, the model achieved training and validation losses close to 1.37 and 1.31 bits, respectively, and accuracies close to 0.72 and 0.73, respectively. Accuracy is computed by picking the entry of the estimated vector \(\boldsymbol{\hat{y}}\) with the max probability. In the following figure we can see the evolution of these quantities during training. We observe that by the end of training, the training loss starts to increase. An explanation for this could be that the learning rate is not small enough, and so the model starts to escape a local minima. This can be altered further by ajdusting the learning rate decreasing schedule. However, we left the model as is, since we our goal was not to obtain the best possible model, but rather one with *reasonable* loss / accuracy that could be the basis for further analysis.

A qualitative way to assess the training of language models is to generate samples to observe if the language's syntax and morphology is correct, and also if the model was capable of generating sentences with satisfiable semantics. The following is an example of sampling 500 bytes from the prime text "Riemann was a German mathematician who made contributions to analysis ", choosing bytes from a multinomial distribution, with temperature 0.4 (obtaining *good* sentences is very dependent on the temperature hyperparameter, and there's no straightforward rule for setting it, but rather observing the generated sentences).

Analysing the generated text it's easy to see that although syntax and morphology are well formed, semantics is not so easy to obtain.

Here it is possible to sample the pre-trained model and observe the empirical results of language generation by changing with the previously explained parameters. In most experiments we observed that model clearly learned the syntactic structure of language, but sometimes failed to produce a sentences with a clear meaning.

**Sample type**

During this project we implemented and tried to replicate to some extent the results from the openAI paper (Radford et al, 2017), which discovered a dimension of the cell state (a neuron) that is closely related to concept of sentiment.

The approach by openAI was to first train a large unsupervised byte-level language model from a collection of reviews (around 80 million) from the Amazon product reviews dataset (He et al, 2016), and to train a logistic regression classifier on top of the cell state representation of a review. The inputs for the classifier are then the LSTM cell states obtained after processing a review, and the target is the review score (positive or negative).

Surprisingly, there was a dimension in the logistic regression classifier that showed a much larger weight. They called this dimension the sentiment neuron, since alone it showed impressive results to determine the strength of the sentiment of a sentence, even at the byte level.

The dataset used is the Amazon product reviews dataset (He et al, 2016). It contains approximatelly 80 million product reviews, classified between 1 (negative) and 5 (positive), although their are not used for the language modelling task. Each review is preprocessed with the steps:

- Replace newline characters with spaces;
- Remove spaces from the beggining and end;
- Insert a newline at the end;
- Insert a newline plus a space at the beggining.

To test the ability of the model to generalize to data from another dataset, the logistic regression classifier was trained using the binary version of the Stanford Sentiment Treebank (SST) dataset (Socher et al). This dataset is composed of 9613 movie reviews (train, valid, test: 6921, 873, 1822), with binary sentiment classification (positive or negative).

- Hidden / Cell state size: 2048 (vs openAI's 4096)
- Batch size: 128
- Sequence length: 128 (vs openAI's 256)
- Epochs per shard: 1
- Initial learning rate: 5e-4, decayed linearly to zero over the training set
- Gradient clipping threshold: 5.0
- States were kept during batches (stateful LSTM) and reset between epochs and shards

To overcome memory exhaustion, the full set of 80 million reviews was splitted into 995 shards (993 for training, 1 for validation, 1 for testing), each containing approximatelly 80,000 reviews.

The binary logistic regression classifier was trained with a L1 penalty, using a regularization coeficient of 0.25.

Training the language model on this giant dataset takes a lot of time (for openAI it took approximately a month). So, we stopped training after processing around 65 million reviews (22 days of training). We obtained a loss close to the results from openAI (~1.21 vs ~1.18 - LSTM, for ~1 900 000 updates) (these values are not directly comparable due to the different LSTM internal sizes and backpropagation length used. Although they provide a guideline).

The trained logistic regression classifier obtained an 84.8% accuracy on the SST dataset (vs 91.8% obtained by openAI). The sentiment neuron alone achieves an 80% accuracy. The following figure shows the value of each weight after training the logistic regression classifier. We observe that one weight stands out, namely the one at dimension 981.

Taking the value of the largest weight as our sentiment neuron, we can see in the next figure that its value alone is a good indicator of the sentiment present in a review. This figure is produced by taking each review from the SST training dataset, compute its LSTM cell state representation, and take the cell state value at dimension 981. We then plot how often this value arised from processing all the reviews. Furthermore, we color positive reviews in blue, and negative ones in red. By visual inspection we see that this neuron value alone is able to separate positive from negative reviews. If the neuron value is less than 0 we classify the review as positive, and if it is greater than 0 we classify it as negative.

The reviews that overlap are also not easy for humans to decipher. The following is a set of missclassified reviews by just taking the sentiment neuron's value:

Predicted, True Label, Review

0, 1, Sometimes, nothing satisfies like old-fashioned swashbuckling.

1, 0, Enough similarities to Gymkata and Howie Long's Firestorm that my fingernails instinctively crawled towards my long-suffering eyeballs.

0, 1, Here's a British flick gleefully unconcerned with plausibility, yet just as determined to entertain you.

0, 1, While not as aggressively impressive as its American counterpart, "In the Bedroom," Moretti's film makes its own, quieter observations

1, 0, Imagine if you will a Tony Hawk skating video interspliced with footage from Behind Enemy Lines and set to Jersey shore techno.

Here it is possible to sample the pre-trained model to generate reviews, using the common parameters.

**Sample type**

Each byte in the generated text is colored with the hyperbolic tangent (places the values in the range [-1, 1]) of the cell state value of neuron 981, where -1 is blue, indicating positive sentiment, and +1 is red, indicating negative sentiment.

Here it is possible to classify a review by either using the pre-trained logistic regression classifier or using only the sentiment neuron value.

A negative (blue) value is indicative of a positive sentiment. A positive (red) value is indicative of a negative sentiment.

-1

+1

The blue (red) bar indicates how probable is the review to be positive (negative).

50%

50%

Following the idea of finding a sentiment neuron (Radford et al, 2017), we questioned if it would be possible to find wether and how other concepts are represented in the cell states of a LSTM network. We chose as concepts a set of POS tags from the Penn Treebank tagset, and our hope was to observe if there was a particular neuron (or set of neurons), which alone encoded the information of a particular tag.

We used the approach of (Radford et al, 2017) using the cell states of a pre-trained unsupervised LSTM neural language model as an input to a supervised classification task. More recently, the idea of using language models as a pre-processing step to solve language related tasks has shown empirical advances in many NLP tasks (Radford et al, 2018).

Our method works as follows:

- Take a sentence as a byte sequence \( \boldsymbol{X} = \boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\), where \(\boldsymbol{x}_i\) is a byte represented as a one-hot-vector of an integer in the range [0, 255];
- Use a POS-tagger to obtain the tags for each word in the sentence. We used the tagger from the PYTHON package
*Natural Language Toolkit*(nltk), which outputs tags from the Penn Treebank tagset, by implementing a Greedy Averaged Perceptron. This tagger works on tokenized sentences and does not include whitespace characters. So, after tagging, we add the additional tag \( \text{SPACE} \) between each word or punctuation, to represent these kind of characters. Therefore, we associate each byte with a POS tag and we have a sequence \(X_{\text{tag}} = x_{1_{\text{tag}}}, \ldots, x_{n_{\text{tag}}} \), where \(x_{i_{\text{tag}}}\) is the tag of byte \(\boldsymbol{x}_i\). Alternatively, we can use a set of manually tagged sentences; - Obtain the cell states of each byte by transforming \(\boldsymbol{X}\) through a previously trained LSTM language model, obtaining a sequence of cell states \( \boldsymbol{X}_{t} = \boldsymbol{x}_{1_{t}}, \ldots, \boldsymbol{x}_{n_{t}} \), where \(\boldsymbol{x}_{i_{t}} \in \mathbb{R}^{D} \) (\(D\) is the LSTM cell state size);
- For each identified POS tag (or the set of desired POS tags to analyse), train a binary logistic regression classifier \( \mathcal{L} = \text{LRC}(\boldsymbol{X}_{\text{LRC}}, \boldsymbol{Y}_{\text{LRC}}) \) in a One-vs-All manner. The input data \(\boldsymbol{X}_{\text{LRC}}\) is the collection of cell states of each processed byte. The target data \(\boldsymbol{Y}_{\text{LRC}}\) are the corresponding POS tags. With \(n\) processed bytes, \( \boldsymbol{X}_{\text{LRC}} \in \mathbb{R}^{n \times D} \) and \( \boldsymbol{Y}_{\text{LRC}} \in \mathbb{R}^{n} \). The classifier \( \mathcal{L} \) produces a vector of weights \(\boldsymbol{W}_{c} \in \mathbb{R}^{D}\) for each unique class \(c\) from the set of POS tags of \( \boldsymbol{Y}_{\text{LRC}} \).
- Note the dimensions of the top \( k \) weights, i.e., the weights with the largest absolute values, and sort them by decreasing value in \( \text{largest_weights} = [d_1, \ldots, d_j, \ldots, d_k] \) (we used in our experiments \( k = 3\));
- Train \(k\) logistic regression classifiers, where the inputs dimensions vary from \(1\) to \(k\). For instance, for \(k=3\) classifiers use as input for the first classifier the dimension of the cell state with largest weight (\(d_1\)), for the second classifier use dimensions (\(d_1, d_2\)) and for the third one use (\(d_1, d_2, d_3\)). More formally, train further \(k\) logistic regression classifiers as \( \mathcal{L}_j = \text{LRC}(\boldsymbol{X}^{(1 \ldots j)}_{\text{LRC}}, \boldsymbol{Y}_{\text{LRC}}) \), where the input \( \boldsymbol{X}^{(1 \ldots j)}_{\text{LRC}} \in \mathbb{R}^{j} \) are the entries of \( \boldsymbol{X}_{\text{LRC}} \) indexed by \( [d_1, \ldots, d_j] \), i.e., the cell state values corresponding to the largest absolute weights of \( \mathcal{L} \).

The method above is described using only one sentence. Naturally, it is easily extended to use more sentences, by concatenating them to \( \boldsymbol{X}_{\text{LRC}} \) .

As a neural language model we used the one previously trained on the Wikitext-103 dataset.

We trained the logistic regression classifiers on three different datasets:

- The first 250 lines of the validation dataset of Wikitext-103 (approximatelly 122,000 pairs of bytes and tags), splitted into training / validation / testing (80% / 10% / 10%). This way we are in the domain of the language model dataset, but using a part of the data we did not use to build the model;
- The first 500 lines of the validation dataset of Wikitext-103 (approximatelly 260,000 pairs of bytes and tags), splitted into training / validation / testing (80% / 10% / 10%);
- The first 1000 sentences of the nltk treebank corpus (approximatelly 137,000 pairs of bytes and tags), splitted into training / validation / testing (80% / 10% / 10%). The nltk treebank corpus contains a 5% fragment of the Penn Treebank, (C) LDC 1995, containing manually tagged sentences from the Wall Street Journal (WSJ). When using this dataset it is not necessary to perform the first two steps of the method presented above.

Additionally, we also opted to train the classifiers with the option of grouping (or not) several tags into one. We thought this step could lead to better results, since some tags are separated into singular or plural, and our language model could not have captured those nuances. For instance, if a neuron captures the tag \( \text{NNP} \) (proper noun singular), probably that same neuron also captured the tag \( \text{NNPS} \) (proper noun plural). With this option selected we have a set of tags \( \text{{ ( ; ) ; , ; . ; CC ; CD ; DT ; IN ; JJ ; MD ; NN ; NNP ; PRP ; RB ; TO ; VB }} \), which we may group as follows:

- Group all verb tags into \( \text{VB} \). If tag \( \in \text{{VB, VBD, VBG, VBN, VBP, VBZ}} \), set tag to \( \text{VB} \);
- Group all adjective tags into \( \text{JJ} \). If tag \( \in \text{{JJ, JJR, JJS}} \), set tag to \( \text{JJ} \);
- Group all noun tags into \( \text{NN} \). If tag \( \in \text{{NN, NNS}} \), set tag to \( \text{NN} \);
- Group all proper noun tags into \( \text{NNP} \). If tag \( \in \text{{NNP, NNPS}} \), set tag to \( \text{NNP} \);
- Group all adverb tags into \( \text{RB} \). If tag \( \in \text{{RB, RBR, RBS}} \), set tag to \( \text{RB} \);
- If a tag is not in the set of tags, tag it with \( \text{OTHER} \). We found a large portion of tags have a low frequency, so we don't train a separate classifier for them, but group them in this tag;
- Tag spaces as \( \text{SPACE} \).

To evaluate the overall results of the classifiers we take a sentence, transform each byte into its cell state representation, and predicted the label (POS tag) using the full or the top 1, 2 or 3 dimensions as input to the \( C \) classifiers \( \mathcal{L} \) (assuming there are \(|C|\) unique classes). In the multiclass classification setting, we take the label of the most probable tag as the predicted one. The results obtained in the test set of each experiment are depicted in the next table. The individual results for each classifier can be consulted in table further down.

k = All | k = 1 | k = 2 | k = 3 | ||
---|---|---|---|---|---|

Dataset | Group tags | Acc | Prec | Recall | Acc | Prec | Recall | Acc | Prec | Recall | Acc | Prec | Recall |

250 lines Wikitext-103 | yes | 0.873 | 0.869 | 0.873 | 0.526 | 0.417 | 0.526 | 0.604 | 0.519 | 0.526 | 0.617 | 0.534 | 0.617 |

no | 0.839 | 0.834 | 0.839 | 0.399 | 0.308 | 0.399 | 0.544 | 0.447 | 0.544 | 0.565 | 0.482 | 0.565 | |

500 lines Wikitext-103 | yes | 0.864 | 0.859 | 0.864 | 0.466 | 0.343 | 0.466 | 0.540 | 0.433 | 0.540 | 0.617 | 0.513 | 0.617 |

no | 0.825 | 0.820 | 0.825 | 0.376 | 0.267 | 0.376 | 0.463 | 0.377 | 0.463 | 0.550 | 0.471 | 0.550 | |

1000 sentences nltk treebank corpus | yes | 0.844 | 0.838 | 0.844 | 0.425 | 0.269 | 0.425 | 0.523 | 0.425 | 0.523 | 0.579 | 0.425 | 0.579 |

no | 0.797 | 0.792 | 0.797 | 0.407 | 0.258 | 0.407 | 0.439 | 0.340 | 0.439 | 0.496 | 0.426 | 0.496 |

The best result obtained was an overall accuracy of 87%, which is not very impressive considering that other methods achieve around 97% (Manning, 2011). However, we were interested if with this method we could discover how different tags are represented in the network, which we observe by looking at the classifier weights. Consider, for instance, the classifier for the tag \(\text{VB}\), trained on the first 250 lines of the Wikitext-103 validation set without grouping tags (not_group_tags_250_lines). Inspecting the classifier weights for this concept (following figure on the left), we observe that the weight along dimension 1961 stands out from the others. We conjecture that this could be a neuron that encodes the information for the \(\text{VB}\) concept. The figure on the right shows the histogram of activation values for neuron 1961, where *match* are activations with true tag \(\text{VB}\), while *no-match* are activations with a tag different from \(\text{VB}\). We can see that, just like with the sentiment neuron, the dimension 1961 of the cell state alone could be a good indication for the POS tag of the current processed byte. Namely, in this particular case, if the activation is negative, the network might be processing a byte that is part of a verb.

In the following figure we use a sample of text to visualize the discovered neuron. Each byte is colored with the hyperbolic tangent (places the values in the range [-1, 1]) of the cell state value of neuron 1961, where -1 is red, +1 is blue. Under each byte we place the corresponding identified POS tag, and color in yellow the one we are analysing. By visual inspection, we observe that the neuron's value alone is at some extent able to distinguish if the current byte belongs to a tag \(\text{VB}\).

Unfortunatelly, this behavior observed for the \(\text{VB}\) tag is not observed for all tags. For instance, for the tag \(\text{RB}\) - adverb - there is no clear dimension which is larger then the rest. Also, with the help of the histogram, we see that the neuron alone cannot identify if the byte belongs to an adverb.

There are concepts that do not have a clear neuron that encodes them, but which can be predicted through a combination of neurons. For instance, the tag \(\text{SPACE}\) is one of such. Inspecting the classifier weights for this concept (following figure on the left), we observe that no weight clearly stands out. Nevertheless, we take the neuron 289, the one with largest absolute value, and plot its histogram of activations. We can see that, just like for the \(\text{RB}\) tag, the neuron alone cannot identify if the byte it is processing belongs to a space. This can be further visualized in the text right under the figures, where we color each byte with the hyperbolic tangent of the cell state value of neuron 289. By visual inspection, we observe that the neuron's value alone is not a good predictor.

However, when using the three dimensions corresponding to the top three largest weights of the trained classifier, we obtain very satisfying results. For the test set of the model group_tags_250_lines, when using a logistic regression classifier with three input dimensions we achieve an accuracy, precision and recall equal to 0.978, 0.936 and 0.944, respectively. In the following figure we see the same text as before, but each byte is now colored with the output probability of the three-dimensional classifier associated with the tag \(\text{SPACE}\).

In this section it is possible to experiment with the trained classifiers by following the steps (hit Submit after each step):

- Select a trained classifier. You can see at the end of the demo a table with the results for the individual classifiers;
- Select a POS tag from the list to analyse (consult its meaning in the Penn Treebank tagset);
- Visualize the activation of a particular neuron for the given text;
- Visualize the probability of the classifiers using all or up to three dimensions.

If the neuron is selected, each byte is colored with the hyperbolic tangent of the cell state value of the selected neuron, where -1 is red, +1 is blue. If the classifier is selected, each byte is colored with the output probability of the classifier associated with the tag (0 is white, 1 is blue). Under each byte we place the corresponding identified POS tag and color in yellow the selected one.

Concept | Data | k = All | k = 1 | k = 2 | k = 3 | |||
---|---|---|---|---|---|---|---|---|

Acc | Prec | Recall | Features used | Top 3 features | Acc | Prec | Recall | Acc | Prec | Recall | Acc | Prec | Recall |

The developed code can be found in the SVN repository https://ad-websvn.informatik.uni-freiburg.de/student-projects/joao-carvalho/

- Baisa, V. (2016). PhD Thesis - Byte Level Language Models. Masaryk University, Brno, Czech Republic
- Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research 3, 1137-1155
- Bengio, Y., Goodfellow, I., Courville, A. (2016). Deep Learning. Cambridge, Massachussets: MIT Press
- Domingos, P. (2015). The Master Algorithm. Basic Books
- He, R., McAuley, J. (2016). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW
- Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning and Representation Learning Workshop
- Jurafsky, D., Martin, J. (2017). Speech and Language Processing. Third Edition draft
- Kingma, D., Jimmy, B. (2014). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR)
- Manning, C., Shuetze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, Massachussets: MIT Press
- Manning, C. (2011). Part-of-speech tagging from 97% to 100%: is it time for some linguistics? Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I, 171-189
- Merity, S., Xiong, C., Bradbury, J., Socher, R. (2016). Pointer Sentinel Mixture Models. arXiv:1609.07843
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26, 3111-3119
- Pascanu, R., Mikolov, T., Bengio, Y. (2013). On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, Pages III-1310-III-1318
- Radford, A., Jozefowicz, R., Sutskever, I. (2017). Learning to Generate Reviews and Discovering Sentiment. arXiv:1704.01444
- Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018) Improving Language Understanding by Generative Pre-Training. https://blog.openai.com/language-unsupervised/
- Schmidhuber, J., Hochreiter S., (1997). Long Short-Term Memory. Neural Computation, Volume 9 Issue 8, November 15, 1997, 1735-1780
- Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., Potts, C. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642