1 Introduction
We have recently seen great successes in using pretrained language models as encoders for a range of difficult natural language processing tasks
(Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al., 2018; Ruder and Howard, 2018; Devlin et al., 2018; Dong et al., 2019; Yang et al., 2019), often with little or no finetuning: Language models learn useful representations that allow them to serve as generalpurpose encoders. A hypothetical generalpurpose decoderwould offer similar benefits: making it possible to both train models for text generation tasks with little annotated data and share parameters extensively across applications in environments where memory is limited. Is it possible to use a pretrained language model as a
generalpurpose decoder in a similar fashion?For this to be possible, we would need a way of feeding some form of continuous sentence representation into a trained language model, and we would need a taskspecific encoder that could convert some task input into a sentence representation that would cause the language model to produce the desired sentence. We are not aware of any work that has successfully produced an encoder that can interoperate in this way with a pretrained language model, and in this paper, we ask whether it is possible at all: Are typical trained neural network language models capable of recovering arbitrary sentences through conditioning of this kind?
We start by defining the sentence space of a recurrent language model and show how this model maps a given sentence to a trajectory in this space. We reparametrize this sentence space into a new space, the reparametrized sentence space, by mapping each trajectory in the original space to a point in the new space. To accomplish the reparametrization, we introduce two complementary methods to add additional bias terms to the previous hidden and cell state at each time step in the trained and frozen language model, and optimize those bias terms to maximize the likelihood of the sentence.
Recoverability inevitably depends on model size and quality of the underlying language model, so we vary both along with different dimensions for the reparametrized sentence space. We find that the choice of optimizer (nonlinear conjugate gradient over stochastic gradient descent) and initialization are quite sensitive, so it is unlikely that a simple encoder setup would work out of the box.
Our experiments reveal that recoverability decreases as sentence length increases and that models find it increasingly difficult to generate words later in a sentence. In other words, models rarely generate any correct words after generating an incorrect word when decoding a given sentence. We find that we can achieve full recoverability with a reparametrized sentence space with dimension equal to the dimension of the recurrent hidden state of the model, at least for large enough models: For nearly all sentences, there exists a single vector that can recover the sentence perfectly. We also observe that the smallest dimension able to achieve the greatest recoverability is approximately equal to the dimension of the recurrent hidden state of the model. These observations indicate that unconditional language models can indeed be conditioned to recover arbitrary sentences almost perfectly.
2 Related Work
Latent Variable Recurrent Language Models
The way we describe the sentence space of a language model can be thought of as performing inference over an implicit latent variable using a fixed decoder . This resembles prior work on sparse coding (Olshausen and Field, 1997) and generative latent optimization (Bojanowski et al., 2018)
. Under this perspective, it also relates to work on training latent variable language models, such as language models based on variational autoencoders by
Bowman et al. (2016) and sequence generative adversarial networks by Yu et al. (2017). Our approach differs from these approaches in that we focus entirely on analyzing a fixed model that was trained unconditionally. Our formulation of the sentence space also is more general, and potentially applies to all of these models.Pretrained Recurrent Language Models
Pretrained or separately trained language models have largely been used in two contexts: as a feature extractor for downstream tasks and as a scoring function for a taskspecific decoder (Gulcehre et al., 2015; Li et al., 2016; Sriram et al., 2018). None of the above analyze how a pretrained model represents sentences nor investigate the potential of using a language model as a decoder. The work by Zoph et al. (2016)
transfers a pretrained language model, as a part of a neural machine translation system, to another language pair and finetunes. The positive result here is specific to machine translation as a downstream task, unlike the proposed framework which is general and downstream task independent. Recently, there has been more work in pretraining the decoder using BERT
(Devlin et al., 2018) for neural machine translation and abstractive summarization (Edunov et al., 2019; Lample and Conneau, 2019; Song et al., 2019).3 The Sentence Space of a Recurrent Language Model
In this section, we first cover the background on recurrent language models. We then characterize its sentence space and show how we can reparametrize it for easier analysis. In this reparametrized sentence space, we define the recoverability of a sentence.
3.1 Recurrent Language Models
Model Description
We train a standard 2layer recurrent language model over sentences in a standard autoregressive fashion:
(1) 
A neural network models each conditional distribution (right side) by taking as input all the previous tokens and producing as output the distribution over all possible next tokens. At every timestep, we update the internal hidden state , which summarizes , with a new token , resulting in . This resulting hidden state, , is used to compute :
(2)  
(3) 
where is a recurrent transition function often implemented as an LSTM recurrent network (as in Hochreiter and Schmidhuber, 1997; Mikolov et al., 2010). The readout function
is generally a softmax layer with dedicated parameters for each possible word. The incoming hidden state
at the start of generation is generally an arbitrary constant vector. We use zeroes. For a LSTM language model with layers of LSTM units, its model dimension because LSTMs have two hidden state vectors (conventionally h and c) both of dimension .Training
We train the full model using stochastic gradient decent with negative log likelihood loss.
Inference
Once learning completes, a language model can be straightforwardly used in two ways: scoring and generation. To score, we compute the logprobability of a newly observed sentence according to Eq. (
1). To generate, we use ancestral sampling by sampling tokens sequentially, conditioning on all previous tokens at each step via Eq. (1).In addition, we can find the approximate most likely sequence using beam search (Graves, 2012). This procedure is generally used with language model variants like sequencetosequence models (Sutskever et al., 2014) that condition on additional context
. We use this procedure in backward estimation to recover the sentence corresponding to a given point in the reparametrized space.
3.2 Defining the Sentence Space
The recurrent transition function in Eq. (2) defines a dynamical system driven by the observations of tokens in a sentence. In this dynamical system, all trajectories start at the origin and evolve according to incoming tokens (’s) over time. Any trajectory is entirely embedded in a dimensional space, where is equal to the dimension of the hidden state and , i.e., , In other words, the language model embeds a sentence of length as a step trajectory in a dimensional space . In this paper, we refer to as the sentence space of a language model.
Reparametrizing the Sentence Space
We want to recover sentences from semantic representations that do not encode sentence length symbolically. Given that and since a single replacement of an intermediate token can drastically change the remaining trajectory in the sentence space, we want a flatvector representation. In order to address this, we propose to (approximately) reparametrize the sentence space into a flatvector space to characterize the sentence space of a language model. Under the proposed reparameterization, a trajectory of hidden states in the sentence space maps to a vector of dimension in the reparametrized sentence space . To accomplish this, we add bias terms to the previous hidden and cell state at each time step in the model and optimize them to maximize the log probability of the sentence as shown in Figure 1. We add this bias in two ways: (1) if , we use a random projection matrix to project our vector up to and (2) if , we use softattention with the previous hidden state to adaptively project our vector down to (Bahdanau et al., 2015).
Our reparametrization must approximately allow us to go back (forward estimation) and forth (backward estimation) between a sequence of tokens, , and a point in this reparametrized space via the language model. We need backandforth reparametrization to measure recoverability. Once preserved, we can inspect a set of points in instead of trajectories in . A vector resembles the output of an encoder acting as context for a conditional generation task. This makes analysis in resemble analyses of context on sequence models and thus helps us understand the unconditional language model that we are trying to condition with better.
We expect that our reparametrization will allow us to approximately go back and forth between a sequence and its corresponding point because we expect to contain all of the information of the sequence. Since we’re adding at every timestep, the information preserved in will not degrade as quickly as the sequence is processed like it could if we just added it to the initial hidden and cell states. While there are other similar ways to integrate , we choose to modify the recurrent connection.
Using the Sentence Space
In this paper, we describe the reparametrized sentence space of a language model as a set of dimensional vectors that correspond to a set
of sentences that were not used in training the underlying language model. This use of unseen sentences helps us understand the sentence space of a language model in terms of generalization rather than memorization, providing insight into the potential of using a pretrained language model as a fixed decoder/generator. Using our reparametrized sentence space framework, evaluation techniques designed for investigating word vectors become applicable. One of those interesting techniques that we can do now is interpolation between different sentences in our reparameterized sentence space
(Table 1 in Choi et al., 2017; Bowman et al., 2016), but we do not explore this here.Forward Estimation
The goal of forward estimation is to find a point that represents a sentence via the trained language model (i.e., fixed ). When the dimension of is smaller than the model dimension , we use a random projection matrix to project it up to and when the dimension of is greater than the model dimension, we use soft attention to project it down to . We modify the recurrent dynamics in Eq. (2) to be:
(4)  
(5) 
where and is just the unflattened matrix of consisting of vectors of dimension . We initialize the hidden state by .
is a random matrix with
normalized rows, following Li et al. (2018)and is an identity matrix when
: , where and . We then estimate by maximizing the logprobability of the given sentence under this modified model, while fixing the original parameters :(6) 
We represent the entire sentence in a single . To solve this optimization problem, we can use any offtheshelf gradientbased optimization algorithm, such as gradient descent or nonlinear conjugate descent. This objective function is highly nonconvex, potentially leading to multiple approximately optimal ’s. As a result, to estimate in forward estimation, we use nonlinear conjugate gradient (Wright and Nocedal, 1999) implemented in SciPy (Jones et al., 2014) with a limit of 10,000 iterations, although almost all runs converge much more quickly. Our experiments, however, reveal that many of these ’s lead to similar performance in recovering the original sentence.
Backward Estimation
Backward estimation, an instance of sequence decoding, aims at recovering the original sentence given a point in the reparametrized sentence , which we refer to as recovery. We use the same objective function as in Eq. (6), but we optimize over rather than over
. Unlike forward estimation, backward estimation is a combinatorial optimization problem and cannot be solved easily with a recurrent language model
(Cho, 2016; Chen et al., 2018). To circumvent this, we use beam search, which is a standard approach in conditional language modeling applications such as machine translation.3.3 Analyzing the Sentence Space through Recoverability
Under this formulation, we can investigate various properties of the sentence space of the underlying model. As a first step toward understanding the sentence space of a language model, we propose three roundtrip recoverability metrics and describe how we use them to characterize the sentence space.
Recoverability
Recoverability measures how much information about the original sentence is preserved in the reparameterized sentence space . We measure this by reconstructing the original sentence . First, we forwardestimate the sentence vector from by Eq. (6). Then, we reconstruct the sentence from the estimated via backward estimation. To evaluate the quality of reconstruction, we compare the original and reconstructed sentences, and using the following three metrics:

Exact Match (EM):

BLEU (Papineni et al., 2002)

Prefix Match (PM):
Exact match gives information about the possibility of perfect recoverability. BLEU provides us with a smoother approximation to this, in which the hypothesis gets some reward for ngram overlap, even if slightly inexact. Since BLEU is 0 for sentences with less than 4 tokens, we smooth these by only considering ngrams up to the sentence length if sentence length is less than 4. Prefix match measures the longest consecutive sequence of tokens that are perfectly recovered from the beginning of the sentence and we divide this by the sentence length. We use prefix match because early experiments show a very strong lefttoright falloff in quality of generation. In other words, candidate generations are better for shorter sentences and once an incorrect token is generated, future tokens are extremely unlikely to be correct. We compute each metric for each sentence
by averaging over multiple optimization runs, we show exact match (EM) in the equations, but we do the same for BLEU and Prefix Match. To counter the effect of nonconvex optimization in Eq. (6), these runs vary by the initialization of and the random projection matrix in Eq. (4). That is,Effective Dimension by Recoverability
These recoverability measures allow us to investigate the underlying properties of the proposed sentence space of a language model. If all sentences can be projected into a dimensional sentence space and recovered perfectly, the effective dimension of must be no greater than . In this paper, when analyzing the effective dimension of a sentence space of a language model, we focus on the effective dimension given a target recoverability :
(7) 
where . In other words, given a trained model (), we find the smallest effective dimension (the dimension of ) that satisfies the target recoverability (). Using this, we can answer questions like what is the minimum dimension needed to achieve recoverability under the model . Using this, the unconstrained effective dimension, i.e. the smallest dimension that satisfies the best possible recoverability, is:
(8) 
We approximate the effective dimension by inspecting various values of . Since our forward estimation process uses nonconvex optimization and our backward estimation process uses beam search, our effective dimension estimates are upperbound approximations.
4 Experimental Setup
Corpus
We use the fifth edition of the English Gigaword (Graff et al., 2003) news corpus. Our primary model is trained on 50M sentences from this corpus, and analysis experiments additionally include a weaker model trained on a subset of only 10M. Our training sentences are drawn from articles published before November 2010. We use a development set with 879k sentences from the articles published in November 2010 and a test set of 878k sentences from the articles published in December 2010. We lowercase the entire corpus, segment a each article into sentences using NLTK (Bird and Loper, 2004), and tokenize each sentence using the Moses tokenizer (Koehn et al., 2007). We further segment the tokens using bytepair encoding (BPE; following Sennrich et al., 2016) with 20,000 merges to obtain a vocabulary of 20,234 subword tokens.
Recurrent Language Models
The proposed framework is agnostic to the underlying architecture of a language model. We choose a 2layer language model with LSTM units (Graves, 2013). We construct a small, medium, and large language model consisting of 256, 512, and 1024 LSTM units respectively in each layer. The input and output embedding matrices of 256, 512, and 1024dimensional vectors respectively are shared (Press and Wolf, 2017). We use dropout (Srivastava et al., 2014) between the two recurrent layers and before the final linear layer with a drop rate of , , and respectively. We use stochastic gradient descent with Adam with a learning rate of on 100sentence minibatches (Kingma and Ba, 2014), where sentences have a maximum length of 100.
We measure perplexity on the development set every 10k minibatches, halve the learning rate whenever it increases, and clip the norm of the gradient to 1 (Pascanu et al., 2013)
. For each training set (10M and 50M), we train for only one epoch. Because of the large size of the training sets, these models nonetheless achieve a good fit to the underlying distribution (Table
1).Model  Dev Ppl.  Test Ppl.  Dev Ppl.  Test Ppl.  

small  256  122.9  125.2  77.2  79.2 
medium  512  89.6  91.3  62.1  63.5 
large  1024  65.9  67.7  47.4  48.9 
Reparametrized Sentence Spaces
We use a set of 100 randomly selected sentences from the development set in our analysis. We set to have 128, 256, 512, 1024, 2048, 4096, 8192, 16384 and 32768 dimensions for each language model and measure its recoverability. For each sentence we have ten random initializations. When the dimension of the reparametrized sentence space is smaller than the model dimension, we construct ten random projection matrices that are sampled once and fixed throughout the optimization procedure. We perform beam search with beam width 5.
5 Results and Analysis
Recoverability Results
In Figure 2, we present the recoverability results of our experiments relative to sentence length using the three language models trained on 50M sentences. We observe that the recoverability increases as increases until . After this point, recoverability plateaus. Recoverability between metrics for a single model are strongly positively correlated. We also observe that recoverability is nearly perfect for the large model when achieving EM , and very high for the medium model when achieving EM .
We find that recoverability increases for a specific as the language model is trained, although we cannot present the result due to space constraints. The corresponding figure to Figure 2
for the 10M setting and tables for both of the settings detailing overall performance are provided in the appendix. All these estimates have high confidence (small standard deviations). Since results were consistent with beam widths of 5, 10, and 20 in beam search decoding, we only report those for width 5.
Effective Dimension of the Sentence Space
From Figure 2, the large model’s unconstrained effective dimension is with a slight degradation in recoverability when increasing beyond . For the medium model, we notice that its unconstrained effective dimension is also with no real recoverability improvements when increasing beyond . For the small model, however, its unconstrained effective dimension is , which is much greater than .
When , we can recover any sentence nearly perfectly, and for large sentences, the large model with achieves recoverability estimates . For other model sizes and other dimensions of the reparametrized space, we fail to perfectly recover some sentences. To ascertain which sentences we fail to recover, we look at the shapes of each curve. We observe that the vast majority of these curves never increase, indicating recoverability and setence length have a strong negative correlation. Most curves decrease to 0 as sentence length exceeds 30 indicating that longer sentences are more difficult to recover. Earlier observations in using neural sequencetosequence models for machine translation concluded exactly this (Cho et al., 2014; Koehn and Knowles, 2017).
This suggests that a fixedlength representation lacks the capacity to represent a complex sentence and could sacrifice important information in order to encode others. The degradation in recoverability also implies that the unconstrained effective dimension of the sentence space could be strongly related to the length of the sentence and may not be related to the model dimension . The fact that the smaller model has an unconstrained effective dimension much larger than supports this claim.
Sources of Randomness
There are two points of stochasticity in the proposed framework: the nonconvexity of the optimization procedure in forward estimation (Eq. 6) and the sampling of a random projection matrix . However, based on the small standard deviations in Figure 2
, these have minimal impact on recoverability. Also, the observation of high confidence (lowvariance) upperbound estimates for recoverability supports the usability of our recoverability metrics for investigating a language model’s sentence space.
Towards a GeneralPurpose Decoder
In this formulation, our vector can be considered as trainable context used to condition our unconditioned language models to generate arbitrary sentences. Since we find that welltrained language models of reasonable size have an unconstrained effective dimension with high recoverability that is approximately its model dimension, our unconditional language models are able to utilize our context . As a result, such a model could be used as a taskindependent decoder given an encoder with the ability to generate an optimal context vector .
We observe that recoverability isn’t perfect for both the small and medium models, falling off dramatically for longer sentences, indicating that the minimum model size for high recoverability is fairly large. Since the sentence length distribution is heavily righttailed, if we can increase the recoverability degredation cutoff point, the number of sentences we fail to recover perfectly would decrease exponentially. But since we find that larger and bettertrained models can exhibit near perfect recoverability and can more easily utilize our conditioning strategy, we think that this may only be a concern for lower capacity models. A regularization mechanism to smooth the implicit sentence space may improve recoverability and reduce the unconstrained effective dimension, whereby increasing the applicability of an unconditional language model as a generalpurpose decoder.
6 Conclusion
In an effort to answer whether unconditional language models can be conditioned to generate heldout sentences, we introduce the concept of the reparametrized sentence space for a frozen pretrained language model, in which every sentence is represented as a point vector that is optimized to reproduce that sentence when used as an added bias during decoding with that language model. We design optimizationbased forward estimation and beamsearchbased backward estimation procedures, allowing us to map a sentence to and from the reparametrized space. We then introduce and use recoverability metrics that allow us to measure the effective dimension of the reparametrized space and to discover the degree to which sentences can be recovered from fixedsized representations by the model without further training.
We evaluate three language models trained on large English corpora under the recoverability lens, while varying the dimension of the reparametrized sentence space. We observe that we can indeed condition our unconditional language models to generate heldout sentences as our large model with effective dimension equal to the model dimension achieves near perfect recoverability across all metrics. Furthermore, we find that recoverability increases with the dimension of the reparametrized space until it reaches the model dimension. After this, it plateaus at nearperfect recoverability for welltrained, sufficientlylarge () models.
These experiments reveal two properties of the sentence space of a language model. First, recoverability improves with the size and quality of the language model and is nearly perfect when the dimension of the reparametrized space equals that of the model. Second, recoverability is negatively correlated with sentence length, i.e., recoverability is more difficult for longer sentences. Our recoverabilitybased approach for analyzing the sentence space gives conservative estimates (upperbounds) of the effective dimension of the space and lowerbounds for the associated recoverabilities. We observe that the choice of optimizer going from a nonlinear conjugate method to a stochastic gradient method to be quite sensitive, which suggests that an offtheshelf encoder setup may not work out of the box.
We see two avenues for further work, one more oriented toward analysis, another more practical. To complete the line of analysis that this work initiates, it would be valuable to measure the relationship between regularization, encouraging the reparametrized sentence space to be of a certain form, and nonlinearity. In addition, although our framework is downstream task and network architectureindependent, we want to compare recoverability and downstream task performance and analyze the sentence space of architectures beyond a plain LSTM language model with another conditioning apparatus. Finally, it would be valuable to better understand the impact of the domain of the test corpus: Can these models uncover text in a substantially different style from that seen at training time? Looking to potential applications, a clear next step would be to evaluate methods for converting encoder representations into points a reparametrized sentence space for some large language model, in the hope of using such a language model as a component in a data and memoryefficient conditional generation model.
Acknowledgments
This work was supported by Samsung Electronics (Improving Deep Learning using Latent Structure). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research.
References
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings.
 Bird and Loper (2004) Steven Bird and Edward Loper. 2004. Nltk: the natural language toolkit. In ACL, page 31. Association for Computational Linguistics.
 Bojanowski et al. (2018) Piotr Bojanowski, Armand Joulin, David LopezPas, and Arthur Szlam. 2018. Optimizing the latent space of generative networks. In ICML, pages 599–608.
 Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. CoNLL 2016, page 10.
 Chen et al. (2018) Yun Chen, Victor OK Li, Kyunghyun Cho, and Samuel Bowman. 2018. A stable and effective learning strategy for trainable greedy decoding. In EMNLP, pages 380–390.
 Cho (2016) Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. arXiv preprint arXiv:1605.03835.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111.
 Choi et al. (2017) Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. 2017. Contextdependent word representation for neural machine translation. Computer Speech & Language, 45:149–160.
 Dai and Le (2015) Andrew M. Dai and Quoc V. Le. 2015. Semisupervised sequence learning. In NIPS.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
 Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and HsiaoWuen Hon. 2019. Unified language model pretraining for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
 Edunov et al. (2019) Sergey Edunov, Alexei Baevski, and Michael Auli. 2019. Pretrained language model representations for language generation. arXiv preprint arXiv:1903.09722.
 Graff et al. (2003) David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
 Graves (2012) Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
 Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
 Gulcehre et al. (2015) Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, HueiChi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation, 9(8):1735–1780.
 Jones et al. (2014) Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2014. Scipy: Open source scientific tools for python, 2014. http://www.scipy.org, 4.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In ACL, pages 177–180. Association for Computational Linguistics.
 Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. ACL 2017, page 28.
 Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.
 Li et al. (2018) Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. 2018. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838.
 Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversitypromoting objective function for neural conversation models. In NAACL, pages 110–119.
 Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
 Olshausen and Field (1997) Bruno A Olshausen and David J Field. 1997. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318. Association for Computational Linguistics.
 Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In ICML, pages 1310–1318.
 Peters et al. (2017) Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semisupervised sequence tagging with bidirectional language models. In ACL.
 Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke S. Zettlemoyer. 2018. Deep contextualized word representations. In NAACLHLT.
 Press and Wolf (2017) Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In ACL, volume 2, pages 157–163.
 Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pretraining. Unpublished ms. available through a link at https://blog.openai.com/languageunsupervised/.
 Ruder and Howard (2018) Sebastian Ruder and Jeremy Howard. 2018. Universal language model finetuning for text classification. In ACL.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
 Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pretraining for language generation. arXiv preprint arXiv:1905.02450.
 Sriram et al. (2018) Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. 2018. Cold fusion: Training seq2seq models together with language models. In Interspeech.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
 Wright and Nocedal (1999) Stephen Wright and Jorge Nocedal. 1999. Numerical optimization. Springer Science, 35(6768):7.
 Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
 Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858.
 Zoph et al. (2016) Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for lowresource neural machine translation. In EMNLP.
7 Appendix
small; 10M  128.0  4.5  0.532  6.62  0.708  6.38  0.563 

small; 10M  256.0  9.0  1.060  14.1  0.627  12.3  0.854 
small; 10M  512.0  16.0  0.000  22.9  0.443  22.3  0.660 
small; 10M  1024.0  28.5  1.410  42.2  0.710  40.1  0.816 
small; 10M  2048.0  23.0  1.060  34.4  0.939  33.6  0.869 
small; 10M  4096.0  34.5  1.600  46.9  1.160  45.7  0.961 
small; 10M  8192.0  38.5  1.190  47.3  1.090  46.4  0.928 
small; 10M  16384.0  33.0  1.510  41.0  1.120  40.1  0.749 
small; 10M  32768.0  29.5  1.190  36.8  0.766  35.1  0.953 
small; 50M  128.0  8.0  0.753  10.2  0.574  8.79  0.689 
small; 50M  256.0  22.0  1.510  28.5  1.120  26.  1.190 
small; 50M  512.0  33.5  1.600  40.0  0.960  37.1  1.180 
small; 50M  1024.0  64.5  1.190  71.3  0.821  69.3  0.801 
small; 50M  2048.0  66.0  1.840  73.1  1.360  72.0  1.100 
small; 50M  4096.0  66.0  1.990  74.0  1.240  72.2  1.250 
small; 50M  8192.0  73.0  1.680  81.1  1.010  79.8  0.945 
small; 50M  16384.0  70.5  1.920  76.6  1.270  74.1  1.100 
small; 50M  32768.0  65.0  1.510  72.8  1.140  68.7  0.964 
medium; 10M  128.0  6.5  0.532  9.26  0.619  7.56  0.634 
medium; 10M  256.0  13.0  0.753  19.9  0.598  15.0  0.941 
medium; 10M  512.0  28.0  1.060  35.0  0.993  30.3  0.975 
medium; 10M  1024.0  39.5  0.922  45.6  0.623  42.3  0.200 
medium; 10M  2048.0  71.0  1.060  76.6  0.660  75.9  0.874 
medium; 10M  4096.0  67.0  1.840  75.4  1.150  73.0  1.090 
medium; 10M  8192.0  71.5  1.600  79.0  0.813  77.1  1.050 
medium; 10M  16384.0  66.5  2.060  74.6  1.030  72.5  1.010 
medium; 10M  32768.0  67.0  1.680  76.1  0.812  71.2  0.920 
medium; 50M  128.0  6.0  0.753  10.9  0.933  7.71  0.911 
medium; 50M  256.0  24.5  0.922  28.0  0.742  26.6  0.661 
medium; 50M  512.0  36.5  0.922  41.1  0.698  37.8  0.806 
medium; 50M  1024.0  51.0  1.300  57.3  0.980  55.9  0.883 
medium; 50M  2048.0  87.0  1.510  91.2  0.544  89.8  0.807 
medium; 50M  4096.0  84.0  1.990  89.4  0.876  89.1  1.040 
medium; 50M  8192.0  85.0  1.680  89.1  0.985  89.2  0.989 
medium; 50M  16384.0  88.0  1.680  92.4  0.687  92.5  0.743 
medium; 50M  32768.0  84.5  1.600  90.6  0.596  89.3  0.646 
large; 10M  128.0  7.0  0.753  11.8  0.741  7.65  0.542 
large; 10M  256.0  21.0  1.300  27.5  0.853  22.7  1.080 
large; 10M  512.0  42.0  1.060  46.3  0.794  43.2  1.140 
large; 10M  1024.0  58.0  1.510  62.1  1.170  59.9  1.340 
large; 10M  2048.0  67.0  0.000  68.2  0.213  67.4  0.045 
large; 10M  4096.0  95.0  1.300  97.6  0.577  97.3  0.529 
large; 10M  8192.0  90.0  1.510  93.7  0.609  93.5  0.664 
large; 10M  16384.0  88.5  1.770  92.4  0.704  92.6  0.587 
large; 10M  32768.0  90.5  1.920  95.3  0.942  95.7  0.821 
large; 50M  128.0  12.5  0.922  15.2  1.000  13.3  0.873 
large; 50M  256.0  29.0  1.300  32.7  1.200  30.4  1.060 
large; 50M  512.0  51.5  1.190  54.8  1.040  54.1  1.110 
large; 50M  1024.0  67.5  0.922  69.5  0.717  68.4  0.633 
large; 50M  2048.0  75.0  1.300  77.0  0.883  76.2  1.050 
large; 50M  4096.0  99.0  0.753  99.8  0.204  99.8  0.189 
large; 50M  8192.0  94.5  1.190  96.3  0.407  96.2  0.427 
large; 50M  16384.0  88.5  1.770  93.8  0.363  93.8  0.351 
large; 50M  32768.0  94.5  1.190  96.5  0.303  96.5  0.316 
Comments
There are no comments yet.