Applied NLP

Classification: CNN

Published: July 14, 2020

Text or sequence classification aims to label a sentence or document based on its content. In this post, we use Convolutional Neural Networks and pre-trained embeddings to classify a novel corpus. This post provides a full treatment of the steps required to prepare data for NLP analysis and to analyze it with PyTorch. It also includes sample code to optimize the hyperparamters through state-of-the-art pruning and search algorithms via Optuna.

Classification: DistilBERT

Published: July 15, 2020

Text or sequence classification aims to label a sentence or document based on its content. In this post, we use Transformers to classify a novel data set that I created based on insurgent propaganda messages. This post provides a full treatment of the steps required to prepare data for NLP analysis and to analyze it with PyTorch. It also includes sample code to optimize the hyperparamters through state-of-the-art pruning and search algorithms via Optuna.

Generation: DistilGPT-2

Published: July 16, 2020

Language models are trained to predict the probability the next token considering the preceding tokens that came before it. A token can be a word, a letter, or a subcomponent of a word. In this guide, we use the a decoder-only language model transformer to predict text from our novel insurgent propaganda corpus.

Classification: BERT-CNN

Published: July 18, 2020

In this guide, we prepare a BERT-CNN ensemble which takes the embeddings generated by the BERT base model and feeds them into a CNN. The general logic from this guide can be used to replace the CNN with any other NN of your choice. While it is a fun task to explore, adding on what is technically an inferior model on-top of a Transformer is not really necessary. Like other guides, this walk through provides a complete treatment of the data preparation and training of the BERT-CNN ensemble in PyTorch.

Summarization: T5

Published: July 18, 2020

There are two types of summaries: (1) abstractive, or explaining in your own words, and (2) extractive, or building a summary from existing text. Humans are mostly abstractive while NLP systems are mostly extractive. In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD.

Classification: Hierarchical Attention Networks

Published: July 19, 2020

Hierarchical Attention Networks (HAN), as its name suggests, have a hierarchical structure that reflects the hierarchical nature of documents. It has two levels of attention mechanisms that are applied at the word and sentence level which afford it the differential ability to capture more and less important content which evaluating documents. In this guide, we walk through how to create a Hieararchical Attention Network in PyTorch and as well as how to create and structure our data appropriately.

Classification: T5

Published: July 21, 2020

In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model for a classification task. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD. Another important advancement is that it treats NLP as a text-to-text problem, whereby our inputs are text and our outputs are also text. In this universal framework, T5 can therefore handle any NLP task (in English). T5 was pre-trained on the C4 (Colossal Clean Crawled Corpus) corpus which amounts to roughly 750GB of clean English text. For comparative purpsoes, BERT was trained on roughly 13GB of text and XLNet was trained on roughly 126GB of text. For these reasons, T5 is the state of the art and its encoder-decoder architecture is likely the future of NLP models.

Classification: Character CNN

Published: July 28, 2020

In this post, we use character-level Convolutional Neural Networks (CNN) to classify a novel data set in PyTorch. CNNs are useful in extracting information from raw signals, ranging from computer vision, speech, and text. Character-level CNNs treat text characters as a kind of raw signal, thereby allowing CNNs to eschew the requirement to develop an understanding of words. While character-level CNNs work in many different situations, they work well to handle Twitter or multilingual text as advanced embeddings are usually unavailable to researchers. In this guide, we replicate the model as specified in the paper Character-level Convolutional Networks for Text Classification

Classification: Capsule Routing

Published: September 06, 2020

While capsule networks have been used in the field of computer vision and CNNs, recent work shows that they work well in Natural Language Processing (NLP) as well. “A capsule is a group of neurons whose outputs represent different properties of the same entity in different contexts. Routing by agreement is an iterative form of clustering in which a capsule detects an entity by looking for agreement among votes from input capsules that have already detected parts of the entity in a previous layer” (Heinsen, 2019). Capsule networks are a means for aggregating the importance of embeddings akin to attention mechanisms.

Classification: Entity Embeddings

Published: September 23, 2020

In this guide, we will implement entity embeddings in two ways via PyTorch: (1) via nn.Embedding(), and (2) via transformers. We will also show how to load data in a more efficient manner through a custom PyTorch data set class. This style of data management is slightly more complicated to initialize, but is the precise way we want to load our data when dealing: (1) big data, or (2) a memory-conservative environment. Entity embeddings refers to the idea of transforming categorical variables into continuous embeddings to avoid one-hot encoding and sparse matrices.

Semantic Search

Published: October 23, 2020

Natural language processing and computer vision methods generate high-dimensional vectors that represent text and images, yet traditional databases that can be queried like SQL are not adapted to these new representations. Given enough text and media, this information can quickly encompass billions of vectors. To find similar entries means that we must find similar high-dimensional vectors which is inefficient and likely impossible with standard query languages. Similarity search fills this void by searching for similar vectors; those nearby in Euclidean space. We can leverage similarity search algorithms once our vectors are generated by deep learning algorithms. In this post, we will use Faiss – Facebook AI Similarity Search.

Andrew Fogarty