Posts by Tags

API

Nutritionalcart

Published:

How healthy is the average Instacart user? Are certain types (i.e., vegetarians, carnivores) of food buyers healthier than others? I bring new data to bear on these questions to better understand how healthy the average Instacart user is and to better understand the health benefits afforded to Instacart users who choose some types (i.e., plant-based, meat-based) of foods over others. To determine the relative health of Instacart users, I matched the top 10 most ordered products by aisle with USDA nutrient data by using USDA-provided API access to their database through JavaScript Object Notation (JSON). To view this project, please click here. An upgraded algorithm that better searches the USDA database can be found here.

Applied Machine Learning

NLP: Natural Language Propaganda

Published:

Who are the targets of insurgent propaganda? I investigate the ability to classify the targets (e.g, the U.S. or Kabul) of insurgent propaganda messages using a novel corpus containing over 11,000 Taliban statements from 2014 to 2020. In experiments with Convolutional Neural Network (CNN) and transformer architectures, I demonstrate that the audiences of insurgent messages are best captured by transformers, likely owing to its encoder-decoder architecture. This paper’s contribution is twofold: First, it offers a new and novel data set with utility in classification and summarization tasks for machine learning. Second, it suggests that since the audience of messaging can be reliably identified, new opportunities are afforded to analysts to look closer at the contrasts in language to better understand the targets of information.

BERT

BERT-Vision

Published:

What compression methods can extract regularity from BERT during fine-tuning? Drawing on research that demonstrates the utility of information found across all of BERT’s layers, we propose a compression method, BERT-Vision, that captures the regularities produced by BERT during fine-tuning. BERT-Vision’s contribution is two-fold: First, we show that compression during fine-tuning can yield comparative and sometimes better performance than BERT, and second, we show that this performance is realized with a model that is 209x smaller than BERT in terms of its parameters. To view this project, please click here.

Casual Inference

Typos: A Survey Experiment

Published:

Command of language is one of the most significant cognitive abilities we possess and is often the most pervasive signal we encounter in a social media setting. When we notice overt and unintentional grammatical errors in social media posts, do we make unconscious assumptions about the authors’ general intelligence? Do we attribute difficulty with written language with other indicators such as lower-performing verbal acuity or overall intelligence? Further, are some categories of grammatical errors more injurious than others – or do we take in stride all these trespasses? To view this project, please click here.

Class Programming

Map Off

Published:

Map Off is a game designed to test your geography skills in the United States or around the World. The inspiration for this game comes from my wife, Hannah, because we often test our spatial skills against one another in the presence of a map. In turn, we now have access to maps and competition anytime we want.

DataBricks

PetaFlights

Published:

What accounts for flight delays in the U.S.? This project portrays the machine learning end of a large data engineering project that merged 630 million rows of weather data against 31 million rows of flight data. I use the state-of-the-art in distributed deep learning by leveraging Petastorm, Horovod, and PyTorch to produce a multilayer perceptron model that is distributed across 8 workers in DataBricks. Importantly, I use novel approaches to transform categorical data into continuous features through an embedding table. To view this project, please click here.

Experiments

Typos: A Survey Experiment

Published:

Command of language is one of the most significant cognitive abilities we possess and is often the most pervasive signal we encounter in a social media setting. When we notice overt and unintentional grammatical errors in social media posts, do we make unconscious assumptions about the authors’ general intelligence? Do we attribute difficulty with written language with other indicators such as lower-performing verbal acuity or overall intelligence? Further, are some categories of grammatical errors more injurious than others – or do we take in stride all these trespasses? To view this project, please click here.

Geospatial Analysis

Latent Control: Hidden Markov Models

Published:

Who controls territory in civil war? This is a central variable in the research and analysis of civil wars – yet it is incredibly difficult to measure. In this post, I model territorial control as a latent variable – an unobserved variable that presumes it is the cause of its indicators. This project models the latent variable across the entire country of Afghanistan using sub-national event data, a Hidden Markov Model, Uber’s hexagonal spatial index, and logistic spatial and temporal decay functions to treat serially correlated data in time and space. To view this project, please click here.

H3

Latent Control: Hidden Markov Models

Published:

Who controls territory in civil war? This is a central variable in the research and analysis of civil wars – yet it is incredibly difficult to measure. In this post, I model territorial control as a latent variable – an unobserved variable that presumes it is the cause of its indicators. This project models the latent variable across the entire country of Afghanistan using sub-national event data, a Hidden Markov Model, Uber’s hexagonal spatial index, and logistic spatial and temporal decay functions to treat serially correlated data in time and space. To view this project, please click here.

Hidden Markov Models

Latent Control: Hidden Markov Models

Published:

Who controls territory in civil war? This is a central variable in the research and analysis of civil wars – yet it is incredibly difficult to measure. In this post, I model territorial control as a latent variable – an unobserved variable that presumes it is the cause of its indicators. This project models the latent variable across the entire country of Afghanistan using sub-national event data, a Hidden Markov Model, Uber’s hexagonal spatial index, and logistic spatial and temporal decay functions to treat serially correlated data in time and space. To view this project, please click here.

Horovod

PetaFlights

Published:

What accounts for flight delays in the U.S.? This project portrays the machine learning end of a large data engineering project that merged 630 million rows of weather data against 31 million rows of flight data. I use the state-of-the-art in distributed deep learning by leveraging Petastorm, Horovod, and PyTorch to produce a multilayer perceptron model that is distributed across 8 workers in DataBricks. Importantly, I use novel approaches to transform categorical data into continuous features through an embedding table. To view this project, please click here.

Linear Regression

Reexamining Civilian Preferences in Civil War: A Survey in Afghanistan

Published:

How do civilians react to changing authority in civil war? We investigate this question in Afghanistan using survey data from The Asia Foundation following the end of U.S.-led combat operations in 2014. I demonstrate that there is clear evidence that civilian attitudes are indeed conditional on the following three-way interaction: territorial control, ethnicity, and survival. For instance, there is a notable and statistically significant distinction between Pashtuns and non-Pashtuns under Taliban control in their approval of the Afghan Government. I bring largely unused country-wide individual-level data to bear on analyzing civilian wartime beliefs. To view this research project, please click here.

NLP

NLP: Natural Language Propaganda

Published:

Who are the targets of insurgent propaganda? I investigate the ability to classify the targets (e.g, the U.S. or Kabul) of insurgent propaganda messages using a novel corpus containing over 11,000 Taliban statements from 2014 to 2020. In experiments with Convolutional Neural Network (CNN) and transformer architectures, I demonstrate that the audiences of insurgent messages are best captured by transformers, likely owing to its encoder-decoder architecture. This paper’s contribution is twofold: First, it offers a new and novel data set with utility in classification and summarization tasks for machine learning. Second, it suggests that since the audience of messaging can be reliably identified, new opportunities are afforded to analysts to look closer at the contrasts in language to better understand the targets of information.

Natural Language Processing

BERT-Vision

Published:

What compression methods can extract regularity from BERT during fine-tuning? Drawing on research that demonstrates the utility of information found across all of BERT’s layers, we propose a compression method, BERT-Vision, that captures the regularities produced by BERT during fine-tuning. BERT-Vision’s contribution is two-fold: First, we show that compression during fine-tuning can yield comparative and sometimes better performance than BERT, and second, we show that this performance is realized with a model that is 209x smaller than BERT in terms of its parameters. To view this project, please click here.

Petastorm

PetaFlights

Published:

What accounts for flight delays in the U.S.? This project portrays the machine learning end of a large data engineering project that merged 630 million rows of weather data against 31 million rows of flight data. I use the state-of-the-art in distributed deep learning by leveraging Petastorm, Horovod, and PyTorch to produce a multilayer perceptron model that is distributed across 8 workers in DataBricks. Importantly, I use novel approaches to transform categorical data into continuous features through an embedding table. To view this project, please click here.

PyTorch

BERT-Vision

Published:

What compression methods can extract regularity from BERT during fine-tuning? Drawing on research that demonstrates the utility of information found across all of BERT’s layers, we propose a compression method, BERT-Vision, that captures the regularities produced by BERT during fine-tuning. BERT-Vision’s contribution is two-fold: First, we show that compression during fine-tuning can yield comparative and sometimes better performance than BERT, and second, we show that this performance is realized with a model that is 209x smaller than BERT in terms of its parameters. To view this project, please click here.

PetaFlights

Published:

What accounts for flight delays in the U.S.? This project portrays the machine learning end of a large data engineering project that merged 630 million rows of weather data against 31 million rows of flight data. I use the state-of-the-art in distributed deep learning by leveraging Petastorm, Horovod, and PyTorch to produce a multilayer perceptron model that is distributed across 8 workers in DataBricks. Importantly, I use novel approaches to transform categorical data into continuous features through an embedding table. To view this project, please click here.

Python

Nutritionalcart

Published:

How healthy is the average Instacart user? Are certain types (i.e., vegetarians, carnivores) of food buyers healthier than others? I bring new data to bear on these questions to better understand how healthy the average Instacart user is and to better understand the health benefits afforded to Instacart users who choose some types (i.e., plant-based, meat-based) of foods over others. To determine the relative health of Instacart users, I matched the top 10 most ordered products by aisle with USDA nutrient data by using USDA-provided API access to their database through JavaScript Object Notation (JSON). To view this project, please click here. An upgraded algorithm that better searches the USDA database can be found here.

Map Off

Published:

Map Off is a game designed to test your geography skills in the United States or around the World. The inspiration for this game comes from my wife, Hannah, because we often test our spatial skills against one another in the presence of a map. In turn, we now have access to maps and competition anytime we want.

Spark

PetaFlights

Published:

What accounts for flight delays in the U.S.? This project portrays the machine learning end of a large data engineering project that merged 630 million rows of weather data against 31 million rows of flight data. I use the state-of-the-art in distributed deep learning by leveraging Petastorm, Horovod, and PyTorch to produce a multilayer perceptron model that is distributed across 8 workers in DataBricks. Importantly, I use novel approaches to transform categorical data into continuous features through an embedding table. To view this project, please click here.

Survey Data

Reexamining Civilian Preferences in Civil War: A Survey in Afghanistan

Published:

How do civilians react to changing authority in civil war? We investigate this question in Afghanistan using survey data from The Asia Foundation following the end of U.S.-led combat operations in 2014. I demonstrate that there is clear evidence that civilian attitudes are indeed conditional on the following three-way interaction: territorial control, ethnicity, and survival. For instance, there is a notable and statistically significant distinction between Pashtuns and non-Pashtuns under Taliban control in their approval of the Afghan Government. I bring largely unused country-wide individual-level data to bear on analyzing civilian wartime beliefs. To view this research project, please click here.

Transformers

BERT-Vision

Published:

What compression methods can extract regularity from BERT during fine-tuning? Drawing on research that demonstrates the utility of information found across all of BERT’s layers, we propose a compression method, BERT-Vision, that captures the regularities produced by BERT during fine-tuning. BERT-Vision’s contribution is two-fold: First, we show that compression during fine-tuning can yield comparative and sometimes better performance than BERT, and second, we show that this performance is realized with a model that is 209x smaller than BERT in terms of its parameters. To view this project, please click here.