Stuff (Data)
Word Embedding Visualizer
November 25, 2019Website that takes in user inputted tokens, performs a lookup for their corresponding embedding, performs PCA, and renders back the main components and their closest descriptions to the front-end. Intended to be used as a tool to introduce natural language embeddings to MBA students in MBA 261 (Graduate Marketing Research).
See the README of the Github repo for more details.
Tags: NLP, Visualization
  Stuff (Data)
CMU Dict Grapheme to Phoneme Alignments
November 24, 2019Grapheme to phoneme alignments for the Carnegie Melon’s pronouncing dictionary data set. Alignments were produced using Phonetisaurus. I couldn’t find an open version of the most recent set of alignments anywhere, so I produced this to save some other people some time.
A grapheme to phoneme alignment maps letters (graphemes) to their sounds when spoken (phonemes). For example the graphemes H,E,LL,O
get aligned to the sounds HH,EH0,L,OW1
where the sounds are represented using ARPAbet notation.
Tags: NLP, Dataset
  Stuff (Data)
Stormfront Posts
November 23, 2019Stormfront is a white nationalist, white supremecist internet forum. For hopefully obvious reasons the views expressed by the forum posts in this dataset are not representative of my own.
Compressed size: 1.9GB
Uncompressed size: 5.0GB
Tags: NLP, Dataset
  Stuff (Data)
News Scrapers
November 23, 2019A handful of specific scraping scrapes for news and social media websites. Done mainly to collect data for word embedding models for use in the Neuroeconomics lab.
Tags: NLP, Data Collection, Open Source
  Stuff (Data)
Fox News Articles
November 23, 2019Fox News is an American pay television news channel. This dataset consists of the text content and metadata of all public web articles published between roughly 2008 and January 2019 on the Fox News and Fox Business websites in CSV format.
Compressed size: 1.6GB
Uncompressed size: 4.9GB
Tags: NLP, Dataset
  Stuff (Professional)
Data Engineer Intern @BOLD
June 2019 - August 2019I interned at the career service company BOLD over the summer of 2019. This job was a really nice extension of some of the embedding projects I was doing earlier in the year for research and it was great to see the technique cropping on in an industry application!
Summary
- Built backend of semantic search and recommendation system for LiveCareer’s job matching service, incorporated into data pipeline
- Worked closely with data science team to build set of offline evaluation measures for information retrieval
- Benchmarked new system against existing one, verified improved relevance and ~80% reduction in preprocessing time
  Stuff (Personal)
VADER Sentiment Analyser
July 5, 2018VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the MIT License. This is a port of the original module, which was written in Python. If you’d like to make a contribution, please checkout the original author’s work here.
Tags: Rust, Python, NLP, Open Source