Stuff (Data)

Word Embedding Visualizer

November 25, 2019

Website that takes in user inputted tokens, performs a lookup for their corresponding embedding, performs PCA, and renders back the main components and their closest descriptions to the front-end. Intended to be used as a tool to introduce natural language embeddings to MBA students in MBA 261 (Graduate Marketing Research).

See the README of the Github repo for more details.

Source
Tags: NLP, Visualization

  Stuff (Data)

CMU Dict Grapheme to Phoneme Alignments

November 24, 2019

Grapheme to phoneme alignments for the Carnegie Melon’s pronouncing dictionary data set. Alignments were produced using Phonetisaurus. I couldn’t find an open version of the most recent set of alignments anywhere, so I produced this to save some other people some time.

A grapheme to phoneme alignment maps letters (graphemes) to their sounds when spoken (phonemes). For example the graphemes H,E,LL,O get aligned to the sounds HH,EH0,L,OW1 where the sounds are represented using ARPAbet notation.

Source
Tags: NLP, Dataset

  Stuff (Data)

Stormfront Posts

November 23, 2019

Stormfront is a white nationalist, white supremecist internet forum. For hopefully obvious reasons the views expressed by the forum posts in this dataset are not representative of my own.

Compressed size: 1.9GB
Uncompressed size: 5.0GB

Source
Tags: NLP, Dataset

  Stuff (Data)

News Scrapers

November 23, 2019

A handful of specific scraping scrapes for news and social media websites. Done mainly to collect data for word embedding models for use in the Neuroeconomics lab.

Source
Tags: NLP, Data Collection, Open Source

  Stuff (Data)

Fox News Articles

November 23, 2019

Fox News is an American pay television news channel. This dataset consists of the text content and metadata of all public web articles published between roughly 2008 and January 2019 on the Fox News and Fox Business websites in CSV format.

Compressed size: 1.6GB
Uncompressed size: 4.9GB

Source
Tags: NLP, Dataset

  Stuff (Professional)

Data Engineer Intern @BOLD

June 2019 - August 2019

I interned at the career service company BOLD over the summer of 2019. This job was a really nice extension of some of the embedding projects I was doing earlier in the year for research and it was great to see the technique cropping on in an industry application!

Summary

  • Built backend of semantic search and recommendation system for LiveCareer’s job matching service, incorporated into data pipeline
  • Worked closely with data science team to build set of offline evaluation measures for information retrieval
  • Benchmarked new system against existing one, verified improved relevance and ~80% reduction in preprocessing time
Tags: Web, NLP

  Stuff (Personal)