Content tagged as NLP

Android Berkeley Bugfix C Data Collection Dataset Design Firebase Fun Game Go Hackathon Java Meta NLP Open Source Python Recursion Robotics Rust Teaching Tools Transportation Visualization Web Workflow

Word Embedding Visualizer

November 25, 2019

Website that takes in user inputted tokens, performs a lookup for their corresponding embedding, performs PCA, and renders back the main components and their closest descriptions to the front-end. Intended to be used as a tool to introduce natural language embeddings to MBA students in MBA 261 (Graduate Marketing Research).

See the README of the Github repo for more details.

Source
Tags: NLP, Visualization

CMU Dict Grapheme to Phoneme Alignments

November 24, 2019

Grapheme to phoneme alignments for the Carnegie Melon’s pronouncing dictionary data set. Alignments were produced using Phonetisaurus. I couldn’t find an open version of the most recent set of alignments anywhere, so I produced this to save some other people some time.

A grapheme to phoneme alignment maps letters (graphemes) to their sounds when spoken (phonemes). For example the graphemes H,E,LL,O get aligned to the sounds HH,EH0,L,OW1 where the sounds are represented using ARPAbet notation.

Source
Tags: NLP, Dataset

Stormfront Posts

November 23, 2019

Stormfront is a white nationalist, white supremecist internet forum. For hopefully obvious reasons the views expressed by the forum posts in this dataset are not representative of my own.

Compressed size: 1.9GB
Uncompressed size: 5.0GB

Source
Tags: NLP, Dataset

News Scrapers

November 23, 2019

A handful of specific scraping scrapes for news and social media websites. Done mainly to collect data for word embedding models for use in the Neuroeconomics lab.

Source
Tags: NLP, Data Collection, Open Source

Fox News Articles

November 23, 2019

Fox News is an American pay television news channel. This dataset consists of the text content and metadata of all public web articles published between roughly 2008 and January 2019 on the Fox News and Fox Business websites in CSV format.

Compressed size: 1.6GB
Uncompressed size: 4.9GB

Source
Tags: NLP, Dataset

Data Engineer Intern @BOLD

June 2019 - August 2019

I interned at the career service company BOLD over the summer of 2019. This job was a really nice extension of some of the embedding projects I was doing earlier in the year for research and it was great to see the technique cropping on in an industry application!

Summary

Built backend of semantic search and recommendation system for LiveCareer’s job matching service, incorporated into data pipeline
Worked closely with data science team to build set of offline evaluation measures for information retrieval
Benchmarked new system against existing one, verified improved relevance and ~80% reduction in preprocessing time

Tags: Web, NLP

VADER Sentiment Analyser

July 5, 2018

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under the MIT License. This is a port of the original module, which was written in Python. If you’d like to make a contribution, please checkout the original author’s work here.

Source
Tags: Rust, Python, NLP, Open Source