Stuff (Data)

CMU Dict Grapheme to Phoneme Alignments

November 24, 2019

Grapheme to phoneme alignments for the Carnegie Melon’s pronouncing dictionary data set. Alignments were produced using Phonetisaurus. I couldn’t find an open version of the most recent set of alignments anywhere, so I produced this to save some other people some time.

A grapheme to phoneme alignment maps letters (graphemes) to their sounds when spoken (phonemes). For example the graphemes H,E,LL,O get aligned to the sounds HH,EH0,L,OW1 where the sounds are represented using ARPAbet notation.

Source
Tags: NLP, Dataset

  Stuff (Data)

Stormfront Posts

November 23, 2019

Stormfront is a white nationalist, white supremecist internet forum. For hopefully obvious reasons the views expressed by the forum posts in this dataset are not representative of my own.

Compressed size: 1.9GB
Uncompressed size: 5.0GB

Source
Tags: NLP, Dataset

  Stuff (Data)

Fox News Articles

November 23, 2019

Fox News is an American pay television news channel. This dataset consists of the text content and metadata of all public web articles published between roughly 2008 and January 2019 on the Fox News and Fox Business websites in CSV format.

Compressed size: 1.6GB
Uncompressed size: 4.9GB

Source
Tags: NLP, Dataset