Stuff (Data)
CMU Dict Grapheme to Phoneme Alignments
November 24, 2019Grapheme to phoneme alignments for the Carnegie Melon’s pronouncing dictionary data set. Alignments were produced using Phonetisaurus. I couldn’t find an open version of the most recent set of alignments anywhere, so I produced this to save some other people some time.
A grapheme to phoneme alignment maps letters (graphemes) to their sounds when spoken (phonemes). For example the graphemes H,E,LL,O
get aligned to the sounds HH,EH0,L,OW1
where the sounds are represented using ARPAbet notation.
Tags: NLP, Dataset
  Stuff (Data)
Stormfront Posts
November 23, 2019Stormfront is a white nationalist, white supremecist internet forum. For hopefully obvious reasons the views expressed by the forum posts in this dataset are not representative of my own.
Compressed size: 1.9GB
Uncompressed size: 5.0GB
Tags: NLP, Dataset
  Stuff (Data)
Fox News Articles
November 23, 2019Fox News is an American pay television news channel. This dataset consists of the text content and metadata of all public web articles published between roughly 2008 and January 2019 on the Fox News and Fox Business websites in CSV format.
Compressed size: 1.6GB
Uncompressed size: 4.9GB
Tags: NLP, Dataset