
Character vs. Word: Comparing and Training Deep Learning Text Generation Models in R
UNDER CONSTRUCTION
NLP, or
Using language to generate text
Too $hort
For training data, I selected the lyrics from all the songs from five of Too $hort’s albums— "Raw, Uncut And X-Rated" (1986), "Born To Mack" (1987), "Life Is... Too Short" (1988), "Short Dog's In The House" (1990), and “Shorty The Pimp” (1992). In order to provide a strong base for the model, the training dataset was doubled (so each songs lyrics are represented twice). This gave our sample a corpus length of 169,798. Now, it’s important to note that this technique of data-doubling is bad practice, and should be avoided in other applications of machine learning. However, I am doing this for two reasons: 1) the level of string repetition across the data sample is already prevalent, with 28% non-unique values, and 2) the end goal here is to emulate a rhetorical style, while standard statistical considerations are tertiary in importance.
Given that this is a basic application of Natural Language Processing, with a relatively small sample set, a lot of the data cleansing and management can be performed anecdotally, or addressing individual concerns as they arise. For example:
Changed all instances of “Too $hort” to “Too Short'“
Changed all instances of “beyotch” to “bitch” (used 466 times in the sample)
Omitted all collaboration tags (ex. no “Feat. …)
Removed all instances of specific racial language
Conformed all Oakland naming conventions to be “Oakland”, so no spelling or slang (Oaktown, or O-Town are allowed)
Removed all verse credits and directional lyrics [things in brackets]
Made all words lower case
45,295 words