Character vs. Word: Comparing and Training Deep Learning Text Generation Models in R

UNDER CONSTRUCTION

 

NLP, or

Using language to generate text

Too $hort

For training data, I selected the lyrics from all the songs from five of Too $hort’s albums— "Raw, Uncut And X-Rated" (1986), "Born To Mack" (1987), "Life Is... Too Short" (1988), "Short Dog's In The House" (1990), and “Shorty The Pimp” (1992). In order to provide a strong base for the model, the training dataset was doubled (so each songs lyrics are represented twice). This gave our sample a corpus length of 169,798. Now, it’s important to note that this technique of data-doubling is bad practice, and should be avoided in other applications of machine learning. However, I am doing this for two reasons: 1) the level of string repetition across the data sample is already prevalent, with 28% non-unique values, and 2) the end goal here is to emulate a rhetorical style, while standard statistical considerations are tertiary in importance.

Given that this is a basic application of Natural Language Processing, with a relatively small sample set, a lot of the data cleansing and management can be performed anecdotally, or addressing individual concerns as they arise. For example:

  • Changed all instances of “Too $hort” to “Too Short'“

  • Changed all instances of “beyotch” to “bitch” (used 466 times in the sample)

  • Omitted all collaboration tags (ex. no “Feat. …)

  • Removed all instances of specific racial language

  • Conformed all Oakland naming conventions to be “Oakland”, so no spelling or slang (Oaktown, or O-Town are allowed)

  • Removed all verse credits and directional lyrics [things in brackets]

  • Made all words lower case

  • 45,295 words

Previous
Previous

By The Numbers: U.S. Drone Strikes 2008 - 2020