While many corpuses have been analyzed to create frequency tables of words for use in lexical and content analyses, there has been little done in the realm of user generated content (UGC) due to the significant variation in prose. However, to create more accurate processes to determine contextual sentiment in UGC, we believe that one must spend the time in understanding and creating a UGC corpus. Moreover, to apply accurate analyses to UGC that is limited in content length, such as found among Twitter users, one must begin with the Twitter lexicon.
Having collected more than 30 million Twitter statuses related to the video gaming market, we decided to analyze a segment representing nearly two-thirds of our corpus. Namely, those tweets dealing with game titles as their primary topic.
Since UGC, in general, is characterized as having enormous lexical variation and micro bloggers’ are communicating in 140 character bursts with a proclivity to attach URLs and multiple hash tags, we analyzed several thousand individual statuses before proceeding with any data cleansing.
The first step was to allow for the use of single- and double-quotes in the escaped raw data, which we found were used quite frequently by our population. This effected 3.4 million of the statuses.
We ran several processes that targeted specific norms found in our user base by extracting all hash tags, at symbols (“@”), and urls. This allowed us to segregate the conversational content from the normal clutter while providing valuable insight into how much of this behavior is utilized in the given population.
We then created a word candidate frequency hash set, applying several filters to further clean the data. This allowed us to eliminate many lengthy word combinations, as we found most of these were of little contextual value. These process steps reduced our working dataset form 18 GB to 608 MB.
Having created a raw frequency dataset of 4.8 million word candidates representing, more than 259 million occurrences, we then removed all remaining non-alphanumeric characters, resulting in many duplicate words being exposed, as they may have been surrounded by any number of non-alphabetic characters. Upon inspection of the data, and running numerous elimination samples, we also decided to eliminate all numeric data at this time, as we found their continued inclusion not statistically meaningful. This resulted in reducing our word candidate frequency data to around 1.5 million.
We then manually inspected and processed the ~3600 candidates that had a frequency greater than 3000, combining like words, removing nonsensical strings (eg: “abababa”), and combining obvious slang to non-slang equivalents (eg. “willin” with “willing”). These combinations were only done for a handful of obvious words which typically had ratios of proper spelling to slang in excess of 4:1. This process was completed in three steps from f > 12,500, 5,500 < f < =12,500, and 3,000 < f < =5,500.
These manually processed datasets represented more than 110 million of the 136 million occurrences of our word candidates. As expected, the remaining 26 million occurrences resided in more than 2.1 million remaining word candidates.
All of the manually pre-processed frequency candidates were then combined forming a unique word set with a bit more than 80% of the total being represented by these ~3600 words. A final process that accumulated all of the remaining word candidate frequencies into their respective unique words yielded our final word count of 73,006.
We now have a very specific word frequency of our corpus for use in our sentiment analysis. We were very pleased to find that our initial run against a common adjectives dataset yielded a 94.6% hit rate, showing that our user base is more verbose, than not.
|Partial Word Frequency Table|
Some Simple Validation of Expected Values
In looking at the partial word frequency table above, we can walk through some examples that you would expect that data to support.
As all gamers know, Microsoft’s Halo franchise was and is a big hit. So, the word “halo” shows up 1,570,066 times. Well, is everyone talking about the original title still, or are they discussing Halo ODST, or Halo Reach? If we search our table for both the words “reach” and “odst”, we find 851,967 and 519,133 occurrences, respectively. Therefore, it is pretty safe to conclude that nearly 1.4 million of the 1.57 million times “halo” was mentioned (87.3%) they were talking about one or the other. In addition, it would appear that Halo reach was significantly more popular than Halo ODST.
Well, we did it, and finally got around to publishing it here. We think it's pretty cool. Have fun in drawing your own conclusions.
Game on, my fellow gamesters!