Here's the link (now that I'm out of bed): http://tweetcloud.icodeforlove.com/
Excellent. I am generating my cloud. I tweet a lot so this is going to take some time but I will upload it when done, just to see. I am extremely curious. I think "I" and "fuck" are going to be popular words for me. Maybe. And probably some food-related ones. Here we go. I just did the last 6 months (March 1 - Sept 2, ok so 7 months) which apparently is about 2/3s of my tweets anyway (2599 out of 3949). I was right with "fuck." Apparently I love things a lot more than I think.
I need to make a correction on this submission. These are the top worlds that are correlated and identifying in a gender of a user, not the 100 top used words. Therefore males may use the word shopping with the same or more frequency than females, but the perception is that is identifying of women.
Here's the entirety of the paper. It was a massive examination across different criteria including, but not limited to gender:
This was the description under this graph (there are some additionally interesting ones as well at that link):
And here is a good description of the correlational analysis with another one of their examples, this time with elevation and language as opposed to gender and language:
Similar to word categories, distinguishing open-vocabulary words, phrases, and topics can be identified using ordinary least squares regression. We again take the coefficient of the target explanatory variable as its correlation strength, and we include other variables (e.g. age and gender) as covariates to get the unique effect of the target explanatory variable. Since we explore many features at once, we consider coefficients significant if they are less than a Bonferroni-corrected [76] two-tailed of 0.001. (I.e., when examining 20,000 features, a passing p-value is less than 0.001 divided by 20,000 which is ). Our correlational analysis produces a comprehensive list of the most distinguishing language features for any given attribute, words, phrases, or topics which maximally discriminate a given target variables. For example, when we correlate the target variables geographic elevation with language features (, , adjusted for gender and age), we find ‘beach’ the most distinguishing feature for low elevation localities, and ‘the mountains’ to be among the most distinguishing features for high elevation localities, (i.e., people in low elevations talk about the beach more, whereas people at high elevations talk about the mountains more). Similarly, we find the most distinguishing topics to be (beach, sand, sun, water, waves, ocean, surf, sea, toes, sandy, surfing, beaches, sunset, Florida, Virginia) for low elevations and (Colorado, heading, headed, leaving, Denver, Kansas, City, Springs, Oklahoma, trip, moving, Iowa, KC, Utah, bound) for high elevations. Others have looked at geographic location [77]. And finally, right below that, they provide a very good (and much needed) explanation for the readability of the graphs:
An analysis over tens of thousands of language features and multiple dimensions results in hundreds of thousands of statistically significant correlations. Visualization is thus critical for their interpretation. We use word clouds [78] to intuitively summarize our results. Unlike most word clouds, which scale word size by their frequency, we scale word size according to the strength of the correlation of the word with the demographic or psychological measurement of interest, and we use color to represent frequency over all subjects; that is, larger words indicate stronger correlations, and darker colors indicate more frequently used words. This provides a clear picture of which words and phrases are most discriminating while not losing track of which ones are the most frequent. Word clouds scaled by frequency are often used to summarize news, a practice that has been critiqued for inaccurately representing articles [79]. Here, we believe the word cloud is an appropriate visualization because the individual words and phrases we depict in it are the actual results we wish to summarize. Further, scaling by correlation coefficient rather than frequency gives clouds that distinguish a given outcome. Word clouds can also used to represent distinguishing topics. In this case, the size of the word within the topic represents its prevalence among the cluster of words making up the topic. We use the 6 most distinguishing topics and place them on the perimeter of the word clouds for words and phrases. This way, a single figure gives a comprehensive view of the most distinguishing words, phrases, and topics for any given variables of interest. See Figure 3 for an example. To reduce the redundancy of results, we automatically prune language features containing information already provided by a feature with higher correlation. First, we sort language features in order of their correlation with a target variable (such as a personality trait). Then, for phrases, we use frequency as a proxy for informative value [80], and only include additional phrases if they contain more informative words than previously included phrases with matching words. For example, consider the phrases ‘day’, ‘beautiful day’, and ‘the day’, listed in order of correlation from greatest to least; ‘Beautiful day’ would be kept, because ‘beautiful’ is less frequent than ‘day’ (i.e., it is adding informative value), while ‘the day’ would be dropped because ‘the’ is more frequent than ‘day’ (thus it is not contributing more information than we get from ‘day’). We do a similar pruning for topics: A lower-ranking topic is not displayed if more than 25% of its top 15 words are also contained in the top 15 words of a higher ranking topic. These discarded relationships are still statistically significant, but removing them provides more room in the visualizations for other significant results, making the visualization as a whole more meaningful. Word clouds allow one to easily view the features most correlated with polar outcomes; we use other visualizations to display the variation of correlation of language features with continuous or ordinal dependent variables such as age. A standard time-series plot works well, where the horizontal axis is the dependent variable and the vertical axis represents the standard score of the values produced from feature extraction. When plotting language as a function of age, we fit first-order LOESS regression lines [81] to the age as the x-axis data and standardized frequency as the y-axis data over all users. We are able to adjust for gender in the regression model by including it as a covariate when training the LOESS model and then using a neutral gender value when plotting.We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or 'boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality.
Figure 3. Words, phrases, and topics most highly distinguishing females and males.
show more
Female language features are shown on top while males below. Size of the word indicates the strength of the correlation; color indicates relative frequency of usage. Underscores (_) connect words of multiword phrases. Words and phrases are in the center; topics, represented as the 15 most prevalent words, surround. (: females and males; correlations adjusted for age; Bonferroni-corrected ).
2. Correlational Analysis.
3. Visualization.
I think you are misinterpreting what they said. Correlation means that males or females are actually more likely to use the word (even if there are, say much more women than men so that women use the word more in total), not that they are merely perceived to do so.
I think maybe you forgot to put the link in the title?
http://www.reddit.com/r/dataisbeautiful/comments/1ngi8c/most.../ I'd wager this was the thread and image flagamuffin intended.