Visualizing 10,000 Tweets for #InternationalWomensDay

Saturday. May 09, 2015 (~4m read)

(This is an older posted originally generated as RMarkdown and converted to HTML for this site.

Formatting and styling may be a bit weird!)

</head>

In honor of International Women’s Day and because I’ve always wanted to play with Twitter data, I pulled 10,000 tweets during the day that used the hashtag #InternationalWomensDay. I cleaned things up a bit using pretty standard text mining analytics to generate the following stuff.

What are people saying?

ALL OF THE WORDS

Here’s a simple word cloud reflecting words that appear most often throughout these tweets. In addition to some cleaning, the hashtag itself has been filtered out so everything pictured here comes from the body of each tweet.

This is a handy big picture view of the kinds of things people are mentioning. It’s also indicative of how messy text data can be (e.g. abbreviations, word stems, spelling issues).

What do people tweet when they tweet about Women?

Here’s a look at words that are appear most often with the word “women” across all tweets. The metric is derived from how often each word appears across all tweets relative to “women,” in addition to how often each words appears along with “women” within a single tweet.

Nice to see that badass and science are among the most often mentioned terms. There have definitely been some badass female scientists.

Words dancing with words

Here’s a look at relationships among words that appear very often: in at least 2% of all 10,000 tweets. While 2% doesn’t sound like a lot, there are only two words that appear in more than 10% of all tweets (it should be obvious what they are from the word cloud above). Meaning that the majority of words are shared by only a handful of tweets here and there. So this is really a range of 2-10%, which provides a reasonable distribution of words to work with.

This matrix visualizes how related these commonly occuring words are. Brighter non-white colors indicate pairs of words (from the corresponding row and column) that tend to vary together in how frequently they appear across tweets. The outlined squares are particular (hierarchically determined) clusters of words that show similar patterns of frequency across tweets. Along the main diagonal are simply the perfect correlations of words with themselves.

While one or two of the clusters appear to reflect a common theme, (e.g. the third being about celebrating International Women’s Day), the others are pretty open to interpretation. In particular, the cluster on the bottom right is more or less a catch-all for word pairs that weakly fit in elsewhere.

All the feels

And of course you can’t look at Twitter data without an obligatory sentiment analysis of some kind. This one is about as basic as it gets, and simply visualizes the distribution of positive and negative sentiments across all these tweets. One way to carry this out (and what’s done here) is to figure out the proportion of words within tweets that are found in positive and negative word dictionaries (i.e. sentiment lexicon). Using these proportions, tweets can be scored and scores can be plotted like below.

While the distribution looks somewhat neutral overall, tweets actually express a small, but meaningful positive sentiment on average as indicated by the dashed golden line.

Caveats:
This is a pretty small sample as far as tweets go, and perhaps more importantly it’s static. In other words, it only came from a single time-range (roughly at the end of the day) and doesn’t reflect trends or changes over the entire day. Constantly querying Twitter’s API is a quick way to get rate-limited, but fortunately Twitter has a streaming API that’s setup (and easy to use) for exactly that purpose.