Sampling Effectively for Creating Word Clouds
Here at Media Cloud, we have over 800 million stories in our database — and we’re adding a million more each day. That’s great news for researchers, journalists, and advocates who want to ask questions about media coverage. But it also means that our database can sometimes be too large to analyze quickly! To make sure our webpages are both fast and useful, we use sampling in a handful of places to show a representative set of the data (rather than all of it). To help you have confidence in the results you are seeing, this blog post evaluates the sampling approach that drives our word clouds and provides evidence for its validity.
A Little Background on Our Word Clouds
One of the main ways we help people understand the media narrative about their issue is by showing them the literal language being employed - ie. the words themselves. The simplest way to do this is to count the words and then visualize the most frequently used words. Word clouds are a quick and easy way to see the top words for a topic or term. (To see the word cloud for a query, and download a list of word frequencies, click on the “language” tab on the search results page at explorer.mediacloud.org.)
To show these top words, many of our tools employ what we call an "ordered word cloud" visualization. Like traditional word clouds, each word is sized according to how frequently it is used. However, in these "ordered" word cloud visualizations, rather than laying them out randomly, we list them in order from most-used to least-used.
We find that encoding the frequency of use into both the order and size of the word makes it easier to read and understand. In our Explorer tool you can flip between these two views of the language data by using the "view options" menu that appears underneath each word cloud.
As mentioned, generating this word frequency data for the entire set of results can take too long. That’s why we rely on sampling to generate our word clouds. Instead of analyzing several million stories that contain a term like “North Korea,” we use a random sample of 1,000 sentences* to represent our data. Random sampling is a simple and popular statistical method that lets us survey data in an unbiased way. But how do we know if our 1,000 sentence sample is a good representation of our data?
Evaluating Our Sampling
To measure how good our samples are, we use a method called bootstrapping. We take several random samples and calculate the standard error of our results to see if our data is stable.
In our case, we took 100 random samples of 1,000 sentences for nine popular topics: climate change, deep state, ebola, gun violence, immigration, network neutrality, teen pregnancy, US election, and vaccines. We then repeated the analysis for 100 random samples of 10,000 sentences to see if there was a significant improvement in accuracy with a larger sample size.
Mean frequency of top 20 stems for five out of the nine topics tested, with error bars.
For all nine topics tested, the standard error for the top 20 words (in this case, stem of the words) was low for both the 1,000 and 10,000 sample size. On average, the standard error ranged from 2.7 to 7.6 percent of the mean values for the 1,000 sample size, and decreased to 0.4 to 7.5 percent of the mean values for the 10,000 sample size.
Sampling Works Well
Although the standard error decreases when sample size increased to 10,000 sentences, using the bootstrap method, we’re able to see that our 1,000 sentence sample is stable and has relatively little error.
However, for researchers that want to take the time to generate word frequencies drawn from a larger sample size, we will soon include an option to use a 10,000 sentence sample for greater accuracy.
* Note that as of May 17th, Media Cloud has transitioned to story-based searching. At the time of analysis, word clouds were based on random samples of sentences that contained the search query. In our current system, word clouds are generated by first taking a random sample of 1,000 (or 10,000) stories that contain the search query, and then returning 1,000 (or 10,000) random sentences that match any term in the query. This change should not have an effect on sampling stability.