When you run a search on Media Cloud, it generates a list of sample stories that match your search query. You can download a CSV with more information about these stories and the media sources that publish them. This guide lists all the columns in the story list CSV and details about the data they hold.
We recommend downloading the sample story list CSV below to follow along.
Story List CSV Columns
This is the internal unique identification number that our system has assigned to the story. This is useful if you are trying to connect the same story across different datasets.
This is the date we think the story was published. This is based on a set of heuristics for parsing the HTML content of the webpage of the story to find and extract a date (the source code for this methodology is available in our date_guesser library). The format of the date is "yyyy-mm-dd hh:mm:ss" (e.g., "2018-03-11 03:00:42").
This is the title that we have extracted from the HTML content of the webpage for the story.
This is the URL that we extracted the story from. Note that sometimes media sources will publish the same story text at different URLs (due to syndication, revisions, or URL redirecting).
This is the language that we think the story is in, written in the form of a 2-letter 639-1 standard code We algorithmically determine the story’s language via our language detection system. If this column is empty or says "none," we do not have enough text to make a good judgement about the language of the story.
This column indicates whether we think the story was ever published by the AP, or Associated Press, even if the story’s URL is from a different source. If it says “true,” our detection process indicates that the story was syndicated from the AP. If it says “false,” we think the story was not syndicated from the AP .
This is a comma separated list of the theme(s) we have detected in the story. We run all our English stories through a set of trained models to detect what theme(s) they focus on. To build these models, we took the approach of transfer learning - starting with the Google News word2vec models and then adapting them to produce based on the New York Times annotated corpus. We score each story against the most common 600 descriptors from the NYT corpus. Any descriptors that score above 0.2 probability are listed as theme(s) for the story.
This is the unique internal identification number that our system has assigned to the media source that published the story.
This is the name of the media source that published the story.
This is the URL of the media source that published the story.
This is the country that the media source is published in. The country is written in the form of an "alpha3" ISO-3166-1 standard code.
This indicates the geographic subdivision (i.e., state/province/region) that the media source is published in. The subdivision is written in the form of a ISO 3166-2 standard code.
This is the main language the media source publishes in, written in the form of a 2-letter 639-1 standard code. This is algorithmically determined by our language detection system. If this column is empty or says "none," we do not have enough stories to make a good judgement about the primary language.
This is the main country that the media source publishes content about. This is algorithmically determined by our geo-parsing and geo-location engine. Countries are represented by their full official name. If this column is empty or says "none," we do not have enough stories to make a good judgement about what the main country of focus is for the media source.
This indicates the type of the media source. This is a fixed taxonomy of types of media sources that we created in collaboration with the Media Cloud community. The values are:
print_native: This source is primarily a print publication. Newspapers and magazines are in this category. Examples: New York Times, The Economist.
digital_native: This source is internet based. News sources that began on the internet first, organizational websites, and blogs are in this category. Examples: CDC, Vox, Scroll.in.
video_broadcast: This source is primarily a broadcast TV station (i.e., video transcriptions or closed captions). Examples: CNN, Fox News.
audio_broadcast: This source is primarily a broadcast radio station or podcast (i.e., audio transcriptions). Examples: NPR.
other: This source doesn't fit in any of the other categories. Examples: AP, Reuters.
Additional Columns on Topic Mapper's Download
Downloading a story list from our Topic Mapper tool provides additional information about each story, namely inlinking and Facebook share information. The additional columns are:
This is the number of inlinks that the story received from other stories in the corpus.
This is the number of Facebook shares that the story received.
This is the number of outlinks to other stories or media that are contained within the story.
This is the number of inlinks from unique media sources that the story received.