Media Cloud

Story List CSV Download

Story List CSV Download

About

When you run a search on Media Cloud, it generates a list of sample stories that match your search query. You can download a CSV with more information about these stories and the media sources that publish them. This guide lists all the columns in the story list CSV and details about the data they hold.

We recommend downloading the sample story list CSV below to follow along. 

Story List CSV Columns

stories_id

This is the internal unique identification number that our system has assigned to the story. This is useful if you are trying to connect the same story across different datasets.

publish_date

This is the date we think the story was published.  This is based on a set of heuristics for parsing the HTML content of the webpage of the story to find and extract a date (the source code for this methodology is available in our date_guesser library). The format of the date is "yyyy-mm-dd hh:mm:ss" (e.g., "2018-03-11 03:00:42").

title

This is the title that we have extracted from the HTML content of the webpage for the story.

url

This is the URL that we extracted the story from. Note that sometimes media sources will publish the same story text at different URLs (due to syndication, revisions, or URL redirecting).

language

This is the language that we think the story is in, written in the form of a 2-letter 639-1 standard code We algorithmically determine the story’s language via our language detection system. If this column is empty or says "none," we do not have enough text to make a good judgement about the language of the story.

ap_syndication

This column indicates whether we think the story was ever published by the AP, or Associated Press, even if the story’s URL is from a different source. If it says “true,” our detection process indicates that the story was syndicated from the AP. If it says “false,” we think the story was not syndicated from the AP .

themes

This is a comma separated list of the theme(s) we have detected in the story. We run all our English stories through a set of trained models to detect what theme(s) they focus on. To build these models, we took the approach of transfer learning - starting with the Google News word2vec models and then adapting them to produce based on the New York Times annotated corpus. We score each story against the most common 600 descriptors from the NYT corpus. Any descriptors that score above 0.2 probability are listed as theme(s) for the story.

media_id

This is the unique internal identification number that our system has assigned to the media source that published the story.

media_name

This is the name of the media source that published the story.

media_url

This is the URL of the media source that published the story.

media_pub_country

This is the country that the media source is published in. The country is written in the form of an "alpha3" ISO-3166-1 standard code.

media_pub_state

This indicates the geographic subdivision (i.e., state/province/region) that the media source is published in. The subdivision is written in the form of a ISO 3166-2 standard code.

media_language

This is the main language the media source publishes in, written in the form of a 2-letter 639-1 standard code. This is algorithmically determined by our language detection system. If this column is empty or says "none," we do not have enough stories to make a good judgement about the primary language.

media_about_country

This is the main country that the media source publishes content about. This is algorithmically determined by our geo-parsing and geo-location engine. Countries are represented by their full official name. If this column is empty or says "none," we do not have enough stories to make a good judgement about what the main country of focus is for the media source.

media_media_type

This indicates the type of the media source. This is a fixed taxonomy of types of media sources that we created in collaboration with the Media Cloud community. The values are:

  • print_native: This source is primarily a print publication. Newspapers and magazines are in this category. Examples: New York Times, The Economist.

  • digital_native: This source is internet based. News sources that began on the internet first, organizational websites, and blogs are in this category. Examples: CDC, Vox, Scroll.in.

  • video_broadcast: This source is primarily a broadcast TV station (i.e., video transcriptions or closed captions). Examples: CNN, Fox News.

  • audio_broadcast: This source is primarily a broadcast radio station or podcast (i.e., audio transcriptions). Examples: NPR.

  • other: This source doesn't fit in any of the other categories. Examples: AP, Reuters.