Media Cloud is a suite of technologies that allow researchers to answer quantitative questions about the content of online media. As an academic research project, Media Cloud is fully committed to being an open source project. This means all our software is written "out-loud" - in public for you to view, engage with, and contribute to. The source-code for our core engine, web-based research support platforms, and many connected libraries are all on GitHub.
The Core Engine
Our core engine collects content and provides web-based tools for doing research on it. People have spun up their own installations of Media Cloud to do their own research, but it is far easier to just add your content and needs to our main hosted installation (so others can benefit as well).
Our core application is a pipeline that collects stories from across the web, processes them, stores them, and makes them available via an API. This is a large amount of Perl and Python code, connected to Postgres and Solr databases.
Online Web Applications
While we've developed out core engine, a number of smaller projects have spun off as useful utilities that others can use, with or without Media Cloud. We've published those back to the community.
We do entity-extraction and geoparsing via our CLIFF-CLAVIN tool. We built it to identify and disambiguate references to places in news articles. This is written in Java and builds on top of the CLAVIN project.
Media Cloud API Client
Researchers who want lower-level access to the data Media Cloud provides can use our python API client library. This is the library all of our online web applications use, and what we use internally to drive research in Jupyter notebooks.
The main way Media Cloud ingests stories is by fetching RSS feeds. For each source we track a list of feeds to pull stories from. Feed Seeker is a Python library for discovering any RSS, ATOM, XML, and RDF feeds that might be associated with any arbitrary web URL.
Determining the date of content published on the web is a hard problem. This is a Python library to extract a publication date from a web page, along with a measure of the accuracy.
Media Cloud supports many languages. This Python library lets us stem content in the Hausa language. It is a reference implementation by Bimba et al., 2015.
Catalan SnowBall Stemmer
Media Cloud support many languages. This Perl library is an interface to the Snowball stemmer for the Catalan language.
Multilingual Sentence Splitter
Our system splits text content into sentences for analysis in multiple languages. This is a Python port of the Lingua::Sentence Perl module.
NYT NewS Labeler
We run all our English stories through a set of trained models to detect what theme(s) they focus on. To build these models, we took the approach of transfer learning - starting with the Google News word2vec models and then adapting them to produce based on the New York Times annotated corpus.