Evaluating Author Extraction for the Media Cloud Platform
by: Sands Fish & Rahul Bhargava
As we expand the capabilities of the Media Cloud platform, we are always looking for ways to detect and parse additional metadata available to support our, and our collaborators', research questions. One type of question we often encounter is to desire to understand who is talking. If we can answer this question at a broad scale, we can begin to understand who speaks the most, who garners the most attention, and who uses certain language. With that information we could paint a clearer picture of the media landscape and the conversations that happen there.
We think about answering this question of who is speaking in online media reporting as two separate sub-questions:
Who is quoted in news online, and who quotes them?
Who are the authors that are writing news online?
This blog post summarizes our latest pass at investigating existing solutions for the second question: author detection in semi-structured web-based text. There are a few approaches we can pull from: using pattern matching, relying on structured metadata, or sourcing the job out to more complex algorithms and APIs.
Author detection is baked into many web-based tools, so it isn't surprising that a small ecosystem of libraries and platforms to help has sprung up. After some research online and with peers in the field, we narrowed in on testing three specific tools that exist already:
Goose - an article text and metadata extraction library (as implemented by Xavier Grangier in the Python programming language)
Newspaper - a library for "scraping and curating" articles online, which pulls from Goose (created in Python by Lucas Ou-Yang)
DiffBot - a commercial API for "web data extraction using artificial intelligence", spun-out from Stanford (ie. we used their API)
Representative Test Data
Because sites can vary widely in the way they represent this kind of data, we used our existing metadata to put together a selection of sites that would include a variety of platforms and sources. Our hope was to create a challenging training set that accurately models the global variety of content we ingest. We build our dataset by coming two approaches.
First we pulled in US political content from across the right-left spectrum by sampling stories from media sources in the list of most retweeted sites from our Berkman Klein Center colelagues' 2016 US Election study. This gave us sources categorized as "right", "center right", "center", "center left", or "left".
Secondly, randomly sampled a number of articles from other large Media Cloud platform collections such as:
As a rule, we attempted to represent a diversity of website platforms and CMS services, avoiding the selection of a number of articles that are for instance hosted on Blogger or Wordpress when the author identification would be potentially identical on multiple instances of this site. Built into the data were some sites that have no author and should return an empty list.
Having run the same set of articles through each of these tools, DiffBot clearly performs best. The table below shares results by measuring the standard Precision and Recall metrics. In this context, "Precision" is a measure of how many of the authors detected were actually authors, and "Recall" is a measure of how many of the authors that should have been found were actually found.
Generally, the Newspaper python library did a fairly decent job at detecting names but missed many, even when the answer was very obvious to human eyes. If the author was included at the head of the article, but was not encoded in any structured metadata way, it was rare that Newspaper was able to detect it. When there was some structured data related to the author, it typically got these right, but there were a number of edge cases where it was mistaken.
One of the common mistakes is for the library to detect the by-lines of articles linked-to on the same site as the current article's author. Typically, you will see a series of related articles in either a right-hand sidebar, or in a section just below the text of the main article. It is strange to mistake these for the current page's author given their location in the page's structure, but it got caught by this case more than once. In some cases, these authors were listed after the correctly detected author, and in other cases, the correct author was not the first in the list, making it difficult to rely on, for instance, the first author in a list as being reliable the correct one.
In other cases, the name of the actual publication or website was unintentionally substituted for or appended to the author name. In other cases, some adjacent information was included. For instance, Newspaper returned the list of authors for one article as "[Usa Today, Charles Ventura, Am Pdt November]". "Charles Ventura" is the correct author, but again, the initial entry is unreliable. It is easy to see that "Am Pdt November" is not one of the authors, but part of the publish date that happened to be in the same section as the author name. This was seen in another case when the only author listed was "Hours Ago". We would guess that adding in a quick layer of entity detection to verify each candidate as a "person" named entity or not would help (if only in contexts where authors have western names).
In general, there were few false positives. We were sure to include a number of articles without any author listed, and each of the options we tested were good at returning an error or empty list when this was the case, save for one case where "AP" (ie. the Associated Press) was suggested as the author because of a separate meta tag included in the article.
The Goose library gave the worst results, mainly because of its limited sense of where authors might be located in the structure of the article, but also because it lacks any heuristics to understand what might be an author and what might be erroneous. Most of the positive matches Goose had in our testing were simply due to not finding any authors when we fed it an article that did not have any; hardly a compliment. Where it detected an author when the others didn't (only a small handful of cases) it is because the designer of the site happened to use the "itemprop" tag in an unusual location. In general, without significant additions to this library, it is not reliable for author detection and focuses more on other parsing tasks for articles.
Though Newspaper showed a reasonable degree of success in returning authors based on a richer sense of where by-lines might be in the structure of a page, it was the DiffBot API that has clearly been refined to detect edge cases and clean up its output. DiffBot was able to detect 83% of authors accurately, with only 8% of these containing some extraneous information such as other erroneous authors or content from the page.
This is compared with 60% correct author identification by Newspaper and a higher rate of erroneous or incorrect additions to the correct author of 26.6% (showing that Newspaper is more aggressive, but suffers for it in accuracy). For Goose, we see a 26% correct identification, with no erroneous results, due to its very conservative sense of where to detect authors and no intelligence beyond a simple set of tags to look in. Some work has been done by MIT's Laboratory for Social Machines to expand the list of structural locations that Goose looks for authors, which undoubtedly expands the abilities of this library.
DiffBot also clearly managed other languages better than the other two libraries we tested. At some points, a non-english character would split a single author into two, and in other situations, it would prevent the detection of the author altogether.
On top of the authors, DiffBot frequently was able to extract a property it calls authorUrl, which was frequently the accurate URL to the author's profile.
So What Next?
It seems that this is one of the cases where open source libraries don't come close to matching the refined API from a private company. DiffBot specializes in extracting structured data for consumption by artificial intelligence, and for uses where understanding content on the web is key to the customer's designs. This is a core product, and they clearly have spent the time to beautify the output, correcting things like capitalization and supporting international character sets. If you are looking for an open source library for this task, like we are, it appears it has not been developed yet.
As an open-source, foundation-funded project, paying for DiffBot to extract authors from other 500,000 stories a day is impossible. So with this test data in hand, and a set benchmark run against it, we hope to move forward by merging various of the approaches described above and seeing how they perform. If we can nudge up Newspaper's performance then it might be valuable enough to deploy author detection within Media Cloud, at least for certain collections where it performs better (ie. US Top News or others such as that).
Download our manually coded test data here.