Media Cloud

API Updated – word count languages

Home »  updates »  API Updated – word count languages

API Updated – word count languages

On May 9, 2016, Posted by , In updates, With No Comments

We have updated the api to better support languages in the wc/list word counting end point.

Before this update, the language for stemming and stopwording had to be specified explicitly in the api call. Now if the user does not specify a language for the api call, the api call will detect the languages used by the returned text and use those languages for stemming and stopwording, as specificed in the updated api spec:

By default, the system stems and stopwords the list in English plus each of the supported languages it detects for either the entire block block of text or for at least 5% of the individual sentences within the query. If you specify the ‘languages’ parameter, the system will stem and stopword the words by each of the listed languages plus english.

Stemming for multiple languages is done by stemming each returned term in each language sequentially, ordered by the language code for each language. This sequential stemming is likely to introduce some artifacts into the results. If you want results in only a single language, include a ‘language:<code>’ (for instance ‘language:en’) clause in your query to ensure only sentences of that language are returned.

The following language are supported (by 2 letter language code): ‘da’ (Danish), ‘de’ (German), ‘en’ (English), ‘es’ (Spanish), ‘fi’ (Finnish), ‘fr’ (French), ‘hu’ (Hungarian), ‘it’ (Italian), ‘lt’ (Lithuanian), ‘nl’ (Dutch), ‘no’ (Norwegian), ‘pt’ (Portuguese), ‘ro’ (Romanian), ‘ru’ (Russian), ‘sv’ (Swedish), ‘tr’ (Turkish).

The dashboard tool is entirely implemented through the our api. So in addition to applying to the api itself, this change adds stopwording and stemming support for the above languages to the dashboard word cloud.

Comments are closed.