News Media Discourse on International Students in Canada
An analysis of the news media's reflection onto Reddit from January 2023 to March 2024 using Natural Language Processing
1. Background
In recent years, the population of international students in Canada has increased dramatically. By the end of 2023, the number of study permit holders was 84% more than it was at the same time in 2018. The purported strain on public services alongside organizations taking advantage of Canada's immigration system for profits were cited as key reasons behind the multiple policy changes in the past year. Among the changes, financial requirements for prospective students were doubled and a significant cap was imposed on the amount of study permits that will be processed over the next two years.
At the beginning of 2024, Immigration, Refugees and Citizenship Canada published a news release: "Rapid increases in the number of international students arriving in Canada also puts pressure on housing, health care and other services. As we work to better protect international students from bad actors and support sustainable population growth in Canada, the government is moving forward with measures to stabilize the number of international students in Canada."
Active study permits spiked following the initial downturn due to the COVID-19 pandemic, most notably so from citizens of India (Fig. 1). In total, 2020 to 2023 saw an average increase of 26% in study permit holders year over year. This can be compared with 2015 to 2018, for example, which saw an average increase of 17% each year.
During this period of growth, Canadian post-secondary schools have become increasingly reliant on the streams of high tuition fees that come with international students. On average, tuition for a domestic undergraduate student in 2023 was around $7,000, while for an international student, it was $38,000. This gap has been continuously widening for years. A stark example can be seen in Ontario, where international students contributed 45% of tuition revenue in the 2020/2021 fiscal year—up 16 percentage points since 2016/17 (see section 4.2.1).
A blue-ribbon panel created by the Government of Ontario in 2023 found that "Many colleges and universities have passed the point where they could survive financially with only domestic students. They are financially sustainable only because of international students".
Many schools pay recruitment agencies (who in turn pay on-the-ground agents) to attract international students (see section 4.2.4). Recruitment is a lucrative business, with severely embedded issues from misrepresentation of life in Canada to direct fraud.
As the government's failures to properly regulate the intersection between immigration and education have come to light, significant media coverage has come with it. This inquiry seeks to understand the nature of that media coverage and its reflected discussion on social media. Natural Language Processing is used at scale to see how the Canadian news media has approached the topic at hand. Similar techniques are then applied to comparatively see how users on Reddit filter from and discuss the same articles.
2. News Media
Research conducted in March 2024 captured news articles published between January 1, 2023 and March 9, 2024. Just under 7000 stories from 319 Canadian media sources were identified to be relevant using Media Cloud and web-scraping (Fig. 2). Stories were deemed relevant if they contained certain keywords (such as 'international student') and mentioned 'students' at least twice (see the Appendix for the full criteria).
To maintain a focus on the media as a whole rather than the same publisher posting duplicate stories, text analysis was performed on a corpus made by de-duplicating the original by exact text. Since exact text content was tested for, this approach allowed for stories that were distributed by news agencies like The Canadian Press and posted by various publishers to be largely retained. If two stories had the same text content but contained any variation in formatting or layout, they were considered to be non-duplicates. A publisher which posted the same story across multiple websites with the exact same layout, however, had duplicates removed. The most prevalent of the high-duplicate publishers in the dataset was Black Press Media, which on certain days posted the same story over 50 times. This deduplication process resulted in a corpus of 3470 stories (Fig. 2).
In the cleaned corpus 278 sources were present, the most prevalent of which being CTV News (Fig. 3).
To determine who was being talked about, what was being talked about, and how international students were being talked about, a few separate attributes of text in the coverage were chosen for examination:
- Entities such as locations, people, and organizations
- Words used in the same sentence as 'international student(s)'
- Noun chunks, which are nouns and the words which describe them, as determined through dependency parsing
- Words such as verbs and adjectives that are grammatically relevant to 'international student(s)', as determined through dependency parsing. Dependency parsing is the process of programmatically identifying the relationships between words in text. For example, a dependency parser could determine if an adjective modifies a noun.
SpaCy's English transformer pipeline was used for Natural Language Processing (NLP) tasks.
I. Entities
The frequency at which various entities were mentioned across the text of articles implies a ranking of believed importance. Figures 4-8 show how the prevalence of the top entities of a given type evolved over the past year.
While Canada and Ontario (which had 54% of Canada's international students as of the end of 2023) would be expected as the most frequently mentioned locations, India stands out. Both the diplomatic rifts in September as well as India contributing by far the most international students to Canada (Fig. 1) help to explain its prevalence. Interestingly, India is mentioned less in the data's later months than earlier in 2023. While there are many factors to the popularity of student recruitment in India, an element is the large youth population; a 2023 report from Pew Research stated that "roughly one-in-five people globally who are under the age of 25 live in India". Since 2018 India has also been part of Canada's Student Direct Stream (SDS) program, which streamlines study permit applications from certain countries. Although articles were only sourced from media sources with a focus on Canada, just about half of all articles explicitly mention 'Canada'.
The Indigenous population tended to get brought up in the same articles as international students when the topic related to discrimination or student elections.
The most commonly mentioned individuals were politicians. Marc Miller became the Minister of Immigration at the end of July 2023, taking over for Sean Fraser who went into a new role as Minister of Housing. On January 22, 2024, Marc Miller announced a cap on the amount of study permit applications that would be processed, amounting to an estimated 35% reduction in approved study permits (which has since been subject to multiple clarifications).
Lagging behind Trudeau in apparent relevance is Doug Ford, the Premier of Ontario. Lastly, Jill Dunlop is the Minister of Colleges and Universities in Ontario.
The name Singh was also common, most often referring to the NDP leader, Jagmeet Singh, but was omitted due to the commonality of the name and difficulties in attribution.
The most frequently mentioned organizations were government organizations and post-secondary institutions. The McGill spikes in October and December relate to Quebec raising tuition for out-of-province students, as well as the province charging schools $20,000 for each admitted international student. While the University of Waterloo was not particularly prevalent in reporting, a stabbing in late June in which the perpetrator was a recently graduated international student received a lot of media coverage at the time of the event. The NDP appears above both the Liberal and Conservative parties. This is in part due to the media coverage it received from its criticism of the federal government's funding plans for universities and housing. It is also possible that articles tend to refer to the Liberal and Conservative parties as 'Liberals' or 'Conservatives' respectively, which would result in them being tagged as groups rather than organizations (Fig. 5). Statistics Canada was often cited by articles for the open data on study permits it provides.
II. Language
As touched on in the previous section, multiple methods were used to build a contextual understanding surrounding the key phrase "international student(s)".
The first approach marked sentences that contain the key phrase, and looked to find which words from each part of speech category were most common across the sentences. Figure 9 shows the results of this proximity-based method with the most common 15 nouns, adjectives, proper nouns, and verbs.
Most of these results follow what would be expected from looking at the headlines at the story volume peaks (Figure 2). By far the most common verb was 'say', implying a significant amount of quotations. Importantly, words were counted by their lemmas– meaning 'said' and 'saying' would both fall under 'say'.
While looking at the nearest words does show relevant information, the approach is naive to the grammatical relevance of each word. SpaCy's dependency parsing was used to fill this gap in context by looking at noun chunks and syntactically meaningful words.
To understand common terms, the top noun chunks (phrases containing a noun and the words that describe it) were retrieved (Table 1). To emphasize the chosen determiners and more precise phrases, noun chunks with more than 2 words are shown.
In the noun chunks which contain a reference to international students, the accompanying determiner was often related to quantity (e.g., many). Across all of the text, the most common noun chunk was 'the federal government'. The 'housing crisis' was also commonly mentioned– a topic which is expanded upon in Section 4.
News articles most commonly used 'international student(s)' as a prepositional object, a subject, part of a compound, or as a direct object. SpaCy's dependency parser was used to find the most common syntactically relevant verbs, prepositions, heads of compounds, and adjectives (Fig. 10).
When the students were the object of the sentence, the verbs used were fairly positive.
'Support', 'protect', and 'welcome' are all in the top 10. The prepositions tended to be about the high number of international students, with the most commonly used being 'number of ', 'cap on', and 'influx of [international students]'.
Overall, post-secondary institutions, housing, tuition, and Marc Miller seemed to be at the forefront of discussion in news media. Ontario was the most often mentioned province in Canada, which aligns with it having the highest international student population in Canada. While India remained a prevalent topic of discussion, its relevance in news media articles was shown to decline towards the end of 2023.
3. Reddit
Reddit at its core is a social media platform for users to aggregate information, like news articles, into relevant communities ('subreddits') and discuss in the comments. According to Semrush, reddit.com was the fourth most visited website in Canada for February 2024 (placing it above Twitter but below Facebook).
To understand how users on Reddit filtered from and discussed what the news media writes about, all posts were collected across all subreddits that linked to one of the 6785 articles discovered in Section 2. Using PRAW (The Python Reddit API Wrapper), data from the 1073 posts was saved and collated (Fig. 11). Combined these posts have over 59k comments, and the median comment count per post is 7 (or 26 if you exclude the 318 posts with 0 comments). Comments or posts that were deleted before data collection (March 17th) were not procured in this dataset. In other words, if an item broke platform or community rules and was consequently removed by moderators, it was not captured in the data and may somewhat influence the veracity of results due to omission.
While more news articles typically led to more Reddit posts, this was not always the case as seen in late March and October. On March 19 and 20th, stories were published about a Sikh international student being assaulted and the Cape Breton University food bank having extremely high demand. While the significant discrepancy was in part due to the large number of duplicate stories published by Black Press Media and Glacier Media Group, neither story got any traction on Reddit regardless. On October 27th, Marc Miller announced new rules targeting fraud following an investigation into fake acceptance letters for international students. In many cases it was found that the students were unaware that their letters were not genuine. While a low number of Reddit posts were made relative to the amount of articles, one post received over 300 upvotes and nearly 150 comments.
On September 7, 2023, there were more Reddit posts created than news articles published. This was mainly due to two stories. Most frequently posted (8 times) was the CBC article "Store manager in Sydney says she's inundated by international students desperate for work". Following that was the National Post story "International students in Canada living on the side of a road come to a solution with local college", which was posted 5 times.
Across all posts, the most prevalent subreddit by multiple metrics was unsurprisingly r/canada (Table 2). Somewhat interesting is the popularity of r/CanadaHousing2 in this dataset. Despite having less than 1/5th the member-count of r/CanadaHousing at the time of writing, r/CanadaHousing2 consistently appears in the top 3-5 subreddits by all measured metrics, while the former has only 4 captured posts.
The r/CanadaHousing2 community is described by the page moderators as "Like r/CanadaHousing but without the censorship. This is a subreddit to discuss the housing crisis in Canada without banning posts for discussing supply *and* demand. Racism is still absolutely prohibited, but you are welcome to debate population growth, immigration rate, foreign home buyers, and the merits of single family homes or the green zone." The top 10 posts in the last year on r/CanadaHousing are largely about landlords, while on r/CanadaHousing2 they are nearly all about immigration and international students. This result highlights the often-discussed large effect that varying moderators have on subreddits.
Also highly visible in the distribution of subreddits by posts (Fig. 12) are r/100kinitiative and r/CanadaMassImmigration. Both are small subreddits that focus on the negatives of immigration.
While r/AnythingOntario was the fourth most posted-in subreddit, it did not appear in the top ranks for any other metric measured in Table 2. All the posts got virtually no interaction and were from a single account, which appeared to be automated to post Ontario-related news stories. Both r/AutoNewspaper and r/TORONTOSTARauto are also automated news streams, like the names imply.
To visualize where discussion happened, all comments that had a parent comment were put into Gephi with edges (replies) coloured by subreddit and node (user) size determined by the amount of replies sent (Fig. 15). Although certain users were shown to be responsible for a disproportionately large amount of replies, no significant widespread or coordinated bot activity was detected. Areas of examination included patterns in the times posts/comments were made, duplicate content, LLM error messages, and days in which the average cosine similarity between the TF-IDF vectors of comments (text similarity) was higher than normal.
The main categories of subreddits where discussion happened were large, encompassing communities like r/canada, region-specific subreddits like r/vancouver, and topical groups like r/CanadaHousing2 or r/TorontoRealEstate.
The preferred media sources are immediately apparent from looking at the distribution of posts by article source domain. Certain domains, such as theglobeandmail.com, have a much higher share of Reddit posts than they do of stories.
Stories from over 300 unique sources were collected, resulting in a fairly diverse set of news article data. Regardless of this, the vast majority of Reddit posts came from a small subset of these domains (Fig. 16).
For instance, only 31 relevant articles were collected from CBC, but 129 Reddit posts were made linking to those articles (Fig. 17). While the distribution of posts by media sources is largely reflective of source popularity, it nonetheless paints a picture of selective influence.
Of the top 10 domains by post count, theglobeandmail.com had the most posts as well as the highest average comment-to-post ratio (Table 3). While domains such as toronto.ctvnews.ca and ctvnews.ca fall under the same source, they were treated as unique domains for this step of analysis so locality specific differences could be spotted. When all CTV News domains are grouped together, it has the highest number of posts (213).
I. Entities
Using a similar approach as Section 1, entities were extracted from all the comments and replies under each post. 59% of comments (39.1k) included at least one entity, and these comments were used as the denominator for entities. Although the top entities by average relevance across all measured months are shown, the percentages are small relative to those seen in Section 2. This implies a larger variety of entities were mentioned when entities were mentioned at all.
Regardless of the larger variety of entities, the top entities for regions and groups are quite similar to those seen in news articles. Toronto and Ottawa disappeared from the Top 5 locations, and the United States and China moved in. Indigenous is no longer as prevalent in the top groups, and 'Chinese' became the fourth most mentioned. The top 3 most mentioned groups (Canadian, Liberal, and Indian) did not shift at all between news articles and Reddit comments.
Although the top individuals remain to be politicians, those most mentioned in Reddit comments are drastically different from those seen in news articles. In contrast to what was previously found, well-known political figures were more emphasized in the data than relevant Ministers. Justin Trudeau and Stephen Harper are respectively the current and former Prime Ministers. Harper was often brought up in reference to his impact on immigration policy. Pierre Poilievre is the leader of the Conservative Party, and as the leader of the Opposition, users tended to compare his ideas with Trudeau's. Donald Trump was often talked about in the context of Canada's more open immigration policy relative to the United States'.
When looking at the top 5 organizations for both news articles and Reddit comments, the NDP was the only organization to be on both lists. As was mentioned in Section 2, while people tend to refer to the New Democratic Party as 'NDP', it is common to refer to the Liberal or Conservative Party as the 'Liberals' or 'Conservatives'. This likely led to their prevalence in the group entities (Figure 19) rather than organization entities. The second most mentioned organization on Reddit– Conestoga College– had the most study-permits approved throughout Canada in 2023 (30,000). For reference, the school in second place on that list had less than half that many study-permits approved in the same year.
II. Language
In the text of 59k collected Reddit comments, international students were directly mentioned 6.3k times.
At first glance, the words near 'international student(s)' in Reddit comments (Fig. 22) look remarkably similar to those found in news articles (Fig. 9). There are some key differences, which build upon themes seen in the frequently mentioned entities (previous section). Specific colleges and universities– namely Conestoga College, UofT, and UBC– were mentioned by name. While Marc Miller can be seen, he was nowhere near as prevalent as he was in news articles; heavily public-facing politicians like "[Justin] Trudeau" and "[Doug] Ford" were mentioned more frequently.
The top noun chunks show that discussions on Reddit placed high importance on the housing crisis– more so than news articles. The quantity of international students was once again commonly used as a determiner. Reddit comments also tended to see topics surrounding international students as an election issue, with frequent references to 'the next election'.
Following the same dependency parsing approach outlined in Section 2, verbs, adjectives, prepositions, and heads of compound nouns were found.
The positive verbs such as 'support' and 'protect' that were highly common when 'international student(s)' was used as an object in news articles (Fig. 10) are no longer present in the top 10. The verb 'pay' was once again the most common verb when the key phrase is used as a subject, often referencing the high tuition that international students pay. Although the most common adjectives are mostly the same, 'indian' moved up from the seventh most used adjective in articles to the fourth most used in Reddit comments.
The similarity of language used in Reddit comments when compared to articles shows that they somewhat mirrored the news– with a swing in weighting applied to certain names and topics. While an increase in specificity was seen when referring to certain post-secondary institutions in the same sentence as "international student(s)", the opposite effect was found when discussing politicians. Justin Trudeau, for example, was brought up more often than any more pertinent Minister. Neither the Minister of Immigration nor the Minister of Housing were in the top 5 most-mentioned individuals on Reddit despite being ranked 1 and 2 respectively for mentions in news media.
4. Housing Case Study
Across both news media coverage and Reddit, housing emerged as one of the most important issues that coincides with international students. The noun 'housing' was often seen in the same sentence as 'international student(s)' (Fig. 9, Fig. 22), and 'the housing crisis' was one of the most common noun chunks (Table 1, Table 4). In essence, the coverage amounts to the influx of international students occurring alongside an ongoing housing shortage. This has led to dire living situations for many, with stories such as a student living in a house with 13 other people not being uncommon to read about.
Of the 6785 news stories about international students, 10% of them have 'housing' in their title. This figure remains similar for the rate at which articles get posted to Reddit, with 10% of posts linking to housing stories, and 11.6% of all comments being made on those posts. When looking at the de-duplicated stories (3470), the percentage with 'housing' in the title drops slightly to 7.5%. As in previous sections, the corpus of stories without exact text duplicates was used for text analysis, while the original corpus of stories was used to acquire Reddit posts.
The seemingly unwarranted spike in Reddit posts late September can be attributed to a single (non-bot) user reposting the same article from The Varsity to various Canadian subreddits.
Noun chunks with more than 2 words were once again looked at to see key repeated phrases. The two most common terms were 'the housing crisis' and 'the federal government'. Of the top noun chunks that are 1 word or more, 'the housing crisis' lowers in ranking, but still remains in the top 10 for both news article text and Reddit comments.
To understand the context in which the housing crisis was talked about, manual coding was performed based on the 127 articles that mentioned the 'housing crisis' and have 'housing' in the title. Correspondingly, coding was carried out on the 210 Reddit comments that contained 'housing crisis' on 43 unique posts about articles with 'housing' in the title.
When looking at Reddit comments, the comment chains (parents and replies) were used to identify the context. If an article conveyed both sides of an argument in a neutral fashion, both sides were reflected in the coding.
An effort was made to differentiate between text that blamed immigration on policy decisions as opposed to the immigrants themselves. For example, text that said 'the amount of immigrants being let in is causing the housing crisis' versus 'the immigrants are causing the housing crisis'. This is an important distinction– immigration can be described as a factor contributing towards the housing crisis while (for example) solely blaming a government for its implementation of policy.
As expected from a set of news articles that contain references to international students, the increasing international student enrollment was cited as a leading factor behind the housing crisis (Fig. 25). However, it was uncommon for an article to directly blame international students themselves, and in 36% of articles coded they were discussed as primary victims. Often, this was due to predatory practices of landlords, recruitment agencies, and post-secondary institutions. Over 75% of articles voiced the desire for an increase in housing supply, typically through building more affordable housing. Another commonly proposed solution, targeted immigration, referred to the idea of an increase in skilled workers such as tradespeople to help aid in the construction of housing. The policy announced on January 22, 2024 that attempts to limit new international students through a cap on processed study permit applications was also highly discussed, and highly criticized. All levels of government– particularly federal– were regularly blamed by news media, but post-secondary institutions were assigned blame most often.
Although reducing immigration in some fashion was proposed at similar rates by news articles and Reddit comments (approximately 19%), it was the most common solution directly proposed on Reddit aside from no solution at all (54%). Conversely, it was the 5th most common solution in news articles, which have a no-solution-rate of only 5.5%. While an article typically provides background information on a topic, what caused it, who it impacts, and possible solutions, comments on social media tend to be less complete.
On Reddit, the federal government was blamed for the housing crisis by far the most often. Just under half of the collected Reddit comments that mentioned 'housing crisis' blamed the federal government to some degree. Reddit comments also tended to direct blame towards those who treat housing as an investment vehicle at a high rate (16%) compared to articles (6%).
When looking at international students specifically, exploitation of the group by post-secondary institutions was not brought up on Reddit nearly as often as it was by news articles. Relatedly, international students were discussed as victims in 5% of comments, significantly less than the 36% of articles. While it was more common to see instances of international students being blamed for the housing crisis on Reddit, it was still quite infrequent (5.2%).
5. Conclusion
Reddit comments about articles tend to be a funhouse mirror reflection of the news media. There was a clear preference on the platform for a small subset of Canadian media sources, and there were key differences in the choices of language used in the comments.
The top terms across both news and Reddit comments centered around the quantity of international students, housing, tuition, and governments/policies. The specifics around these topics, however, often differed. In news articles, multiple of the most common verbs, such as 'protect' and 'welcome', were used to refer to international students in a somewhat positive light. The regularity of these positive verbs did not carry over into the results of the text analysis for Reddit comments. Both Reddit comments and news articles talked about politicians the most out of all types of individuals, but the types of politicians they mentioned the most were quite different. Across the corpora, news articles highlighted Ministers as the foremost individuals, while Reddit comments placed an emphasis on political figureheads like Justin Trudeau.
Looking at the housing crisis in particular– which was shown to be seen as an interwoven issue with international students– Reddit comments overwhelmingly assigned blame to the federal government. Meanwhile, news articles spread blame somewhat evenly between post-secondary institutions, the federal government, and provincial governments. The proposed solutions also differed, with the most common solution on Reddit being an approach from the demand side (adjusting immigration targets) and news articles largely discussing solutions on the supply side (housing development).
The preference towards a small subset of news sources on Reddit was well apparent. The 5 most posted sources on Reddit (CTV News, The Globe and Mail, The Toronto Star, CBC, and Global News) made up 67% of all detected posts. These same sources accounted for 12% of all collected articles. It is now common knowledge that social media platforms will recommend certain posts, but it is just as important to understand that the pool of posts might be weighted in the first place.
Activity on Reddit extends far beyond commenting on posts about news articles. Although the focus here was to understand Reddit posts and comments as they relate to the news ecosystem, the discussion under non-news posts may be quite different.
6. Appendix
Interesting next steps would be inter-ecosystem comparisons, media ownership analysis, and upvote correlations. Adding to the available media sources would also be an improvement.
In terms of data collection, the following boolean search string was used to get stories from Media Cloud's Canadian National and Provincial news source collections:
False positives from CTV News were common due to areas of the page like 'More Stories' being scraped by Media Cloud, so those stories were re-processed locally to improve accurate matches. Due to limited sources, stories from CBC, National Post, Ottawa Citizen, and The Globe and Mail were programmatically backfilled.
Sources for headlines used in Figures 2 and 24 (left to right)
Fig. 2:
https://bc.ctvnews.ca/absolutely-disgusting-b-c-councillor-speaks-out-after-sikh-international-student-swarmed-beaten-1.6320026
https://www.cbc.ca/news/politics/international-student-fraud-rules-1.7010427
https://nationalpost.com/news/canada/housing-immigration-federal-government
https://www.theglobeandmail.com/politics/article-international-student-visa-cap-miller-immigration/
Fig. 24:
Here's an Easter-egg for reading the Appendix, a visualization of the 75 largest comment threads in the collected Reddit data. Node size is determined by comment score and colour is determined by subreddit.