Our initial research funded by the National Science Foundation investigates the characteristics of online communities that shape the quality of their conversations based on a large-scale multilevel dataset that covers all public comments across 1,058 political forums on Facebook. Since then, we have expanded our research team and agenda, working on various computational social science projects that are boradly related to political polarization, deliberation, offline/online social networks and culture. Below, we present a brief overview of current projects that we are actively working on.

Current Projects

This page will be updated at the beginning of each semester. Last updated: January 2024

Linguistic Polarization on Social Media

Shiyu Ji, Barum Park, BK Lee
A recent study shows that the separability of these machine-learning models in classifying parliament speeches into liberal and conservative labels can serve as a reliable measure of political polarization. We extend this approach by fine-tuning BERT models to distinguish between liberal and conservative commentators from the Facebook forum data which includes both text and users’ ideology score.

Leveraging Large Language Models for Analyzing Belief Space at Scale

Junsol Kim, BK Lee
In our previous work, we demonstrated that fine-tuning large language models (LLMs) with the General Social Survey enables us to predict individuals’ opinion to even unasked questions. Building on this foundational work, we currently examine the structure and patterns of individual embeddings where individuals who have the similar belief systems are located closely. Our approach will demonstrate how to study the changes and patterns in cultural belief spaces more effectively over time by comparing the position of individuals who participated in the survey in different periods in the same belief dimensions.

Measuring Deliberation and Toxicity on Social Media

David Broska, Jack LaViolette, Junsol Kim, Daniel McFarland, Barum Park, BK Lee
What makes conversation deliberative? What are characteristics of productive or toxic conversations? How can we foster positive interactions on social media? To answer these and related questions, we created a stratified sample of 100,000 comments and distributed it to 5,000 workers to be labeled. We currently use this data set on multiple projects, including (i) developing toxicity classifiers that consider the context of conversations and (ii) partisan differences in understanding toxicity and productivity among others.

Uncovering Gender-Based Toxicity in Public Discourse on Facebook

Sky Wang, Barum Park, BK Lee
The toxicity of online social media discussions has become a growing problem, potentially exacerbated by the dominance of men in online discussion forums. To date, no study has systematically examined how women and men are treated differently across various online communities. Additionally, it remains a challenge to identify the causal effects of gender on the level of toxicity in reactions to their discussions. In this study, we address these gaps using a unique large-scale dataset from Facebook, encompassing 500 popular political media pages over two years. We employ text matching to identify the causal effect by comparing people’s reactions to almost identical comments written by women and men.

Extracting Events from Short Conversations

Wanying Zhao, Shiyu Ji, BK Lee
Our goal is to develop machine learning models to extract “significant social events” from any social media conversations. In doing so, we have collected Google daily “top trends” data since 2023 March, and construct a database for all major social events across 48 countries. We use LLMs (LLAMA2 / GPT4) to summarize the event, and find a way to identify similar or different events. Using New York Times public commeting data sets, we will test our models’ performance. Ultimately, we aim to examine longitudinal changes in public attention space across different countries.

Sustaining Cross-Ideological Interaction

Austen Mack-Crane, Barum Park
While studies about exposure to opposite ideological content abound, we know less about what happens after users from different political camps start to a conversation. In this project, we investigate the context and conversational dynamics associated with continuation vs. cessation of subcomment-threads that start with both liberal and conservative users on Facebook.

In Search for Echo Chambers

Shiyu Ji, Barum Park
Echo chambers are often blamed for increase polarization and incivility online. In this research, we directly measure whether echo chambers tend to be more incivil and whether incivility generated within echo chambers diffuses outwards. We introduce a new attention-flow metric to analyze how users shift their attention between Facebook pages, and define echo chambers as online communities with high attention-retention rates and high ideological homogeneity. Contrary to common expectations, we find that echo chambers tend to be more civil than communities with lower retention rates and higher ideological heterogeneity; neither is it the case that the users, who are active in echo chambers, are the main culprit for spreading incivility in other communities. These results points toward the importance of opportunity structures in the spread of toxic discourse online.

Differences in Likes Patterns and Toxic User Interactions

Zhonghao Wang, Shiyu Ji, Barum Park
To understand what drives users to engage in toxic online interactions, we examine whether the overlap/difference in Liked pages (i.e., page sources of Liked posts) or the overlap/difference in Liked topics (i.e., topics that Liked posts talk about) between users lead to hateful language use in comments. Analyzing over 60 million comments posted on Facebook, we find that users are more likely to use hateful language targeting both other users who like dissimilar pages and those who like dissimilar topics. Moreover, dissimilarity in Liked pages, which we interpret as reflecting users’ ideology, predicts a toxic comment interaction to a higher degree than dissimilarity in Liked topics, which we interpret as reflecting users’ interests and lifestyles.

Working Papers

  1. Kim, Junsol, and Byungkyu Lee. “AI-Augmented Surveys: Leveraging Large Language Models for Opinion Prediction in Nationally Representative Surveys.” arXiv, May 16, 2023.

Recent Publications

  1. Lee, Byungkyu, Kangsan Lee, and Benjamin Hartmann. “Transformation of Social Relationships in COVID-19 America: Remote Communication May Amplify Political Echo Chambers.” Science Advances 9, no. 51 (December 20, 2023): eadi1540.

Major Data Sources

Below, we will briefly describe the main data sources we are relying on. Once our papers are published in the conference proceedings or academic jorunals, we will share data and code to replicate results. In the meanwhile, we are willing to share data with potential collaborators. If you are interested in collaborating with us, please reach out to us!

We have identified 1,058 public forums that were active in March 2017 and subsequently collected all public comments and reaction data among three hundred million users across 7.5 million posts on Facebook from 2015 to 2017. Our data set consists of 1.2 billion comments, 8.2 billion reactions to posts, and 2.6 billion reactions to comments. In addition, we have been collecting and using a similar set of public comment data from New York Times, Youtube, and Reddit.

Besides, we use network analysis, spatial analysis, and large language models to analyze data from social surveys such as ego-centric networks, text annotation survey, or nationally representative surveys such as the GSS, ANES, and PEW surveys.