Ask a Caltech Expert: AI Tools for Addressing Online Misinformation and Harassment

As part of Conversations on Artificial Intelligence, a webinar series hosted by the Caltech Science Exchange, Professor of Political and Computational Social Science Michael Alvarez and Bren Professor of Computing and Mathematical Sciences Anima Anandkumar discuss how misinformation is amplified online and ways their artificial intelligence (AI) tools can help create a more trustworthy social media ecosystem.

As documented by the #MeToo movement and seen in the aftermath of the 2020 election, trolling and misleading information online leads to consequences in the offline world. In conversation with Caltech science writer Robert Perkins, the scientists describe AI methods that can parse millions of social media posts to identify and prevent online harassment and track the spread of misinformation and disinformation on platforms like Twitter and Facebook.

Highlights from the conversation are below.

The questions and answers below have been edited for clarity and length.

What have you learned so far about how misinformation spreads?

Michael Alvarez: We’ve been collecting Twitter data on tweets about election 2020 and the information and misinformation that was being spread. Things like: Were Twitter’s attempts to label and block some of the misinformation in 2020 successful? We found mixed evidence. The platforms are clearly having trouble policing the spread of misinformation.

Anima Anandkumar: In my view, the core of it is the algorithmic amplification: automated amplification at such a large scale combined with targeted attacks, a lot of them by bots, and platforms not having the ability to authenticate who are real users. And that also means you’re just making these attacks bigger in a very short amount of time.

Alvarez: It flows very quickly, unlike in the past when I first started participating in politics myself, when things were pretty much in the paper or in the evening news. And these platforms, with their rapid spread of information and the lack of real fact checking, give us this problem where people are susceptible to seeing lots of misleading information. It’s becoming, I think, one of the biggest problems that we’re facing in today’s world: How do we fact check a lot of the information that’s being spread on social media?

What is at stake here? What will happen if we don’t address it?

Alvarez: Well, that’s a big question. Recent polling from the Pew Research Center finds that four out of 10 Americans report some form of online harassment and bullying. That means that of the people who are watching this right now, that many of them have experienced this themselves. Then there is this larger puzzle of trying to police the misinformation and/or misleading information, which is a problem at a large scale that is very difficult.

What’s at stake is, on the one hand, we need to stop this harassment online. People should not be harassed online, and that just has to stop. In the macro scenario, we need to be very careful that our democracy isn’t eroded because of the spread of either misleading information or factually incorrect information in this midterm election and, in particular, as we move forward into the next presidential election. That is part of why we’re motivated to really put a lot of time into this project. And I think that’s motivating a lot of our students to get involved as well.

What can AI do to help address this issue? Can you talk about the AI tools that you’ve been building and how they work?

Anandkumar: AI gives us an excellent array of tools to do this research at large scale. We’ve been able to now analyze billions of tweets very quickly, and that wouldn’t have been possible before.

One challenge here is that this is not a fixed vocabulary or a fixed set of topics. On social media, we have new topics that evolve all the time. COVID-19 wasn’t even in our vocabulary when the models were initially trained. Especially when it comes to harassment and trolling, bad actors are coming up with new keywords, new hashtags all the time. How do we handle this constantly evolving vocabulary?

My research that dates back more than a decade has been to develop these unsupervised learning tools, what we call topic modeling. Here, we automatically discover topics and data by looking at how patterns evolve. Think of how, if words occur together, they should have some strong relationship in representing a topic. If the words apple and orange are occurring together, likely the topic is fruit. What we call tensor methods look at such co-occurring relationships. And then the AI has to do this over hundreds of thousands of possible words over billions of documents to extract topics automatically. We propose scalable methods to extract hidden patterns or topics efficiently from such large-scale co-occurring relationships between words.

Alvarez: It’s not just that we can do this, it’s that with Anima’s contribution in this area, we can do this at scale very quickly. The techniques that she’s been developing allowed us to look at a data set of all of the tweets about #MeToo, about 9 million tweets. We can analyze that data set in a matter of minutes, whereas before it would take many days, maybe a month. We can do things in the computational social sciences that we could never do before.

Anandkumar: That’s the power of unsupervised learning. It doesn’t rely on humans labeling different topics, so you can do this online as the topics emerge instead of waiting for human annotators to collect data and discover new topics and then train AI.

Let’s turn to some potential concerns about this new technology. Is there any worry that it could be used to curtail free speech or could be misused in some way? Say, by an authoritarian government?

Anandkumar: Our tools are open source, and they’re really all about helping you understand different topic relationships in evolving text. It’s really helping people who are setting policies to handle the scale and the speed of data that is coming at them. It’s definitely not up to us to police or frame those questions. We are enabling researchers as well as social media companies to handle the scale and speed of data.

To me personally, algorithmic amplification is not about free speech. We are not curtailing anybody’s speech if we are saying how we should limit the amplification of misinformation. Somebody can shout from their rooftop anything they want; they’re free to do that. But if it reaches millions of people, and people tend to believe what is being engaged with at such a massive scale, that’s a different story. I think we should separate free speech from algorithmic amplification.

Alvarez: I think Anima really answered the question quite well in that, as scientists, our responsibility, first and foremost, is to the scientific community, to make sure that our colleagues can use these tools and verify that they work as we claim they do. We have a responsibility to share the materials that we’ve used to develop those claims.

But we do really hope that these tools are ones that are used by the social media platforms. We are very optimistic. We know that they have very large data science teams. We know that they follow what we do, and so we’re very optimistic that they’re going to take a look at them and perhaps use them to try to mitigate, if not resolve, a lot of the problems that currently exist.

How will AI keep up with highly variable social mores over time?

Anandkumar: The idea is not for AI to make decisions about how we deal with amplification. There has to be a human in the loop. And I think that’s one of the problems today: it is automated amplification. Unsupervised learning tools look at the patterns in data itself as they’re evolving and bubbling up to human moderators, and so we need to have human in the loop. And there needs to be consistent policies to guide these humans in terms of how to judge if something is harassment.

The short answer is yes, we have to be adaptable to different social mores and changing conditions. And if we give agile AI tools that enable human moderators to adapt, I think that’s the best we can do.

Alvarez: And I’ll say that there are other people who study this at Caltech. For example, Frederick Eberhart, who is a faculty member and a philosopher. He teaches a wonderful class here at Caltech on the ethics of AI. As our research continues to grow and progress, I’m pretty confident that we’ll be reaching out to scholars like Frederick here at Caltech and elsewhere to really start to incorporate more about the underlying ethics of use of AI and how AI is responding as social mores about things like harassment are changing.

When and where should we be on the lookout for your tools to become open source?

Anandkumar: We are going to release this as part of an open-source library called TensorLy. Mike also maintains a blog about trustworthy social media, and we’ll hopefully announce it on Twitter and other social media.

You can also look at some of our papers on this topic:

• Tensor Decompositions for Learning Latent Variable Models,” Journal of Machine Learning Research 2014
• Large Scale Cloud Deployment of Spectral Topic Modeling,” Knowledge Discovery and Data Mining (KDD) 2019, ParLearning Workshop, Anchorage, Alaska, USA

Here are some of the other questions addressed in the video linked above:

• Who gets to decide what the truth is and what is misinformation?
• How does monitoring for misinformation differ across social media platforms? For example, is it more difficult to track TikTok, which is video-based versus Twitter or Facebook?
• How would we monitor misinformation in languages other than English?
• What are the most persuasive facts regarding the integrity of federal elections?

Learn more about artificial intelligence and the science behind voting and elections on the Caltech Science Exchange.