Course description

This course introduces students to various methods for automated text classification for the social sciences. The methods covered allow, among other things, to automatically annotate texts (e.g., identify sentences with hate language in a large corpus), or to detect which topic they are about (e.g., to what extent a newspaper article talks about the economy). The sessions cover approaches with different levels of sophistication, whilst also focusing on foundational and cross-cutting concepts that are relevant to all analyses relying on automated text classification. The course is structured in three sessions. Beyond introducing various methods for automated text classification, each session also covers practical exercises to allow students to familiarise with the methods discussed.

At the end of this course, students will be able to:

Understand fundamental aspects of quantitative approaches to text classification, such as accuracy, validation, and reliability.
Understand the functioning of various methods to perform automated text classification, their differences and their pros and cons.
Apply these methods to their own text corpus to address a substantive research questions.
Critically evaluate social science research that uses automated text analysis methods for text classification.

Convenor

Dr Michele Scotto di Vettimo, King’s College London

Registration

This course is currently offered in the context of the Summer Term 2025 training programme of the London Interdisciplinary Social Science Doctoral Training Partnership (LISS-DTP). Eligible students can register via SkillsForge.

Course structure

The course is structured in three sessions, each covering both theoretical discussions on different methodologies and practical exercises. A full reading list is provided here.

Session 1: Key concepts and introduction to bag-of-words approaches

Session 1 covers foundational concepts related to automated text classification and introduces some basic quantitative approaches. Firstly, it puts text classification in the wider context of quantitative text analysis and clarifies its scope and relationship with other research focuses (e.g., text scaling). Secondly, it introduces foundational elements of bag-of-word approaches (e.g., tokens, document-feature matrices). Thirdly, it covers basic methods for automated classification such as keyword counting and dictionary methods. Finally, it focuses on general theoretical and practical aspects related to the validation of the results, which will be further elaborated throughout the course, and tailored to the relevant text analysis method discussed.

Required readings

Grimmer, J., & Stewart, B. (2013). “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”, Political Analysis, 21(3), 267-297. DOI: 10.1093/pan/mps028.
Grimmer, J., Roberts, M., & Stewart, B. (2022). Text As Data. A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press. Chapters 5 and 16.

Optional readings

Grimmer, J., Roberts, M., & Stewart, B. (2022). Text As Data. A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press. Chapter 15.
Van Atteveldt, W., Van der Velden, M. A., & Boukes, M. (2021). “The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms”. Communication Methods and Measures, 15(2), 121-140. DOI: 10.1080/19312458.2020.1869198.

Slides and other materials

Session 2: Topics models and machine-learning algorithms for classification

Session 2 further expands the discussion of bag-of-words approaches by covering topic models and machine-learning algorithms for text classification. Firstly, the session focuses on topic classification using semi-supervised topic models, which will also be compared to unsupervised alternatives, as well as to simpler approaches covered in Session 1. Secondly, it introduces the logic of automated classification via machine learning and presents some widely used algorithms for text classification. In so doing, it expands on issues related to model training and validation of the results.

Required readings

Grimmer, J., Roberts, M., & Stewart, B. (2022). Text As Data. A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press. Chapters 13, 19, and 20.

Optional readings

Anastasopoulos, L. J., & Bertelli, A. M. (2020). “Understanding delegation through machine learning: A method and application to the European Union”. American Political Science Review, 114(1), 291-301. DOI: 10.1017/S0003055419000522
Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J. (2021). “Automated text classification of news articles: A practical guide”. Political Analysis, 29(1), 19-42. DOI: 10.1017/pan.2020.8.
Eshima, S., Imai, K., & Sasaki, T. (2024). “Keyword‐assisted topic models”. American Journal of Political Science, 68(2), 730-750. DOI: 10.1111/ajps.12779

Slides and other materials

Session 3: Word-embeddings approaches and large language models

Session 3 moves away from bag-of-words approaches and introduces novel methodologies based on word-embeddings and large language models. Firstly, it presents embeddings representations of words and their key properties. It then replicates some of the machine-learning algorithms covered in the previous session to show how they can handle both bag-of-words and word-embeddings representations. Secondly, this session focuses on large language models, particularly transformers models. It covers the use and fine-tuning of pre-trained models for text classification. Thirdly, it introduces natural language inference as a strategy for text classification. Finally, the session gives an overview of models capable of classifying texts in multilingual contexts.

Required readings

Grimmer, J., Roberts, M., & Stewart, B. (2022). Text As Data. A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press. Chapter 8.
Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2023). “Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI”. Political Analysis, 32(1), 84-100. DOI: 10.1017/pan.2023.20.
Rodriguez, P. L., & Spirling, A. (2022). “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research”. The Journal of Politics, 84(1), 101-115. DOI: 10.1086/715162.

Optional readings

Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2025). “On Measurement Validity and Language Models: Increasing Validity and Decreasing Bias with Instructions”. Communication Methods and Measures, 19(1), 46-62. DOI: 10.1080/19312458.2024.2378690.
Rodman, E. (2020). “A timely intervention: Tracking the changing meanings of political concepts with word vectors”. Political Analysis, 28(1), 87-111. DOI: 10.1017/pan.2019.23.

LISS2117: Quantitative methods for text classification and topic detection

2025-05-19

Course description

Course structure

Session 1: Key concepts and introduction to bag-of-words approaches

Required readings

Optional readings

Slides and other materials

Session 2: Topics models and machine-learning algorithms for classification

Required readings

Optional readings

Slides and other materials

Session 3: Word-embeddings approaches and large language models

Required readings

Optional readings

Slides and other materials

Full reading list