LISS2117: Quantitative methods for text classification and topic detection

2025-03-28

Course description

This course introduces students to various methods for automated text classification for the social sciences. The methods covered allow, among other things, to automatically annotate texts (e.g., identify sentences with hate language in a large corpus), or to detect which topic they are about (e.g., to what extent a newspaper article talks about the economy). The sessions cover approaches with different levels of sophistication, whilst also focusing on foundational and cross-cutting concepts that are relevant to all analyses relying on automated text classification. The course is structured in three sessions. Beyond introducing various methods for automated text classification, each session also covers practical exercises to allow students to familiarise with the methods discussed.

At the end of this course, students will be able to:

Convenor

Dr Michele Scotto di Vettimo, King’s College London

Registration

This course is currently offered in the context of the Summer Term 2025 training programme of the London Interdisciplinary Social Science Doctoral Training Partnership (LISS-DTP). Eligible students can register via SkillsForge.

Course structure

The course is structured in three sessions, each covering both theoretical discussions on different methodologies and practical exercises. A full reading list is provided here.

Session 1:

Session 1 covers foundational concepts related to automated text classification and introduces some basic quantitative approaches. Firstly, it puts text classification in the wider context of quantitative text analysis and clarifies its scope and relationship with other research focuses (e.g., text scaling). Secondly, it introduces foundational elements of bag-of-word approaches (e.g., tokens, document-feature matrices). Thirdly, it covers basic methods for automated classification such as keyword counting and dictionary methods. Finally, it focuses on general theoretical and practical aspects related to the validation of the results, which will be further elaborated throughout the course, and tailored to the relevant text analysis method discussed.

Required readings

  • Grimmer & Stewart (2013).

  • Grimmer et al. (2022), Chapters 5 and 16.

Optional readings

  • Grimmer et al. (2022), Chapter 15.

  • Van Atteveldt et al. (2021).

Slides and other materials

Uploaded closer to the day of the session

Session 2:

Session 2 further expands the discussion of bag-of-words approaches by covering topic models and machine-learning algorithms for text classification. Firstly, the session focuses on topic classification using semi-supervised topic models, which will also be compared to unsupervised alternatives, as well as to simpler approaches covered in Session 1. Secondly, it introduces the logic of automated classification via machine learning and presents some widely used algorithms for text classification. In so doing, it expands on issues related to model training and validation of the results.

Required readings

  • Grimmer et al. (2022), Chapters 13, 19, and 20.

Optional readings

  • Anastasopoulos & Bertelli (2020).

  • Barberá et al. (2021).

  • Watanabe & Zhou (2022).

Slides and other materials

Uploaded closer to the day of the session

Session 3:

Session 3 moves away from bag-of-words approaches and introduces novel methodologies based on word-embeddings and large language models. Firstly, it presents embeddings representations of words and their key properties. It then replicates some of the machine-learning algorithms covered in the previous session to show how they can handle both bag-of-words and word-embeddings representations. Secondly, this session focuses on large language models, particularly transformers models. It covers the use and fine-tuning of pre-trained models for text classification. Thirdly, it introduces natural language inference as a strategy for text classification. Finally, the session gives an overview of models capable of classifying texts in multilingual contexts.

Required readings

  • Grimmer et al. (2022), Chapter 8.

  • Laurer et al. (2023).

  • Rodriguez & Spriling (2022).

Optional readings

  • Laurer et al. (2024).

  • Rodman (2020).

Slides and other materials

Uploaded closer to the day of the session

Full reading list

This is a general reading list (please see details under “Course structure” on specific preparation for each session). The selection of material below is aimed at offering an overview of additional sources useful in terms of concepts, methodologies, and substantive applications. The readings are roughly divided by topics covered in the course.

General readings

Baden, C., Pipal, C., Schoonvelde, M., & van der Velden, M. A. G. (2022). “Three gaps in computational text analysis methods for social sciences: A research agenda”. Communication Methods and Measures, 16(1), 1-18. DOI: 10.1080/19312458.2021.2015574.

Benoit, K. (2020). Text as data: An overview. In “The SAGE Handbook of Research Methods in Political Science and International Relations”. London: SAGE Publishing.

DiMaggio, P. (2015). “Adapting computational text analysis to social science (and vice versa)”. Big Data & Society, 2(2), 1-5. DOI: 10.1177/2053951715602908.

Grimmer, J., Roberts, M., & Stewart, B. (2022). Text As Data. A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press.

Grimmer, J., & Stewart, B. (2013). “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”, Political Analysis, 21(3), 267-297. DOI: 10.1093/pan/mps028.

Dictionary methods

Busuioc, M., & Rimkute, D. (2020). “Meeting expectations in the EU regulatory state? Regulatory communications amid conflicting institutional demands”. Journal of European Public Policy, 27(4), 547-568. DOI: 10.1080/13501763.2019.1603248.

Chinn, S., Hart, P., & Soroka, S. (2020). “Politicization and Polarization in Climate Change News Content, 1985-2017”. Science Communication, 42(1), 112-129. DOI: 10.1177/1075547019900290.

King, G., Lam, P., & Roberts, M. (2017). “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text”. American Journal of Political Science, 61(4), 971-988. DOI: 10.1111/ajps.12291.

Trubowitz, P., & Watanabe, K. (2021). “Geopolitical Threat Index: A text-based computational approach to identifying foreign threats”. International Studies Quarterly, 65(3), 852-865. DOI: 10.1093/isq/sqab029.

Young, L., & Soroka, S. (2012). “Affective news: The automated coding of sentiment in political texts”. Political communication, 29(2), 205-231. DOI: 10.1080/10584609.2012.671234

Topic models

Bernhard, J., Teuffenbach, M., & Boomgaarden, H. G. (2023). “Topic Model validation methods and their impact on Model selection and evaluation”. Computational Communication Research, 5(1), 1-26. DOI: 10.5117/CCR2023.1.13.BERN

Chen, Y., Peng, Z., Kim, S. H., & Choi, C. W. (2023). “What we can do and cannot do with topic modeling: A systematic review”. Communication Methods and Measures, 17(2), 111-130. DOI: 10.1080/19312458.2023.2167965

Eshima, S., Imai, K., & Sasaki, T. (2024). “Keyword‐assisted topic models”. American Journal of Political Science, 68(2), 730-750. DOI: 10.1111/ajps.12779

Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Haussler, T., Schmid-Petri, H., & Adam, S. (2021). “Applying LDA topic modeling in communication research: Toward a valid and reliable methodology”. Communication Methods and Measures, 12(2-3), 13-38. DOI: 10.1080/19312458.2018.1430754

Machine learning algorithms

Anastasopoulos, L. J., & Bertelli, A. M. (2020). “Understanding delegation through machine learning: A method and application to the European Union”. American Political Science Review, 114(1), 291-301. DOI: 10.1017/S0003055419000522

Large language models and embeddings methods

Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2023). “Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI”. Political Analysis, 32(1), 84-100. DOI: 10.1017/pan.2023.20.

Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2024). “On Measurement Validity and Language Models: Increasing Validity and Decreasing Bias with Instructions”. Communication Methods and Measures, 1-17. DOI: 10.1080/19312458.2024.2378690.

Licht, H. (2023). “Cross-lingual classification of political texts using multilingual sentence embeddings”. Political Analysis, 31(3), 366-379. DOI: 10.1017/pan.2022.29.

Rodriguez, P. L., & Spirling, A. (2022). “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research”. The Journal of Politics, 84(1), 101-115. DOI: 10.1086/715162.

Validation and method comparison

Birkenmaier, L., Lechner, C., & Wagner, C. (2024). “The search for solid ground in text as data: A systematic review of validation approaches”. Communication Methods and Measures, 8(3), 249-277. DOI: 10.1080/19312458.2023.2285765.

Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2019). “Comparing automated text classification methods”. International Journal of Research in Marketing, 36(1), 20-38. DOI: 10.1016/j.ijresmar.2018.09.009.

Reveilhac, M., & Morselli, D. (2022). “Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data”. Political Research Exchange, 4(1), 2029217. DOI: 10.1080/2474736X.2022.2029217.

Song, H., Tolochko, P., Eberl, J. M., Eisele, O., Greussing, E., Heidenreich, T., Lind, F., Galyga, S., & Boomgaarden, H. G. (2020). “In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis”. Political Communication, 37(4), 550-572. DOI: 10.1080/10584609.2020.1723752.

Van Atteveldt, W., Van der Velden, M. A., & Boukes, M. (2021). “The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms”. Communication Methods and Measures, 15(2), 121-140. DOI: 10.1080/19312458.2020.1869198.

Widmann, T., & Wich, M. (2023). “Creating and comparing dictionary, word embedding, and transformer-based models to measure discrete emotions in German political text”. Political Analysis, 31(4), 626-641. DOI: 10.1017/pan.2022.15

Ying, L., Montgomery, J. M., & Stewart, B. M. (2022). “Topics, concepts, and measurement: A crowdsourced procedure for validating topics as measures”. Political Analysis, 30(4), 570-589. DOI: 10.1017/pan.2021.33