[DHd-blog] Workshop Review: Expert Workshop on Topic Models and Corpus Analysis


Under the auspices of DARIAH-EU Working Group Text and Data Analytics an expert colloquium was held at Dublin City University’s Adapt Centre, 14th – 15th of December 2015, on the subject of topic models and corpus-analytic approaches in the humanities, with a special focus on literary studies and the philologies.

In the light of increasingly available digital text ressources and suitable quantitative methodologies – which are increasingly able to augment or even reframe research questions traditionally seen as exclusively qualitative – such approaches have found their way into a variety of humanities desciplines and demand a closer look at their domain-specific adaptions, technically as well as epistemologically. As such the workshop was addressed to experienced users, researchers, and developers working with corpus-analytic approaches, especially those geared towards the automatic analysis of semantic content. Contributions ranged from specific experience reports to opinion pieces discussing the broader implications of said approaches, and while the workshop was set up around a group of experts from TDA working group, it was open to the public and well attended by master students and doctoral researchers working in the field.

In the course of the two-day programme, a wide variety of corpus-analytic approaches were discussed – ranging from topic models and other established methods of ‚distant reading‘ to approaches using novel feature combinations. It was a variety of contributions, which were furthermore framed by a unique perspective from translation studies, as presented by Sharon O’Brien. Her keynote pointed at various critical points during the adoption of computer-aided methodologies in translation studies – a discipline that came into contact with language technology early on – showing that machine translation is eventually able to adapt the job to the person, rather than forcing the user to work for the machine. It is a development that led to human-in-the-loop approaches – where the initial machine output acts as a springboard for the actual translation, harboring extensive savings of time – and to the development of an integrated, cognitive perspective on machine translation.

Following this outlook from translation studies, a number of corpus-analytic studies closely aligned with the theme of the workshop were discussed. Maciej Piasecki and Maciej Maryl presented applications of Text Clustering Methods in Literary Analysis of Weblog Genres, using a stylometric approach and reporting on language features that lend themselves for genre-clustering, leading to a preliminary identification of more approachable (e.g. cooking recipes) as well as more challenging (e.g. short stories) text types. Maciej Piasecki furthermore showed a prototype of WebSty, an open web-based stylometric system. As part of a research design informed by political and sociological theory, Susan Leavy presented her work on Detecting Gender Bias in the Coverage of Politicians in Irish Newspapers Using Automated Text Classification. An original classification approach that leverages various linguistic and sociopolitical features as training material, making it possible to explore patterns of difference and gender bias in media representations. Another corpus-driven approach was put forward by Jan Rybicki, reframing translation as text re-use and showing how anti-plagiarism and translation memory tools can be applied in order to compare different translations of a text to each other and to explore translation as ‚a series of potential texts‘.

Also, a number of contributions specifically fostered the discussion of topic modeling techniques: Carsten Schnober presented an approach for Extrinsic Evaluation of Topic Models on Unknown Corpora, which suggests – based on the observation that a plethora of preprocessing options and parameter settings make it difficult to compare topic models – to assess model quality via an information retrieval use case based on the model. Another contribution that goes into the direction of topic model evaluation was put forward by Gary Munnelly: Finding Meaning in the Chaos. Establishing Corpus Size Limits for Unsupervised Semantic Linking shows how measuring the stability of an increasingly reduced model against its gold standard variant gives a lead on the smallest possible size of that corpus (for the purpose of topic modeling). Coming from the angle of personal semantics and folk taxonomy, Gregory Grefenstette presented a novel (a.k.a. non-LDA) approach to topic modeling (Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access) and showed how topic specific taxonomies can be extracted from from domain texts (crawled via dmoz) as cleanly as possible, as well as how extracted taxonomies can be evaluated using thematically related subreddits. Bringing the topic modeling track to a close, Fotis Jannidis reported on the adoption of topic modeling techniques in the Digital Humanities, contrasting pivotal points in the technical development and its uptake in humanities contexts, as well as discussing epistemological implications for its application in literary theory.

Finally, and as a missing link and an overarching contribution to various research designs in literary computing, Allen Riddell and Karina van Dalen-Oskam reported on a large survey, asking people what it means for books to be literary and furthermore triangulating with census data and other sociologically relevant information – thus approaching a gold standard for reader perceptions in literary studies.

The diversity and quality of workshop contributions provided the participants with an equally varied and engaging overview on current approaches in automated text analysis, fostered discussion and led to the identification of overlapping research agendas – all of which will be continued within the Text and Data Analytics Working Group.

Follow us on Twitter: @dariahtdawg