整理者:尚闻一; 转自:公众号 DH数字人文


整理人:尚闻一 / 美国伊利诺伊大学厄巴纳香槟校区信息科学学院博士生


美国弗吉尼亚大学的Brandon Walsh(弗吉尼亚大学图书馆)和Rebecca Draughon(弗吉尼亚大学宗教研究系)制作了“A Humanist’s Cookbook for Natural Language Processing in Python”的研究项目,通过为人文学者设计的一系列的Jupyter notebook,讨论了文本分析的常见数据操作问题:



以色列学者Ethan Fetaya(以色列Bar-Ilan University大学)等人通过循环神经网络重构了巴比伦泥板文献的残片,发表于2020年九月份的的PNASProceedings of the National Academy of Sciences of the United States of America)期刊中。应当是一个重要的突破:



The main sources of information regarding ancient Mesopotamian history and culture are clay cuneiform tablets. Many of these tablets are damaged, leading to missing information. Currently, the missing text is manually reconstructed by experts. We investigate the possibility of assisting scholars, by modeling the language using recurrent neural networks and automatically completing the breaks in ancient Akkadian texts from Achaemenid period Babylonia.


The documentary sources for the political, economic, and social history of ancient Mesopotamia constitute hundreds of thousands of clay tablets inscribed in the cuneiform script. Most tablets are damaged, leaving gaps in the texts written on them, and the missing portions must be restored by experts. This paper uses available digitized texts for training advanced machine-learning algorithms to restore daily economic and administrative documents from the Persian empire (sixth to fourth centuries BCE). As the amount of digitized texts grows, the model can be trained to restore damaged texts belonging to other genres, such as scientific or literary texts. Therefore, this is a first step for a large-scale reconstruction of a lost ancient heritage.


加拿大McGill University英文系的苏真(Richard Jean So)教授在美国哥伦比亚大学出版社出版了了新书:Redlining Culture: A Data History of Racial Inequality and Postwar Fiction,利用数字方法探讨战后美国的小说的经典形成,和其中的种族不平等:


The canon of postwar American fiction has changed over the past few decades to include far more writers of color. It would appear that we are making progress—recovering marginalized voices and including those who were for far too long ignored. However, is this celebratory narrative borne out in the data?

Richard Jean So draws on big data, literary history, and close readings to offer an unprecedented analysis of racial inequality in American publishing that reveals the persistence of an extreme bias toward white authors. In fact, a defining feature of the publishing industry is its vast whiteness, which has denied nonwhite authors, especially black writers, the coveted resources of publishing, reviews, prizes, and sales, with profound effects on the language, form, and content of the postwar novel. Rather than seeing the postwar period as the era of multiculturalism, So argues that we should understand it as the invention of a new form of racial inequality—one that continues to shape the arts and literature today.

Interweaving data analysis of large-scale patterns with a consideration of Toni Morrison’s career as an editor at Random House and readings of individual works by Octavia Butler, Henry Dumas, Amy Tan, and others, So develops a form of criticism that brings together qualitative and quantitative approaches to the study of literature. A vital and provocative work for American literary studies, critical race studies, and the digital humanities, Redlining Culture shows the importance of data and computational methods for understanding and challenging racial inequality.


Richard Jean So is assistant professor of English and cultural analytics at McGill University. He is the author of Transpacific Community: America, China, and the Rise and Fall of a Cultural Network (Columbia, 2016).


Shakespeare and Company Project与Journal of Cultural AnalyticsModernism/modernity两本期刊合作,发布了关于Shakespeare and Company的征稿启事。文章需基于Shakespeare and Company的档案及数据,最终发表于两篇期刊之中:



美国加州大学伯克利校区信息科学学院的David Bamman教授发布了拉丁文的自然语言处理模型BERT(Bidirectional Encoder Representations from Transformer)。基于642.7M的拉丁文文本,作者实验了POS标记、词义消歧、预测修正以及发现上下文的最小近邻:



加拿大McGill University语言、文学和文化系的Andrew Piper教授发表了新书Can We Be Wrong? The Problem of Textual Evidence in a Time of Data,探讨了数据与文本阐释之间的关系,特别关注“generalization”的问题:



Andrew Piper is Professor in the Department of Languages, Literatures, and Cultures at McGill University. He directs .txtLAB, a laboratory for cultural analytics at McGill, and is editor of the Journal of Cultural Analytics.

His work focuses on applying the tools and techniques of data science to the study of literature and culture, with a particular emphasis on questions of cultural equality. He has on-going projects that address questions of cultural capital, academic publishing and power, and the the visibility of knowledge.


美国Carnegie Mellon University图书馆发布了包含1960年-2020年的489个数字人文会议的索引数据库,可对作者、作品、会议进行检索:



美国Emory University英文系的Dan Sinykin教授发布了课程 “Data Science with Text” 的大纲,主要介绍实用性文本数据挖掘方法:



美国New York University历史系的Ben Schmidt教授发布了课程“History of Big Data”,从历史学的角度探讨了前计算机时代“大数据”的源流,不同角度的数字人文呢:



美国康奈尔大学信息科学系的David Mimno教授制作了主题建模(topic modeling)的在线项目:jsLDA,帮助非技术背景的人对于topic modeling理解、学习和探索: