The Productive Scholar: Tools for Text Analysis in the Humanities

Text Analysis with NLTK Cheatsheet

Topic: Tools for Text Analysis in the Humanities170192449
Speaker: Ben Johnston

Time: Thursday, April 3, 12:00 PM – 1:00 PM
Location: New Media Center, 130 Lewis Library, First Floor


A sequel to last semester’s ‘Tools for Text Analysis in the Humanities’, this session will give participants a brief yet hands-on introduction to NLTK, the Natural Language Toolkit. This extension to the popular Python programming language is geared specifically toward computational work with written human language data. In this introduction, we will use tools from this library to tokenize a corpus into sentences, n-grams, and words, create word frequency lists, view concordances, and do part-of-speech tagging. In doing so, this session will also serve as a very gentle introduction to the Python programming language. Absolutely no experience with Python or with programming is expected or required.

SESSION RECAP: Presenter Ben Johnston started by providing a contextual framework for this session which focused on Natural Language Toolkit (NLTK) and Python. He emphasizing the impossibility of actually learning Python in an hour, and the importance of those who have developed a sincere enthusiasm for the applications of digital tool with which they’ve become familiar to engage in ‘knowledge sharing,’ with peers and others. Knowledge sharing requires knowledge but not at the expert level. Digital humanists should be encouraged to share knowledge even while they themselves are still learning (as you will likely never stop learning). Doing so reinforces learning and helps build community–both important aspects of gaining competency in the digital humanities. Here’s an excerpt from Ben’s introduction:

“We have an hour, and it’s obvious I’m not going to teach anyone Python in an hour, and I’m not going to teach you NLTK in an hour, there’s just no way. Also, I don’t think you should think of Python as something that you learn, it’s something you pick up when you need it and [then] accomplish the tasks that you need. My own knowledge of Python is actually very weak [this is Ben being modest; keep reading]. But I feel that we need people to step forward and lead workshops on things that they are enthusiastic about, and I’m extremely enthusiastic about what I’ve learned about NLTK just in the last few months, and about Python. I’ve used Python for years, but I’m really a web-developer-PHP kind of person, so Python–I’m not an expert. If any of you are experts in Python please feel free to correct me, but also please feel free to to teach this workshop next time it’s offered. We have a lot to cover here today, so we’re only going to touch the tip of the iceberg. There’s so much that can be done with NLTK. I keep saying NLTK; it stands for Natural Language Toolkit. It’s a toolkit, it’s a module that’s installed into the programming language Python so that you can do things with natural language. You can do processing on natural language, [but] in an hour we’re only going to see a very small portion. It’s kind of like if we were standing in the middle of Home Depot and I held up a wrench and said, ‘This is a tool.’ And we’re surrounded by tools, right? We’re only going to see a little bit, so that’s a big disclaimer.” To hear and see more of Ben’s presentation, please watch the video!

Ben Johnston is manager of OIT’s Humanities Resource Center in East Pyne.  Since 2005, Ben has worked with Princeton educators, students, and researchers across the Humanities and Social Sciences to facilitate the use of digital assets, technology tools, databases, and digital video in teaching and research. Ben is also an active member of Princeton Digital Humanities Initiative.

Presentation co-sponsored with Digital Humanities Initiative at Princeton (DHI).