Reading Gender in Children’s Literature Mathematically: An Award-winning Thesis

For her senior thesis, AnneMarie Caballero ’23 went through more than a thousand children’s books published during the 19th century and analyzed the pattern of topics in relation to the gender of protagonists. Titled “Gendered Topics: Boyhood and Girlhood in a Century of (Cotsen) Children’s Literature,” her project won Princeton’s Center for Digital Humanities (CDH) Senior Thesis Prize of 2023.

We are delighted that Caballero answered our interview questions about her research. An overview of her thesis, which gives us an opportunity to examine Cotsen’s collection from a new perspective, follows the interview. Caballero’s thesis will be made available via the repository of Princeton’s undergraduate senior theses and retrievable from the library online catalog in Fall 2023.


AnneMarie Caballero

AnneMarie Caballero ’23, winner of the Center for Digital Humanities Senior Thesis Prize for her project titled “Gendered Topics: Boyhood and Girlhood in a Century of (Cotsen) Children’s Literature.” (photo courtesy of AnneMarie Caballero)

Hi AnneMarie, please tell us about yourself. What would you like our readers to know about you?

I recently graduated from Princeton’s computer science department, although I also focused on English in my coursework. In college, I was a part of our literary magazine, Nassau Literary Review, the Model United Nations team, and I worked for the computer science department as a grader. In my free time, I love to read, cook, and play volleyball (although I’m bad at the last one).

You have written a fascinating senior thesis, which applies computational literary analysis to a corpus of 19th-century children’s literature and delineates large patterns of gender and space representations in them. How did the idea of this project come to you?

The project sprang out of a desire to look at literary history with a more large-scale lens than I previously had. My junior research had examined the role of female authors in the early British novel through semantic vectorization but had a limited scope (seven authors, 35 novels). I hoped with my senior thesis to work with a larger dataset that could more comprehensively address the questions I wanted to ask.

Children’s literature was a good fit for my goal. I initially chose it because I wanted a dataset that I had requisite domain knowledge in, and also because I was hoping to work with Professor William Gleason, whose class on children’s literature I had taken. Further, this focus facilitated working with a larger dataset because children’s literature is defined by its audience. As attitudes towards children change, we see the emergence of the genre (within the English language) in the 1740s, after which works are consistently published for children. This choice of genre ensured that there would be a plethora of works available across the course of the nineteenth century.

My research question focused on the gendered nature of the nineteenth-century literary market, specifically its tendency to treat girls and boys as separate consumers. This question originated because multiple literary scholars that I had read discussed this phenomenon. I wanted to see if it could be detected by topic modeling, and, if so, what are the different topics discussed by girls’ vs. boys’ books?

Tell us about your research process. How long did it take from beginning to end? What part would you describe was the most challenging? Rewarding? Enjoyable?

I started thinking about my senior thesis and its direction the summer before my senior year. I spent the fall proposing a topic, doing initial research, and beginning the creation of my dataset. However, because of my lighter course load in the spring, much of the work was done then. I’m unsure exactly how long it took, but easily in the hundreds of hours.

The most challenging part was defining the methodology for my research question. I wanted it to be flexible enough to be explored through qualitative analysis, but for these observations to be supported by concrete quantitative metrics. Similar research I was looking at focused more heavily on the qualitative, so finding that balance was a struggle.

The most rewarding part was probably when I found out that 112 of my 125 topics were considered statistically significant by gender. It was the last step in the research process before writing my results, and it was incredibly validating to see that my hypothesis (that books written for boys vs. for girls fundamentally featured different topics) was so strongly supported by the dataset and methods.

The most enjoyable part was showing my friends the results, and getting to discuss with them the interesting gender differences that cropped up. There was a lot of joy in getting to share my research with the people who had been there for me throughout the process.

How has this project facilitated your professional growth? Aside from gaining valuable insights into 19th-century children’s literature, what else have you harvested from the project in terms of skills, experience, and understandings?

I can’t emphasize enough how much the project shaped my view of research. One of the major ways that the project facilitated my professional growth was the experience creating the dataset. While I had curated a very small dataset for my junior year research, working on this project entailed a months-long curation process that consistently caused me to question myself and my decisions. In my conclusion, I included some advice for first-time curators, such as documenting your decision process, which allows other users of the data to understand sources of bias.

I also feel much more rooted in the digital humanities as a whole. One portion of my thesis was writing a short history of computational literary analysis, and to a lesser extent, the digital humanities. Familiarizing myself with the debates around the field helped me avoid some of the shortcomings that critics have pointed out about digital humanities research, while also taking advantage of benefits like the ability to look at whole eras of literary history quantitatively.

Further, beyond project specifics, executing such a significant independent project offered so many lessons. I learned about how to scope projects properly, partially because I was certainly too ambitious at the start—I wanted to answer three research questions and only got to one. I had to become comfortable reaching out to anyone who might have the appropriate domain knowledge to answer my questions, resulting in a lengthy acknowledgments section. I learned more about data science and statistical methods, which I had not focused on as much in my coursework.

The impressive corpus of digital texts you have curated may benefit future researchers conducting digital humanities studies. Are there any sample questions you can think of the corpus may help address?

Very much so! As mentioned, I had other research questions I was hoping to get to, but, as the thesis was already 159 pages (before the appendix and references), I ended up cutting them due to scope. The question I was really hoping to explore, but didn’t get to, was about the value of children in the nineteenth century. In her book, Pricing the Priceless Child, Viviana Zelizer explores how, beginning in the nineteenth century, the child gains in sentimental value as their financial value decreases. I was hoping to explore that trend in literature, by linking it to literary trends like the cult of childhood, and examining how much the books in the dataset use a diction of sentiment/emotion vs. a diction of utility.

Beyond that question, which I explored fairly in depth but did not get to apply to the dataset, there are endless new avenues for questions. Especially in the curation process, I regularly stumbled across questions and topics that the dataset could significantly address from discussions of colonialism to the fairy tale subgenre.

Any other aspect of the project I have not asked about and you’d like to share with our readers?

I would be remiss not to talk about how critical a role the Cotsen’s Children Library played in the project. After I decided to look at children’s literature, I needed to find a collection of works that would suit my purpose. I explore this more in my thesis, but no existing corpus met the project requirements. Professor Gleason showed me the nineteenth-century catalogue of the Cotsen Children’s Library as a starting point, and that fundamentally shaped my project.

Out of the 1020 works included in my final dataset, 416 were directly from the catalogue and the other 604 were almost all added because of the collection—works by authors in the collection or that I found while searching for works in the collection. While curation could be exhausting (it required searching the 6000+ works from the catalogue in the HathiTrust Digital Library search bar), it also was an amazing introduction to the variety of children’s literature in the nineteenth century. I often found myself down research rabbit holes, or even at times, just being surprised by the books. In one of the books, Other Stories by E. H. Knatchbull-Hugessen, I read its very long dedication to his armchair. It felt like uncovering a secret history, although one that was often troubling, especially with its treatment of non-European cultures, race and ethnicity, and colonialism.

A one-page dedication to the author’s soothing, non-judgmental comfy chair. In Other Stories, by E.H. Knatchbull-Hugessen ; with illustrations by Ernest Griset. London: George Routledge and Sons, 1880. (Cotsen 2646)

Moreover, when I was feeling particularly tired in the final weeks of my thesis, I stopped by the library, and ended up talking with the staff about my thesis. That memory was hugely encouraging as I finished my thesis and is still one of my favorite memories from my senior year.

Your work won this year’s Senior Thesis Prize from the Center for Digital Humanities. Big congratulations! What is your future career plan like?

Next year, I’m working on the Atlas product, a database-as-a-service, for MongoDB, a tech company in New York. While I loved my research and was lucky enough to be accepted to Cambridge’s MPhil in the digital humanities, I ultimately wanted to take time off from school. I had a really wonderful time interning with the company last summer, and I wanted to experience working full-time for a tech company, especially as I decide if I want to go into tech long-term or explore one of my other interests. I definitely see myself returning to the digital humanities, or more generally to a job at the intersection of tech and culture.

Lastly, since you (distantly) read over a thousand children’s books to conduct your research, please tell us about your childhood reading. Did you have any favorite books or reading material? Any people or places you associate with your early reading?

I actually very recently reread several favorite children’s books! One of my all-time favorites that I think really holds up is Tamora Pierce, particularly her Wild Magic and Circle of Magic series. Her female protagonists are better-written than most of the ones I find in adult literature. There are so, so many other series I love (Little House on the Prairie, Shannon Hale’s books—which I mentioned briefly in the thesis, Nancy Drew, Cornelia Funke’s Igraine the Brave, etc.), but Tamora Pierce’s books are the ones I go back to the most.

For people, obviously my parents and my siblings played a huge part in my reading. Also, my librarians: my elementary school librarian even gave me my school’s copy of Pride and Prejudice when I left because I was the only one who ever checked it out. Oddly enough, the place I associate with early reading is the upstairs hallway in my house. There’s a bookshelf there, and I remember sitting there on the beige carpet for hours, reading a book, and when I was done, just picking another one off the shelf.

Reading Children’s Literature, Fast and Slow

Partly due to the relative scarcity of children’s literature corpora, Caballero’s project is a rare computational literary analysis (CLA) that is implemented upon children’s texts. In the field of digital humanities (DH), “corpus” refers to a digital collection of texts. Having curated a corpus of 19th-century English-language children’s literature herself, Caballero applies the method of topic modeling to tease out the statistical pattern of topics in relation to the gender of protagonists. The strength of Caballero’s outstanding research lies in multiple areas. First, she is not afraid of engaging in controversies about DH and in thorny challenges of children’s literature studies. Second, she makes an impressive contribution to DH by publishing a large corpus of digitized children’s literature, which will benefit future researchers. Thirdly, by firmly grounding her statistical revelations in the concerns and findings of traditional literary criticism, the thesis carefully balances quantitative and qualitative methods, reaching nuanced conclusions that are both supported by large-scale analysis and informed by close reading of canons.

Chapter 1 reviews the history of CLA, which over time has succeeded in applying increasingly sophisticated computational tools such as natural language process to literary studies, processing texts on a scale previously impossible for the solitary researcher. Caballero visits debates around the field of DH and examines, with a fine-tooth comb, critiques that are among those made by its harshest detractors. This would shape the design and process of her project.

Responding to flaws that have been raised about DH scholarship, in Chapter 2 Caballero defines the scope of her data with transparency, meticulously documents how the dataset has been constructed, and makes it readily available through the HathiTrust Digital Library collection system. Caballero used A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (Princeton University Library, 2019) as a guide for building the digital corpus. The catalog, consisting of two tomes that stack up to four inches high, describes over 6300 titles published during the 19th century and having been collected by the late donor, Lloyd E. Cotsen, ’50 and Charter Trustee Emeritus. With what I can only imagine to be mighty Princetonian tenacity, Caballero has gone through all of them, selecting English-language titles that meet her criteria of narrative texts for children.

Caballero used A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (Princeton University Library, 2019) as a guide for locating texts in HathiTrust Digital Library and building the Cotsen Children’s Literature Dataset.

The curation process forced Caballero to wrestle with some of the fundamental questions that are beguilingly simple but fraught with rule-defying exceptions: How do you define literature? (e.g., should primers, ABC books, stories in verse, etc. be included ?) How do you define children’s literature? (e.g., is the presence of a child protagonist a necessary and sufficient criterion? Must children’s books be written with a young audience in mind? What about folk tales and fables, genres that were not produced only for children, but have morphed into classical children’s literature?) By sharing her challenging decision-making process, Caballero hopes to keep future users of her dataset fully informed of the limitations of the corpus.

Caballero was able to locate 416 titles of English-language children’s literature from the catalog that are available in full text in HathiTrust Digital Library (HTDL), a major repository of digital content from research libraries. By conducting author and keyword searches, she added another 604 titles to the Cotsen Children’s Literature (CCL) Dataset. Authors that appear most frequently in the dataset include Horatio Alger (1832-1899), Mary Martha Sherwood (1775-1851), Laura Elizabeth Howe Richards (1850-1943), Mrs. Molesworth (1839-1921), Oliver Optic (1822-1897), Louisa May Alcott (1832-1888), and A. L. O. E. (pseudonym of Charlotte Maria Tucker, 1821-1893) (68).

To prepare the dataset for computational analysis, Caballero then ran the 1000+ works through the BookNLP pipeline for nearly twenty-four hours of intensive analysis. An open-source natural language processing tool, the BookNLP pipeline tracks “all of the characters appearing within a work, the number of times they’re mentioned, the names and pronouns by which they are mentioned” (91), and other entities it is capable of recognizing and tagging in scale.

Chapter 3 describes the computational analysis of the texts in terms of gender and topics. First, based on the annotations generated by the BookNLP pipeline, Caballero determined that 613 titles met the mathematical threshold for having a central protagonist (92-95), all but eleven of them having an identifiable gender, which is treated as a proxy for the gender of the intended audience. However, Caballero is quick to point out that the intended audience does not equate the actual audience: whereas boys tend to read books with male protagonists, girls tend to cross gender boundaries and read about boys and girls (89).

The male-to-female ratio of protagonists in this subset approaches 1.9:1–389 titles with male protagonists versus 206 with female protagonists (99)–an uncanny number that echoes the findings of studies about gender imbalance with other bodies of literature. For example, McCabe et al. (2011) analyzed gender representation in 5,618 children’s books published throughout the 20th century in the United States. Through manual coding, her team of five scholars found a male-to-female ratio of 1.9:1 in title characters, and that of 1.6:1 among central characters. Data also suggested that male protagonists receive more mentions than female ones in the CCL works (101), again a pattern that is consistent with existing scholarship. Underwood et al. (2018) traced 104,000 works of English-language fiction spread over three centuries, from 1703 to 2009. Similarly using BookNLP and relying on HathiTrust Digital Library, they calculated the proportion of words used in describing female characters, and found a steady decline from the 19th century through the early 1960s.

Next, Caballero conducted topic modeling to sort the dataset into 125 clusters, each containing co-occurring words from which a topic or a theme may emerge. Among them, 112 were found to be gendered: 64 topics were more often in stories with male protagonists, and 48 more often in those with female protagonists, suggesting that boys’ topics are 33% more varied than girls’. The rest of the 13 topics were gender-neutral. Caballero presents both macro statistical revelations and in-depth analysis of selected topics.

I thought it would be interesting to pick titles from the Cotsen catalog of the 19th century and test to what extent an individual work reflects large patterns detected by machine. To build up the suspense further, of the five titles I selected from the catalog (largely based on the interest level of illustrations highlighted in the tomes), only one is in the CCL dataset, and four others are not available in HathiTrust, thus not having been “read” by computer programs mathematically.

Topic: Violence/Combat

Word cloud produced by running a statistical natural language processing toolset called MALLET. The topic of this cluster of words is labeled as Violence/Combat. Image courtesy of AnneMarie Caballero.

It should come as no surprise that the topic of violence or combat is found more often in works with male protagonists. One of the Cotsen titles I selected, The Little Deserter; Or, Holiday Sports; An Amusing Tale Dedicated to All Good Boys, epitomizes the strong connection between the topic and an intended boy audience–from the unequivocal dedication to boy readers in its subtitle, to illustrations that portray boys playing soldiers with menacing-looking toys and props.

The Little Deserter; Or, Holiday Sports; An Amusing Tale Dedicated to All Good Boys. Edinburgh: Oliver and Boyd, [1807 or 1808]. (Cotsen 7108)

If you find the scene of execution–in a book published during the Napoleonic Wars–offensively violent for 21st-century sensitivity, you are justifiably feeling so. Here is a spoiler that may be offered as a small solace: Julius, the boy protagonist who has been blindfolded and received the death penalty, bounces back in no time and puts dibs on playing the captain in tomorrow’s game.

Miss Johnston’s name was inscribed on the front pastedown and, as shown here, the front free endpaper of the copy of The Little Deserter. (Cotsen 7108)

What makes the Cotsen copy of The Little Deserter remarkable is that it carries evidence of girls’ expansive reading interests. A “Miss Elizabeth Johnston,” likely a former owner/reader of the book, inscribed her name twice in it. As quoted in Caballero (87), Kimberley Reynolds attributes the appeal of boys’ books for girl readers to the fact that books deemed suitable for young ladies were frequently unexciting tales for cultivating good behavior.

Topic: Island (Stranding)

The topic of this cluster of words is labeled as Island (Stranding). Word cloud courtesy of AnneMarie Caballero.

“Island” is a frequent word found in multiple topics that range from Island (Stranding) to Boats (Stranding, Shipwreck) and Nation (Nationalism), linking to the traditional boys’ adventure story as well as the historical subject of colonial conquest (139-140). Both literary criticism and Caballero’s computational study confirm a gendered landscape in children’s literature, which contrasts the feminine home with the masculine away, and excludes boy characters from the home and girl characters from the away (147). The dichotomy, however, is complicated by what is referred to as “adventurous domesticity” (142), whereby protagonists attempt to reconstruct domesticity while stranded.

“Adventures of Robinson Crusoe” in The Robinson Crusoe Picture Book. George Routledge and Sons, [not after 1873]. (Cotsen 152150)

The masculine pursuit for “home away from home” is well reflected in “Adventures of Robinson Crusoe,” a short, illustrated verse story based on Daniel Defoe’s novel. The titular castaway builds a thatched and fenced house that he can call his little home, makes furniture and clothes (he is pictured as putting finishing touches to an umbrella), keeps company with his dog and cats, and domesticates a young goat and a parrot he has found on the island.

Johnny Headstrong’s Trip to Coney Island. McLoughlin Bros., 1882. (Cotsen 540)

Of the five titles I selected, Johnny Headstrong’s Trip to Coney Island is the only one that is included in the CCL dataset, thanks to digitized copies contributed by member university libraries to HathiTrust. In this verse story, Johnny’s family takes a trip to Coney Island beach. Even though his sister Sue has also joined the outing, she is rarely mentioned. At one point she is described as sitting on the wooden horse of a carousel “like a lady,” i.e., side-saddling. Johnny is the protagonist and remains the center of attention (and chaos) by getting into a nonstop series of scrapes, departing the island with bandages over his nose and cheek at the end of the day. Johnny Headstrong’s adventure seems to be a quintessential bad boy’s tale, having packed into its 20 pages so many of boys’ topics on Caballero’s list (115-6): Movement, Body of Water, Boats, Injury, Donkeys, Animal, to name a few.

“Painful Emotion (Death)” is high on the list of girls’ topics (117, 121). Word cloud courtesy of AnneMarie Caballero.

The pattern of gendered topics does not mean that a boy’s tale is devoid of all topics that are statistically prominent in girl’s stories, and vice versa. Caballero conducts a case study with two of the best-known girls’ adventure stories, Alice’s Adventures in Wonderland and Alice Through the Looking-Glass, and finds a good portion (a quarter and nearly a half respectively) of the top 20 topics in each work are boy’s topics, such as Injury, Water, and Animal (143). Likewise, both Robinson Crusoe and Johnny Headstrong have their emotionally vulnerable moments, described in words that are frequently found in the topic Painful Emotion (Death), which is statistically a girls’ one. A forlorn Robinson Crusoe sometimes grows “very sad,” “cries aloud,” weeps “like any child,” thinks of his father and mother, and prays to God “with many tears” (“Adventures” 1-2). Johnny’s adventure begins as he tumbles overboard, is fished out of the water, and cries as he is sent to the engine-room to dry beside the furnace fire. In one episode, he slips away and loses his Papa and sister Sue, then begins “to cry,” “big tears” running down his chubby face (Johnny Headstrong’s). In another, he accidentally strikes a boy hard with a ball and, thinking the boy would surely die, sobs with “childish fright.” Towards the end he falls off a swing, and adults have to sooth “his sobs and groans.” It is tempting to ask if there might be any correlation between how broadly appealing a children’s story is and how inclusive the work is in encompassing gendered topics.

Girls, Domesticity, and Travel

“Confidential People” in The May Blossom, or, the Princess and Her People. Illustrations by H.H. Emmerson; verses by Marion M. Wingrave. London: Frederick Warne and Co., [1881]. (Cotsen 9380)

Caballero recognized that illustration is an essential element of the Cotsen collection, because of Lloyd Cotsen’s “passion for illustrated works that help children become independent readers” (Immel, quoted in Caballero 54). Her computational analysis handles only texts that have been OCRed, thus the machine has missed about half of the fun of perusing the Cotsen collection! The May Blossom, a collection of short verses, presents an intriguing case of what machine manages not to miss in spite of its singular focus on texts. In one of the entries “Confidential People,” a first-person “I” shares a secret with a second-person “you”–there is no textual description of the setting of the story. In the accompanying illustration, the two characters are seated in an intimate, ornate space, surrounded by objects that well match the most frequent keywords of the topic Domestic Space, one that is found more often in works with female protagonists.

The most frequent words in the cluster for the topic Domestic Space include room, table, chair, and sit (161). Word cloud courtesy of AnneMarie Caballero.

The narrator confides that she plans to marry “a sweet little beau” and to take a honeymoon by “a coach and six horses” to Lilliput Land next year. It is a striking contrast how a story that hints at an exciting trip to the faraway fantasy land is visually represented by two girls confined in a stuffy room, a setting that is mentioned nowhere in the text. Travel (Driving, Carriage) turns out to be one of the gender-neutral topics (114), meaning it is as likely to appear in stories with a male protagonist as a female one. How does it square that domestic space is tied to girls’ stories, yet travel is not ? In “Confidential People,” the girl’s narrative about travel is firmly grounded within approved gender roles. The endearingly amusing verse both adores the young narrator’s childish innocence and models an aspiration for marriage that leads to the fulfillment of traditional womanhood.

“Johnny’s First Motor Ride” in Little Tots Holiday Book: With Numerous Coloured Plates and Other Illustrations. London; New York : Frederick Warne & Co. (Cotsen 30357)

A close reading of another story that fits the topic of Travel (Driving, Carriage) invites us to consider what it means to be the central character of a story, and circles back to gender imbalance in terms of the count of female versus male protagonists as well as the proportion of words devoted to each gender. In “Johnny’s First Motor Ride,” the titular character receives a real little motor-car from his father and soon learns how to “control it with ease.” With a bonneted baby deposited in the passenger seat–possibly against the baby’s will, judging from his/her facial expression–he goes out for a ride[1]. After trying to abruptly avert a collision with Margery’s goat-chaise, however, Johnny finds his car stuck. It is at this point, where the story has run two-thirds of the way towards the end, that attention swerves to Margery. Described by her father as “a real clever little woman,” Margery is sympathetic, helpful, and resourceful. Even though it is not her fault that Johnny’s car malfunctioned, she does not abandon the stranded novice motorist. She sets to work “to harness the damaged motor-car to the goat-chaise,” which is pulled by “Nanny” the goat, and coaxed the hoofed “engine” to tow the modern vehicle home. “That was a real triumph for Nanny!”–the story concludes with the exclamation.

Whether by its title “Johnny’s First Motor Ride” or by the amount of text devoted to Johnny, the protagonist of the story is apparently a boy–to machine’s mathematical “mind” at least. I can’t predict how a human reader interprets who is the central character of the story. Margery clearly shines with what she has done, even though she doesn’t receive the most mention in the story. That the credit for the successful rescue act should go to the goat implicitly imparts a self-effacing virtue expected from females. The girl character is sidelined even in a story where she is not the damsel in distress but the heroine who saves the day.

Caballero’s computational analysis of a sizeable body of 19th-century English language children’s literature reveals a gendered landscape, tethering female characters to the domesticity and the inward, freeing male characters to the wider world away from home, enlarging the gap between endorsed masculine and feminine behavior, and bundling implicit morals and values for each gender. She brings rich complexity into her project by tracing how a large-scale analysis of over a thousand works agrees with or departs from findings based on traditional literary criticism of a limited number of canons. It is a testament to the robustness of her study that, for the five titles from the Cotsen collection–only one of which available in the dataset–the patterns still hold true and help us gain fresh insights into these dusty volumes.


[1] Johnny and his father break all the modern government regulations for driving an automobile. There is no publication date on the book. Let’s assume the story was written soon after the invention of the first automobile in 1886, before the driver’s license began to be implemented by the end of the 19th century, or before the age restriction was first introduced in Pennsylvania in 1909.


Caballero, AnneMarie. Gendered Topics: Boyhood and Girlhood in a Century of (Cotsen) Children’s Literature, Princeton University, 2023.

Mccabe, Janice, et al. “Gender in Twentieth-Century Children’s Books: Patterns of Disparity in Titles and Central Characters.” Gender & Society, vol. 25, no. 2, 2011, pp. 197-226.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, vol. 3, no. 2, 2018.


Datasets Curated by AnneMarie Caballero (the exact scope of each dataset is detailed on page 65 of her thesis):

The Cotsen Children’s Literature (CCL) Dataset (1021 items as of July 2023) [URL]

  • The subset of titles as found in A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (416 items) [URL]
  • The subset of titles as found outside A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (605 items) [URL]

Titles as found in Cotsen’s catalog but excluded from the CCL Dataset (123 items) [URL]

GitHub repository of the cleaning script for the CCL Dataset [URL]

(Edited by Andrea Immel)

Opium, Gospel, and the Conquest of the Babel

Quiz: What is the language of the text on this page?

A) German
B) Latin
C) Chinese
D) Turkish

When Professor Matthew Grenby, a children’s literature scholar from the University of Newcastle, first brought our attention to the above book held in the Rare Books collection of the Princeton University Library, I was stumped by the scripts, which did not immediately spell into anything sounding familiar. The book surfaced when Professor Grenby was conducting a search for primers, catechisms and other kinds of ephemeral, educational texts for children, but we were not sure what we were looking at. The thin volume, printed on frail yellowed paper, carries no title page or publication statement. With the help of fading cursive text inscribed on its brown wrapper, a skeleton bibliographical record, several reference books, and multiple experts knowledgeable in different domain areas, I gained a clearer idea of who published the book when, where, why, and even how, but have just as many questions unanswered. With this post I am inviting more informants who can enrich our understanding of the unusual book.

A mid-nineteenth century primer of the Ningbo dialect

Image source:

The answer to the quiz that opens this post is C–Chinese. To be precise, the text is in the dialect of Ningbo, China transcribed in the Roman letterforms. To explain the origin of the book, we may well begin with the First Opium War (1839-42), which broke out between the Qing dynasty and the British Empire, after the former launched a campaign to ban opium trade and the latter dispatched its Royal Navy to protect British opium dealers’ interests in China. Prior to the war, China was nearly entirely closed to foreigners, confining foreign trade within an enclave on the outskirts of Guangzhou (Canton) and outlawing missionary activities due to conflicts–perceived irreconcilable by both Pope Clement IX and the Kangxi 康熙 Emperor–between Christianity and the Chinese tradition of ancestral worship (Pruden 2009, 22). After the war, a defeated and waning Qing government signed a series of treaties first with Britain, then with other Western powers, and granted foreigners enormous freedom in port cities. Ningbo, a long-coveted city sitting at the midpoint of the Chinese coastline, was among the first five “treaty ports” opened by the “Treaty of Nanking” in 1842. By 1845, the Board of Foreign Missions of the Presbyterian Church in the U.S.A. (PCUSA) had set up mission stations in Canton and Ningbo (PCUSA 1879, 61).

Missionaries immediately devoted themselves to learning the dialects of local residents. As William Alexander Parsons Martin (丁韙良, 1827-1916), an Indiana-born Presbyterian minister stationed in Ningbo in the 1850s, described in his memoir, “The spoken language of China is divided into a babel of dialects” (1897, 53). Back in the 19th century, the rate of literacy was abysmal except among the elites, and Mandarin was not standardized or universally acquired as it is today. To spread the Gospel, missionaries must first master the tongue of local heathens.

And master it they did. There were no written materials to guide the learning of the spoken language, which cannot be accurately represented by written Chinese characters on a one-to-one basis. Martin (1897) initially relied on a local teacher’s “object-lessons” and “mimicry” (52) to pick up words. To speed up learning, he hired a second tutor, enabling himself to study from morning to evening (53). Soon “the mists began to rise” and learning the Ningbo dialect became “a fascinating pastime” for him (53). By January 1851, Martin came up with a phonetic system based on the “German, or rather Continental, vowels” (54) to record the sound of the Ningbo dialect. He taught the notation system to his teacher Lu, who within a week composed a letter in the dialect written in the Roman letterforms, which Martin easily decoded to be an invitation to his family for lunch (55).

Martin and other missionaries worked on finalizing the notation system over the years, publishing somewhere between 42 and 52 titles of books written in the Romanized Ningbo dialect (Su 2018, 390). A bibliography of publications by Protestant missionaries in China shows that the contents of the works range from sacred scriptures to gospel harmony, theology, catechisms, prayers, hymns, the secular topics of geography and mathematics, as well as materials that teach the Ningbo dialect (Wylie 1867, 328-330). Among them are an eight-leaf spelling book (undated) and a 92-page Primer of the Ningpo Colloquial Dialect 宁波土话初学 (1857). Though credited to the Rev. Robert Henry Cobbold (哥伯播义, 1819-1893) and the Rev. Henry Van Vleck Rankin (兰金, 1825-1863) respectively, both titles were the fruition of collaborative, successive development by multiple missionaries (Wylie 1867, 183 & 194). Princeton’s copy was gifted by John Luther Rankin (1869-1959)–Henry V. Rankin’s nephew–in 1904. A member of the Princeton class of 1892, J.L. Rankin was a regular donor to the Princeton University Library.

Lacking access to Rankin’s 1857 edition of the primer, I have tentatively dated Princeton’s copy in between 1851 and 1857, hypothesizing that our leaner volume might be one of the earlier works that enriched the fuller version. On its wrapper is inscribed “A primer, the first Romanized colloquial Chinese book, prepared chiefly by Mr. W. Martin’s teacher. Cut in blocks.” The handwritten account differs noticeably from the Presbyterian bibliography, which neglects to mention the contribution of local Chinese to such dialect books. The aforementioned teacher Lu might well be the said chief author. By Martin’s (1897) account, Lu was converted and later became a preacher (even his mother, once a devout Buddhist, became a zealous Christian) (67-68).

Men and God’s creation

The volume, which I have tentatively entitled A Primer of the Ningbo Dialect, begins with monosyllabic words, then moves on to two-syllable ones, then three- and four-syllable words, followed by eight chapters titled “One Man” “One Tree” “One Ship” and so on. I enlisted the help of two native Ningbo speakers, Fengming Lu and Lidong Xiang, with decoding the chapter “One Man.” With some patience and an English-Ningbo dialect dictionary compiled by the Rev. William T. Morrison (睦礼逊惠理, ca. 1835-1869), “the mists began to rise” like it did for Martin and we were able to comprehend the gist of the chapter (the length of one paragraph).

A primer of the Ningbo dialect, likely printed by the Chinese and American Sacred Classic Book Establishment 华花圣经书房 in Ningbo, between 1851 and 1857. Chapter “One Man,” annotated with close semantic or phonetic equivalents of the Chinese characters, p. 19-20. (Princeton Rare Books 2014-0211Q)

The chapter “One Man” read aloud in the Ningbo dialect by Lidong Xiang.

One Man

This is an educated man in a foreign place, with a book tucked under his arm. People differ by level and type. There are good people, bad people, smart ones, and stupid ones. In some places people’s skin is white; in some places their skin is yellowish; in some places people’s skin is truly inky dark. These differences are caused by water and soil in all kinds of places. Even though people may be black or white, with varying looks, they are all from one ancestor. Where do you think was the very first ancestor from?–was made by the True God. That’s why, regardless of where you are from, we are all of the same family, just like brothers. All these people have a body and a soul. The body will die, and the soul does not. The body is like a house, and the soul is like the people who live in it. When the body dies, it is like the house collapses, but the people in there are still intact and alright. They just change to a different dwelling, like how people move. If a person has done good, their soul will surely go to a very nice place and enjoy the blessing. If a person has done bad, their soul will then go to a most horrible place, where they suffer punishments. These are the definitive rules.

The text introduces the diverse skin colors of human beings, tracing them all to God’s creation but building the idea upon “ancestors,” which were both familiar to and revered by Chinese. Similes further help to get across the concept of the body versus the soul. The idea of heaven and hell is not articulated though alluded to, and one cannot help noticing the semblance between the overly simplified “definitive rules” of Christianity and Buddhist karma. The plain style of the vernacular language differs drastically from earlier translations of Bible texts, which were in classical Chinese (Tam 2020, 45) and accessible to an elite audience at best.

God is referred to as “Tsing-Jing” (真神, or True God) in the passage. The correct Chinese translation for “God” was once a fiercely contested topic among Roman Catholic missions, but Protestants in Ningbo apparently decided not to waste their energy in splitting hairs and instead embraced a range of terms (Martin 1897, 34-35). In W.T. Morrison’s (1876) An Anglo-Chinese Vocabulary of the Ningpo Dialect, the entry “God” lists multiple Chinese translations that include Jing-ming’ 神明 (Deity), T’in-cü’ 天主 (Lord of Heaven), and Zông-ti’ 上帝 (Supreme Ruler) (202).

Who read the primer?

Primers of the Ningbo dialect served two functions. First, they provided missionaries with a much-needed tool to speed up language acquisition. Second, missionaries soon realized that locals could be taught the Romanization system and read religious and secular text written in their own dialect. Martin proudly reported the advantage of his system in instruction for the young and the old, and especially among the lower-class men and women, who were otherwise denied schooling:

The Chinese saw with astonishment their children taught to read in a few days, instead of spending years in painful toil, as they must with the native characters. Old women of three-score and ten, and illiterate servants and laborers, on their conversion, found by this means their eyes opened to read in their own tongue… the wonderful works of God. (Martin 1897, 55-56)

Photo taken in Yuyao 余姚 by an unnamed Chinese photographer. In Old Ningpo: Bulletin of Ningpo Station, Central China Mission of the Presbyterian Church in the U.S.A. Ningpo, Chekiang, China, 1919. Courtesy of the Presbyterian Historical Society.

The usage of such dialect books was documented in a rare photo taken in Yuyao, Zhejiang (a town neighboring Ningbo) in the 1910s. A class of Chinese women students, ranging from youthful-looking ones to elderly ones, were learning the Bible and hymns from Romanized Ningbo dialect books in a courtyard. Both Martin’s bragging and the photo reflect how missionaries vigorously reached out to the poor and the marginalized population to propagate the gospel, bringing, in tandem, educational opportunities to girls and poor women (see, for example, Nimick 2020).

The teacher, dressed in a dark full-length gown and standing in front of the class, appeared to be a Westerner, most likely a missionary’s wife. She probably had two assistants, one standing next to a large pictorial chart, and the other seated by a pump organ, no doubt ready to play the next hymn melody. The students had brought along babies and toddlers, who were held in the women’s arms, or sleeping in cribs, or simply hanging out by the desks. Though it is hard to imagine how the class managed on rainy days, not to mention the burden of moving heavy furniture in and out frequently, the family-friendliness of the setting deserves serious respect from 21st-century institutions.

How was the book printed?

It so happened that Princeton University Library has a hymn book in the Ningbo dialect, one of five other titles in Romanized scripts compiled by the Rev. Henry V. Rankin and donated by his son, Henry William Rankin (1851-1937). Born in China, the son graduated from Princeton University with a degree in literature in 1873 (Princeton University 1914, 26), and gifted his father’s works to the library in 1921. A comparison between the hymn book, dated 1860, and the previous primer, helps us imagine how missionaries explored ways to print Roman letterforms in China.

Tsaen-me S. = 赞美诗 [A hymn book], compiled by Henry Van Vleck Rankin. Nying-Po, 1860. (Princeton Rare Books N-003610)

The Ningbo Mission set up a printing press, the Chinese and American Sacred Classic Book Establishment, the same year the station was established in 1845 and published books in both Chinese characters and Roman letterforms (Su 2018, 328). Both traditional Chinese woodblock printing and moveable type printing had been utilized to publish Ningbo dialect books. Martin (1897) described how the first primer was printed:

Causing a set of letters to be engraved on separate pieces of horn, I taught a young man to use them in stamping the pages of a primer. This was roughly engraved on wood, in the Chinese manner, called “block-printing”… (55)

The Book Establishment published three Romanized books in 1851, apparently with movable type, because in its annual report, the Ningbo Station reported that the hired Chinese typesetters were slow due to unfamiliarity with alphabets (Su 2018, 381). By 1853 the press had produced 25 titles in Romanized scripts, and over half of them were printed from woodblocks, because there were not enough printing presses (389).

The mixed use of printing technology explains why the primer and the hymn book appear different. Both were printed on folded leaves, and the hymn book in particular was stitched in thread in the elegant pre-modern Chinese style, but their semblance stops there. The text of the primer is in such a fat, bold font size that one short paragraph takes up four full pages of an oversized book! Noticeably, it follows the convention of woodblock-printed books, outlining a text box on each side of the folded leaves. (Illustrations placed at the top of the chapter titles occasionally bleed over the edge of the text box or overlap with the letters, suggesting that the illustrated pages were printed in a two-step process.) In contrast, the hymn book, with text in a small, neat moveable type font and without boxes, can well blend in with any Western Roman-language book.

Conquerors of the Babel

W.A.P. Martin (1827-1916)

W.A.P. Martin (Image source:

W.A.P. Martin served in Ningbo for ten years from 1850 to 1860 and went on to have an exciting life and an illustrious career in China. Guangxu 光绪, the penultimate emperor of the Qing dynasty, appointed Martin as the inaugural president of the Imperial University of Peking in 1898 as part of his reform to revive the dying empire. The short-lived reform failed, but China’s first modern national university survived and became the predecessor of the renowned Peking University. Martin was the translator-adapter of the first English version of the “Ballad of Mulan” (Dong 2011, 93), publishing it under the title “Mulan, The Maiden Chief” as an appendix to The Chinese: Their Education Philosophy (1881), a collection of his observations of formal and informal education in China.

The earliest English translation of the “Ballad of Mulan,” by W.A.P. Martin, published in a bilingual form in The Chinese: Their Education Philosophy (p. 316-319). London: Trübner, 1881. Page image from Google Books. (See also Princeton University Library DS709 .M384 1881)

The Rankin Family

Henry Van Vleck Rankin (Image source:

Henry V. Rankin (1825-1863) came from a New Jersey family intimately involved with the Presbyterian Church or Princeton University. He was born to William Rankin, Sr. (1785-1869), who ran a prosperous hat-manufacturing business in Newark, New Jersey (Wheeler 1907, 267). He graduated from Princeton Theological Seminary and, in 1848, was appointed by the Board of Foreign Missions as a missionary to China, where he compiled no fewer than eight titles of Romanized Ningbo dialect books. Rankin’s health deteriorated in the early 1860s, but, partly due to the ongoing American Civil War, he chose not to go home (Rankin 1895, 296). He visited Teng-chow 登州, Shandong Province for rehabilitation and died there on July 2, 1863 (297). Susan Rankin Janvier (1858-1943), one of his daughters, married into another big missionary family–to Caesar Augustus Rodney Janvier (1861-1928) of the Princeton Class of 1880–and with her husband served mission in India (Presbyterian 2015; Ingram 1929, 513).

Henry’s eldest brother, William Rankin, Jr. (1810-1912) lived to be a centenarian, who served as the treasurer of the Board of Foreign Missions of the PCUSA for 37 years. In that role he wrote an account, preserved in the University Archives, of the arrival of the first three Japanese students who enrolled at Princeton circa. 1871 (“Meet” 2014). One of the sons of William Rankin, Jr. was Dr. Walter Mead Rankin (1858-1947), who received his Master of Science degree from Princeton in 1884 and later taught biology at Princeton from 1889 until his retirement in 1923. He founded the YMCA Town Club in Princeton in 1908, the predecessor of the still-thriving Princeton Family YMCA in town (“Dr. Walter” 1947; Princeton YMCA 2017).

Another of Henry’s brothers was the Rev. Edward Erastus Rankin (1820-1889), whose son presented the primer to Princeton. Edward served in Presbyterian churches in New Jersey and New York for decades. He contributed a 12-page narrative detailing Henry’s life in Memorials of Foreign Missionaries of the Presbyterian Church U. S. A. (1895)–edited by the Treasurer of the Board, i.e., their elder brother William.

Unintended Fruit

Foreign mission’s endeavor in China lasted over a century after the Opium War, yet it failed to convert the “Middle Kingdom” into a Christian country. The work it put into spreading the gospel, however, had a profound impact on the education, culture, medicine, and science communication of the Chinese society. Thanks to the inspiration of the Romanization systems built by Martin and his fellow missionaries for spoken dialects, Chinese standardized the pinyin scheme that represents the pronunciation of Mandarin Chinese, greatly easing the learning of written characters since the second half of the 20th century. The fact that people can bang on the same QWERTY keyboard to make Chinese characters pop up on a computer screen can be traced all the way back to the conquerors of the Babel hacking Chinese dialects with the Latin alphabet.

Children’s Epoch 儿童时代, October 1958, page 14-15. (Cotsen 35519) A double spread in the popular children’s magazine Children’s Epoch uses visuals to teach A-Z alphabets. Each animal is “contorted” into the shape of the initial letter of its pinyin pronunciation. (Only two of the animals would share the same initial letter in pinyin and in English.) Both pinyin and “Bopomofo,” an alternative system of phonetic symbols introduced during Republican China, are printed, but pinyin would prevail in mainland China, and Bopomofo remains in use in Taiwan.

Henry V. Rankin’ books form part of Princeton’s collection of late-Qing-dynasty dialect books from various regions, scriptures in ethnic minority languages of Southwest China (Heijdra 1998), as well as global mission publications in indigenous languages (such as this bilingual Luther’s Small Catechism in Munsee and Swedish prepared in Pennsylvania during the mid-17th century), preserving clues to the pronunciation and vocabulary of spoken languages before the advent of audio-recording technology.


[A Primer of the Ningbo Dialect]. 1851-1857. Edited by W. A. P. Martin, Henry Van Vleck Rankin. [Ningpo: Chinese and American Sacred Classic Book Establishment]

Dong, Lan. 2011. Mulan’s Legend and Legacy in China and the United States. Philadelphia: Temple University Press.

“Dr. Walter Mead Rankin, Princeton Ex-Professor.” 1947. New York Herald Tribune, May 26, 1947, 18.

Heijdra, Martin. 1998. “Who were the Laka? A Survey of Scriptures in the Minority Languages of Southwest China .” The East Asian Library Journal 8 (1): 150-198.

Ingram, George H. 1929. “Princeton in the Nation’s Service, VII: A Man Who made a Distinct Impress in His Every Work.” Princeton Alumni Weekly, February 1, 513-514.

Lowrie, Walter M. 1854. Memoirs of the Rev. Walter M. Lowrie, Missionary to China. Philadelphia: Presbyterian Board of Publication.

Martin, W. A. P. 1881. The Chinese: Their Education, Philosophy, and Letters. London: Trübner.

Martin, W. A. P. 1897. A Cycle of Cathay, or China, South and North. 2nd ed. Edinburgh: Oliphant Anderson and Ferrier.

“Meet Mudd’s Jarrett M. Drake.” 2014. Mudd Manuscript Library Blog. Last modified June 18, 2014.

Morrison, William T. 1876. An Anglo-Chinese Vocabulary of the Ningpo Dialect. Revised and enlarged ed. Shanghai: American Presbyterian Mission Press.

Nimick, Thomas G. 2020. “Missionary Women’s Outreach to Poor Women in China; Origins of the Industrial Class Strategy.” The Journal of Presbyterian History 98 (1): 4-17.

PCUSA. 1879. The Forty-Second Annual Report of the Board of Foreign Missions of the Presbyterian Church in the United States of America. New York: Mission House.

Presbyterian Historical Society. 2015. “Guide to the Janvier Family Papers.” Last modified [November 18, 2015]

Princeton University. 1914. Directory of Living Alumni of Princeton University. Princeton, New Jersey: The University.

Princeton YMCA. 2017. “History.” Last modified [December 2017].

Pruden, George B. 2009. “American Protestant Missions in Nineteenth-Century China.” Education about Asia 14 (2): 22-29.

Rankin, Edward Erastus. 1895. “Rev. Henry V. Rankin.” In Memorials of Foreign Missionaries of the Presbyterian Church U. S. A., edited by William Rankin, 288-299. Philadelphia: Presbyterian Board of Publication and Sabbath-Schoolwork.

Su, Jing 苏精. 2018. 铸以代刻: 十九世纪中文印刷变局. 北京: 中华书局.

Tam, Gina Anne. 2020. “A Chinese Language: Fangyan before the Twentieth Century.” In Dialect and Nationalism in China, 1860–1960, 35-71. Cambridge: Cambridge University Press.

Wheeler, W. O. 1907. The Ogden Family in America, Elizabethtown Branch, and their English Ancestry: John Ogden, the Pilgrim, and His Descendants, 1640-1906.

Wylie, Alexander. 1867. Memorials of Protestant Missionaries to the Chinese. Shanghae: American Presbyterian Mission Press.

Appendix: Bibliography of Ningbo dialect books compiled by Henry V. Rankin

Ah-lah Kyiu-cu Yiae-su-go Sing-yi Tsiao shu. S-du pao-lo-go Shü-sing = 阿拉救主耶稣的新遗诏书: 使徒保罗的书信 [New Testament: Epistles of Paul]. Nying-Po, 1859. 61 pages. (Rare Books N-003608)

C’ih Yiai gyih = 出埃及记 [Exodus]. Ningpo. 72 pages.

Foh-ing tsaen di = 福音赞帝 [Synopsis gospel harmony]. Ningpo. 6 pages.

Gyiu-yi tsiao-shü. Tsʻông-shü kyi = 旧遗诏书: 创世记 [Old Testament: The Book of Genesis]. Nying-po, 1859. 72 pages. (Rare Books N-003594)

Meh-z Loh = 默示录 [Book of Revelation]. Nying-Po? 1859? 34 pages. (Rare Books N-003609)

Nying-po t’u-wô ts’u-‘ôh = 宁波土话初学 [A primer of the Ningbo colloquial dialect]. Ningpo, 1857. 92 pages.

S-du Pao-lo sia-peh Lo-mo Nying-go Shü-sing = 使徒保罗写给罗马人的书信 [The Epistle to the Romans]. Nying-Po? 1859? 30 pages. (Rare Books N-003607)

Sing jah djün shü = 新约传书 [New Testament] by William Armstrong Russell and H. V. Rankin. Revised edition. Ningpo. 260 leaves.

Tsaen-me S. = 赞美诗 [Hymn book]. Nying-Po, 1860. 156 pages. (Rare Books N-003610)

Tsʻông-shü kyi = 创世记 [The Book of Genesis]. Ningpo. 86 pages.


Professor Fengming Lu of the Australian National University and Lidong Xiang, PhD candidate at Rutgers University–both natives of Ningbo, China–worked on puzzling out the 160-year-old Romanized Ningbo dialect. Thank Lidong Xiang for reading aloud a chapter of the primer in her native tongue for us.

Professor Thomas G. Nimick, Ph.D. *93, of the United States Military Academy, West Point generously shared his research on women missionaries’ work in Ningbo and brought my attention to the photo of women’s class using Romanized Ningbo dialect books in the 1910s. Natalie Shilstut, Director of Programs and Services at the Presbyterian Historical Society, kindly made the photo available for this blog post.

Professor Ling Yiming of the Academy of Rare Book Preservation, Tianjin Normal University, and Dr. Eric White, the Scheide Librarian, helped with discerning the printing technology of the primer.

Stephen Ferguson, a rare books expert, deftly traced down the provenance of the primer after I hit more dead ends than I care to admit.

Last but not the least, thank Professor Matthew Grenby of the University of Newcastle for discovering the mysterious scripts in Princeton’s collection!

(Edited by Stephen Ferguson)