For her senior thesis, AnneMarie Caballero ’23 went through more than a thousand children’s books published during the 19th century and analyzed the pattern of topics in relation to the gender of protagonists. Titled “Gendered Topics: Boyhood and Girlhood in a Century of (Cotsen) Children’s Literature,” her project won Princeton’s Center for Digital Humanities (CDH) Senior Thesis Prize of 2023.
We are delighted that Caballero answered our interview questions about her research. An overview of her thesis, which gives us an opportunity to examine Cotsen’s collection from a new perspective, follows the interview. Caballero’s thesis will be made available via the repository of Princeton’s undergraduate senior theses and retrievable from the library online catalog in Fall 2023.
Hi AnneMarie, please tell us about yourself. What would you like our readers to know about you?
I recently graduated from Princeton’s computer science department, although I also focused on English in my coursework. In college, I was a part of our literary magazine, Nassau Literary Review, the Model United Nations team, and I worked for the computer science department as a grader. In my free time, I love to read, cook, and play volleyball (although I’m bad at the last one).
You have written a fascinating senior thesis, which applies computational literary analysis to a corpus of 19th-century children’s literature and delineates large patterns of gender and space representations in them. How did the idea of this project come to you?
The project sprang out of a desire to look at literary history with a more large-scale lens than I previously had. My junior research had examined the role of female authors in the early British novel through semantic vectorization but had a limited scope (seven authors, 35 novels). I hoped with my senior thesis to work with a larger dataset that could more comprehensively address the questions I wanted to ask.
Children’s literature was a good fit for my goal. I initially chose it because I wanted a dataset that I had requisite domain knowledge in, and also because I was hoping to work with Professor William Gleason, whose class on children’s literature I had taken. Further, this focus facilitated working with a larger dataset because children’s literature is defined by its audience. As attitudes towards children change, we see the emergence of the genre (within the English language) in the 1740s, after which works are consistently published for children. This choice of genre ensured that there would be a plethora of works available across the course of the nineteenth century.
My research question focused on the gendered nature of the nineteenth-century literary market, specifically its tendency to treat girls and boys as separate consumers. This question originated because multiple literary scholars that I had read discussed this phenomenon. I wanted to see if it could be detected by topic modeling, and, if so, what are the different topics discussed by girls’ vs. boys’ books?
Tell us about your research process. How long did it take from beginning to end? What part would you describe was the most challenging? Rewarding? Enjoyable?
I started thinking about my senior thesis and its direction the summer before my senior year. I spent the fall proposing a topic, doing initial research, and beginning the creation of my dataset. However, because of my lighter course load in the spring, much of the work was done then. I’m unsure exactly how long it took, but easily in the hundreds of hours.
The most challenging part was defining the methodology for my research question. I wanted it to be flexible enough to be explored through qualitative analysis, but for these observations to be supported by concrete quantitative metrics. Similar research I was looking at focused more heavily on the qualitative, so finding that balance was a struggle.
The most rewarding part was probably when I found out that 112 of my 125 topics were considered statistically significant by gender. It was the last step in the research process before writing my results, and it was incredibly validating to see that my hypothesis (that books written for boys vs. for girls fundamentally featured different topics) was so strongly supported by the dataset and methods.
The most enjoyable part was showing my friends the results, and getting to discuss with them the interesting gender differences that cropped up. There was a lot of joy in getting to share my research with the people who had been there for me throughout the process.
How has this project facilitated your professional growth? Aside from gaining valuable insights into 19th-century children’s literature, what else have you harvested from the project in terms of skills, experience, and understandings?
I can’t emphasize enough how much the project shaped my view of research. One of the major ways that the project facilitated my professional growth was the experience creating the dataset. While I had curated a very small dataset for my junior year research, working on this project entailed a months-long curation process that consistently caused me to question myself and my decisions. In my conclusion, I included some advice for first-time curators, such as documenting your decision process, which allows other users of the data to understand sources of bias.
I also feel much more rooted in the digital humanities as a whole. One portion of my thesis was writing a short history of computational literary analysis, and to a lesser extent, the digital humanities. Familiarizing myself with the debates around the field helped me avoid some of the shortcomings that critics have pointed out about digital humanities research, while also taking advantage of benefits like the ability to look at whole eras of literary history quantitatively.
Further, beyond project specifics, executing such a significant independent project offered so many lessons. I learned about how to scope projects properly, partially because I was certainly too ambitious at the start—I wanted to answer three research questions and only got to one. I had to become comfortable reaching out to anyone who might have the appropriate domain knowledge to answer my questions, resulting in a lengthy acknowledgments section. I learned more about data science and statistical methods, which I had not focused on as much in my coursework.
The impressive corpus of digital texts you have curated may benefit future researchers conducting digital humanities studies. Are there any sample questions you can think of the corpus may help address?
Very much so! As mentioned, I had other research questions I was hoping to get to, but, as the thesis was already 159 pages (before the appendix and references), I ended up cutting them due to scope. The question I was really hoping to explore, but didn’t get to, was about the value of children in the nineteenth century. In her book, Pricing the Priceless Child, Viviana Zelizer explores how, beginning in the nineteenth century, the child gains in sentimental value as their financial value decreases. I was hoping to explore that trend in literature, by linking it to literary trends like the cult of childhood, and examining how much the books in the dataset use a diction of sentiment/emotion vs. a diction of utility.
Beyond that question, which I explored fairly in depth but did not get to apply to the dataset, there are endless new avenues for questions. Especially in the curation process, I regularly stumbled across questions and topics that the dataset could significantly address from discussions of colonialism to the fairy tale subgenre.
Any other aspect of the project I have not asked about and you’d like to share with our readers?
I would be remiss not to talk about how critical a role the Cotsen’s Children Library played in the project. After I decided to look at children’s literature, I needed to find a collection of works that would suit my purpose. I explore this more in my thesis, but no existing corpus met the project requirements. Professor Gleason showed me the nineteenth-century catalogue of the Cotsen Children’s Library as a starting point, and that fundamentally shaped my project.
Out of the 1020 works included in my final dataset, 416 were directly from the catalogue and the other 604 were almost all added because of the collection—works by authors in the collection or that I found while searching for works in the collection. While curation could be exhausting (it required searching the 6000+ works from the catalogue in the HathiTrust Digital Library search bar), it also was an amazing introduction to the variety of children’s literature in the nineteenth century. I often found myself down research rabbit holes, or even at times, just being surprised by the books. In one of the books, Other Stories by E. H. Knatchbull-Hugessen, I read its very long dedication to his armchair. It felt like uncovering a secret history, although one that was often troubling, especially with its treatment of non-European cultures, race and ethnicity, and colonialism.
Moreover, when I was feeling particularly tired in the final weeks of my thesis, I stopped by the library, and ended up talking with the staff about my thesis. That memory was hugely encouraging as I finished my thesis and is still one of my favorite memories from my senior year.
Your work won this year’s Senior Thesis Prize from the Center for Digital Humanities. Big congratulations! What is your future career plan like?
Next year, I’m working on the Atlas product, a database-as-a-service, for MongoDB, a tech company in New York. While I loved my research and was lucky enough to be accepted to Cambridge’s MPhil in the digital humanities, I ultimately wanted to take time off from school. I had a really wonderful time interning with the company last summer, and I wanted to experience working full-time for a tech company, especially as I decide if I want to go into tech long-term or explore one of my other interests. I definitely see myself returning to the digital humanities, or more generally to a job at the intersection of tech and culture.
Lastly, since you (distantly) read over a thousand children’s books to conduct your research, please tell us about your childhood reading. Did you have any favorite books or reading material? Any people or places you associate with your early reading?
I actually very recently reread several favorite children’s books! One of my all-time favorites that I think really holds up is Tamora Pierce, particularly her Wild Magic and Circle of Magic series. Her female protagonists are better-written than most of the ones I find in adult literature. There are so, so many other series I love (Little House on the Prairie, Shannon Hale’s books—which I mentioned briefly in the thesis, Nancy Drew, Cornelia Funke’s Igraine the Brave, etc.), but Tamora Pierce’s books are the ones I go back to the most.
For people, obviously my parents and my siblings played a huge part in my reading. Also, my librarians: my elementary school librarian even gave me my school’s copy of Pride and Prejudice when I left because I was the only one who ever checked it out. Oddly enough, the place I associate with early reading is the upstairs hallway in my house. There’s a bookshelf there, and I remember sitting there on the beige carpet for hours, reading a book, and when I was done, just picking another one off the shelf.
Reading Children’s Literature, Fast and Slow
Partly due to the relative scarcity of children’s literature corpora, Caballero’s project is a rare computational literary analysis (CLA) that is implemented upon children’s texts. In the field of digital humanities (DH), “corpus” refers to a digital collection of texts. Having curated a corpus of 19th-century English-language children’s literature herself, Caballero applies the method of topic modeling to tease out the statistical pattern of topics in relation to the gender of protagonists. The strength of Caballero’s outstanding research lies in multiple areas. First, she is not afraid of engaging in controversies about DH and in thorny challenges of children’s literature studies. Second, she makes an impressive contribution to DH by publishing a large corpus of digitized children’s literature, which will benefit future researchers. Thirdly, by firmly grounding her statistical revelations in the concerns and findings of traditional literary criticism, the thesis carefully balances quantitative and qualitative methods, reaching nuanced conclusions that are both supported by large-scale analysis and informed by close reading of canons.
Chapter 1 reviews the history of CLA, which over time has succeeded in applying increasingly sophisticated computational tools such as natural language process to literary studies, processing texts on a scale previously impossible for the solitary researcher. Caballero visits debates around the field of DH and examines, with a fine-tooth comb, critiques that are among those made by its harshest detractors. This would shape the design and process of her project.
Responding to flaws that have been raised about DH scholarship, in Chapter 2 Caballero defines the scope of her data with transparency, meticulously documents how the dataset has been constructed, and makes it readily available through the HathiTrust Digital Library collection system. Caballero used A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (Princeton University Library, 2019) as a guide for building the digital corpus. The catalog, consisting of two tomes that stack up to four inches high, describes over 6300 titles published during the 19th century and having been collected by the late donor, Lloyd E. Cotsen, ’50 and Charter Trustee Emeritus. With what I can only imagine to be mighty Princetonian tenacity, Caballero has gone through all of them, selecting English-language titles that meet her criteria of narrative texts for children.
The curation process forced Caballero to wrestle with some of the fundamental questions that are beguilingly simple but fraught with rule-defying exceptions: How do you define literature? (e.g., should primers, ABC books, stories in verse, etc. be included ?) How do you define children’s literature? (e.g., is the presence of a child protagonist a necessary and sufficient criterion? Must children’s books be written with a young audience in mind? What about folk tales and fables, genres that were not produced only for children, but have morphed into classical children’s literature?) By sharing her challenging decision-making process, Caballero hopes to keep future users of her dataset fully informed of the limitations of the corpus.
Caballero was able to locate 416 titles of English-language children’s literature from the catalog that are available in full text in HathiTrust Digital Library (HTDL), a major repository of digital content from research libraries. By conducting author and keyword searches, she added another 604 titles to the Cotsen Children’s Literature (CCL) Dataset. Authors that appear most frequently in the dataset include Horatio Alger (1832-1899), Mary Martha Sherwood (1775-1851), Laura Elizabeth Howe Richards (1850-1943), Mrs. Molesworth (1839-1921), Oliver Optic (1822-1897), Louisa May Alcott (1832-1888), and A. L. O. E. (pseudonym of Charlotte Maria Tucker, 1821-1893) (68).
To prepare the dataset for computational analysis, Caballero then ran the 1000+ works through the BookNLP pipeline for nearly twenty-four hours of intensive analysis. An open-source natural language processing tool, the BookNLP pipeline tracks “all of the characters appearing within a work, the number of times they’re mentioned, the names and pronouns by which they are mentioned” (91), and other entities it is capable of recognizing and tagging in scale.
Chapter 3 describes the computational analysis of the texts in terms of gender and topics. First, based on the annotations generated by the BookNLP pipeline, Caballero determined that 613 titles met the mathematical threshold for having a central protagonist (92-95), all but eleven of them having an identifiable gender, which is treated as a proxy for the gender of the intended audience. However, Caballero is quick to point out that the intended audience does not equate the actual audience: whereas boys tend to read books with male protagonists, girls tend to cross gender boundaries and read about boys and girls (89).
The male-to-female ratio of protagonists in this subset approaches 1.9:1–389 titles with male protagonists versus 206 with female protagonists (99)–an uncanny number that echoes the findings of studies about gender imbalance with other bodies of literature. For example, McCabe et al. (2011) analyzed gender representation in 5,618 children’s books published throughout the 20th century in the United States. Through manual coding, her team of five scholars found a male-to-female ratio of 1.9:1 in title characters, and that of 1.6:1 among central characters. Data also suggested that male protagonists receive more mentions than female ones in the CCL works (101), again a pattern that is consistent with existing scholarship. Underwood et al. (2018) traced 104,000 works of English-language fiction spread over three centuries, from 1703 to 2009. Similarly using BookNLP and relying on HathiTrust Digital Library, they calculated the proportion of words used in describing female characters, and found a steady decline from the 19th century through the early 1960s.
Next, Caballero conducted topic modeling to sort the dataset into 125 clusters, each containing co-occurring words from which a topic or a theme may emerge. Among them, 112 were found to be gendered: 64 topics were more often in stories with male protagonists, and 48 more often in those with female protagonists, suggesting that boys’ topics are 33% more varied than girls’. The rest of the 13 topics were gender-neutral. Caballero presents both macro statistical revelations and in-depth analysis of selected topics.
I thought it would be interesting to pick titles from the Cotsen catalog of the 19th century and test to what extent an individual work reflects large patterns detected by machine. To build up the suspense further, of the five titles I selected from the catalog (largely based on the interest level of illustrations highlighted in the tomes), only one is in the CCL dataset, and four others are not available in HathiTrust, thus not having been “read” by computer programs mathematically.
It should come as no surprise that the topic of violence or combat is found more often in works with male protagonists. One of the Cotsen titles I selected, The Little Deserter; Or, Holiday Sports; An Amusing Tale Dedicated to All Good Boys, epitomizes the strong connection between the topic and an intended boy audience–from the unequivocal dedication to boy readers in its subtitle, to illustrations that portray boys playing soldiers with menacing-looking toys and props.If you find the scene of execution–in a book published during the Napoleonic Wars–offensively violent for 21st-century sensitivity, you are justifiably feeling so. Here is a spoiler that may be offered as a small solace: Julius, the boy protagonist who has been blindfolded and received the death penalty, bounces back in no time and puts dibs on playing the captain in tomorrow’s game.
What makes the Cotsen copy of The Little Deserter remarkable is that it carries evidence of girls’ expansive reading interests. A “Miss Elizabeth Johnston,” likely a former owner/reader of the book, inscribed her name twice in it. As quoted in Caballero (87), Kimberley Reynolds attributes the appeal of boys’ books for girl readers to the fact that books deemed suitable for young ladies were frequently unexciting tales for cultivating good behavior.
Topic: Island (Stranding)
“Island” is a frequent word found in multiple topics that range from Island (Stranding) to Boats (Stranding, Shipwreck) and Nation (Nationalism), linking to the traditional boys’ adventure story as well as the historical subject of colonial conquest (139-140). Both literary criticism and Caballero’s computational study confirm a gendered landscape in children’s literature, which contrasts the feminine home with the masculine away, and excludes boy characters from the home and girl characters from the away (147). The dichotomy, however, is complicated by what is referred to as “adventurous domesticity” (142), whereby protagonists attempt to reconstruct domesticity while stranded.The masculine pursuit for “home away from home” is well reflected in “Adventures of Robinson Crusoe,” a short, illustrated verse story based on Daniel Defoe’s novel. The titular castaway builds a thatched and fenced house that he can call his little home, makes furniture and clothes (he is pictured as putting finishing touches to an umbrella), keeps company with his dog and cats, and domesticates a young goat and a parrot he has found on the island.
Of the five titles I selected, Johnny Headstrong’s Trip to Coney Island is the only one that is included in the CCL dataset, thanks to digitized copies contributed by member university libraries to HathiTrust. In this verse story, Johnny’s family takes a trip to Coney Island beach. Even though his sister Sue has also joined the outing, she is rarely mentioned. At one point she is described as sitting on the wooden horse of a carousel “like a lady,” i.e., side-saddling. Johnny is the protagonist and remains the center of attention (and chaos) by getting into a nonstop series of scrapes, departing the island with bandages over his nose and cheek at the end of the day. Johnny Headstrong’s adventure seems to be a quintessential bad boy’s tale, having packed into its 20 pages so many of boys’ topics on Caballero’s list (115-6): Movement, Body of Water, Boats, Injury, Donkeys, Animal, to name a few.
The pattern of gendered topics does not mean that a boy’s tale is devoid of all topics that are statistically prominent in girl’s stories, and vice versa. Caballero conducts a case study with two of the best-known girls’ adventure stories, Alice’s Adventures in Wonderland and Alice Through the Looking-Glass, and finds a good portion (a quarter and nearly a half respectively) of the top 20 topics in each work are boy’s topics, such as Injury, Water, and Animal (143). Likewise, both Robinson Crusoe and Johnny Headstrong have their emotionally vulnerable moments, described in words that are frequently found in the topic Painful Emotion (Death), which is statistically a girls’ one. A forlorn Robinson Crusoe sometimes grows “very sad,” “cries aloud,” weeps “like any child,” thinks of his father and mother, and prays to God “with many tears” (“Adventures” 1-2). Johnny’s adventure begins as he tumbles overboard, is fished out of the water, and cries as he is sent to the engine-room to dry beside the furnace fire. In one episode, he slips away and loses his Papa and sister Sue, then begins “to cry,” “big tears” running down his chubby face (Johnny Headstrong’s). In another, he accidentally strikes a boy hard with a ball and, thinking the boy would surely die, sobs with “childish fright.” Towards the end he falls off a swing, and adults have to sooth “his sobs and groans.” It is tempting to ask if there might be any correlation between how broadly appealing a children’s story is and how inclusive the work is in encompassing gendered topics.
Girls, Domesticity, and TravelCaballero recognized that illustration is an essential element of the Cotsen collection, because of Lloyd Cotsen’s “passion for illustrated works that help children become independent readers” (Immel, quoted in Caballero 54). Her computational analysis handles only texts that have been OCRed, thus the machine has missed about half of the fun of perusing the Cotsen collection! The May Blossom, a collection of short verses, presents an intriguing case of what machine manages not to miss in spite of its singular focus on texts. In one of the entries “Confidential People,” a first-person “I” shares a secret with a second-person “you”–there is no textual description of the setting of the story. In the accompanying illustration, the two characters are seated in an intimate, ornate space, surrounded by objects that well match the most frequent keywords of the topic Domestic Space, one that is found more often in works with female protagonists.
The narrator confides that she plans to marry “a sweet little beau” and to take a honeymoon by “a coach and six horses” to Lilliput Land next year. It is a striking contrast how a story that hints at an exciting trip to the faraway fantasy land is visually represented by two girls confined in a stuffy room, a setting that is mentioned nowhere in the text. Travel (Driving, Carriage) turns out to be one of the gender-neutral topics (114), meaning it is as likely to appear in stories with a male protagonist as a female one. How does it square that domestic space is tied to girls’ stories, yet travel is not ? In “Confidential People,” the girl’s narrative about travel is firmly grounded within approved gender roles. The endearingly amusing verse both adores the young narrator’s childish innocence and models an aspiration for marriage that leads to the fulfillment of traditional womanhood.
A close reading of another story that fits the topic of Travel (Driving, Carriage) invites us to consider what it means to be the central character of a story, and circles back to gender imbalance in terms of the count of female versus male protagonists as well as the proportion of words devoted to each gender. In “Johnny’s First Motor Ride,” the titular character receives a real little motor-car from his father and soon learns how to “control it with ease.” With a bonneted baby deposited in the passenger seat–possibly against the baby’s will, judging from his/her facial expression–he goes out for a ride. After trying to abruptly avert a collision with Margery’s goat-chaise, however, Johnny finds his car stuck. It is at this point, where the story has run two-thirds of the way towards the end, that attention swerves to Margery. Described by her father as “a real clever little woman,” Margery is sympathetic, helpful, and resourceful. Even though it is not her fault that Johnny’s car malfunctioned, she does not abandon the stranded novice motorist. She sets to work “to harness the damaged motor-car to the goat-chaise,” which is pulled by “Nanny” the goat, and coaxed the hoofed “engine” to tow the modern vehicle home. “That was a real triumph for Nanny!”–the story concludes with the exclamation.
Whether by its title “Johnny’s First Motor Ride” or by the amount of text devoted to Johnny, the protagonist of the story is apparently a boy–to machine’s mathematical “mind” at least. I can’t predict how a human reader interprets who is the central character of the story. Margery clearly shines with what she has done, even though she doesn’t receive the most mention in the story. That the credit for the successful rescue act should go to the goat implicitly imparts a self-effacing virtue expected from females. The girl character is sidelined even in a story where she is not the damsel in distress but the heroine who saves the day.
Caballero’s computational analysis of a sizeable body of 19th-century English language children’s literature reveals a gendered landscape, tethering female characters to the domesticity and the inward, freeing male characters to the wider world away from home, enlarging the gap between endorsed masculine and feminine behavior, and bundling implicit morals and values for each gender. She brings rich complexity into her project by tracing how a large-scale analysis of over a thousand works agrees with or departs from findings based on traditional literary criticism of a limited number of canons. It is a testament to the robustness of her study that, for the five titles from the Cotsen collection–only one of which available in the dataset–the patterns still hold true and help us gain fresh insights into these dusty volumes.
 Johnny and his father break all the modern government regulations for driving an automobile. There is no publication date on the book. Let’s assume the story was written soon after the invention of the first automobile in 1886, before the driver’s license began to be implemented by the end of the 19th century, or before the age restriction was first introduced in Pennsylvania in 1909.
Caballero, AnneMarie. Gendered Topics: Boyhood and Girlhood in a Century of (Cotsen) Children’s Literature, Princeton University, 2023.
Mccabe, Janice, et al. “Gender in Twentieth-Century Children’s Books: Patterns of Disparity in Titles and Central Characters.” Gender & Society, vol. 25, no. 2, 2011, pp. 197-226.
Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, vol. 3, no. 2, 2018.
Datasets Curated by AnneMarie Caballero (the exact scope of each dataset is detailed on page 65 of her thesis):
The Cotsen Children’s Literature (CCL) Dataset (1021 items as of July 2023) [URL]
- The subset of titles as found in A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (416 items) [URL]
- The subset of titles as found outside A Catalogue of the Cotsen Children’s Library: The Nineteenth Century (605 items) [URL]
Titles as found in Cotsen’s catalog but excluded from the CCL Dataset (123 items) [URL]
GitHub repository of the cleaning script for the CCL Dataset [URL]
(Edited by Andrea Immel)