Supporting All Children to Thrive: The Promise of a New Corpus Construction Protocol for Understudied Languages

Contributing author:  Kathleen Kupiec

This article was written as part of the TalkTogether’s academic blog writing programme for early career researchers.

Based on: Nag, S., John, S., & Agrawal, A. (2024). NSP-SCD: A corpus construction protocol for child directed print in understudied languages. Behavior Research Methods. Advanced online publication. https://doi.org/10.3758/s13428-024-02339-x 

The Problem: Current Guidance and Tools are Limited to Certain Languages

Children sitting on world.  Illustration by Kathleen KupiecChildren’s language acquisition and development is immensely important, with far-reaching and wide-ranging benefits that include supporting their social skills, emotional life, academic success, and literacy development [1]. Considered a foundational skill, literacy is the bedrock for an individual’s future, with positive impacts on their educational achievement and quality of life. Unfortunately, without taking additional measures, “an estimated 300 million children and young people will still lack basic numeracy and literacy skills by 2030” [2, p.20]. Though no child should “be denied the right . . . to use his or her own language” [3, Article 30], even if their language is a linguistic minority, literacy rates tend to be lower in regions of understudied languages [4]. 

A child’s language and literacy development are greatly influenced by opportunity. Children’s books are an important real-world opportunity where children encounter language and its writing system. Such child-directed print is a particularly important opportunity because it offers children exposure to a greater variety and complexity of words and word types than child-directed speech [5, 6]. The individual words in books also provide a way to examine a language’s writing system.

Text box with highlighted words from blog
One way to use children’s books as a tool for psycholinguistic research is by developing a print corpus, a collection of all the language used in a set of texts. Drawing on child-directed print corpora, researchers can determine patterns of vocabulary use, word lengths, sentence lengths, and spelling challenges across books, called corpora statistics. Informed by these statistics, researchers and teachers alike can better understand and support literacy development with more informed instruction.  However, the bulk of research, and, thus, the current knowledge base and targeted guidance, such as for developing language and literacy assessments and interventions, have been limited to certain languages. Despite this guidance being laced with contextual and language-specific assumptions, languages with limited research currently have little choice but to adopt what is available. This is problematic because the guidance may not be representative of a language or applicable enough to create equally high-quality language and literacy assessments and interventions, which all children, not just those learning select languages, deserve.

A perhaps easy solution in understudied languages is to borrow and translate existing materials and tools from other cultures and languages. However, this comes with consequences. While research suggests that translators of children’s books do attempt to replace text and illustrations with alterations that reflect the target culture (e.g., [7]), translations still fall short at the language level. It is not only what individual words mean that does not always translate between cultures (e.g., emotions: [8]), but also the psycholinguistic properties of translated child-directed print differ from those in the original language [9]. To responsibly support children from understudied languages, tools informed directly by these languages are necessary.

Stickman climbing books image.  Illustration by Kathleen Kupiec

Another consideration of children’s books is that as the target age of books increases, so does their overall length and complexity. As such, stratifying corpora by book levels would help inform how language-specific patterns change across these levels, called ‘a developmental catalogue’ by Nag, John, and Agarwal in their paper. This could help educators choose appropriate materials and determine best-fit interventions and assessments, which could enhance learning, enjoyment, motivation, and engagement by helping match the challenge of a text with the skill level of the student [10]. 


While child-directed print corpora offer immense and varied value, they are not yet available for most languages. Two reasons may be behind this:

  1. Though there are efforts to support an increase in the publishing of children’s literature in understudied languages, there are simply few, or even no, children’s reading books available in many understudied languages [4]. 
  2. Whereas software has been developed in some highly studied languages that can automatically parse and code text, these are not available in understudied languages. This means that even where print materials are becoming available, capitalising on them for corpora development might be held back by the time manual coding takes. 

Under such constraint, one potential solution would be to reduce the workload by developing small-sized corpora.

The Concern: Accurately Representing a Language

The number of unique words in a text increases rapidly as the total size of a text increases before levelling off. This is because most words would have already been captured and will merely begin reappearing in later text. The number of unique words, or lexical diversity, can be represented as a type-token ratio (TTR). To calculate this ratio, the number of unique words in a text (i.e., ‘type’) is divided by the overall word count of that same text (i.e., ‘token’). For instance, the TTR of the first sentence of this paragraph is .85 and the TTR of the first and second sentences combined is .73, reflecting that as word count increases, the likelihood of words repeating increases, and the TTR decreases. The non-linear decrease in the TTR as a token increases is represented by an empirical prediction termed the Herdan–Heaps law [11, 12]. The practical implication, and potential concern, is that small-sized corpora might not provide TTRs that accurately represent a language, limiting their utility and potentially skewing research findings in ways that a language is detrimentally misunderstood and misinforms programmes and assessments. 

The Potential Solution: A New Protocol

If there were a way to limit the workload and use what resources are available, while still capturing and accurately representing the psycholinguistic features of a language, such an approach to corpus construction could be immensely valuable, particularly for understudied languages. This study explored the feasibility of a non-sequential sampling protocol for small corpus development (NSP-SCD). To examine the potential of this protocol, corpora developed with child-directed print materials in the understudied language of Kannada were used. 

Constructing a Sample with the NSP-SCD

A cross-corpora analysis was conducted comparing a larger corpus to one constructed through selective non-sequential sampling of material from the books in this corpus.

The larger corpus

The Promise Foundation Corpus of Child-Directed Print in Kannada was used as the larger corpus in this study. It contains 151,249 words from 24,375 sentences. The corpus was developed from the text of 402 books appropriate for 3–10-year-olds and included picture books, story collections, folktales, chapter books, non-fiction books, textbooks and translated works. Each book was also levelled into one of the following book levels: books appropriate for children 3–5-years-old, 6–8-years-old, or 9–10-years-old. 

The smaller corpus

The smaller corpus contains 17,431 words from 2,661 sentences. It was constructed from the 402 books that made up the larger corpus using the following selective non-sequential sampling protocol for small corpora development

  • For books with less than 10 sentences, 1 sentence was randomly chosen. 
  • For books with more than 10 sentences but less than 32 pages, every 10th sentence was included. 
  • For books with more than 32 pages, a passage of 1500 words was randomly selected and then every 10th sentence from this was included. 

To improve random selection, a random number generator determined which of the first three sentences to begin with before carrying out the every 10th sentence selection (i.e., sentences chosen were either position 1, 11, 21…, or 2, 12, 22…, or 3, 13, 23...). 

Is the Protocol Equivalent and Robust?

A cross-corpora analysis examined equivalence in word-level characteristics between the non-sequential small corpus and the larger corpus. 

  • The differences in lexical diversity between the larger corpus and the non-sequential small corpus were slight, showing that the small-sized corpus approximates its larger counterpart well. The non-sequential sampling protocol drawing from 402 books was also compared to a same-sized small corpus (as measured by word count) constructed using sequential sentences from 7 books; the lexical diversity was lower in the sequential approach, supporting the robustness of the non-sequential sampling protocol. 
  • Word length was marginally different between the non-sequential small corpus and the large corpus in phoneme count (i.e., phonological length), but equivalent by akshara count (i.e., orthographic length). 
  • The Kannada orthography uses a form of orthographic representation called akshara. As would be anticipated, the larger corpus had greater orthographic diversity, or more unique akshara types. When calculated as proportions (the % of each akshara type), the non-sequential small corpus was able to match the larger corpus in capturing akshara diversity, including an ability to capture rarely occurring akshara. Akshara diversity also increased with book level. Further analyses explored the proportion of the different akshara types across book level and found akshara-specific patterns as the book level increased. For example, the proportion of CV akshara decreased, and the Ca and CCa akshara increased with increase in book level.

How Could the Smaller Corpus Help Understand a Language?

In addition to the cross-corpora analysis that found the NSP-SCD to be a viable alternative to a larger corpus, a within-corpus analysis examined language-specific characteristics and developmental trends by exploring whether corpus statistics of the non-sequentially sampled corpus changed across book levels. 

  • Books increased in length, both by number of words and number of sentences, as book level increased. At the word level, phonological and orthographic lengths also both increased with book level.
  • Examining lexical diversity, manual coding revealed that different parts of speech (PoS) (i.e., nouns, verbs, adjectives, adverbs, pronouns, and other) each had a unique pattern of total occurrence across the corpus. Informing language-specific characteristics of Kannada, the number of unique nouns was the most, followed by verbs, pronouns, and adverbs. The lowest number of unique occurrences was with adjectives, suggesting that describing words in children’s books are a smaller and much-repeated set. 
  • Within each PoS, each book level had a unique trend relative to the other book levels. For example, across PoS and particularly for nouns and verbs, the TTR values had a steeper decline for the lowest book level (3–5-years-old). This signifies that books targeting the early years use more repetitive words while older books use more unique words, and, thus, that lexical diversity increases with age.

Promise of the Protocol

Open door image.  Illustration by Kathleen Kupiec
Necessity might require small-sized corpora be considered for understudied languages rather than detrimentally dismissing corpora development altogether. Despite the total word count of the non-sequential small corpus being 11.5% of the larger corpus and concerns that a smaller corpus could not accurately represent the psycholinguistic properties of a language, this study has illustrated that employing the NSP-SCD satisfies the need for close enough approximations in determining the language and writing within children’s books. 

That the NSP-SCD is both viable in its estimates and robust is meaningful for understudied languages, as it will reduce the time necessary to develop corpora. This will aid in making important contributions to research and practice without privileging select languages, particularly those with automatised parsing software. As the comparison between the non-sequential protocol and a similarly small-sized corpus built from sequential sentences found the non-sequential protocol to be superior, corpora development still benefits from drawing on a diverse sample of books (e.g., a variety of authors, publishers, themes, book categories) within a language. As such, it remains important to continue efforts to encourage further publication of child-directed materials and to focus on supporting a wider pool of authors writing about a larger variety of topics in a range of styles.

Text box with words highlighted from blog
This study further showed how these corpora can be used. For instance, stratifying child-directed print corpora by book level aids in understanding developmental trends and increasing text demands, which can be used to support children’s language development. In addition, by capturing well the psycholinguistic features of a language as it is in use, such corpora enable the examination of language-specific characteristics and cross-language comparisons. For example, this study found that, like Turkish but unlike English, child-directed print materials in Kannada have more adverbs than adjectives. 

Child-directed print corpora help map word level exposure and writing system encounters. Utilising the NSP-SCD protocol in understudied languages will benefit countless children by supporting the development of better language and literacy experiments, interventions, assessments, and instruction, as well as, in turn, the Sustainable Development Goal of Quality Education [4].

Where to Next? 

Signpost image.  Illustration by Kathleen Kupiec
While the NSP-SCD has shown promise for examining psycholinguistic features, such as lexical diversity and orthographic diversity, smaller corpora might not be as comparable to larger corpora for applications that rely on features of contextual diversity. For example, a small corpus lessens the instances of individual words’ appearance, and a selective non-sequential approach removes the broader context around each sentence. In doing so, co-occurrences and the proximity of words to each other have less chances of being captured and mapped, and distributional semantic networks and latent semantic analysis might, in turn, prove to be less accurate. This difference in contextual diversity could impact work such as creativity assessments using corpora-informed software (i.e., to measure flexibility and originality), and future research is needed.

About the Author

Kathleen Kupiec is a doctoral student in the Department of Education at the University of Oxford, where she explores promising practices to enhance family experiences of, and maximize child outcomes from, informal educational environments. Her current research focuses on examining the potential that children’s museums have in supporting child creativity, as well as how to optimally tap into it on the museum floor during family visits.

References

[1] TalkTogether. (2023, July 5). Assessing speaking and listening: What is important? (1/4) [Video]. YouTube. https://www.youtube.com/watch?v=iwxc21LlRuU

[2] United Nations. (2023). The sustainable development goals report 2023: Special edition. United Nations. Retrieved from https://unstats.un.org/sdgs/report/2023/The-Sustainable-Development-Goals-Report-2023.pdf 

[3] United Nations. (1989). Convention on the rights of the child. United Nations. Retrieved from https://www.ohchr.org/en/instruments-mechanisms/instruments/convention-rights-child

[4] Results for Development Institute. (2016). Global book fund feasibility study: Final report. Results for Development Institute. Retrieved from https://r4d.org/wp-content/uploads/R4D-IEP_GBF_Full-Report_-web.pdf

[5] Dawson, N., Hsiao, Y., Wei Ming Tan, A., Banerji, N., & Nation, K. (2021). Features of lexical richness in children's books: Comparisons with child-directed speech. Language development research. https://lps.library.cmu.edu/LDR/article/id/77/

[6] Massaro, D. W. (2015). Two different communication genres and implications for vocabulary development and learning to read. Journal of Literacy Research, 47(4), 505–527. https://doi.org/10. 1177/1086296x15627528 

[7] Coillie, J. V. (2020). Diversity can change the world: Children’s literature, translation, and images of childhood. In J. V. Coillie & J. McMartin (Eds.), Children’s literature in translation (pp.141-156). Leuven University Press.

[8] Mesquita, B. (2022). Between us: How cultures create emotions. W. W. Norton & Company.

[9] Puurtinen, T. (2003). Nonfinite constructions in Finnish children’s literature: Features of translationese contradicting translation universals? In S. Granger, J. Lerot, & S. Petch-Tyson (Eds.), Corpus- based approaches to contrastive linguistics and translation studies (pp. 141–154). Rodopi.

[10] Csikszentmihalyi, M. (1990). Flow: The psychology of optimal experience. Harper Perennial.

[11] Heaps, H. S. (1978). Information retrieval: Computational and theoretical aspects. Academic Press. 

[12] Herdan, G. (1960). Type-token mathematics: A textbook of mathematical linguistics. Mouton & Co. 

Credits: Illustrations by Kathleen Kupiec

Comments