Validity Checks in Language Testing

Contributing Author: Mehar Kahlon

This article was written for the 2021 Corpus Roundtable event hosted by the TalkTogether team.

Test validity typically refers to whether a test measures what it aims to measure. Phrases such as “a valid test” or “a validated test” imply that validity is a quality of the test itself and implicitly places the burden of achieving validity on test developers rather than users [1]. Because testing involves judgement on the underlying abilities of individuals based on their test performance, a consideration of their lived experiences is necessary to secure fairness in measurement. In this piece, I aim to highlight some of the important considerations in validating language tests in low-and middle-income countries, particularly multilingual India. A word definition test (also called an expressive vocabulary test) will be used to illustrate these considerations and linkages will be made to child-directed print corpora as a potential resource for guiding the development of such language and literacy tests. The focus will be on assessments for research purposes; assessments that serve classroom teaching also require a similar discussion but will not be covered here.

A helpful starting point for conceptualizing test validity is the Bachman and Palmer framework of test usefulness [2]. Test usefulness is defined as the utility of a test in achieving its intended outcome. This framework provides six parameters for examining the usefulness of a test; namely:

(1-2) psychometric properties of reliability and construct validity,

(3) authenticity: overlap between tasks encountered in the test and non-test situation,

(4) interactiveness: overlap between the level of engagement exercised in the test and other contexts,

(5) impact: effect of the test on the individuals being tested, the communities they belong to, and society at large, and

(6) practicality: resource availability for conducting the test.

To bring the concept of test-usefulness within validation research, Bachman and Palmer also suggest that researchers produce an ‘assessment use argument’ [3] to justify test use, involving two parts: the argument to link test scores to interpretations (the assessment validity argument), and another to connect the interpretations to decisions made (the assessment utilization argument). Indeed, they propose that “the ways in which we use tests is at the heart of language assessment” [3]. Therefore, examining the usefulness of a test in achieving its intended outcome is central to the validity question.

Related to the idea of test use is the importance of considering contextual realities. Nag and Vagh [4], for example, consider context while choosing assessments to evaluate language and literacy skills in the Akshara Languages of India. They consider factors such as the bilingual and multilingual identities of children, access to books, socioeconomic status, and level of family engagement in literacy-related activities. They also emphasize linguistic factors such as the phonological characteristics of the languages and the visual complexity of their writing system. Put differently, being ignorant of context and language-specific characteristics provides invalid representations of the underlying language abilities of children, and this is particularly so for children in resource-deprived contexts where access to books and family engagement in literacy activities are limited. Culture, community and language must be honoured to ensure that interpretations made from the test are valid.

As an example of running a validity check of a child language assessment tool, I will apply the ‘assessment use argument’ to a word definition task developed by Sonali Nag [5]. This is a test for assessing Kannada expressive vocabulary and the users of the test are a hypothetical sample of multilingual children aged 5-and-6-years from the Indian city of Bangalore [see Note]. The linked parameters that I will use to evaluate the validity of this test are: the degree of correspondence between the assessment task and the underlying construct, degree of prior exposure to assessment content, age appropriateness of the items, fairness in the scoring criteria of the responses, and the possibility of bias seeping in at any of these levels for our sample who, hypothetically, are Hindi, English and Kannada speakers.

Firstly, there is the degree of correspondence needed between the assessment task and the underlying construct being measured, as explained by discussant Shelley Stagg Peterson in the 2021 Corpus Roundtable hosted by TalkTogether. Vocabulary refers to the entire body of words in a language that a child assimilates over time including the set of words that are understood by the child (receptive vocabulary) and those that are actively used by them (expressive vocabulary) [6]. A commonly used test to measure expressive vocabulary is the word definition task. In this test, children are presented with a word and are asked to define it. Given that rote learning is predominant in Indian classrooms and students are often expected to reproduce the information they encounter in their textbooks (particularly definitions) during examinations, a word definition task may not provide an accurate representation of the child’s understanding of the underlying meaning of the word. The multilingual context also makes unique demands on item selection: words that sound almost similar may have different meanings in different languages (for example: ‘daari’ in Kannada means way or path whereas ‘dari’ in Hindi means a mat). The Hindi-English-Kannada trilingual children in the hypothetical sample taking the Kannada vocabulary test may mix up the two similar-sounding words with test scores, reflecting the confusion caused by such items. In replacing words that are confusable across languages known to a child, a test might display greater sensitivity to the multilingual identities of test users.


Sample items: Word definition task by Nag (2008) [Source]

Secondly, there is the level of prior exposure of the children to the words being assessed. In this regard, one useful resource is the child-directed print corpus. The Kannada print corpus, compiled by the TalkTogether team, is composed of 411 children’s books (picture books, chapter books and short story collections) commonly read by children aged 3-10 years. The analysis of such databases allows test developers to check the frequency of appearance of particular words, thereby indicating the potential exposure that children may have to that word in print. For example, the frequency counts of sample items from the Kannada word definition task range from 0 (‘gombe’ or doll, ‘guDugitu’ or thunder, ‘suKhavaagi’ or happy, ‘tuntu’ or mischief, and ‘niyantrisu’ or control) to over 300 (‘citra’ or picture and ‘hasivu’ or hunger). Testing children on words that they are not likely to be exposed to regularly in print provides misleading information about their capabilities. Similar to written language exposure, is the patterns of exposure to spoken language: children whose home language is not the test language may not know the meaning of the test word because conversation partners restrict labelling of the object or phenomena to their own home language. Thus, in addition to ensuring that test items are related to labels found in the child’s surroundings, it is important to acknowledge that the language in which those labels are supplied to the child is perhaps what test scores capture. 

Another validity check would be to see if the test has a mix of earlier and later acquired words which helps to capture individual differences in the expressive vocabulary development of children. A recent TalkTogether survey asking parents, teachers and language experts about the age at which children first understand particular words is helpful to conduct this validation check. Out of the 28 words constituting the vocabulary words in the test we are analysing, the majority were acquired by children between the ages of 4-5 and-6-7 years. Four words had an age of acquisition rating of 2-3 years and one word had a rating of 8-9 years. The test therefore seems to be age-appropriate for the selected sample of 5-and-6-year old children because majority of the words included are expected to be acquired before and by that age. 

Thirdly, discussing the scoring scheme is imperative to evaluate the fairness of the test. In the test we are analysing, children were given one point for repeating the word even if this was extended using an inflection or idiomatic phrase, two for a correct sentential use of the word, and three for providing synonyms, translations of the word or defining it completely [5]. The scoring scheme did not penalize students for using the non-test language for answering. Collecting qualitative information regarding why children answer in a specific way could help further analyze whether responses arose from a lack of knowledge of the concept or other factors such as a different meaning for the same word in another language (see challenges in child assessment in South Africa highlighted by Laher and Cockroft [7]).

Lastly, where appropriate and possible, assessments must be co-created by researchers and members of the community being tested. This ensures that the items selected are relevant for the child being tested. The development and psychometric validation of assessments that consider the phenonmenon being studied from the perspective of members of a culture (Emic Approaches) takes time and additional effort. Researchers must reflect on the assessments they use by balancing these practical considerations with their commitment to fair assessment. Ultimately, a valid assessment must serve the community being assessed by accurately showcasing the abilities of each child being tested. A child-directed print corpus has the potential to support the creation of valid assessments by guiding the selection of age-appropriate and culturally sensitive items. 

 

About the Author

Mehar Kahlon is an MSc student at the Department of Education, University of Oxford and a research assistant for The Promise Foundation.

Note

Kannada is a Southern Indian language, and Bangalore a city in South India with a substantial multilingual child population.


References

[1] Chapelle, C. A. (2012). Conceptions of Validity. In G. Fulcher & F. Davidson (Eds.), The Routledge Handbook of Language Testing (pp. 21–33). Taylor & Francis Group.

[2] Bachman, L.F., & Palmer, A. (1996). Language testing in Practice: Designing and developing useful language tests. Oxford: Oxford University Press.

[3] Bachman, L. F. (2005). Building and Supporting a Case for Test Use. Language Assessment Quarterly, 2(1), 1–34. https://doi.org/10.1207/s15434311laq0201_1

[4] Vagh, S. B. & Nag, S. (2019). The Assessment of Emergent and Early Literacy Skills in the Akshara languages. In M. Joshi, & C. McBride, (Eds). Handbook of Literacy in Akshara Orthography. Springer.

[5] Nag, S. (2008). Kannada vocabulary test. Bangalore, India: The Promise Foundation.

[6] Nag, S. (2017). Assessment of literacy and foundational learning in developing countries: Final report. London Health and Education Advice and Resource Team, Department for International Development (DFID). https://assets.publishing.service.gov.uk/media/593e6e6240f0b63e0b000249/Nag_Final_Report_20170517.pdf

[7] Laher, S., & Cockcroft, K. (2017). Moving from culturally biased to culturally responsive assessment practices in low-resource, multicultural settings. Professional Psychology, Research and Practice, 48(2), 115-121.



Comments