Differential Item Functioning Analysis on the NIH Toolbox Picture Vocabulary Test in Black and White Participants
Faculty Advisor Name
Deborah Bandalos
Department
Department of Graduate Psychology
Description
Racial disparities have been investigated across many cognitive domains, including on tasks targeting episodic memory, semantic memory, vocabulary, and executive functioning. Researchers have found the racial disparities in cognition are mediated by education, income, physical health, and external locus of control (Zahodne et al., 2017).
The goal of the current study was to conduct a differential item functioning (DIF) analysis to see if White respondents had a differing probability of getting an item correct in comparison to Black respondents, after matching respondents on overall ability. A secondary-data analysis was conducted on the ARMADA (Advancing Reliable Measurement in Alzheimer’s disease and Cognitive Aging) study to specifically look at performance on the NIH Toolbox Picture Vocabulary Test (TPVT). The goal of ARMADA is to validate The NIH toolbox, which is used to assess neurological and behavioral function. The TPVT is a computer-adaptive test (CAT) that targets auditory comprehension to assess the presence of language disorders (i.e. aphasia) or neurodegenerative diseases (Gershon et al.,2014).
Vocabulary is moderated by the educational experience of an individual (Gershon et al., 2014), which is in turn influenced by cultural background. Differences in these variables may have contributed to differential performance researchers have seen. Gershon and colleagues (2014) stated that the validity of the TPVT should continue to be evaluated—which we aimed to do in the current study. Similar to the TPVT, the Boston Naming Test (BNT) has been used to look at different aphasic syndromes (Na & King, 2019). The differential performance seen between Black and White participants found in that study prompted Pedraza et al. (2009) to conduct a DIF analysis on the BNT in which they examined differential performance on the BNT for African American and Caucasian Older adults. As defined by Bandalos (2016), DIF is the differing probability of answering an item correctly, after controlling for levels of the construct of interest. The presence of DIF would indicate that something other than the construct of interest, is contributing to the differential performance of a specific group.
In this study, I extend Pedraza and colleagues’ (2009) framework of assessing DIF in a neuropsychological measure of language. While the BNT target vocabulary comprehension through naming ability, the TPVT focused on vocabulary via the process of recognition of images. The current study used the Mantel-Haenszel (MH) and Logistic Regression (LR) as tools to examine the TPVT for DIF. Although Pedraza et al. used Item Response Theory in their study, in the current study I used MH and LR because the sample sizes were quite small. However, Belzak (2020) found that the LR is a powerful tool for detecting uniform DIF (differences in item difficulties) in sample sizes as low as 25 per group.
Forty-four items met the previously mentioned sample size requirement, and nine were flagged for DIF. Of these, three favored Black respondents and five favored White respondents. In the presentation I will discuss the full results and possible reasons for DIF in these items.
Differential Item Functioning Analysis on the NIH Toolbox Picture Vocabulary Test in Black and White Participants
Racial disparities have been investigated across many cognitive domains, including on tasks targeting episodic memory, semantic memory, vocabulary, and executive functioning. Researchers have found the racial disparities in cognition are mediated by education, income, physical health, and external locus of control (Zahodne et al., 2017).
The goal of the current study was to conduct a differential item functioning (DIF) analysis to see if White respondents had a differing probability of getting an item correct in comparison to Black respondents, after matching respondents on overall ability. A secondary-data analysis was conducted on the ARMADA (Advancing Reliable Measurement in Alzheimer’s disease and Cognitive Aging) study to specifically look at performance on the NIH Toolbox Picture Vocabulary Test (TPVT). The goal of ARMADA is to validate The NIH toolbox, which is used to assess neurological and behavioral function. The TPVT is a computer-adaptive test (CAT) that targets auditory comprehension to assess the presence of language disorders (i.e. aphasia) or neurodegenerative diseases (Gershon et al.,2014).
Vocabulary is moderated by the educational experience of an individual (Gershon et al., 2014), which is in turn influenced by cultural background. Differences in these variables may have contributed to differential performance researchers have seen. Gershon and colleagues (2014) stated that the validity of the TPVT should continue to be evaluated—which we aimed to do in the current study. Similar to the TPVT, the Boston Naming Test (BNT) has been used to look at different aphasic syndromes (Na & King, 2019). The differential performance seen between Black and White participants found in that study prompted Pedraza et al. (2009) to conduct a DIF analysis on the BNT in which they examined differential performance on the BNT for African American and Caucasian Older adults. As defined by Bandalos (2016), DIF is the differing probability of answering an item correctly, after controlling for levels of the construct of interest. The presence of DIF would indicate that something other than the construct of interest, is contributing to the differential performance of a specific group.
In this study, I extend Pedraza and colleagues’ (2009) framework of assessing DIF in a neuropsychological measure of language. While the BNT target vocabulary comprehension through naming ability, the TPVT focused on vocabulary via the process of recognition of images. The current study used the Mantel-Haenszel (MH) and Logistic Regression (LR) as tools to examine the TPVT for DIF. Although Pedraza et al. used Item Response Theory in their study, in the current study I used MH and LR because the sample sizes were quite small. However, Belzak (2020) found that the LR is a powerful tool for detecting uniform DIF (differences in item difficulties) in sample sizes as low as 25 per group.
Forty-four items met the previously mentioned sample size requirement, and nine were flagged for DIF. Of these, three favored Black respondents and five favored White respondents. In the presentation I will discuss the full results and possible reasons for DIF in these items.