The Accuracy of Online Testing: A Reliability and Validity Investigation

Faculty Advisor Name

Dr. John Hathcoat and Dr. Yu Bao

Department

Department of Graduate Psychology

Description

James Madison University (JMU) has conducted Assessment Day in-person to assess student learning in general education and other university-wide initiatives for over 30 years (Pastor, Foelber, Jacovidis, Fulcher, & Love, 2019). A typical Assessment Day will test around 4,000 students in-person over three, 2-hour proctored sessions. Assessments are low stakes since they have no personal impact on the students. Starting in the fall of 2020, Assessment Day was moved online (Pastor & Love, 2020). Assessments remained unchanged in length and content. However, the remote Assessment Days were asynchronous, not proctored, and assessments remained open for three weeks in the fall (Pastor & Love, 2020) and 24-hours in the spring.

Higher education professionals raised concerns about data quality from online testing (Jankowski, 2020). These concerns are validity concerns. Valid scores will accurately capture a desired construct (Messik, 1995). However, various things can compromise validity, such as construct irrelevant variance. This systematic error can be introduced through low effort levels, specifically when effort relates to test scores (Wise, Pastor, and Kong, 2009). Additionally, reliability may be affected. Reliability captures the accuracy of measurements (Cronbach, 1951) or the consistency of high and low scoring groups. If students are not putting forth their best effort and begin to rapidly respond (guess), random error will be introduced (Wise, Pastor, and Kong, 2009). Typically, random error will reduce reliability, but Wise, (2006) noted that sometimes reliability might increase due to rapid responders, which contradicts the definition of reliability.

This project investigates the reliability and validity changes in the ERITXA (a 90-question ethical reasoning test) from the online administration in spring 2021 to previous in-person administrations. All comparisons are between sophomore cohorts (students with 45-70 credits). The scores from previous years revealed an unexpected increase in reliability in the spring of 2021. Higher reliability may seem ideal, but rapid responders can falsely inflate reliability estimates (Wise, 2006). A threshold was used to collect overall test timing information for rapid responders. Item level timing information was not available. Even still, when rapid responders were removed from the spring 2021 data, reliability decreased. This decrease has implications for the validity of the scores. The rapid responders introduce construct irrelevant variance, which mistakenly raised the reliability while lowering the validity. An item analysis revealed differences in item difficulty as well as item discrimination. These differences were reflected in the lower overall scores in the spring of 2021 compared to previous years. Together, this suggests that reliability and validity have been somewhat compromised in this online administration of this assessment.

This project's results support previous research showing that rapid responding poses a threat to online assessment reliability and validity. Efforts should be made to reduce and identify rapid responders (Wise, 2006). Future administrations of this test may also use information from the item analysis and discrimination analysis to address issues of reliability and validity of online assessment to improve future online assessments.

References

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,16(3), 297–334.

Jankowski, N. A. (2020, August). Assessment during a crisis: Responding to a global pandemic. Urbana, IL: University of Illinois and Indiana University, National Institute for Learning Outcomes Assessment.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

Pastor, D. A., Foelber, K. J., Jacovidis, J. N., Fulcher, K. H., Sauder, D. C., & Love, P. D. (2019). University-wide assessment days: The James Madison University model. The Association for Institutional Research (AIR) Professional File, 144, 1-13.

Wise, S. L. (2006). An Investigation of the Differential Effort Received by Items on a Low-Stakes Computer-Based Test. Applied Measurement in Education, 19(2), 95-114, doi: 10.1207/s15324818ame1902_2

Wise, S. L. & DeMars, C. E. (2005). Low Examinee Effort in Low-Stakes Assessment: Problems and Potential Solutions. Educational Assessment. 10(1), 1-17, doi: 10.1207/s15326977ea1001_1

Wise, S. L., Pastor, D. A., & Kong, X. J. (2009). Correlates of Rapid-Guessing Behavior in Low-Stakes Testing: Implications for test development and measurement practice. Applied Measurement in Education, 22(2), 185-205, doi: 10.1080/08957340902754650

This document is currently not available here.

Share

COinS
 

The Accuracy of Online Testing: A Reliability and Validity Investigation

James Madison University (JMU) has conducted Assessment Day in-person to assess student learning in general education and other university-wide initiatives for over 30 years (Pastor, Foelber, Jacovidis, Fulcher, & Love, 2019). A typical Assessment Day will test around 4,000 students in-person over three, 2-hour proctored sessions. Assessments are low stakes since they have no personal impact on the students. Starting in the fall of 2020, Assessment Day was moved online (Pastor & Love, 2020). Assessments remained unchanged in length and content. However, the remote Assessment Days were asynchronous, not proctored, and assessments remained open for three weeks in the fall (Pastor & Love, 2020) and 24-hours in the spring.

Higher education professionals raised concerns about data quality from online testing (Jankowski, 2020). These concerns are validity concerns. Valid scores will accurately capture a desired construct (Messik, 1995). However, various things can compromise validity, such as construct irrelevant variance. This systematic error can be introduced through low effort levels, specifically when effort relates to test scores (Wise, Pastor, and Kong, 2009). Additionally, reliability may be affected. Reliability captures the accuracy of measurements (Cronbach, 1951) or the consistency of high and low scoring groups. If students are not putting forth their best effort and begin to rapidly respond (guess), random error will be introduced (Wise, Pastor, and Kong, 2009). Typically, random error will reduce reliability, but Wise, (2006) noted that sometimes reliability might increase due to rapid responders, which contradicts the definition of reliability.

This project investigates the reliability and validity changes in the ERITXA (a 90-question ethical reasoning test) from the online administration in spring 2021 to previous in-person administrations. All comparisons are between sophomore cohorts (students with 45-70 credits). The scores from previous years revealed an unexpected increase in reliability in the spring of 2021. Higher reliability may seem ideal, but rapid responders can falsely inflate reliability estimates (Wise, 2006). A threshold was used to collect overall test timing information for rapid responders. Item level timing information was not available. Even still, when rapid responders were removed from the spring 2021 data, reliability decreased. This decrease has implications for the validity of the scores. The rapid responders introduce construct irrelevant variance, which mistakenly raised the reliability while lowering the validity. An item analysis revealed differences in item difficulty as well as item discrimination. These differences were reflected in the lower overall scores in the spring of 2021 compared to previous years. Together, this suggests that reliability and validity have been somewhat compromised in this online administration of this assessment.

This project's results support previous research showing that rapid responding poses a threat to online assessment reliability and validity. Efforts should be made to reduce and identify rapid responders (Wise, 2006). Future administrations of this test may also use information from the item analysis and discrimination analysis to address issues of reliability and validity of online assessment to improve future online assessments.

References

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika,16(3), 297–334.

Jankowski, N. A. (2020, August). Assessment during a crisis: Responding to a global pandemic. Urbana, IL: University of Illinois and Indiana University, National Institute for Learning Outcomes Assessment.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

Pastor, D. A., Foelber, K. J., Jacovidis, J. N., Fulcher, K. H., Sauder, D. C., & Love, P. D. (2019). University-wide assessment days: The James Madison University model. The Association for Institutional Research (AIR) Professional File, 144, 1-13.

Wise, S. L. (2006). An Investigation of the Differential Effort Received by Items on a Low-Stakes Computer-Based Test. Applied Measurement in Education, 19(2), 95-114, doi: 10.1207/s15324818ame1902_2

Wise, S. L. & DeMars, C. E. (2005). Low Examinee Effort in Low-Stakes Assessment: Problems and Potential Solutions. Educational Assessment. 10(1), 1-17, doi: 10.1207/s15326977ea1001_1

Wise, S. L., Pastor, D. A., & Kong, X. J. (2009). Correlates of Rapid-Guessing Behavior in Low-Stakes Testing: Implications for test development and measurement practice. Applied Measurement in Education, 22(2), 185-205, doi: 10.1080/08957340902754650