Evaluating halo effect in performance assessments: A Rasch measurement model simulation study

Presenter Information

Yelisey ShapovalovFollow

Faculty Advisor Name

Dr. Christine DeMars

Department

Department of Graduate Psychology

Description

Performance assessments have grown in popularity but are prone to validity threats from rater effects due to the subjective nature of the scoring process. One reason educators prefer performance assessment is their greater fidelity to real-life scenarios as compared to closed-form tests, such as those using multiple choice or true-false questions. A performance assessment requires the examinee to engage in a process or produce a product which is then scored according to agreed-upon criteria. Often the scoring is completed by human raters who are usually trained to become familiar with the scoring rubric so that they can apply criteria to student work systematically. However, due to the subjective processes whereby raters must apply their understanding of the scoring criteria to their interpretation of the student work, even trained raters exhibit tendencies that inadvertently influence scores. For instance, researchers have found that some raters tend to form an overall impression of the work sample that influences their scores across independent criteria—a prominent rater effect known as the halo effect. Scores with such influences are not only representative of student ability but also of rater effects, threatening score validity.

Many Facets Rasch Models (MFRM) have been developed to account for how raters tend to apply scores to each rubric element. Based on the fit of individual raters to the MFRM, Myford and Wolfe (2004) used the results of a simulation study to recommend criteria for determining the degree of halo effect present. A simulation study involves predetermining certain conditions (such as the extent of halo effect) and then evaluating how well a proposed method (such as using measurement models to detect halo effect) accurately reflects the predetermined conditions. However, all simulation studies are limited in their capacity to fully capture real-life nuances and in the generalizability of cutoff criteria or recommendations to empirical setting that have different conditions.

Building upon my thesis research where I applied recommendations derived from a simulation study to detect halo effects in an empirical dataset of a national organization, I will investigate the appropriateness of applying Myford and Wolfe's recommendations to the simulated data modeled on the original thesis dataset. Ideally, these recommendations will identify the raters in my simulated dataset for whom halo effects were created. The degree to which use of these recommendations can identify the halo effect raters, or can correctly classify non-halo effect raters as not exhibiting halo effects, will indicate the suitability of applying these recommendations to different contexts.

Myford, C. M. & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet rasch measurement: Part II. Journal of Applied Measurement, 5, 189-227.

This document is currently not available here.

Share

COinS
 

Evaluating halo effect in performance assessments: A Rasch measurement model simulation study

Performance assessments have grown in popularity but are prone to validity threats from rater effects due to the subjective nature of the scoring process. One reason educators prefer performance assessment is their greater fidelity to real-life scenarios as compared to closed-form tests, such as those using multiple choice or true-false questions. A performance assessment requires the examinee to engage in a process or produce a product which is then scored according to agreed-upon criteria. Often the scoring is completed by human raters who are usually trained to become familiar with the scoring rubric so that they can apply criteria to student work systematically. However, due to the subjective processes whereby raters must apply their understanding of the scoring criteria to their interpretation of the student work, even trained raters exhibit tendencies that inadvertently influence scores. For instance, researchers have found that some raters tend to form an overall impression of the work sample that influences their scores across independent criteria—a prominent rater effect known as the halo effect. Scores with such influences are not only representative of student ability but also of rater effects, threatening score validity.

Many Facets Rasch Models (MFRM) have been developed to account for how raters tend to apply scores to each rubric element. Based on the fit of individual raters to the MFRM, Myford and Wolfe (2004) used the results of a simulation study to recommend criteria for determining the degree of halo effect present. A simulation study involves predetermining certain conditions (such as the extent of halo effect) and then evaluating how well a proposed method (such as using measurement models to detect halo effect) accurately reflects the predetermined conditions. However, all simulation studies are limited in their capacity to fully capture real-life nuances and in the generalizability of cutoff criteria or recommendations to empirical setting that have different conditions.

Building upon my thesis research where I applied recommendations derived from a simulation study to detect halo effects in an empirical dataset of a national organization, I will investigate the appropriateness of applying Myford and Wolfe's recommendations to the simulated data modeled on the original thesis dataset. Ideally, these recommendations will identify the raters in my simulated dataset for whom halo effects were created. The degree to which use of these recommendations can identify the halo effect raters, or can correctly classify non-halo effect raters as not exhibiting halo effects, will indicate the suitability of applying these recommendations to different contexts.

Myford, C. M. & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet rasch measurement: Part II. Journal of Applied Measurement, 5, 189-227.