An Item Response Tree Model for Validating Rubric Scoring Processes

Faculty Advisor Name

Dr. Allison Ames

Description

Performance assessments are considered a more authentic approach to measuring complex skills and knowledge than multiple-choice testing. Ideally, performance assessments are scored by trained raters using well-developed rubrics. However, many factors, such as rater fatigue, drift, leniency, harshness, and inconsistency, may cause raters to differ on scores ascribed to the same level of performance. Raters may also arrive at different scores for the same performance when rubric application is not in alignment with the intended application. However, previous think-alouds have indicated this is not always true. Given interpretation of performance assessment scores assumes raters apply rubrics as the rubric developers intended, misalignment between raters’ scoring processes and content experts’ intended scoring processes may lead to invalid inferences made from performance assessment scores.

An alternative scoring method—the Diagnostic Rating System (DRS)—was developed to standardize the scoring processes used by raters. With this method, the rubric developers’ intended scoring processes are made explicit, requiring raters to respond to a series of nested, logic-based, selected-response statements resembling a decision tree. Accordingly, scoring via the DRS may standardize raters’ scoring processes and mitigate scoring subjectivity. The DRS was adminsitered via an online survey system to assure raters were only exposed to relevant statements corresponding to their previous responses to the branching network of nested statements.

The purpose of the current study is twofold. The first purpose is to determine whether raters scoring essays via the DRS method are scoring the essays as intended by the rubric developers. The second purpose is to determine whether raters scoring essays via the traditional rubric use the same cognitive scoring processes as raters scoring essays via the DRS. Recall, application of the rubric in a manner not in alignment with the intended application of the rubric may not support the intended interpretation of scores.

Data were collected from a mid-Atlantic public university. Examinees were asked to compose an essay regarding a personal ethical dilemma using the 8 Key Question ethical reasoning framework. Independent groups of raters scored the essays using the traditional rubric or the DRS. To determine if raters scored essays as intended using both the traditional rubric and the DRS, an item response theory model with a tree-like structure (i.e., IRTree) was specified. The IRTree was specified to depict the raters’ hypothesized decision-making processes made explicit by the DRS. The decision-making process is modeled as a series of sub-processes (decisions) represented by internal nodes that either branch to other internal node(s) or to a terminal node (rating). Given adequate model-data fit, the IRTree model should elucidate whether the raters’ scoring processes adhere to the rubric developer’s intention. Preliminary analysis suggests raters using the DRS are better able to rate analytically, as intended, rather than being influenced by the holistic quality of the essay. Given the validity of inferences made from performance assessment scores is dependent on use of rubrics in a consistent and intended manner, the DRS may be a viable alternative to traditional rubric scoring.

This document is currently not available here.

Share

COinS
 

An Item Response Tree Model for Validating Rubric Scoring Processes

Performance assessments are considered a more authentic approach to measuring complex skills and knowledge than multiple-choice testing. Ideally, performance assessments are scored by trained raters using well-developed rubrics. However, many factors, such as rater fatigue, drift, leniency, harshness, and inconsistency, may cause raters to differ on scores ascribed to the same level of performance. Raters may also arrive at different scores for the same performance when rubric application is not in alignment with the intended application. However, previous think-alouds have indicated this is not always true. Given interpretation of performance assessment scores assumes raters apply rubrics as the rubric developers intended, misalignment between raters’ scoring processes and content experts’ intended scoring processes may lead to invalid inferences made from performance assessment scores.

An alternative scoring method—the Diagnostic Rating System (DRS)—was developed to standardize the scoring processes used by raters. With this method, the rubric developers’ intended scoring processes are made explicit, requiring raters to respond to a series of nested, logic-based, selected-response statements resembling a decision tree. Accordingly, scoring via the DRS may standardize raters’ scoring processes and mitigate scoring subjectivity. The DRS was adminsitered via an online survey system to assure raters were only exposed to relevant statements corresponding to their previous responses to the branching network of nested statements.

The purpose of the current study is twofold. The first purpose is to determine whether raters scoring essays via the DRS method are scoring the essays as intended by the rubric developers. The second purpose is to determine whether raters scoring essays via the traditional rubric use the same cognitive scoring processes as raters scoring essays via the DRS. Recall, application of the rubric in a manner not in alignment with the intended application of the rubric may not support the intended interpretation of scores.

Data were collected from a mid-Atlantic public university. Examinees were asked to compose an essay regarding a personal ethical dilemma using the 8 Key Question ethical reasoning framework. Independent groups of raters scored the essays using the traditional rubric or the DRS. To determine if raters scored essays as intended using both the traditional rubric and the DRS, an item response theory model with a tree-like structure (i.e., IRTree) was specified. The IRTree was specified to depict the raters’ hypothesized decision-making processes made explicit by the DRS. The decision-making process is modeled as a series of sub-processes (decisions) represented by internal nodes that either branch to other internal node(s) or to a terminal node (rating). Given adequate model-data fit, the IRTree model should elucidate whether the raters’ scoring processes adhere to the rubric developer’s intention. Preliminary analysis suggests raters using the DRS are better able to rate analytically, as intended, rather than being influenced by the holistic quality of the essay. Given the validity of inferences made from performance assessment scores is dependent on use of rubrics in a consistent and intended manner, the DRS may be a viable alternative to traditional rubric scoring.