Item Analysis in Criterion-Referenced Testing: A Study of Differences in Pre, Post, and Pre-Post Judgements.
Faculty Advisor Name
John D. Hathcoat
Department
Department of Graduate Psychology
Description
Items on cognitive tests are often evaluated by examining their difficulty and discrimination. Difficulty, according to classical test theory, is defined as the proportion of individuals who answer an item correctly (p-value); whereas, discrimination indicates the degree to which an item can distinguish individuals who are high on a trait from those who are low on a trait (Haladyna, 2015). Discrimination, on the other hand, can be examined by calculating a discrimination index (D). D is defined as the difference between the difficulty of high achieving individuals (e.g., upper 27%) and low achieving individuals (e.g., lower 27%). These indices help inform practitioners whether items are considered “good” or if items are candidates for review by content experts. There are numerous situations in which the results of an item analysis may be inadvertently influenced by characteristics of the population and the intended aim of the researcher. For example, in educational settings instruments are often created to evaluate the effectiveness of interventions using a pre-post design. In such situations, one is more interested in making criterion-referenced decisions (i.e. mastery versus non-mastery) as opposed to norm-referenced decisions (i.e. distinguishing one student from another). Items should, therefore, be instructionally sensitive in that they are capable of detecting differences in the quality of instruction received (Polikoff, 2010). Traditional item analysis at either the pretest or posttest would likely fail to distinguish masters (i.e., those who received instruction) from non-masters (i.e., those who have not received instruction). Consequently, some have recommended calculating differences in p-values for an item before and after an intervention when examining item quality in educational interventions (e.g., Polikoff, 2010). Our study examined the extent to which judgments about item quality would differ depending upon whether the analysis focused on the pretest only, posttest only, or pre-post differences in an applied educational setting. Specifically, we looked at the scores of 288 students in a 30-item information literacy exam (𝛼= .67) at pre and post conditions and obtained measures of item quality based on both traditional item analysis and instructional sensitivity. Preliminary results indicated important differences between these indices, consequently leading to different judgments about item quality. This suggests that caution should be used when developing an instrument used to assess the effectiveness of an intervention as a traditional item analysis, defined previously, may lead to inappropriate conclusions about item quality.
Item Analysis in Criterion-Referenced Testing: A Study of Differences in Pre, Post, and Pre-Post Judgements.
Items on cognitive tests are often evaluated by examining their difficulty and discrimination. Difficulty, according to classical test theory, is defined as the proportion of individuals who answer an item correctly (p-value); whereas, discrimination indicates the degree to which an item can distinguish individuals who are high on a trait from those who are low on a trait (Haladyna, 2015). Discrimination, on the other hand, can be examined by calculating a discrimination index (D). D is defined as the difference between the difficulty of high achieving individuals (e.g., upper 27%) and low achieving individuals (e.g., lower 27%). These indices help inform practitioners whether items are considered “good” or if items are candidates for review by content experts. There are numerous situations in which the results of an item analysis may be inadvertently influenced by characteristics of the population and the intended aim of the researcher. For example, in educational settings instruments are often created to evaluate the effectiveness of interventions using a pre-post design. In such situations, one is more interested in making criterion-referenced decisions (i.e. mastery versus non-mastery) as opposed to norm-referenced decisions (i.e. distinguishing one student from another). Items should, therefore, be instructionally sensitive in that they are capable of detecting differences in the quality of instruction received (Polikoff, 2010). Traditional item analysis at either the pretest or posttest would likely fail to distinguish masters (i.e., those who received instruction) from non-masters (i.e., those who have not received instruction). Consequently, some have recommended calculating differences in p-values for an item before and after an intervention when examining item quality in educational interventions (e.g., Polikoff, 2010). Our study examined the extent to which judgments about item quality would differ depending upon whether the analysis focused on the pretest only, posttest only, or pre-post differences in an applied educational setting. Specifically, we looked at the scores of 288 students in a 30-item information literacy exam (𝛼= .67) at pre and post conditions and obtained measures of item quality based on both traditional item analysis and instructional sensitivity. Preliminary results indicated important differences between these indices, consequently leading to different judgments about item quality. This suggests that caution should be used when developing an instrument used to assess the effectiveness of an intervention as a traditional item analysis, defined previously, may lead to inappropriate conclusions about item quality.