JMU Scholarly CommonsCopyright (c) 2017 James Madison University All rights reserved.
http://commons.lib.jmu.edu
Recent documents in JMU Scholarly Commonsen-usWed, 26 Jul 2017 01:49:12 PDT3600High Risk Drinking Concerns across College Campuses And a Look at JMU Programming
http://commons.lib.jmu.edu/gradpsych/56
http://commons.lib.jmu.edu/gradpsych/56Tue, 25 Jul 2017 09:24:34 PDT
The purpose of this project is to take a deeper look at excessive alcohol use in the college setting and to review prevention and support programs and services available for this population in reducing the likelihood of ongoing high risk drinking. This project contains a literature review of emerging adulthood and their developmental tasks, the impact of alcohol on an emerging adult’s brain and gender differences that may impact attitudes and decisions about alcohol. In conclusion, this project includes implications for counselors who may want to work in a college setting and provide substance abuse counseling.
]]>
Rachel C. TysingerProtecting the Protectors: Enhancing Emotional Well-Being in Law Enforcement
http://commons.lib.jmu.edu/gradpsych/55
http://commons.lib.jmu.edu/gradpsych/55Tue, 25 Jul 2017 09:24:30 PDT
Law enforcement officers face a myriad of stressors, both personally and professionally, and regularly suffer serious outcomes that affect their physical health and psychological well-being. Fortunately, counselors have important skills that can be used to assist officers in building resilience, coping with stress, and managing negative outcomes, such as posttraumatic stress syndrome and interpersonal troubles. This project outlines the various difficulties that law enforcement officers may experience, explores current practices to manage these concerns, and provides a discussion of useful approaches counselors and law enforcement agencies can take in supporting their most valuable assets.
]]>
Olivia GilliesPartially-compensatory multi-dimensional IRT models: Two alternate model forms
http://commons.lib.jmu.edu/gradpsych/54
http://commons.lib.jmu.edu/gradpsych/54Tue, 25 Jul 2017 09:24:26 PDT
Partially compensatory models may capture the cognitive skills needed to answer test items more realistically than compensatory models, but estimating the model parameters may be a challenge. Data were simulated to follow two different partially compensatory models, a model with an interaction term and a product model. The model parameters were then estimated for both models and for the compensatory model. Either the model used to simulate the data or the compensatory model generally had the best fit, as indexed by information criteria. Interfactor correlations were estimated well by both the correct model and the compensatory model. The predicted response probabilities were most accurate from the model used to simulate the data. Regarding item parameters, root mean square errors seemed reason-able for the interaction model but were quite large for some items for the product model. Thetas were recovered similarly by all models, regardless of the model used to simulate the data.
]]>
Christine E. DeMarsThe interaction of ability differences and guessing when modeling DIF with the Rasch model: Conventional and tailored calibration.
http://commons.lib.jmu.edu/gradpsych/53
http://commons.lib.jmu.edu/gradpsych/53Tue, 25 Jul 2017 09:24:22 PDT
In educational testing, differential item functioning (DIF) statistics must be accurately estimated to ensure the appropriate items are flagged for inspection or removal. This study showed how using the Rasch model to estimate DIF may introduce considerable bias in the results when there are large group differences in ability (impact) and the data follow a three-parameter logistic model. With large group ability differences, difficult non-DIF items appeared to favor the focal group and easy non-DIF items appeared to favor the reference group. Correspondingly, the effect sizes for DIF items were biased. These effects were mitigated when data were coded as missing for item–examinee encounters in which the person measure was considerably lower than the item location. Explanation of these results is provided by illustrating how the item response function becomes differentially distorted by guessing depending on the groups’ ability distributions. In terms of practical implications, results suggest that measurement practitioners should not trust the DIF estimates from the Rasch model when there is a large difference in ability and examinees are potentially able to answer items correctly by guessing, unless data from examinees poorly matched to the item difficulty are coded as missing.
]]>
Christine E. DeMars et al.DIF detection with the Mantel-Haenszel procedure: The effects of matching type and other factors
http://commons.lib.jmu.edu/gradpsych/52
http://commons.lib.jmu.edu/gradpsych/52Tue, 25 Jul 2017 09:24:18 PDT
The Mantel-Haenszel (MH) procedure is commonly used to detect items that function differentially for groups of examinees from various demographic and linguistic backgrounds—for example, in international assessments. As in some other DIF methods, the total score is used to match examinees on ability. In thin matching, each of the total score points is used as its own matching category, whereas in thick matching the total score is discretized into several score ranges. Evidence regarding how matching type affects the accuracy of MH procedure is inconclusive. The current study investigated the effects of thin and thick matching in conjunction with sample size, puriﬁcation, symmetric and asymmetric groups sample sizes, test length, and differences in the ability distributions. Results suggest that whenever feasible, puriﬁcation should be used in conjunction with thin matching.
]]>
Alan Socha et al.Estimating Variance Components from Sparse Data Matrices in Large-Scale Educational Assessments
http://commons.lib.jmu.edu/gradpsych/51
http://commons.lib.jmu.edu/gradpsych/51Tue, 25 Jul 2017 09:24:14 PDT
In generalizability theory studies in large-scale testing contexts, sometimes a facet is very sparsely crossed with the object of measurement. For example, when assessments are scored by human raters, it may not be practical to have every rater score all students. Sometimes the scoring is systematically designed such that the raters are consistently grouped throughout the scoring, so that the data can be analyzed as raters nested within teams. Other times, rater pairs are randomly assigned for each student, such that each rater is paired with many other raters at different times. One possibility for this scenario is to treat the data as if raters were nested within students. Because the raters are not truly independent across all students, the resulting variance components could be somewhat biased. This study illustrates how the bias will tend to be small in large-scale studies.
]]>
Christine E. DeMarsAn Illustration of the Effects of Ignoring a Secondary Factor
http://commons.lib.jmu.edu/gradpsych/50
http://commons.lib.jmu.edu/gradpsych/50Tue, 25 Jul 2017 09:24:10 PDT
The purpose of this brief report is to illustrate how a small proportion of items measuring a secondary factor can have a large impact on the misestimation of the a-parameters for all items. In this real dataset, when the model was specified as unidimensiona the a-parameters for the items tapping the secondary construct were overestimated and the a-parameters for the other items tended to be underestimated.
]]>
Christine E. DeMarsThe Role of Gender in Test-Taking Motivation under Low-Stakes Conditions
http://commons.lib.jmu.edu/gradpsych/49
http://commons.lib.jmu.edu/gradpsych/49Tue, 25 Jul 2017 09:24:06 PDT
Examinee effort can impact the validity of scores on higher education assessments. Many studies of examinee effort have briefly noted gender differences, but gender differences in test-taking effort have not been a primary focus of research. This review of the literature brings together gender-related findings regarding three measures of examinee motivation: attendance at the assigned testing session, time spent on each test item, and self-reported effort. Evidence from the literature is summarized, with some new results presented. Generally, female examinees exert more effort, with differences mostly at very low levels of effort—the levels at which effort is most likely to impact test scores. Examinee effort is positively correlated with conscientiousness and agreeableness, and negatively correlated with work-avoidance. The gender differences in these constructs may account for some of the gender differences in test-taking effort. Limitations and implications for higher education assessment practice are discussed.
]]>
Christine E. DeMars et al.A Tutorial on Interpreting Bifactor Model Scores
http://commons.lib.jmu.edu/gradpsych/48
http://commons.lib.jmu.edu/gradpsych/48Tue, 25 Jul 2017 09:24:03 PDT
This tutorial addresses possible sources of confusion in interpreting trait scores from the bifactor model. The bifactor model may be used when subscores are desired, either for formative feedback on an achievement test or for theoretically different constructs on a psychological test. The bifactor model is often chosen because it requires fewer computational resources than other models for subscores. The bifactor model yields a score on the general or primary trait measured by the test overall, as well as speciﬁc or secondary traits measured by the subscales. Interpreting the general trait score is straight-forward, but the speciﬁc traits must be interpreted as residuals relative to the general trait. Trait scores on the speciﬁc factors are contrasted with trait scores on a simple-structure model with correlated factors, using example data from one TIMSS test booklet and a civic responsibility measure. The correlated factors model was used for contrast because its scores correspond to a more intuitive interpretation of subscores, and thus it helps to illustrate how the bifactor scores should NOT be interpreted. Estimation details are covered in an appendix.
]]>
Christine E. DeMarsAn Investigation of Sample Size Splitting on ATFIND and DIMTEST
http://commons.lib.jmu.edu/gradpsych/47
http://commons.lib.jmu.edu/gradpsych/47Tue, 25 Jul 2017 09:23:59 PDT
Modeling multidimensional test data with a unidimensional model can result in seri-ous statistical errors, such as bias in item parameter estimates. Many methods exist for assessing the dimensionality of a test. The current study focused on DIMTEST. Using simulated data, the effects of sample size splitting for use with the ATFIND pro-cedure for empirically deriving a subtest composed of items that potentially measure a second dimension versus DIMTEST for assessing whether this subtest represents a second dimension were investigated. Conditions explored included proportion of sample used for ATFIND, sample size, test length, interability correlations, test struc-ture, and distribution of item difficulties. Overall, it appears that DIMTEST has Type I error rates near the nominal rate and good power in detecting multidimensionality, although Type I error inflation is observed for larger sample sizes. Results suggest that a 50/50 split maximizes power and keeps the Type I error rate below the nom-inal level unless the test is short and the sample is large. A 75/25 split controls Type I error better for short tests and large samples.
]]>
Alan Socha et al.A Note on Specifying the Guessing Parameter in ATFIND and DIMTEST
http://commons.lib.jmu.edu/gradpsych/46
http://commons.lib.jmu.edu/gradpsych/46Tue, 25 Jul 2017 09:23:55 PDT
The software program DIMTEST can be used to assess the unidimensionality of item scores. The software allows the user to specify a guessing parameter. Using simulated data, the effects of guessing parameter specification for use with the ATFIND procedure for empirically deriving the Assessment Subtest (AT; that is, a subtest composed of items that potentially measure a second dimension) and DIMTEST for assessing whether that AT represents a second dimension were investigated. Results suggest that specifying higher guessing parameters in ATFIND and DIMTEST result in higher Type I error rates.
]]>
Christine E. DeMars et al.A Comparison of Limited-Information and Full-Information Methods in Mplus for Estimating Item Response Theory Parameters for Nonnormal Populations
http://commons.lib.jmu.edu/gradpsych/45
http://commons.lib.jmu.edu/gradpsych/45Tue, 25 Jul 2017 09:23:51 PDT
In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2-parameter item response theory (IRT) model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and platykurtic latent variable distributions, 3 methods were compared in Mplus: limited information, full information integrating over a normal distribution, and full information integrating over the known underlying distribution. Interfactor correlation estimates were similar for all 3 estimation methods. For the platykurtic distribution, estimation method made little difference for the item parameter estimates. When the latent variable was negatively skewed, for the most discriminating easy or difﬁcult items, limited-information estimates of both parameters were considerably biased. Full-information estimates obtained by marginalizing over a normal distribution were somewhat biased. Full-information estimates obtained by integrating over the true latent distribution were essentially unbiased. For the a parameters, standard errors were larger for the limited-information estimates when the bias was positive but smaller when the bias was negative. For the d parameters, standard errors were larger for the limited-information estimates of the easiest, most discriminating items. Otherwise, they were generally similar for the limited- and full-information estimates. Sample size did not substantially impact the differences between the estimation methods; limited information did not gain an advantage for smaller samples.
]]>
Christine E. DeMarsInvestigating the Impact of Compromised Anchor Items on IRT Equating Under the Nonequivalent Anchor Test Design
http://commons.lib.jmu.edu/gradpsych/44
http://commons.lib.jmu.edu/gradpsych/44Tue, 25 Jul 2017 09:23:48 PDT
The prevalence of high-stakes test scores as a basis for significant decisions necessitates the dissemination of accurate and fair scores. However, the magnitude of these decisions has created an environment in which examinees may be prone to resort to cheating. To reduce the risk of cheating, multiple test forms are commonly administered. When multiple forms are employed, the forms must be equated to account for potential differences in form difficulty. If cheating occurs on one of the forms, the equating procedure may produce inaccurate results. A simulation study was conducted to examine the impact of cheating on item response theory (IRT) true score equating. Recovery of equated scores and scaling constants was assessed for the Stocking–Lord IRT scaling method under various conditions. Results indicated that cheating artificially increased the equated scores of the entire examinee group that was administered the compromised form. Future research should focus on the identification and removal of compromised items.
]]>
Daniel P. Jurich et al.Software Note: Using BILOG for Fixed-Anchor Item Calibration
http://commons.lib.jmu.edu/gradpsych/43
http://commons.lib.jmu.edu/gradpsych/43Tue, 25 Jul 2017 09:23:44 PDT
Using BILOG in the fixed-anchor method of parameter scaling can lead to poor results if the ability distributions for the new-form and old-form populations differ. This software note explains the options the user should choose to avoid these problems.
]]>
Christine E. DeMars et al.Confirming Testlet Effects
http://commons.lib.jmu.edu/gradpsych/42
http://commons.lib.jmu.edu/gradpsych/42Tue, 25 Jul 2017 09:23:40 PDT
A testlet is a cluster of items that share a common passage, scenario, or other context. These items might measure something in common beyond the trait measured by the test as a whole; if so, the model for the item responses should allow for this testlet trait. But modeling testlet effects that are negligible makes the model unnecessarily complicated and risks capitalization on chance, increasing the error in parameter estimates. Checking each testlet to see if the items within the testlet share something beyond the primary trait could therefore be useful. This study included (a) comparison between a model with no testlets and a model with testlet g,(b) comparison between a model with all suspected testlets and a model with all suspected testlets except testlet g, and (c) a test of essential unidimensionality. Overall, Comparison b was most useful for detecting testlet effects. Model comparisons based on information criteria, spe-cifically the sample-size adjusted Bayesian Information Criteria (SSA-BIC) and BIC, resulted in fewer false alarms than statistical significance tests. The test of essential unidimensionality had true hit rates and false alarm rates similar to the SSA-BIC when the testlet effect was zero for all testlets except the studied testlet. But the presence of additional testlet effects in the partitioning test led to higher false alarm rates for the test of essential unidimensionality.
]]>
Christine E. DeMarsAn Analytic Comparison of Effect Sizes for Differential Item Functioning
http://commons.lib.jmu.edu/gradpsych/41
http://commons.lib.jmu.edu/gradpsych/41Tue, 25 Jul 2017 09:23:36 PDT
Three types of effects sizes for DIF are described in this exposition: log of the odds-ratio (differences in log-odds), differences in probability-correct, and proportion of variance accounted for. Using these indices involves conceptualizing the degree of DIF in different ways. This integrative review discusses how these measures are impacted in different ways by item difﬁculty, item discrimination, and item lower asymptote. For example, for a ﬁxed discrimination, the difference in probabilities decreases as the difference between the item difﬁculty and the mean ability increases. Under the same conditions, the log of the odds-ratio remains constant if the lower asymptote is zero. A non-zero lower asymptote decreases the absolute value of the probability difference symmetrically for easy and hard items, but it decreases the absolute value of the log-odds difference much more for difﬁcult items. Thus, one cannot set a criterion for deﬁning a large effect size in one metric and ﬁnd a corresponding criterion in another metric that is equivalent across all items or ability distributions. In choosing an effect size, these differences must be understood and considered.
]]>
Christine E. DeMarsDifferential Item Functioning Detection With Latent Classes: How Accurately Can We Detect Who Is Responding Differentially?
http://commons.lib.jmu.edu/gradpsych/40
http://commons.lib.jmu.edu/gradpsych/40Tue, 25 Jul 2017 09:23:33 PDT
There is a long history of differential item functioning (DIF) detection methods for known, manifest grouping variables, such as sex or ethnicity. But if the experiences or cognitive processes leading to DIF are not perfectly correlated with the manifest groups, it would be more informative to uncover the latent groups underlying DIF. The use of item response theory (IRT) mixture models to detect latent groups and estimate the DIF caused by these latent groups has been explored/interpreted with real data sets, but the accuracy of model estimation has not been thoroughly explored. The purpose of this simulation research was to assess the accuracy of the recovery of classes, item parameters, and DIF effects in contexts where relatively small clusters of items showed DIF. Overall, the results from the study reveal that the use of IRT mixture models for latent DIF detection may be problematic. Class membership recovery was poor in all conditions tested. Discrimination parameters were estimated well for the invariant items, as well as for the DIF items when there was no group impact. But when there was group impact, discriminations for the DIF items were positively biased. When there was no group impact, DIF effect estimates tended to be positively biased. In general, having fewer items was associated with more biased estimates and larger standard errors.
]]>
Christine E. DeMars et al.Type I Error Inflation for Detecting DIF in the Presence of Impact
http://commons.lib.jmu.edu/gradpsych/39
http://commons.lib.jmu.edu/gradpsych/39Tue, 25 Jul 2017 09:23:30 PDT
In this brief explication, two challenges for using differential item functioning (DIF) measures when there are large group differences in true proficiency are illustrated. Each of these difficulties may lead to inflated Type I error rates, for very different reasons. One problem is that groups matched on observed score are not necessarily well matched on true proficiency, which may result in the false detection of DIF due to inaccurate matching. The other problem is that a model that does not allow for a nonzero asymptote can produce what seems to be DIF. These issues have been discussed separately in the literature earlier. This article brings them together in a nontechnical form.
]]>
Christine E. DeMarsExaminee Noneffort and the Validity of Program Assessment Results
http://commons.lib.jmu.edu/gradpsych/38
http://commons.lib.jmu.edu/gradpsych/38Tue, 25 Jul 2017 09:23:26 PDT
Educational program assessment studies often use data from low-stakes tests to provide evidence of program quality. The validity of scores from such tests, however, is potentially threatened by examinee noneffort. This study investigated the extent to which one type of noneffort—rapid-guessing behavior—distorted the results from three types of commonly used program assessment designs. It was found that, for each design, a modest amount of rapid guessing had a pronounced effect on the results. In addition, motivation ﬁltering was found to be successful in mitigating the effects caused by rapid guessing. It is suggested that measurement practitioners routinely apply motivation ﬁltering whenever the data from low-stakes tests are used to support program decisions.
]]>
Steven L. Wise et al.Can Differential Rapid-Guessing Behavior Lead to Differential Item Functioning?
http://commons.lib.jmu.edu/gradpsych/37
http://commons.lib.jmu.edu/gradpsych/37Tue, 25 Jul 2017 09:23:22 PDT
This investigation examined whether different rates of rapid guessing between groups could lead to detectable levels of differential item functioning (DIF) in situations where the item parameters were the same for both groups. Two simulation studies were designed to explore this possibility. The groups in Study 1 were simulated to reﬂect differences between high-stakes and low-stakes conditions, with no rapid guessing in the high-stakes condition. Easy, discriminating items with high rates of rapid guessing by the low-stakes group were detected as showing DIF favoring the high-stakes group when using the Mantel-Haenszel index. The groups in Study 2 were simulated to reﬂect gender differences in rapid guessing on a low-stakes test. Both groups had some rapid guessing, but the focal group guessed more. Easy items with greater differences in rapid guessing were more likely to be detected as showing DIF. When the group with more rapid guessing had lower mean proﬁciency, the overall proportion of ﬂagged items was lower but the effect of difference in rapid guessing remained. Our results suggest that there likely are instances in which statistically identiﬁed DIF is observed due to the behavioral characteristics of the studied subgroups rather than the content of the items.
]]>
Christine E. DeMars et al.Multilevel IRT: When is local independence violated?
http://commons.lib.jmu.edu/gradpsych/36
http://commons.lib.jmu.edu/gradpsych/36Tue, 25 Jul 2017 09:23:18 PDT
Calibration data often is often collected within schools. This illustration shows that random school effects for ability do not bias IRT parameter estimates or their standard errors. However, random school effects for item difficulty lead to bias in item discrimination estimates and inflated standard errors for difficulty and ability.
]]>
Christine E. DeMars et al.Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing
http://commons.lib.jmu.edu/gradpsych/35
http://commons.lib.jmu.edu/gradpsych/35Tue, 25 Jul 2017 09:23:15 PDT
Concerns with using the Rasch model to estimate DIF when there are large group differences in ability (impact) and the data follow a 3PL model are discussed. This demonstration showed that, with large group ability differences, difficult non-DIF items appeared to favor the focal group and, to a smaller degree, easy non-DIF items appeared to favor the reference group. Correspondingly, the effect sizes for DIF items were biased. With equal ability distributions for the reference and focal groups, DIF effect sizes were unbiased for non-DIF items; effect sizes were somewhat overestimated in absolute values for difficult items and somewhat underestimated for easy items. These effects were explained by showing how the item response function was distorted differentially depending on the ability distribution. The practical implication is that measurement practitioners should not trust the DIF estimates from the Rasch model when there is large impact and examinees are potentially able to answer items correctly by guessing.
]]>
Christine E. DeMars et al.A comparison of limited-information and full-information methods in Mplus for estimating IRT parameters for non-normal populations
http://commons.lib.jmu.edu/gradpsych/34
http://commons.lib.jmu.edu/gradpsych/34Tue, 25 Jul 2017 09:23:11 PDT
In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2PL IRT model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and platykurtic latent variable distributions, three methods were compared in Mplus: limited-information, full-information integrating over a normal distribution, and full-information integrating over the known underlying distribution. For the most discriminating easy or difficult items, limited-information estimates of both parameters were considerably biased. Full-information estimates obtained by integrating over a normal distribution were somewhat biased. Full-information estimates obtained by integrating over the true latent distribution were essentially unbiased. For the a-parameters, standard errors were larger for the limited-information estimates when the bias was positive but smaller when the bias was negative. For the b-parameters, standard errors were generally similar for the limited- and full-information estimates. Sample size did not substantially impact the differences between the estimation methods; limited-information did not gain an advantage for smaller samples.
]]>
Christine E. DeMarsIndividual score validity and student effort in higher education assessment
http://commons.lib.jmu.edu/gradpsych/33
http://commons.lib.jmu.edu/gradpsych/33Tue, 25 Jul 2017 09:23:07 PDT
This study explored the use of the five invalidity flags plus a new sixth flag based on self-reported effort. Participants were 155 entering first-year university students who were measured during an orientation week and again 18 months later. The instruments were a faculty-developed test of oral communications skills with 40 four-option multiple-choice items and a self-reported measure of test-taking motivation (Student Opinion Survey; Sundre, 1999 adapted from Wolf and Smith, 1995). Results indicated that the Flags explored in this study generalized well to university students. There was a moderate correlation between Response Time Effort and Effort as measured by the Student Opinion Scale, suggesting there was a relationship not captured by the dichotomized flags.
]]>
Christine E. DeMars et al.Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods
http://commons.lib.jmu.edu/gradpsych/32
http://commons.lib.jmu.edu/gradpsych/32Tue, 25 Jul 2017 09:23:04 PDT
Four methods of scoring multiple-choice items were compared: Dichotomous classical (number-correct), polytomous classical (classical optimal scaling – COS), dichotomous IRT (3 parameter logistic – 3PL), and polytomous IRT (nominal response – NR). Data were generated to follow either a nominal response model or a non-parametric model, based on empirical data. The polytomous models, which weighted the distractors differentially, yielded small increases in reliability compared to their dichotomous counterparts. The polytomous IRT estimates were less biased than the dichotomous IRT estimates for lower scores. The classical polytomous scores were as reliable, sometimes more reliable, than the IRT polytomous scores. This was encouraging because the classical scores are easier to calculate and explain to users.
]]>
Christine E. DeMarsScoring Subscales using Multidimensional Item Response Theory Model
http://commons.lib.jmu.edu/gradpsych/31
http://commons.lib.jmu.edu/gradpsych/31Tue, 25 Jul 2017 09:23:00 PDT
Several methods for estimating item response theory scores for multiple subtests were compared. These methods included two multidimensional item response theory models: a bifactor model where each subtest was a composite score based on the primary trait measured by the set of tests and a secondary trait measured by the individual subtest, and a model where the traits measured by the subtests were separate but correlated. Composite scores based on unidimensional item response theory, with each subtest borrowing information from the other subtest, as well as independent unidimensional scores for each subtest were also considered. Correlations among scores from all methods were high, though somewhat lower for the independent unidimensional scores. Correlations between course grades and test scores, a measure of validity, were similar for all methods, though again slightly lower for the unidimensional scores. To assess bias and RMSE, data were simulated using the parameters estimated for the correlated factors model. The independent unidimensional scores showed the greatest bias and RMSE; the relative performance of the other three methods varied with the subscale.
]]>
Christine E. DeMarsNeutral or unsure: Is there a difference?
http://commons.lib.jmu.edu/gradpsych/30
http://commons.lib.jmu.edu/gradpsych/30Tue, 25 Jul 2017 09:22:56 PDT
University students responded to a survey measuring identity development using a 4-point Likert-type scale with two additional options: neutral and unsure. The level of identity development of students who chose neutral was compared to the level of identity development of students who chose unsure on the same item. On average, these two groups of students had similar scores. Neutral and unsure did not seem to be used to indicate different levels of the construct of interest. Often these two categories were used as a middle response, but on one scale they were used as a moderately high response
]]>
Christine E. DeMars et al.Item Parameter Drift: The Impact of the Curricular Area
http://commons.lib.jmu.edu/gradpsych/29
http://commons.lib.jmu.edu/gradpsych/29Tue, 25 Jul 2017 09:22:53 PDT
The items from tests from two content areas, information literacy and global issues, were examined for item parameter drift across four years. The items on the information literacy test were expected to show more drift because the content of this field is changing more rapidly and because the test changed from low to high stakes for students while the other test remained low stakes. More items did show drift on the information literacy test, but the drift was not always readily explained. Further, some items did not fit the drift model available in BILOG-MG, either because the drift was a one-time shift rather than a gradual change or because both the discrimination and difficulty changed over time.
]]>
Christine E. DeMarsA Comparison of the Recovery of Parameters Using the Nominal Response and Generalized Partial Credit Models
http://commons.lib.jmu.edu/gradpsych/28
http://commons.lib.jmu.edu/gradpsych/28Tue, 25 Jul 2017 09:22:49 PDT
In this simulation study, data were generated such that some items fit the generalized partial credit model (GPCM) while other items fit the nominal response model (NRM) but not the constraints of the GPCM. The purpose was to explore (a) how the errors in parameter estimation were affected by using the GPCM when the constraints of the GPCM were inappropriate, and (b) how the errors were affected by using the less-constrained NRM when the constraints of the GPCM were appropriate. With large sample sizes, there were considerable gains in precision from using the NRM when the GPCM was inappropriate, and only small losses in precision from using the NRM when the GPCM would have been appropriate. With small samples, there were greater benefits due to applying the constraints of the GPCM when appropriate, and smaller benefits due to using the NRM when the GPCM was inappropriate.
]]>
Christine E. DeMarsRecovery of Graded Response and Partial Credit Parameters in MULTILOG and PARSCALE
http://commons.lib.jmu.edu/gradpsych/27
http://commons.lib.jmu.edu/gradpsych/27Tue, 25 Jul 2017 09:22:43 PDT
Using simulated data, MUL TILOG and P ARSCALE were compared on their recovery of item and trait parameters under the graded response and generalized partial credit item response theory models. The shape of the latent population distribution (normal, skewed, or uniform) and the sample size (250 or 500) were varied. Parameter estimates were essentially unbiased under all conditions, and the root mean square error was similar for both software packages. The choice between these packages can therefore be based on considerations other than the accuracy of parameter estimation.
]]>
Christine E. DeMarsMissing Data and IRT Item Parameter Estimation
http://commons.lib.jmu.edu/gradpsych/26
http://commons.lib.jmu.edu/gradpsych/26Tue, 25 Jul 2017 09:22:39 PDT
Non-randomly missing data has theoretically different implications for item parameter estimation depending on whether joint maximum likelihood or marginal maximum likelihood methods are used in the estimation. The objective of this paper is to illustrate what potentially can happen, under these estimation procedures, when there is an association between ability and the absence of response. In this example, data is missing because some students, particularly low-ability students, did not complete the test.
]]>
Christine E. DeMarsEquating Multiple Forms of a Competency Test: An Item Response Theory Approach
http://commons.lib.jmu.edu/gradpsych/25
http://commons.lib.jmu.edu/gradpsych/25Tue, 25 Jul 2017 09:22:35 PDT
A competency test was developed to assess students' skills in using electronic library resources. Because all students were required to pass the test, and had multiple opportunities to do so, multiple test forms were desired. Standards had been set on the original form, and minor differences in form difficulty needed to be taken into account. Students were randomly administered one of six new test forms; each form contained the original items and 12 pilot items which were different on each form. The pilot items were then calibrated to the metric of the original items and incorporated in two additional operational forms.
]]>
Christine E. DeMarsModeling Student Outcomes in a General Education Course with Hierarchical Linear Models
http://commons.lib.jmu.edu/gradpsych/24
http://commons.lib.jmu.edu/gradpsych/24Tue, 25 Jul 2017 09:22:32 PDT
When students are nested within course sections, the assumption of independence of residuals is unlikely to be met, unless the course section is explicitly included in the model. Hierarchical linear modeling (HLM) allows for modeling the course section as a random effect, leading to more accurate standard errors. In this study, students chose one of four themes for a communications course, with multiple sections and instructors within each theme. HLM was used to test for differences by theme in scores on a final exam; the differences were not significant when SAT scores were controlled.
]]>
Christine E. DeMarsDoes the Relationship between Motivation and Performance Differ with Ability?
http://commons.lib.jmu.edu/gradpsych/23
http://commons.lib.jmu.edu/gradpsych/23Tue, 25 Jul 2017 09:22:28 PDT
In this study of college students taking a science test or a social science test under non-consequential conditions, performance was positively correlated with self-reported motivation. The association, though, was smaller for students of lower ability (as measured by the SAT).
]]>
Christine E. DeMarsItem Estimates under Low-Stakes Conditions: How Should Omits Be Treated?
http://commons.lib.jmu.edu/gradpsych/22
http://commons.lib.jmu.edu/gradpsych/22Tue, 25 Jul 2017 09:22:25 PDT
Using data from a pilot test of science and math, item difficulties were estimated with a one-parameter model (partial-credit model for the multi-point items). Some items were multiple-choice items, and others were constructed-response items (open-ended). Four sets of estimates were obtained: estimates for males and females, and treating omitted items as incorrect and treating omitted items as not-presented (not-reached). Then, using data from an operational test (high-stakes, for diploma endorsement), the fit of these item estimates was assessed. In science, the fit was quite good under all conditions. In math, the fit was better for girls than for boys, the fit was better when omitted items were treated as not-presented, and the gender difference in fit was smaller when the omitted items were treated as not-presented.
]]>
Christine E. DeMarsA clarification of the effects of rapid guessing on coefficient alpha: A note on Attali’s “Reliability of Speeded Number-Right Multiple-Choice Tests”
http://commons.lib.jmu.edu/gradpsych/21
http://commons.lib.jmu.edu/gradpsych/21Tue, 25 Jul 2017 09:22:21 PDT
Attali (2005) recently demonstrated that Cronbach’s coefﬁcient a estimate of reliability for number-right multiple-choice tests will tend to be deﬂated by speededness, rather than inﬂated as is commonly believed and taught. However, random responses on low-stakes tests may be due to lack of effort rather than speededness. In real data, we found that random responses tended to result in more pairs of inflated than deflated item covariances, inflating estimates of reliability.
]]>
Steven L. Wise et al.Modification of the Mantel-Haenszel and Logistic Regression DIF Procedures to Incorporate the SIBTEST Regression Correction
http://commons.lib.jmu.edu/gradpsych/20
http://commons.lib.jmu.edu/gradpsych/20Tue, 25 Jul 2017 09:22:17 PDT
The Mantel-Haenszel (MH) and logistic regression (LR) differential item functioning (DIF) procedures have inflated Type I error rates when there are large mean group differences, short tests, and large sample sizes. When there are large group differences in mean score, groups matched on the observed number-correct score differ on true score, contributing to inflated Type I error rates. The simultaneous item bias test procedure has incorporated an adjustment for this difference, originally using a linear regression correction and later using a nonlinear correction. In this study, these adjustments are applied to the MH and LR procedures. They effectively reduce the Type I error inflation for the MH and the LR test of uniform DIF, but not the LR test of nonuniform DIF. For large samples and large group mean differences, the Δ effect size is estimated with greater accuracy using these adjustments.
]]>
Christine E. DeMarsPolytomous Differential Item Functioning and Violations of Ordering of the Expected Latent Trait by the Raw Score
http://commons.lib.jmu.edu/gradpsych/19
http://commons.lib.jmu.edu/gradpsych/19Tue, 25 Jul 2017 09:22:13 PDT
The graded response (GR) and generalized partial credit (GPC) models do not imply that examinees ordered by raw observed score will necessarily be ordered on the expected value of the latent trait (OEL). Factors were manipulated to assess whether increased violations of OEL also produced increased Type I error rates in differential item functioning (DIF) procedures conditioned on the raw score. Shorter tests and greater variance in item slope parameters increased OEL violations for the GR data but not for the GPC data. These same factors, combined with group mean differences between the reference and focal groups, increased the Type I error rate for the observed raw score DIF methods for both the GR and GPC data. A procedure condi-tioned on the classical test theory latent score estimate instead of the observed score helped reduce the Type I error in some of the conditions but not for the shortest tests.
]]>
Christine E. DeMars“Guessing" parameter estimates for multidimensional IRT models
http://commons.lib.jmu.edu/gradpsych/18
http://commons.lib.jmu.edu/gradpsych/18Tue, 25 Jul 2017 09:22:09 PDT
Two software packages commonly used for multidimensional item response theory (IRT) models require the user to input values for the lower asymptotes of the item response functions. One way of selecting these values is to estimate lower asymptotes with a one-dimensional IRT model and use those estimates as ﬁxed values in the multidimensional model. This procedure was compared to simply setting the asymptotes to a reasonable value. For two-factor tests, the use of unidimensional asymptotes worked well, yielding results nearly comparable to setting the lower asymptotes to the true values. With four-factor tests, in contrast, the item parameter and item response surface estimates were less accurate when the lower asymptotes were estimated through a unidimensional model. The estimates of the lower asymptotes from the unidimensional model tended to be too high for the four-factor tests, which likely caused the decreased accuracy of this procedure.
]]>
Christine E. DeMarsChanges in rapid-guessing behavior over a series of assessments
http://commons.lib.jmu.edu/gradpsych/17
http://commons.lib.jmu.edu/gradpsych/17Tue, 25 Jul 2017 09:22:06 PDT
A series of 8 tests was administered to university students over 4 weeks for program assessment purposes. The stakes of these tests were low for students; they received course points based on test completion, not test performance. Tests were administered in a counterbalanced order across 2 administrations. Response time effort, a measure of the proportion of items on which solution behavior rather than rapid-guessing behavior was used, was higher when a test was administered in the 1st week. Test scores were also higher. Differences between Week 1 and Week 4 test scores decreased when the test was scored with an effort-moderated model that took into account whether the student used solution or rapid-guessing behavior. Differences further decreased when students who used rapid-guessing on 5 or more of the 30 items were filtered from the data set.
]]>
Christine E. DeMarsApplication of the bi-factor multidimensional item response theory model to testlet-based tests
http://commons.lib.jmu.edu/gradpsych/16
http://commons.lib.jmu.edu/gradpsych/16Tue, 25 Jul 2017 09:22:03 PDT
Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difﬁculty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difﬁculty and slope. Similar differences between the models were illustrated with real data.
]]>
Christine E. DeMarsYou have full text access to this content An Application of Item Response Time: The Effort-Moderated IRT Model
http://commons.lib.jmu.edu/gradpsych/15
http://commons.lib.jmu.edu/gradpsych/15Tue, 25 Jul 2017 09:21:59 PDT
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proﬁciency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reﬂecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Speciﬁcally, it was found that the effort-moderated model (a) showed better model ﬁt, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proﬁciency estimates with higher convergent validity.
]]>
Steven L. Wise et al.Type I Error Rates for Parscale’s Fit Index
http://commons.lib.jmu.edu/gradpsych/14
http://commons.lib.jmu.edu/gradpsych/14Tue, 25 Jul 2017 09:21:55 PDT
Type I error rates for PARSCALE’s fit statistic were examined. Data were generated to fit the partial credit or graded response model, with test lengths of 10 or 20 items. The ability distribution was simulated to be either normal or uniform. Type I error rates were inflated for the shorter test length and, for the graded-response model, also for the longer test length when the ability distribution was uniform. In conditions in which α was inflated, it was particularly high when one or more response categories were used infrequently. Overall, PARSCALE’s fit index is not recommended for short tests.
]]>
Christine E. DeMarsLow examinee effort in low-stakes assessment: Problems and potential solutions
http://commons.lib.jmu.edu/gradpsych/13
http://commons.lib.jmu.edu/gradpsych/13Tue, 25 Jul 2017 09:21:51 PDT
Student test-taking motivation in low-stakes assessment testing is examined in terms of both its relationship to test performance and the implications of low student effort for test validity. A theoretical model of test-taking motivation is presented, with a synthesis of previous research indicating that low student motivation is associated with a substantial decrease in test performance. A number of assessment practices and data analytic procedures for managing the problems posed by low student motivation are discussed.
]]>
Steven L. Wise et al.Type I Error Rates for Generalized Graded Unfolding Model Fit Indices
http://commons.lib.jmu.edu/gradpsych/12
http://commons.lib.jmu.edu/gradpsych/12Tue, 25 Jul 2017 09:21:47 PDT
Type I error rates were examined for several ﬁt indices available in GGUM2000: extensions of Inﬁt, Outﬁt, Andrich’s χ2, and the log-likelihood ratio χ2. Inﬁt and Outﬁt had Type I error rates much lower than nominal α. Andrich’s χ2 had Type I error rates much higher than nominal α, particularly for shorter tests or larger sample sizes. The log-likelihood χ2 had Type I error rates near or below nominal α for small samples or longer tests but had inﬂated error rates with large samples and shorter tests. For conditions in which the log-likelihood ratio χ2 did not perform well, alternative ﬁt indices or modiﬁcations to these procedures should be considered in future studies.
]]>
Christine E. DeMarsSample Size and the Recovery of Nominal Response Model Item Parameters
http://commons.lib.jmu.edu/gradpsych/11
http://commons.lib.jmu.edu/gradpsych/11Tue, 25 Jul 2017 09:21:43 PDT
In this study of polytomous items, the number of items and categories per item were varied to explore the effects on estimation of item parameters in the nominal response model. De Ayala and Sava-Bolesta's (1999) work suggested that the ratio of the sample size to the total number of item parameters was a key factor. They varied the total number of item parameters by increasing the sample size or changing the number of categories per item while leaving the number of items constant. In this study, the total number of item parameters, the sample size, and the number of categories were manipulated as separate factors. Increasing the nwnber of items had little effect on item parameter recovery, but increasing the number of categories increased the error variance of the parameter estimates. Error variance was also greater for more highly discriminating items and for skewed distributions of ability.
]]>
Christine E. DeMarsYou have full text access to this content Detecting Multidimensionality Due to Curricular Differences
http://commons.lib.jmu.edu/gradpsych/10
http://commons.lib.jmu.edu/gradpsych/10Tue, 25 Jul 2017 09:21:39 PDT
Data were generated to simulate multidimensionality resulting from including two or four subtopics on a test. Each item was dependent on an ability trait due to instruction and learning, which was the same across all items, as well as an ability trait unique to the subtopic of the test (such as biology on a general science test). The eigenvalues of the item correlation matrix and Yen's Q_{3} were not greatly influenced by multidimensionality under conditions where the responses of a large proportion of students shared the influence of common instruction across subtopics. In contrast, Stout's T procedure was effective at detecting this type of multidimensionality, unless the subtopic abilities were correlated.
]]>
Christine E. DeMarsDetection of item parameter drift over multiple test administrations
http://commons.lib.jmu.edu/gradpsych/9
http://commons.lib.jmu.edu/gradpsych/9Tue, 25 Jul 2017 09:21:35 PDT
Three methods of detecting item drift were compared: the procedure in BILOG-MG for estimating linear trends in item difficulty, the CUSUM procedure that Veerkamp and Glas (2000) used to detect trends in difficulty or discrimination, and a modification of Kim, Cohen, and Park’s (1995) χ 2 test for multiple-group differential item functioning (DIF), using linear contrasts on the discrimination and difficulty parameters. Data were simulated as if collected over 3, 4, or 5 time points, with parameter drift in either a gradual, linear pattern, a less linear but still monotonic pattern, or as a sudden shift at the third time point. The BILOG-MG procedure and the modification of the Kim et al. procedure were more powerful than the CUSUM procedure, nearly always detecting drift. All three procedures had false alarm rates for nondrift items near the nominal alpha. The procedures were also illustrated on a real data set.
]]>
Christine E. DeMarsIncomplete data and item parameter estimates under JMLE and MML
http://commons.lib.jmu.edu/gradpsych/8
http://commons.lib.jmu.edu/gradpsych/8Tue, 25 Jul 2017 09:21:31 PDT
Although nonrandomly missing data is readily accommodated by joint maximum likelihood estimation (JMLE), it can theoretically be problematic for marginal maxi-mum likelihood (MML) estimation. One situation of nonrandomly missing data, vertical equating using an anchor test, was simulated for this study under several conditions. The items from two test forms were calibrated simultaneously using JMLE and MML methods. Under MML, when the different ability distributions of the students taking the forms were not taken into account, the item difficulty parameters were overestimated for the items on the less difficult form and underestimated for the items on the more difficult form.
]]>
Christine E. DeMarsTest stakes and item format interactions
http://commons.lib.jmu.edu/gradpsych/7
http://commons.lib.jmu.edu/gradpsych/7Tue, 25 Jul 2017 09:21:27 PDT
The effects of test consequences, response formats (multiple choice or constructed response), gender, and ethnicity were studied for the math and science sections of a high school diploma endorsement test. There was an interaction between response format and test consequences: Under both response formats, students performed better under high stakes (diploma endorsement) than under low stakes (pilot test), but the difference was larger for the constructed response items. Gender and ethnicity did not inter-act with test stakes; the means of all groups increased when the test had high stakes. Gender interacted with format; boys scored higher than girls on multiple-choice items, girls scored higher than boys on constructed-response items.
]]>
Christine E. DeMarsGender differences in mathematics and science on a high school proficiency exam
http://commons.lib.jmu.edu/gradpsych/6
http://commons.lib.jmu.edu/gradpsych/6Tue, 25 Jul 2017 09:21:23 PDT
Scores from mathematics and science sections of pilot forms of the Michigan High School Proficiency Test (HSPT) were examined for evidence of an interaction between gender and response format (multiple choice or constructed response). When students of all ability levels were considered, the interaction was small in science and nonexistent in mathematics. When only the highest ability students were considered, male students scored higher on the multiple-choice section, whereas female students either scored higher on the constructed-response section or the degree to which the male students scored higher was less on the constructed-response section. Correlations between the formats were high and did not vary by gender. Standard errors of measurement were similar across gender.
]]>
Christine E. DeMarsInformation literacy as foundational: Determining competence
http://commons.lib.jmu.edu/gradpsych/5
http://commons.lib.jmu.edu/gradpsych/5Tue, 25 Jul 2017 09:21:19 PDT
This paper describes the development of an instrument to measure information literacy for university students. The assessment development process is detailed, followed by ways of presenting results and some of the instructional changes in response to the results.
]]>
Christine E. DeMars et al.Standard setting: A systematic approach to interpreting student learning
http://commons.lib.jmu.edu/gradpsych/4
http://commons.lib.jmu.edu/gradpsych/4Tue, 25 Jul 2017 09:21:15 PDT
This paper explains the need for standards in general education, describes the Bookmark procedure (a standard setting procedure), shows how the procedure was applied in one program, and discuss how standard setting could be used in other contexts.
]]>
Christine E. DeMars et al.Group differences based on IRT scores: Does the model matter?
http://commons.lib.jmu.edu/gradpsych/3
http://commons.lib.jmu.edu/gradpsych/3Tue, 25 Jul 2017 09:21:11 PDT
In this study, effect sizes based on simulated groups were compared for the one-parameter and three-parameter logistic IRT models. Data were generated based on a three-parameter model, and item estimates were obtained from the simulated data based on both the one-parameter and three-parameter models. Abilities were estimated using both maximum likelihood and expected a posterior methods. The data fit the three-parameter model much better, but there were only minimal differences between the effect sizes based on different models.
]]>
Christine E. DeMarsRevising the scale of intellectual development: Application of an unfolding model
http://commons.lib.jmu.edu/gradpsych/2
http://commons.lib.jmu.edu/gradpsych/2Tue, 25 Jul 2017 09:21:08 PDT
An unfolding model was selected for the scores on the Scale of Intellectual Development to take into account that, for stage-based instruments, agreement with a statement first increases as the student approaches the stage represented by the statement, then decreases as the student progresses beyond that stage.
]]>
Christine E. DeMars et al.Scoring Neutral or Unsure on an identity development instrument for higher education
http://commons.lib.jmu.edu/gradpsych/1
http://commons.lib.jmu.edu/gradpsych/1Tue, 25 Jul 2017 09:21:04 PDT
The use of neutral or unsure on an instrument designed to measure identity development in college students was explored. The nominal response model from item response theory was used to evaluate whether neutral or unsure was used more frequently by those at low or middle levels of development; the results depended on the subscale and sometimes on the item within the subscale. Scoring based on the nominal response model allows for this category to be treated differently for different items.
]]>
Christine E. DeMars et al.