Differential item functioning
Differential item functioning (DIF) is a statistical property of a test item that indicates how likely it is for individuals from distinct groups, possessing similar abilities, to respond differently to the item. It manifests when individuals from different groups, with comparable skill levels, do not have an equal likelihood of answering a question correctly. There are two primary types of DIF: uniform DIF, where one group consistently has an advantage over the other, and nonuniform DIF, where the advantage varies based on the individual's ability level.[1] The presence of DIF requires review and judgment, but it doesn't always signify bias. DIF analysis provides an indication of unexpected behavior of items on a test. DIF characteristic of an item isn't solely determined by varying probabilities of selecting a specific response among individuals from different groups. Rather, DIF becomes pronounced when individuals from different groups, who possess the same underlying true ability, exhibit differing probabilities of giving a certain response. Even when uniform bias is present, test developers sometimes resort to assumptions such as DIF biases may offset each other due to the extensive work required to address it, compromising test ethics and perpetuating systemic biases.[2] Common procedures for assessing DIF are Mantel-Haenszel procedure, logistic regression, item response theory (IRT) based methods, and confirmatory factor analysis (CFA) based methods.[3]
Procedures for detecting DIF[edit]
Mantel-Haenszel[edit]
A common procedure for detecting DIF is the Mantel-Haenszel (MH) approach.[13] The MH procedure is a chi-squared contingency table based approach which examines differences between the reference and focal groups on all items of the test, one by one.[14] The ability continuum, defined by total test scores, is divided into k intervals which then serves as the basis for matching members of both groups.[15] A 2 x 2 contingency table is used at each interval of k comparing both groups on an individual item. The rows of the contingency table correspond to group membership (reference or focal) while the columns correspond to correct or incorrect responses. The following table presents the general form for a single item at the kth ability interval.
Considerations[edit]
Sample size[edit]
The first consideration pertains to issues of sample size, specifically with regard to the reference and focal groups. Prior to any analyses, information about the number of people in each group is typically known such as the number of males/females or members of ethnic/racial groups. However, the issue more closely revolves around whether the number of people per group is sufficient for there to be enough statistical power to identify DIF. In some instances such as ethnicity there may be evidence of unequal group sizes such that Whites represent a far larger group sample than each individual ethnic group being represented. Therefore, in such instances, it may be appropriate to modify or adjust data so that the groups being compared for DIF are in fact equal or closer in size. Dummy coding or recoding is a common practice employed to adjust for disparities in the size of the reference and focal group. In this case, all Non-White ethnic groups can be grouped together in order to have a relatively equal sample size for the reference and focal groups. This would allow for a "majority/minority" comparison of item functioning. If modifications are not made and DIF procedures are carried out, there may not be enough statistical power to identify DIF even if DIF exists between groups. Another issue that pertains to sample size directly relates to the statistical procedure being used to detect DIF. Aside from sample size considerations of the reference and focal groups, certain characteristics of the sample itself must be met to comply with assumptions of each statistical test utilized in DIF detection. For instance, using IRT approaches may require larger samples than required for the Mantel-Haenszel procedure. This is important, as investigation of group size may direct one toward using one procedure over another. Within the logistic regression approach, leveraged values and outliers are of particular concern and must be examined prior to DIF detection. Additionally, as with all analyses, statistical test assumptions must be met. Some procedures are more robust to minor violations while others less so. Thus, the distributional nature of sample responses should be investigated prior to implementing any DIF procedures.
Items[edit]
Determining the number of items being used for DIF detection must be considered. No standard exists as to how many items should be used for DIF detection as this changes from study-to-study. In some cases it may be appropriate to test all items for DIF, whereas in others it may not be necessary. If only certain items are suspected of DIF with adequate reasoning, then it may be more appropriate to test those items and not the entire set. However, oftentimes it is difficult to simply assume which items may be problematic. For this reason, it is often recommended to simultaneously examine all test items for DIF. This will provide information about all items, shedding light on problematic items as well as those that function similarly for both the reference and focal groups. With regard to statistical tests, some procedures such as IRT-Likelihood Ratio testing require the use of anchor items. Some items are constrained to be equal across groups while items suspected of DIF are allowed to freely vary. In this instance, only a subset would be identified as DIF items while the rest would serve as a comparison group for DIF detection. Once DIF items are identified, the anchor items can also be analyzed by then constraining the original DIF items and allowing the original anchor items to freely vary. Thus it seems that testing all items simultaneously may be a more efficient procedure. However, as noted, depending on the procedure implemented different methods for selecting DIF items are used. Aside from identifying the number of items being used in DIF detection, of additional importance is determining the number of items on the entire test or measure itself. The typical recommendation as noted by Zumbo (1999) is to have a minimum of 20 items. The reasoning for a minimum of 20 items directly relates to the formation of matching criteria. As noted in earlier sections, a total test score is typically used as a method for matching individuals on ability. The total test score is divided up into normally 3–5 ability levels (k) which is then used to match individuals on ability prior to DIF analysis procedures. Using a minimum of 20 items allows for greater variance in the score distribution which results in more meaningful ability level groups. Although the psychometric properties of the instrument should have been assessed prior to being utilized, it is important that the validity and reliability of an instrument be adequate. Test items need to accurately tap into the construct of interest in order to derive meaningful ability level groups. Of course, one does not want to inflate reliability coefficients by simply adding redundant items. The key is to have a valid and reliable measure with sufficient items to develop meaningful matching groups. Gadermann et al. (2012),[30] Revelle and Zinbarg (2009),[31] and John and Soto (2007)[32] offer more information on modern approaches to structural validation and more precise and appropriate methods for assessing reliability.
Balancing statistics and reasoning[edit]
As with all psychological research and psychometric evaluation, statistics play a vital role but should by no means be the sole basis for decisions and conclusions reached. Reasoned judgment is of critical importance when evaluating items for DIF. For instance, depending on the statistical procedure used for DIF detection, differing results may be yielded. Some procedures are more precise while others less so. For instance, the Mantel-Haenszel procedure requires the researcher to construct ability levels based on total test scores whereas IRT more effectively places individuals along the latent trait or ability continuum. Thus, one procedure may indicate DIF for certain items while others do not.
Another issue is that sometimes DIF may be indicated but there is no clear reason why DIF exists. This is where reasoned judgment comes into play. Especially by understanding why uniform and nonuniform DIF occurs.[33] The researcher must use common sense to derive meaning from DIF analyses. It is not enough to report that items function differently for groups; there needs to be a qualitative reasoning for why it occurs.
Uniform DIF occurs when there's a consistent advantage for one group compared to another across all levels of ability. This type of bias can often be addressed by using separate test norms for different groups to ensure fairness in assessment. Nonuniform DIF, on the other hand, is more complex as the advantage varies based on individuals' ability levels. Factors such as socioeconomic status, cultural differences, language barriers, and disparities in knowledge access can contribute to nonuniform DIF. Identifying and addressing nonuniform DIF requires a deeper understanding of the underlying cognitive processes involved and may require tailored interventions to ensure fair assessment practices.
In DIF studies, uncovering certain items exhibiting DIF is common, indicating potential issues needing scrutiny. However, DIF evidence doesn't automatically imply the entire test is unfair. Instead, it signals specific items may be biased, requiring attention to maintain test integrity and fairness for all examinees. Identifying items with DIF offers an opportunity to review and potentially revise or remove problematic items, ensuring equitable assessment practices. Therefore, DIF analysis serves as a valuable tool for item analysis, particularly when supplemented with qualitative exploration of causal factors.
Below are common statistical programs capable of performing the procedures discussed herein. By clicking on list of statistical packages, you will be directed to a comprehensive list of open source, public domain, freeware, and proprietary statistical software.
Mantel-Haenszel procedure
IRT-based procedures
Logistic regression