Interpreting the Test Response Data Report

Perhaps it is a character flaw, but I have never enjoyed reading long technical reports. In my early years as a practitioner in Human Resources, I was typically so busy that I did my best to gain the key information I needed from written materials by skimming them. This approach worked a large percentage of the time and yet there was more than one time when I would have benefited from having read an entire document thoroughly.

My days of skimming ended when one of the jurisdictions I had just started working for was sued by the Department of Justice for patterns and practices of discrimination in their entry-level hiring and promotional processes. One of the important lessons that came out of defending that lawsuit was how critical it is to read all important documents thoroughly.

In that regard, if you are one of the jurisdictions who has made the choice to utilize IPMA-HR’s public safety tests, you may be overlooking a wealth of information if you do not take the time to read thoroughly the “Test Response Data Report,” which is available to Test Security Agreement signers on request. Review of this report will provide you with some key information regarding the test and how your candidates performed in comparison to how all test takers combined performed.

The beauty of this report is in its simplicity. Unlike typical research studies that are weighed down with a lot of technical jargon, this report comes complete with all the information necessary to understand it. While a background in statistics may be helpful in applying significance to some of the information, it is not required to understand the report. To its credit, the document also provides a concise explanation of adverse impact and makes it clear that adverse impact does not equal discrimination.

There are four main sections contained within a Test Response Data Report and all of them are relatively easy to understand:

Frequency Distribution
Adverse Impact Thresholds by Race
Adverse Impact Thresholds by Gender
List of Agencies Administering the Test

Frequency Distribution

Several bits of information contained in the Frequency Distribution provide valuable information about test performance and the overall effectiveness of the test. In terms of the report I reviewed, the information at the top of the first page identifies the test being studied and the number of questions appearing on the test. The report also indicates the total number of reported individuals that took the test from January 2008 to January 2013, along with the mean score and standard deviation obtained from all test administrations included in this time period. While not specifically spelled out, you can also identify the range of scores for a particular test. In the report I reviewed, the top of the range was a raw score of 99 and the lowest raw score reported was 30. The report also shows that only one person earned a 30 and only one person earned a 99.

All of this information helps us to gain a picture of how the test takers performed as well as how the test performed. Looking at the high and low scores tells us that this test has an appropriate range. The fact that only one individual out of 2,618 earned a score of 99 tells us that the test is not too easy. Put another way, we have an appropriate ceiling and we are not losing variance in our test because everyone is topping out.

The point that the test has appropriate variance or dispersion is also supported by the standard deviation of 10.11. This tells us that the test is able to distinguish appropriately between test performance and possession of the underlying KSAs measured by the test without having everyone too spread out.

This concept is also supported by the fact that when the mean (76.48) is inserted into the frequency distribution, it draws a line between 55.81% of the candidates being at or above the mean with 44.19% of the candidates being below the mean. This suggests that the test comes close to dividing the group in half and therefore, approximating a normal curve.

The data also indicates that 115 test takers achieved a raw score of 78 and 115 test takers achieved a score of 77. Again, with the mean being 76.48 and the two modes being 77 and 78 we can see that the most scores on the test fall in the middle and again this supports the assertion that the results reflect a close to normal curve if we were to plot all of this on a histogram. The desirability of a normal curve is that it lends itself well to other statistical calculations.

Combining the facts that the report I reviewed was based on 2,618 test takers and the distribution approached a normal curve, we can also say that we have stable measures. As you may recall from statistics, the more cases our calculations are based on the more stable or reliable they are. In this case we know our mean is reliable and not subject to large fluctuations and our test itself is reliable so this supports its validity.

Applying the normal curve to our distribution also lets us determine how our test candidates compare to the overall results as well as each other. We can determine whether differences we see are statistically significant because we have reliable results and we have a mean and standard deviation which we can plot on our distribution and apply the laws of probability in regard to the differences we see in scores. Further, since the data we are basing our decisions on is valid and reliable, we can be comfortable knowing that significant differences in test performance should equate to significant differences in job performance.

Adverse Impact Thresholds

The second section of the report addresses the issue of adverse impact. It also does a very good job of explaining the four-fifths rule and how it is calculated. The explanation correctly points out that while adverse impact can be related to discrimination on an illegal basis it is not, in itself, proof that such discrimination exists. While adding “on an illegal basis” to the word discrimination may be a minor point, it is a distinction I make since all tests discriminate and it is what they are intended to do.

The key is that they should discriminate on the basis of test performance as it relates to the job and not on the basis of race, religion, gender, age, ethnic background or any other protected status. A test that does not discriminate when viewed in terms of its ability to differentiate test performance and subsequent job performance would be worthless.

Simply put, the four-fifths or eighty percent rule is based on calculating the pass rate for the majority group and then calculating eighty percent of that. So it is a percentage of a percentage. As indicated in the report, 80% of the pass rate for the majority group is the “Adverse Impact Threshold.” This threshold can then be compared to the pass rate for all protected groups that have a representative sample taking the test. The Test Response Data Report takes the 80% concept a step further by saying that the cut score is typically related to the degree of adverse impact, that is as the cut score or pass point is raised the adverse impact tends to increase and as the cut score is lowered, the adverse impact tends to decrease. Whether this holds true or not will depend on the number of individuals in the majority group that move from below to above the cut score when the cut score is lowered.

The report provides Adverse Impact Threshold Comparisons for gender and race with males representing the majority in the gender comparison and whites representing the majority in the comparison by race. As indicated previously, since the cut score can influence adverse impact it is important to take that into consideration when establishing cut scores. Other factors usually taken into consideration in establishing cut scores based on test performance are the mean and standard deviation. Again, the concept of test validity suggests that test performance predicts job performance with average test performance equating to average job performance while below average test performance would equate to below average job performance. Since most employers do not want to hire individuals that are expected to have below average job performance it makes since that they would not want to hire individuals with below average test performance. However; since there are confounding factors regarding test performance and job performance such as the SEM (Standard Error of Measurement) that deal with observed scores versus true scores, it is generally advisable to take such factors into consideration when setting the cut score and therefore cut scores are not typically set right at the mean. More often, a range of possible scores for candidates is developed based on the SEM allowing the practitioner to view observed scores as fitting into a range of scores that would have a higher probability of representing a candidate’s true score. For example, if a candidate achieves a score of 77 as an observed score and we have an SEM of 3, we can add 3 points to 77 and subtract 3 points from 77 to create a range of scores that have a greater probability of containing the candidate’s true score than just taking the observed score as the true score.

Another way of looking at this is taking the observed score as the mean of the distribution of scores for one individual who takes the test numerous times. Over successive administrations the distribution of observed scores would approximate a normal curve and the probability of an individual’s true score can be determined by taking the mean and standard deviation (SEM) for the individual to create confidence bounds for determining the candidate’s true score. If we add and subtract one standard deviation we have a 68% chance that candidate’s score falls in the range we have created. If we add and subtract two standard deviations, we have a 95% chance that we have included the individual’s true score. Therefore, knowing some of the limitations of tests, it is also important to consider our basic understanding of adverse impact and doing what we can to avoid it, if possible, when setting our cut scores.

Applying what we have discussed thus far to the tables in the report, we can take a look at potential cut scores. Using the mean of 76.48 gives a starting point for considering cut scores and looking at the resulting adverse impact if any. In our table comparing males and females, we can see that by rounding our mean up to a raw score of 77 we would have 55.7624% pass rate for males with a subsequent AIT (Adverse Impact Threshold) of 44.6099%. In other words 80% of the pass rate for males is 44.6099%. Females would need to pass at that rate or higher for us not to have adverse impact on females. Since females have a 56.0773% pass rate when the cut score is 77, we can see that we do not have adverse impact. In fact females pass at a slightly higher rate than men do when the cut score is 77. This means that if we wanted to use this as our cut score we would not adversely affect females.

Using the same cut score of 77 on our table examining AIT by Race, we can see that the pass rate is 59.94% for whites and 80% of this is 47.95%. Using an AIT of 47.95% we can see that African Americans (39.65%), Hispanics (44.20%), and Asians (41.46%) would all be below the threshold indicating the test would have adverse impact if the cut score was set at 77. The only minority group with a pass rate above the threshold would be American Indians with a pass rate of 61.9%. Using our table, we can see that the highest cut score we could use and eliminate all adverse impact at the same time is 66 with an AIT of 71.16. Since our test has a Standard Deviation of 10.11, a score of 66 is one SD below our mean of 76.48.

If we are comfortable with the impact on our jurisdiction that would result from using 66 as a cut score, we could eliminate adverse impact by doing so. On the other hand, a cut score of 70 could be chosen since it eliminates adverse impact from every minority group except African Americans and their pass rate of 64.21% is very close to the AIT which is 65.30%. This is also where our SEM could come into play and if it is close to 3 we could see that using a cut score of 70 would include our mean of 77 in the confidence bounds that would represent two SEM’s above an observed score of 70. Our purpose here however; is not to establish or even recommend a pass point, but it is rather to show how the data provided in the report can be used to gain information about the performance of the test and make important decisions regarding the test.

Agency Listing

The final section of the report provides information regarding other jurisdictions that have used the test and provided data regarding its use.¹ The data includes the number of candidates tested, the pass point used, whether or not the test was used for ranking, and the size of the department. This information is straight forward and it gives agencies an idea of who is using the test, how it is being used and the size of the departments that are using it.

This information can be useful in that it can be valuable for jurisdictions using the test to share information regarding their experience with the test. Size and location of the departments also lets users know which departments most like their own are using the test. In addition it may provide for some opportunities to prevent duplication of effort by testing together and/or accepting test scores from candidates that have taken the test when it was administered by another jurisdiction. Also of note is the fact that the most commonly used cut score is 70. As indicated in the paragraph above, this is the score closest to the mean that reduces all, but the slightest amount of adverse impact on African Americans.

Hopefully, this information has shed some light on what you may be overlooking if you don’t take the time to read the Test Response Data Report, and will encourage you to read it and use the data effectively in the future.

Please note that the list of agencies on a test response data report is not a comprehensive list of every agency that has used a particular test, but rather just those who have provided us with the data used to prepare the report. If you need a full list of other agencies using a particular test, you can contact the Assessment Services Department and we’ll be able to provide a security agreement signer with that information. ↩

One Comment

Adverse Impact: Tips on Eliminating and Recordkeeping | Assessment Services Review July 10, 2014 at 9:01 am

[…] the nature of test use, differences in distribution sizes, or simply by chance. The article “Interpreting the Test Response Data Report” back in March covered the methods of determining adverse impact and the Adverse Impact Threshold, […]