Successive Hurdles, Test Weighting and Certification Rules: Part 3

In the last article we focused on weighing the tests and subtests that comprise the total selection process. We identified some instruments that should only be used as pass fail and we identified others that can be used to rank candidates. Those tests and subtests suitable for ranking are those identified through the job analysis as assisting in differentiating potential job performance. We also identified an issue with weighting tests and sub tests if we rely on simply multiplying test results by the percentage we want them to weigh in our total. Tests with greater variance tend to impact ranking more than the desired weight. Simply put, tests tend to self weight based on their variance.

Given a simple illustration we can see that tests that spread test scores out (have greater variance) will have a greater impact on the final ranking of candidates than tests that tend to lump everyone together (have less variance). Taking this concept to its extreme, it can be seen that if a group of five people all got the same score on a multiple-choice exam but achieved widely divergent scores on a structured interview, the multiple-choice exam would weight zero in our final ranking and the interview would weigh one hundred percent.

Multiple-Choice Test Scores Structured Interview Scores
80 95
80 85
80 80
80 75
80 70

This would be an undesirable result if you actually wanted each test to have a different weight or equal weights in the final ranking. Also note that simply multiplying each test score by the desired weight, for example multiplying by .5 if you wanted each component to weigh 50%, will not achieve the desired results in that it will not change the ranking established by the Structured Interview.

Even though this may be an extreme example it does help us picture what is happening within the tests we use for ranking in regard to their self weighting. Further, utilizing an extreme example also illustrates the decision making process a jurisdiction can go through in determining whether or not taking the time an effort to adjust scores for the desired weighting is warranted. Since in most cases, if the means and the standard deviations for the tests within the battery do not vary widely, it may not be necessary to apply corrections.

In addition to the challenges posed by the variance of the selection instruments within our test battery, we can also have other sources of weighting errors when we utilize multiple panels for conducting structured interviews or multiple panels of assessors for rating candidates in multiple assessment center exercises. Frequently, very large jurisdictions will have to employ both of these strategies in dealing with large numbers of candidates competing for positions at each rank. Although these rating differences should be avoided with sufficient training, agencies can use the corrections outlined below to address differences in the way panels score candidates.  A review of the panel’s scores may indicate that panels are varying their application of ratings with some boards being very strict, some being quite lenient and others being conservative despite best efforts to train all raters.

If you do have differences in the way a panels are scoring candidates you may want to consider some type of correction so that the panel that is considered the “easiest” and tends to give the highest scores will not end up determining who gets ranked the highest.  As suggested previously, the issues related to combining scores in proportion to their desired weights can be resolved by utilizing standardized scores.

All of us have been subjected to the use of standardized scores in some fashion or another since they are widely used in the educational system in our country. Their utility in the educational model is similar to that in HR selection; they provide a means of making comparisons of candidate performance on different tests. Essentially they allow us to use the test group to establish the norms for comparing candidates’ scores within that group. In the case of Z scores, the standardization process (i.e. the process used to convert a score) involves expressing an individual’s score in terms of its distance from the mean (arithmetic average) of the test utilizing the standard deviation (SD – distance scores vary from mean). Other types of standardized scores use statistical methods to set the mean and the standard deviation with T scores and Deviation IQ scores being examples of these. T scores utilize a mean of 50 and a SD of 10 and IQ scores use a mean of 100 and a SD of 15. Thus someone who earned a Z score of 1 a T score of 60 and a standard IQ score of 115 would have had the same level of performance on each instrument when compared to the norm groups. Such comparisons are illustrated in the chart of the normal curve below with the first line under the normal curve representing Z scores indicating they range from -4 to +4 as they divide the normal curve by standard deviations. That is a Z score of +1 would be equivalent to being one standard deviation above the mean and -1 would be equivalent to one standard deviation below the mean.

The most commonly used standardized scores are Z scores. To compute a Z score, we obtain the difference between an individual’s raw score and the mean of the normative group and then divide the difference by the Standard Deviation (SD) of the normative group. That is, if an individual earned 85 on a test with a Mean of 75 and a Standard Deviation of 10 their Z score would be 1.

(Raw Score – Mean) divided by the Standard Deviation = Z

85 – 75 = 10 divided by 10 = 1 so the Z score is 1

The utility of standardized scores, as indicated previously, is in the ease of comparisons they provide. Once we have calculated Z scores, and thankfully, there are computer programs that will do this for us, we can make comparisons with other candidates on other tests and in addition, we have a tool we can use to combine test scores to achieve their desired weights. The diagram below illustrates how Z scores and other standardized scores compare with each other as well as how they compare to the normal curve which represents the distribution of scores we hope to have from the administration of our tests. We can also see that if we calculate Z scores for our candidates on multiple exams, the point that each score falls can be illustrated linearly and will give us a graphic of how a candidate performed on a test compared to the rest of the group.

Then to combine these scores we take the portion or percentage of the Z score that we want a test to contribute to the final score and use it to multiply the score. So a Z score of 1.0 times a weighting of .5 becomes .5 and a Z score of .5 times a weighting of .5 becomes .25. Summing these two values gives us a Z score of .75.

Combining Z scores for final rankings should only be applied to passing scores and thus an important assumption is incorporated in this process and that is those passing exams are at least average or above and therefore combining scores will not be confounded with the computation of negative Z scores (remember Z scores can range from -4 to +4). Using only passing scores means that you have the option of using straight Z scores for establishing your ranked list or transforming them into a more recognizable metric. To accomplish that task you would establish a mean that represents the pass point (i.e.,you set the mean to the same number as your pass point) and a standard deviation that approximates the average of the standard deviations achieved on the instruments themselves. Then to compute the transformed scores you use the mean as the starting point for a candidate’s score and add the product of the SD multiplied by the Z score.

That is if an individual had a Z score of 1.0 his score would be 70 (mean) plus 10 (SD X1.0) which equals 80.

As can be seen by these transformations, addressing the issues related to self weighting can add several steps to determining scores and that, along with the complexity of the computations has served to deter many jurisdictions from correcting for self weighting. The importance of this information is to inform test developers and users that there is an issue with self weighting, there are methods for correcting self weighting errors and individuals responsible for ranking candidates can apply corrective methods should they choose.


This is part three of a four-part series on successive hurdles, test weighting and certification rules. If you’ve just joined us, we suggest you catch up on part 1 and part 2. Part 4 will discuss how certification rules fit into the process of successive hurdles and test weighting as we’ve discussed thus far. The conclusion to this series will be available to read on the Assessment Services Review on April 25, 2012. In case you missed it, check out Robert Burd’s previous series, Item Analysis In Public Safety.

Leave a Reply