- We want test questions that are very detailed, highly complex, engage the test taker, reflect the job, and, oh yes, are at a 6th grade reading level.
- Grade level = 5.9. Reading level from paragraphs from a 3rd grade reader from 1960, as calculated by Word.
This is part two of a blog dealing with the measurement of readability and the establishment of appropriate reading levels. For purposes of this blog, readability can be defined as the ability of material to be comprehended by its intended audience.
In Part 1, we investigated approaches to readability based on:
- The measurement of grammatical features or readability formulas.
- The linguistic perspective.
- Job analysis.
In Part 2, we turn our attention to more practical issues such as:
- How are readability indices used by assessment professionals?
- What adjustments can or should be made when evaluating multiple-choice tests?
- How has the changing nature of jobs impacted readability?
How do assessment professionals use readability indices?
Organizations often request that a test meet some arbitrarily determined grade level, and test constructors attempt to comply. Some jurisdictions may be required by union contact or by state regulations to offer tests or other documents that are below some pre-established reading level. Test constructors may also attempt to match the reading level of the test to source materials used on the job, which requires the measurement of the reading level of both.
Beyond any internal pressures, the courts may require an analysis of reading levels. In Vulcan (United States v. City of New York, 637 F. 4 Supp. 2d 77, E.D.N.Y. 2009), citing Guardians (Guardians Assoc. of New York City Police Dep’t, Inc. v. Civil Serv. Comm., 630 F.2d 79, 86, 2d Cir. 1980), the court criticized the City of New York for a failure to perform readability analysis. Thus, assessment professionals may evaluate the readability of their tests as a form of insurance in case there is any legal challenge.
What adjustments can or should be made when evaluating multiple-choice tests?
Regardless of any deficiencies inherent in the use of one number to measure readability, the ability to represent readability in a single metric is immensely attractive to both test users and test producers. Furthermore, the widespread use of the Word software program makes the provided Flesch Reading Ease test and the Flesch-Kincaid Grade Level index the preferred choice for assessment practitioners.
For example, to measure the readability of my blog, all I have to do is:
- Bring the document up in Word.
- Click on Review.
- Click on Spelling and Grammar
Limiting ourselves to multiple-choice items, the analysis of readability still involves a great deal of choices. In particular, in applying the Microsoft readability analysis program to multiple-choice tests, the Flesch-Kincaid formula may give inaccurate results due to the use of partial or incomplete sentences; in addition, the use of other features, such as numbers on a math or budget test, will skew the results.
In response, assessment professionals use a number of modifications or approaches including:
- Simply analyzing the whole test as it is written.
- Splitting the instructions and the test items, then performing a separate analysis on each.
- Dropping the alternatives and creating sentences out of the stem of the item plus the correct alternative.
- Analyzing only longer reading passages or longer items.
- Analyzing individual questions and finding an average, although this approach appears to lead to higher reading levels due to placing too great of a weight on words with large numbers of syllables.
The question might be asked as to whether possible modifications make a difference. The answer is that modifications do make a difference, although the size of the effect and total impact may vary.
For example, the Flesch-Kinkaid does not depend on or count paragraphs, so it should not matter if you analyze it in question or in text format. However, it actually does. The fluctuation can be as much as a grade level. So it does appear to matter. For example, I tried an experiment with a simple reading test section and the grades varied between 6.2 and 6.9 depending on how I formatted the material.
Further complications can be introduced by considering factors such as question numbering, the lettering for alternatives, how you treat alternatives, how you treat numbers, and even how the page is formatted. For example, large numbers such as might be included in math, mechanical comprehension, or accounting tests, may seriously inflate the reading estimate, leading some to argue that all numbers should be removed from the test before calculating a reading index.
How has the changing nature of jobs impacted readability?
I would argue vociferously for the position that tests developed by professionals or by testing companies achieve admirable levels of readability and, if anything, are too easy in terms of reading demands. This seems to be especially true of public sector testing in the realm of the safety forces. Due to homeland security concerns, both police work and firefighting now involve the need for greater training and preparation. Firefighters deal with a wide range of hazardous chemicals requiring greater knowledge of chemistry and emergency medical services are now moving toward community-based practice. Police officers deal with a wide range of crimes involving computers and the internet. The range of possible problems and the complexity of the world and technology lead to increased pressures for greater education and more advanced training; the result is a dramatic increase in the reading requirements placed upon the safety forces. This should be reflected in our test content.
Conclusion and Recommendations
The use of any single index of readability, including the Flesch offered by Word, is far from an optimal solution. Although the levels of reliability and validity are probably adequate when applied to text, the application of reading indices to multiple-choice tests would appear to result in greater variability, due to the impact of the various choices made in preparing the test for analysis. Still, from a practical perspective, it is unlikely that we will drop the use of an easily calculated metric. Given the above, I offer the following opinions on what might be seen as best practices; I would recommend that:
- The content of the test should be linked to the job and to the source material; the best defense is to be able to show that the reading level of the test is a result of the linkages between the items and the necessary knowledge, skills, and abilities.
- The test should be written by trained item writers, based on well-formulated item writing rules including having item writers attempt to minimize reading difficulty in the process of constructing the test.
- Items should be reviewed to ensure that the reading levels are acceptable.
- In applying a measure of readability, the instructions and body of the test should be assessed separately.
- In applying a measure of readability, the questions should be rewritten into paragraph form.
- Numbers and any other unique features in the questions or alternatives should be removed, again with the goal of the test looking like a standard text in terms of formatting before the application of a readability formula.
- Normal item analysis practices should be followed, the real issues should be item difficulty, equivalence, etc.
In addition, the world of testing is rapidly evolving. New item types are being developed that can be delivered using a wide range of media. As a result, the question arises as to whether readability is even a relevant issue when considering a range of new item types including various kinds of job simulations. My prediction is that in the future we will see a greater reliance on usability testing and the test taker experience, with less attention paid to issues such as readability. One of the reasons I find usability attractive is that it tends to rely upon good, old-fashioned experimentation.
It is surprising to me that more research has not been conducted on the actual impact of differences in readability on item properties. The question we should be asking, but rarely seem to ask, is to what extent varying the reading level of our tests impacts the use and effects their test score. For example, does increasing the readability level of an item make much of a difference in terms of item difficulty or equivalence across subgroups? Of course, the follow-up question is whether any change in test score has implications in terms of actual job performance. In the final analysis, the real question is do we have a good test, one that is reliable and valid.
If your agency does have specific documented policies, please answer the following questions:
I would like to thank Bruce Davey for providing me with information and several documents related to readability. Although it came in too late to include in last month’s blog, Bruce Davey’s research led to the suggestion that the SMOG index be used in constructing tests. Bruce Davey also produced an excellent guide to readability, which includes helpful hints on reducing the reading level of tests; the title of the guide is Readability: Why It’s Important, How To Measure It, How To Improve It, which was produced in 1975 for the Connecticut State Personnel Department. In writing this blog, Dennis Doverspike was assisted by Megan Nolan and Chelsea Whims. The blog was based on a longer, unpublished paper by Doverspike, Nolan, and Whims entitled Establishing Reading Levels in Employee Assessment.