Standardized Testing Essay

A common misconception equates the term standardized test only with those tests that use multiple-choice items and machine-readable (“bubble”) answer sheets. To the contrary, if any aspect of a test—format, procedures, or administration—is standardized across test takers, it can be considered a standardized test.

Thus, tests with written, oral, or otherwise “constructed” instructions or responses can be standardized. Physical tests, such as those conducted by engineers, can be standardized, of course, but most focus on the measurement of latent (i.e., non-observable) mental traits.

Both surveys and tests measure mental traits. Surveys typically measure attitudes or opinion, whereas tests typically measure proficiencies or aptitudes. Like surveys, however, tests are composed of items (e.g., questions, prompts).

Educators and psychologists are the most frequent users of standardized mental tests. Educators measure academic achievement (i.e., mastery of knowledge or skills), diagnose learning problems or aptitudes, and select candidates for higher academic programs. Psychologists measure beliefs, attitudes, and preferences; diagnose psychological problems or strengths; and select candidates for employment.

Good standardized tests tend to be reliable— producing consistent results across time, conditions, and test takers, when the characteristics of the test takers are held equal—and valid—measuring those traits that they are intended to measure. When used for selection (e.g., university admission, employment), an effective test has strong predictive validity—a high correlation between test scores and future performance—or allocative efficiency—a high correlation between test scores and the performance of a group as a whole when the separate individuals are organized in a particular structure of relationships.

When tests have consequences, or stakes, they can induce beneficial washback effects. A plethora of research studies demonstrates that students and candidates study more and take their studies more seriously (and, thus, learn more) when facing tests with stakes. Moreover, when test takers wish to perform well, and their teachers wish to help them, they tend to focus their preparation on mastering that knowledge or those skills they judge will be covered by the test. This has the indirect, but usually beneficial, effect of increasing efficiency by aligning educational or training program curricula with known standards, benchmarks, and goals.

Education Tests

In most of the world, system-wide education tests are based on common standards and curricula. Teachers align their instruction, and students their study, to them. Most of these tests have consequences and, typically, are placed at the entry and/or exit points of levels of education. Most countries require both upper secondary exit and university entrance examinations as well as either lower secondary exit or upper secondary entrance exams.

In the United States, however, where education governance is more fragmented, common standards and curricula have been difficult to implement and enforce. In the place of standards-based (criterion-referenced tests) tests, many U.S. states and school districts purchased norm-referenced tests from commercial test publishers. The “norms” were constructed through field tests with national samples of students on a generalized curriculum of the publishers’ own construction. In the absence of standards-based testing, aptitude tests were often used to make consequential decisions, such as assigning students to special programs, retaining them in grades, awarding them scholarships, or admitting them to selective schools or universities.

The courts, however, have since declared it unfair to impose negative consequences, such as retention in grade or diploma denial, on students who fail tests based on pseudo-curricula rather than the one to which they actually were exposed. This principle has since been written into the Standards for Educational and Psychological Testing, which have become the de facto regulations governing legal uses of standardized tests in the United States. As a result, virtually all grade promotion and graduation examinations are now standards based, not norm referenced, and achievement, not aptitude, tests.

Test Development

Standardized tests evolve through a demanding and time-consuming process based on either classical test theory or item response theory. In classical test theory, every test is custom designed and relevant to a particular population. First, one develops a test content framework, or outline, and then validates it with reviews by experts or current job holders. Next, one drafts test items and field tests them with a representative sample of the test-taking population, which can be difficult to do without exposing that population to the test content.

Test development according to item response theory (IRT) requires large populations of test takers to produce reliable test statistics, but these statistics are not then dependent on any particular choice of respondent samples. Moreover, respondent scores are independent of any particular choice of test items.

Computer-adaptive testing (CAT) owes its existence and its increasing popularity to IRT. With CAT, test takers are presented an item at a level of difficulty determined by their performance on the previous item. Correct responses yield more difficult subsequent items. Those responding to all items correctly can finish the examination early by circumventing the need to respond to the less difficult items.


