Any Difficulties Met In Essay Test Directions

Psychological assessment contributes important information to the understanding of individual characteristics and capabilities, through the collection, integration, and interpretation of information about an individual (Groth-Marnat, 2009; Weiner, 2003). Such information is obtained through a variety of methods and measures, with relevant sources determined by the specific purposes of the evaluation. Sources of information may include

  • Records (e.g., medical, educational, occupational, legal) obtained from the referral source;

  • Records obtained from other organizations and agencies that have been identified as potentially relevant;

  • Interviews conducted with the person being examined;

  • Behavioral observations;

  • Interviews with corroborative sources such as family members, friends, teachers, and others; and

  • Formal psychological or neuropsychological testing.

Agreements across multiple measures and sources, as well as discrepant information, enable the creation of a more comprehensive understanding of the individual being assessed, ultimately leading to more accurate and appropriate clinical conclusions (e.g., diagnosis, recommendations for treatment planning).

The clinical interview remains the foundation of many psychological and neuropsychological assessments. Interviewing may be structured, semistructured, or open in nature, but the goal of the interview remains consistent—to identify the nature of the client's presenting issues, to obtain direct historical information from the examinee regarding such concerns, and to explore historical variables that may be related to the complaints being presented. In addition, the interview element of the assessment process allows for behavioral observations that may be useful in describing the client, as well as discerning the convergence with known diagnoses. Based on the information and observations gained in the interview, assessment instruments may be selected, corroborative informants identified, and other historical records recognized that may aid the clinician in reaching a diagnosis. Conceptually, clinical interviewing explores the presenting complaint(s) (i.e., referral question), informs the understanding of the case history, aids in the development of hypotheses to be examined in the assessment process, and assists in determination of methods to address the hypotheses through formal testing.

An important piece of the assessment process and the focus of this report, psychological testing consists of the administration of one or more standardized procedures under particular environmental conditions (e.g., quiet, good lighting) in order to obtain a representative sample of behavior. Such formal psychological testing may involve the administration of standardized interviews, questionnaires, surveys, and/or tests, selected with regard to the specific examinee and his or her circumstances, that offer information to respond to an assessment question. Assessments, then, serve to respond to questions through the use of tests and other procedures. It is important to note that the selection of appropriate tests requires an understanding of the specific circumstances of the individual being assessed, falling under the purview of clinical judgment. For this reason, the committee refrains from recommending the use of any specific test in this report. Any reference to a specific test is to provide an illustrative example, and should not be interpreted as an endorsement by the committee for use in any specific situation; such a determination is best left to a qualified assessor familiar with the specific circumstances surrounding the assessment.

To respond to questions regarding the use of psychological tests for the assessment of the presence and severity of disability due to mental disorders, this chapter provides an introductory review of psychological testing. The chapter is divided into three sections: (1) types of psychological tests, (2) psychometric properties of tests, and (3) test user qualifications and administration of tests. Where possible an effort has been made to address the context of disability determination; however, the chapter is primarily an introduction to psychological testing.


There are many facets to the categorization of psychological tests, and even more if one includes educationally oriented tests; indeed, it is often difficult to differentiate many kinds of tests as purely psychological tests as opposed to educational tests. The ensuing discussion lays out some of the distinctions among such tests; however, it is important to note that there is no one correct cataloging of the types of tests because the different categorizations often overlap. Psychological tests can be categorized by the very nature of the behavior they assess (what they measure), their administration, their scoring, and how they are used. Figure 3-1 illustrates the types of psychological measures as described in this report.


Components of psychological assessment. NOTE: Performance validity tests do not measure cognition, but are used in conjunction with performance-based cognitive tests to examine whether the examinee is exerting sufficient effort to perform well and responding (more...)

The Nature of Psychological Measures

One of the most common distinctions made among tests relates to whether they are measures of typical behavior (often non-cognitive measures) versus tests of maximal performance (often cognitive tests) (Cronbach, 1949, 1960). A measure of typical behavior asks those completing the instrument to describe what they would commonly do in a given situation. Measures of typical behavior, such as personality, interests, values, and attitudes, may be referred to as non-cognitive measures. A test of maximal performance, obviously enough, asks people to answer questions and solve problems as well as they possibly can. Because tests of maximal performance typically involve cognitive performance, they are often referred to as cognitive tests. Most intelligence and other ability tests would be considered cognitive tests; they can also be known as ability tests, but this would be a more limited category. Non-cognitive measures rarely have correct answers per se, although in some cases (e.g., employment tests) there may be preferred responses; cognitive tests almost always have items that have correct answers. It is through these two lenses—non-cognitive measures and cognitive tests—that the committee examines psychological testing for the purpose of disability evaluation in this report.

One distinction among non-cognitive measures is whether the stimuli composing the measure are structured or unstructured. A structured personality measure, for example, may ask people true-or-false questions about whether they engage in various activities or not. Those are highly structured questions. On the other hand, in administering some commonly used personality measures, the examiner provides an unstructured projective stimulus such as an inkblot or a picture. The test-taker is requested to describe what they see or imagine the inkblot or picture to be describing. The premise of these projective measures is that when presented with ambiguous stimuli an individual will project his or her underlying and unconscious motivations and attitudes. The scoring of these latter measures is often more complex than it is for structured measures.

There is great variety in cognitive tests and what they measure, thus requiring a lengthier explanation. Cognitive tests are often separated into tests of ability and tests of achievement; however, this distinction is not as clear-cut as some would portray it. Both types of tests involve learning. Both kinds of tests involve what the test-taker has learned and can do. However, achievement tests typically involve learning from very specialized education and training experiences; whereas, most ability tests assess learning that has occurred in one's environment. Some aspects of learning are clearly both; for example, vocabulary is learned at home, in one's social environment, and in school. Notably, the best predictor of intelligence test performance is one's vocabulary, which is why it is often given as the first test during intelligence testing or in some cases represents the body of the intelligence test (e.g., the Peabody Picture Vocabulary Test). Conversely, one can also have a vocabulary test based on words one learns only in an academic setting. Intelligence tests are so prevalent in many clinical psychology and neuropsychology situations that we also consider them as neuropsychological measures. Some abilities are measured using subtests from intelligence tests; for example, certain working memory tests would be a common example of an intelligence subtest that is used singly as well. There are also standalone tests of many kinds of specialized abilities.

Some ability tests are broken into verbal and performance tests. Verbal tests, obviously enough, use language to ask questions and demonstrate answers. Performance tests on the other hand minimize the use of language; they can involve solving problems that do not involve language. They may involve manipulating objects, tracing mazes, placing pictures in the proper order, and finishing patterns, for example. This distinction is most commonly used in the case of intelligence tests, but can be used in other ability tests as well. Performance tests are also sometimes used when the test-taker lacks competence in the language of the testing. Many of these tests assess visual spatial tasks. Historically, nonverbal measures were given as intelligence tests for non-English speaking soldiers in the United States as early as World War I. These tests continue to be used in educational and clinical settings given their reduced language component.

Different cognitive tests are also considered to be speeded tests versus power tests. A truly speeded test is one that everyone could get every question correct if they had enough time. Some tests of clerical skills are exactly like this; they may have two lists of paired numbers, for example, where some pairings contain two identical numbers and other pairings are different. The test-taker simply circles the pairings that are identical. Pure power tests are measures in which the only factor influencing performance is how much the test-taker knows or can do. A true power test is one where all test-takers have enough time to do their best; the only question is what they can do. Obviously, few tests are either purely speeded or purely power tests. Most have some combination of both. For example, a testing company may use a rule of thumb that 90 percent of test-takers should complete 90 percent of the questions; however, it should also be clear that the purpose of the testing affects rules of thumb such as this. Few teachers would wish to have many students unable to complete the tests that they take in classes, for example. When test-takers have disabilities that affect their ability to respond to questions quickly, some measures provide extra time, depending upon their purpose and the nature of the characteristics being assessed.

Questions on both achievement and ability tests can involve either recognition or free-response in answering. In educational and intelligence tests, recognition tests typically include multiple-choice questions where one can look for the correct answer among the options, recognize it as correct, and select it as the correct answer. A free-response is analogous to a “fill-in-the-blanks” or an essay question. One must recall or solve the question without choosing from among alternative responses. This distinction also holds for some non-cognitive tests, but the latter distinction is discussed later in this section because it focuses not on recognition but selections. For example, a recognition question on a non-cognitive test might ask someone whether they would rather go ice skating or to a movie; a free recall question would ask the respondent what they like to do for enjoyment.

Cognitive tests of various types can be considered as process or product tests. Take, for example, mathematics tests in school. In some instances, only getting the correct answer leads to a correct response. In other cases, teachers may give partial credit when a student performs the proper operations but does not get the correct answer. Similarly, psychologists and clinical neuropsychologists often observe not only whether a person solves problems correctly (i.e., product), but how the client goes about attempting to solve the problem (i.e., process).

Test Administration

One of the most important distinctions relates to whether tests are group administered or are individually administered by a psychologist, physician, or technician. Tests that traditionally were group administered were paper-and-pencil measures. Often for these measures, the test-taker received both a test booklet and an answer sheet and was required, unless he or she had certain disabilities, to mark his or her responses on the answer sheet. In recent decades, some tests are administered using technology (i.e., computers and other electronic media). There may be some adaptive qualities to tests administered by computer, although not all computer-administered tests are adaptive (technology-administered tests are further discussed below). An individually administered measure is typically provided to the test-taker by a psychologist, physician, or technician. More faith is often provided to the individually administered measure, because the trained professional administering the test can make judgments during the testing that affect the administration, scoring, and other observations related to the test.

Tests can be administered in an adaptive or linear fashion, whether by computer or individual administrator. A linear test is one in which questions are administered one after another in a pre-arranged order. An adaptive test is one in which the test-taker's performance on earlier items affects the questions he or she received subsequently. Typically, if the test-taker is answering the first questions correctly or in accordance with preset or expected response algorithms, for example, the next questions are still more difficult until the level appropriate for the examinee performance is best reached or the test is completed. If one does not answer the first questions correctly or as typically expected in the case of a non-cognitive measure, then easier questions would generally be presented to the test-taker.

Tests can be administered in written (keyboard or paper-and-pencil) fashion, orally, using an assistive device (most typically for individuals with motor disabilities), or in performance format, as previously noted. It is generally difficult to administer oral or performance tests in a group situation; however, some electronic media are making it possible to administer such tests without human examiners.

Another distinction among measures relates to who the respondent is. In most cases, the test-taker him- or herself is the respondent to any questions posed by the psychologist or physician. In the case of a young child, many individuals with autism, or an individual, for example, who has lost language ability, the examiner may need to ask others who know the individual (parents, teachers, spouses, family members) how they behave and to describe their personality, typical behaviors, and so on.

Scoring Differences

Tests are categorized as objectively scored, subjectively scored, or in some instances, both. An objectively scored instrument is one where the correct answers are counted and they either are, or they are converted to, the final scoring. Such tests may be scored manually or using optical scanning machines, computerized software, software used by other electronic media, or even templates (keys) that are placed over answer sheets where a person counts the number of correct answers. Examiner ratings and self-report interpretations are determined by the professional using a rubric or scoring system to convert the examinee's responses to a score, whether numerical or not. Sometimes subjective scores may include both quantitative and qualitative summaries or narrative descriptions of the performance of an individual.

Scores on tests are often considered to be norm-referenced (or normative) or criterion-referenced. Norm-referenced cognitive measures (such as college and graduate school admissions measures) inform the test-takers where they stand relative to others in the distribution. For example, an applicant to a college may learn that she is at the 60th percentile, meaning that she has scored better than 60 percent of those taking the test and less well than 40 percent of the same norm group. Likewise, most if not all intelligence tests are norm-referenced, and most other ability tests are as well. In recent years there has been more of a call for criterion-referenced tests, especially in education (Hambleton and Pitoniak, 2006). For criterion-referenced tests, one's score is not compared to the other members of the test-taking population but rather to a fixed standard. High school graduation tests, licensure tests, and other tests that decide whether test-takers have met minimal competency requirements are examples of criterion-referenced measures. When one takes a driving test to earn one's driver's license, for example, one does not find out where one's driving falls in the distribution of national or statewide drivers, one only passes or fails.

Test Content

As noted previously, the most important distinction among most psychological tests is whether they are assessing cognitive versus non-cognitive qualities. In clinical psychological and neuropsychological settings such as are the concern of this volume, the most common cognitive tests are intelligence tests, other clinical neuropsychological measures, and performance validity measures. Many tests used by clinical neuropsychologists, psychiatrists, technicians, or others assess specific types of functioning, such as memory or problem solving. Performance validity measures are typically short assessments and are sometimes interspersed among components of other assessments that help the psychologist determine whether the examinee is exerting sufficient effort to perform well and responding to the best of his or her ability. Most common non-cognitive measures in clinical psychology and neuropsychology settings are personality measures and symptom validity measures. Some personality tests, such as the Minnesota Multiphasic Personality Inventory (MMPI), assess the degree to which someone expresses behaviors that are seen as atypical in relation to the norming sample.1 Other personality tests are more normative and try to provide information about the client to the therapist. Symptom validity measures are scales, like performance validity measures, that may be interspersed throughout a longer assessment to examine whether a person is portraying him- or herself in an honest and truthful manner. Somewhere between these two types of tests—cognitive and non-cognitive—are various measures of adaptive functioning that often include both cognitive and non-cognitive components.


Psychometrics is the scientific study—including the development, interpretation, and evaluation—of psychological tests and measures used to assess variability in behavior and link such variability to psychological phenomena. In evaluating the quality of psychological measures we are traditionally concerned primarily with test reliability (i.e., consistency), validity (i.e., accuracy of interpretations and use), and fairness (i.e., equivalence of usage across groups). This section provides a general overview of these concepts to help orient the reader for the ensuing discussions in Chapters 4 and 5. In addition, given the implications of applying psychological measures with subjects from diverse racial and ethnic backgrounds, issues of equivalence and fairness in psychological testing are also presented.


Reliability refers to the degree to which scores from a test are stable and results are consistent. When constructs are not reliably measured the obtained scores will not approximate a true value in relation to the psychological variable being measured. It is important to understand that observed or obtained test scores are considered to be composed of true and error elements. A standard error of measurement is often presented to describe, within a level of confidence (e.g., 95 percent), that a given range of test scores contains a person's true score, which acknowledges the presence of some degree of error in test scores and that obtained test scores are only estimates of true scores (Geisinger, 2013).

Reliability is generally assessed in four ways:


Test-retest: Consistency of test scores over time (stability, temporal consistency);


Inter-rater: Consistency of test scores among independent judges;


Parallel or alternate forms: Consistency of scores across different forms of the test (stability and equivalence); and


Internal consistency: Consistency of different items intended to measure the same thing within the test (homogeneity). A special case of internal consistency reliability is split-half where scores on two halves of a single test are compared and this comparison may be converted into an index of reliability.

A number of factors can affect the reliability of a test's scores. These include time between two testing administrations that affect test-retest and alternate-forms reliability, and similarity of content and expectations of subjects regarding different elements of the test in alternate forms, split-half, and internal consistency approaches. In addition, changes in subjects over time and introduced by physical ailments, emotional problems, or the subject's environment, or test-based factors such as poor test instructions, subjective scoring, and guessing will also affect test reliability. It is important to note that a test can generate reliable scores in one context and not in another, and that inferences that can be made from different estimates of reliability are not interchangeable (Geisinger, 2013).


While the scores resulting from a test may be deemed reliable, this finding does not necessarily mean that scores from the test have validity. Validity is defined as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (AERA et al., 2014, p. 11). In discussing validity, it is important to highlight that validity refers not to the measure itself (i.e., a psychological test is not valid or invalid) or the scores derived from the measure, but rather the interpretation and use of the measure's scores. To be considered valid, the interpretation of test scores must be grounded in psychological theory and empirical evidence that demonstrates a relationship between the test and what it purports to measure (Furr and Bacharach, 2013; Sireci and Sukin, 2013). Historically, the fields of psychology and education have described three primary types of evidence related to validity (Sattler, 2014; Sireci and Sukin, 2013):


Construct evidence of validity: The degree to which an individual's test scores correlate with the theoretical concept the test is designed to measure (i.e., evidence that scores on a test correlate relatively highly with scores on theoretically similar measures and relatively poorly with scores on theoretically dissimilar measures);


Content evidence of validity: The degree to which the test content represents the targeted subject matter and supports a test's use for its intended purposes; and


Criterion-related evidence of validity: The degree to which the test's score correlates with other measurable, reliable, and relevant variables (i.e., criterion) thought to measure the same construct.

Other kinds of validity with relevance to SSA have been advanced in the literature, but are not completely accepted in professional standards as types of validity per se. These include


Diagnostic validity: The degree to which psychological tests are truly aiding in the formulation of an appropriate diagnosis.


Ecological validity: The degree to which test scores represent everyday levels of functioning (e.g., impact of disability on an individual's ability to function independently).


Cultural validity: The degree to which test content and procedures accurately reflect the sociocultural context of the subjects being tested.

Each of these forms of validity poses complex questions regarding the use of particular psychological measures with the SSA population. For example, ecological validity is especially critical in the use of psychological tests with SSA given that the focus of the assessment is on examining everyday levels of functioning. Measures like intelligence tests have been sometimes criticized for lacking ecological validity (Groth-Marnat, 2009; Groth-Marnat and Teal, 2000). Alternatively, “research suggests that many neuropsychological tests have a moderate level of ecological validity when predicting everyday cognitive functioning” (Chaytor and Schmitter-Edgecombe, 2003, p. 181).

More recent discussions on validity have shifted toward an argument-based approach to validity, using a variety of evidence to build a case for validity of test score interpretation (Furr and Bacharach, 2013). In this approach, construct validity is viewed as an overarching paradigm under which evidence is gathered from multiple sources to build a case for validity of test score interpretation. Five key sources of validity evidence that affect the degree to which a test fulfills its purpose are generally considered (AERA et al., 2014; Furr and Bacharach, 2013; Sireci and Sukin, 2013):


Test content: Does the test content reflect the important facets of the construct being measured? Are the test items relevant and appropriate for measuring the construct and congruent with the purpose of testing?


Relation to other variables: Is there a relationship between test scores and other criterion or constructs that are expected to be related?


Internal structure: Does the actual structure of the test match the theoretically based structure of the construct?


Response processes: Are respondents applying the theoretical constructs or processes the test is designed to measure?


Consequences of testing: What are the intended and unintended consequences of testing?

Standardization and Testing Norms

As part of the development of any psychometrically sound measure, explicit methods and procedures by which tasks should be administered are determined and clearly spelled out. This is what is commonly known as standardization. Typical standardized administration procedures or expectations include (1) a quiet, relatively distraction-free environment, (2) precise reading of scripted instructions, and (3) provision of necessary tools or stimuli. All examiners use such methods and procedures during the process of collecting the normative data, and such procedures normally should be used in any other administration, which enables application of normative data to the individual being evaluated (Lezak et al., 2012).

Standardized tests provide a set of normative data (i.e., norms), or scores derived from groups of people for whom the measure is designed (i.e., the designated population) to which an individual's performance can be compared. Norms consist of transformed scores such as percentiles, cumulative percentiles, and standard scores (e.g., T-scores, Z-scores, stanines, IQs), allowing for comparison of an individual's test results with the designated population. Without standardized administration, the individual's performance may not accurately reflect his or her ability. For example, an individual's abilities may be overestimated if the examiner provides additional information or guidance than what is outlined in the test administration manual. Conversely, a claimant's abilities may be underestimated if appropriate instructions, examples, or prompts are not presented. When nonstandardized administration techniques must be used, norms should be used with caution due to the systematic error that may be introduced into the testing process; this topic is discussed in detail later in the chapter.

It is important to clearly understand the population for which a particular test is intended. The standardization sample is another name for the norm group. Norms enable one to make meaningful interpretations of obtained test scores, such as making predictions based on evidence. Developing appropriate norms depends on size and representativeness of the sample. In general, the more people in the norm group the closer the approximation to a population distribution so long as they represent the group who will be taking the test.

Norms should be based upon representative samples of individuals from the intended test population, as each person should have an equal chance of being in the standardization sample. Stratified samples enable the test developer to identify particular demographic characteristics represented in the population and more closely approximate these features in proportion to the population. For example, intelligence test scores are often established based upon census-based norming with proportional representation of demographic features including race and ethnic group membership, parental education, socioeconomic status, and geographic region of the country.

When tests are applied to individuals for whom the test was not intended and, hence, were not included as part of the norm group, inaccurate scores and subsequent misinterpretations may result. Tests administered to persons with disabilities often raise complex issues. Test users sometimes use psychological tests that were not developed or normed for individuals with disabilities. It is critical that tests used with such persons (including SSA disability claimants) include attention to representative norming samples; when such norming samples are not available, it is important for the assessor to note that the test or tests used are not based on representative norming samples and the potential implications for interpretation (Turner et al., 2001).

Test Fairness in High-Stakes Testing Decisions

Performance on psychological tests often has significant implications (high stakes) in our society. Tests are in part the gatekeepers for educational and occupational opportunities and play a role in SSA determinations. As such, results of psychological testing may have positive or negative consequences for an individual. Often such consequences are intended; however, there is the possibility for unintended negative consequences. It is imperative that issues of test fairness be addressed so no individual or group is disadvantaged in the testing process based upon factors unrelated to the areas measured by the test. Biases simply cannot be present in these kinds of professional determinations. Moreover, it is imperative that research demonstrates that measures can be fairly and equivalently used with members of the various subgroups in our population. It is important to note that there are people from many language and cultural groups for whom there are no available tests with norms that are appropriately representative for them. As noted above, in such cases it is important for assessors to include a statement about this situation whenever it applies and potential implications on scores and resultant interpretation.

While all tests reflect what is valued within a particular cultural context (i.e., cultural loading), bias refers to the presence of systematic error in the measurement of a psychological construct. Bias leads to inaccurate test results given that scores reflect either overestimations or underestimations of what is being measured. When bias occurs based upon culturally related variables (e.g., race, ethnicity, social class, gender, educational level) then there is evidence of cultural test bias (Suzuki et al., 2014).

Relevant considerations pertain to issues of equivalence in psychological testing as characterized by the following (Suzuki et al., 2014, p. 260):


Functional: Whether the construct being measured occurs with equal frequency across groups;


Conceptual: Whether the item information is familiar across groups and means the same thing in various cultures;


Scalar: Whether average score differences reflect the same degree, intensity, or magnitude for different cultural groups;


Linguistic: Whether the language used has similar meaning across groups; and


Metric: Whether the scale measures the same behavioral qualities or characteristics and the measure has similar psychometric properties in different cultures.

It must be established that the measure is operating appropriately in various cultural contexts. Test developers address issues of equivalence through procedures including

  • Expert panel reviews (i.e., professionals review item content and provide informed judgments regarding potential biases);

  • Examination of differential item functioning (DIF) among groups;

  • Statistical procedures allowing comparison of psychometric features of the test (e.g., reliability coefficients) based on different population samples;

  • Exploratory and confirmatory factor analysis, structural equation modeling (i.e., examination of the similarities and differences of the constructs structure), and measurement invariance; and

  • Mean score differences taking into consideration the spread of scores within particular racial and ethnic groups as well as among groups.

Cultural equivalence refers to whether “interpretations of psychological measurements, assessments, and observations are similar if not equal across different ethnocultural populations” (Trimble, 2010, p. 316). Cultural equivalence is a higher order form of equivalence that is dependent on measures meeting specific criteria indicating that a measure may be appropriately used with other cultural groups beyond the one for which it was originally developed. Trimble (2010) notes that there may be upward of 50 or more types of equivalence that affect interpretive and procedural practices in order to establish cultural equivalence.

Item Response Theory and Tests2

For most of the 20th century, the dominant measurement model was called classical test theory. This model was based on the notion that all scores were composed of two components: true score and error. One can imagine a “true score” as a hypothetical value that would represent a person's actual score were there no error present in the assessment (and unfortunately, there is always some error, both random and systematic). The model further assumes that all error is random and that any correlation between error and some other variable, such as true scores, is effectively zero (Geisinger, 2013). The approach leans heavily on reliability theory, which is largely derived from the premises mentioned above.

Since the 1950s and largely since the 1970s, a newer mathematically sophisticated model developed called item response theory (IRT). The premise of these IRT models is most easily understood in the context of cognitive tests, where there is a correct answer to questions. The simplest IRT model is based on the notion that the answering of a question is generally based on only two factors: the difficulty of the question and the ability level of the test-taker. Computer-adaptive testing estimates scores of the test-taker after each response to a question and adjusts the administration of the next question accordingly. For example, if a test-taker answers a question correctly, he or she is likely to receive a more difficult question next. If one, on the other hand, answers incorrectly, he or she is more likely to receive an easier question, with the “running score” held by the computer adjusted accordingly. It has been found that such computer-adaptive tests can be very efficient.

IRT models have made the equating of test forms far easier. Equating tests permits one to use different forms of the same examination with different test items to yield fully comparable scores due to slightly different item difficulties across forms. To convert the values of item difficulty to determine the test-taker's ability scores one needs to have some common items across various tests; these common items are known as anchor items. Using such items, one can essentially establish a fixed reference group and base judgments from other groups on these values.

As noted above, there are a number of common IRT models. Among the most common are the one-, two-, and three-parameter models. The one-parameter model is the one already described; the only item parameter is item difficulty. A two-parameter model adds a second parameter to the first, related to item discrimination. Item discrimination is the ability of the item to differentiate those lacking the ability in high degree from those holding it. Such two-parameter models are often used for tests like essay tests where one cannot achieve a high score by guessing or using other means to answer currently. The three-parameter IRT model contains a third parameter, that factor related to chance level correct scoring. This parameter is sometimes called the pseudo-guessing parameter, and this model is generally used for large-scale multiple-choice testing programs.

These models, because of their lessened reliance on the sampling of test-takers, are very useful in the equating of tests that is the setting of scores to be equivalent regardless of the form of the test one takes. In some high-stakes admissions tests such as the GRE, MCAT, and GMAT, for example, forms are scored and equated by virtue of IRT methods, which can perform such operations more efficiently and accurately than can be done with classical statistics.


The test user is generally considered the person responsible for appropriate use of psychological tests, including selection, administration, interpretation, and use of results (AERA et al., 2014). Test user qualifications include attention to the purchase of psychological measures that specify levels of training, educational degree, areas of knowledge within domain of assessment (e.g., ethical administration, scoring, and interpretation of clinical assessment), certifications, licensure, and membership in professional organizations. Test user qualifications require psychometric knowledge and skills as well as training regarding the responsible use of tests (e.g., ethics), in particular, psychometric and measurement knowledge (i.e., descriptive statistics, reliability and measurement error, validity and the meaning of test scores, normative interpretation of test scores, selection of appropriate tests, and test administration procedures). In addition, test user guidelines highlight the importance of understanding the impact of ethnic, racial, cultural, gender, age, educational, and linguistic characteristics in the selection and use of psychological tests (Turner et al., 2001).

Test publishers provide detailed manuals regarding the operational definition of the construct being assessed, norming sample, reading level of test items, completion time, administration, and scoring and interpretation of test scores. Directions presented to the examinee are provided verbatim and sample responses are often provided to assist the examiner in determining a right or wrong response or in awarding numbers of points to a particular answer. Ethical and legal knowledge regarding assessment competencies, confidentiality of test information, test security, and legal rights of test-takers are imperative. Resources like the Mental Measurements yearbook (MMy) provide descriptive information and evaluative reviews of commercially available tests to promote and encourage informed test selection (Buros, 2015). To be included, tests must contain sufficient documentation regarding their psychometric quality (e.g., validity, reliability, norming).

Test Administration and Interpretation

In accordance with the Standards for Educational and Psychological Testing (AERA et al., 2014) and the APA's Guidelines for Test User Qualifications (Turner et al., 2001), many publishers of psychological tests employ a tiered system of qualification levels (generally A, B, C) required for the purchase, administration, and interpretation of such tests (e.g., PAR, n.d.; Pearson Education, 2015). Many instruments, such as those discussed throughout this report, would be considered qualification level C assessment methods, generally requiring an advanced degree, specialized psychometric and measurement knowledge, and formal training in administration, scoring, and interpretation. However, some may have less stringent requirements, for example, a bachelor's or master's degree in a related field and specialized training in psychometric assessment (often classified level B), or no special requirements (often classified level A) for purchase and use. While such categories serve as a general guide for necessary qualifications, individual test manuals provide additional detail and specific qualifications necessary for administration, scoring, and interpretation of the test or measure.

Given the need for the use of standardized procedures, any person administering cognitive or neuropsychological measures must be well trained in standardized administration protocols. He or she should possess the interpersonal skills necessary to build rapport with the individual being tested in order to foster cooperation and maximal effort during testing. Additionally, individuals administering tests should understand important psychometric properties, including validity and reliability, as well as factors that could emerge during testing to place either at risk. Many doctoral-level psychologists are well trained in test administration; in general, psychologists from clinical, counseling, school, or educational graduate psychology programs receive training in psychological test administration. For cases in which cognitive deficits are being evaluated, a neuropsychologist may be needed to most accurately evaluate cognitive functioning (see Chapter 5 for a more detailed discussion on administration and interpretation of cognitive tests). The use of non-doctoral-level psychometrists or technicians in psychological and neuropsychological test administration and scoring is also a widely accepted standard of practice (APA, 2010; Brandt and van Gorp, 1999; Pearson Education, 2015). Psychometrists are often bachelor's- or master's-level individuals who have received additional specialized training in standardized test administration and scoring. They do not practice independently or interpret test scores, but rather work under the close supervision and direction of doctoral-level clinical psychologists or neuropsychologists.

Interpretation of testing results requires a higher degree of clinical training than administration alone. Threats to the validity of any psychological measure of a self-report nature oblige the test interpreter to understand the test and principles of test construction. In fact, interpreting tests results without such knowledge would violate the ethics code established for the profession of psychology (APA, 2010). SSA requires psychological testing be “individually administered by a qualified specialist … currently licensed or certified in the state to administer, score, and interpret psychological tests and have the training and experience to perform the test” (SSA, n.d.). Most doctoral-level clinical psychologists who have been trained in psychometric test administration are also trained in test interpretation. SSA (n.d.) also requires individuals who administer more specific cognitive or neuropsychological evaluations “be properly trained in this area of neuroscience.” As such, clinical neuropsychologists—individuals who have been specifically trained to interpret testing results within the framework of brain-behavior relationships and who have achieved certain educational and training benchmarks as delineated by national professional organizations—may be required to interpret tests of a cognitive nature (AACN, 2007; NAN, 2001).

Use of Interpreters and Other Nonstandardized Test Administration Techniques

Modification of procedures, including the use of interpreters and the administration of nonstandardized assessment procedures, may pose unique challenges to the psychologist by potentially introducing systematic error into the testing process. Such errors may be related to language, the use of translators, or examinee abilities (e.g., sensory, perceptual, and/or motor capacity). For example, if one uses a language interpreter, the potential for mistranslation may yield inaccurate scores. Use of translators is a nonpreferred option, and assessors need to be familiar with both the language and culture from which an individual comes to properly interpret test results, or even infer whether specific measures are appropriate. The adaptation of tests has become big business for testing companies, and many tests, most often measures developed in English for use in the United States, are being adapted for use in other countries. Such measures require changes in language, but translators must also be knowledgeable about culture and the environment of the region from which a person comes (ITC, 2005).

For sensory, perceptual, or motor abilities, one may be altering the construct that the test is designed to measure. In both of these examples, one could be obtaining scores for which there is no referenced normative group to allow for accurate interpretation of results. While a thorough discussion of these concepts is beyond the scope of this report and is presented elsewhere, it may be stated that when a test is administered following a procedure that is outside of that which has been developed in the standardization process, conclusions drawn must recognize the potential for error in their creation.


As noted in Chapter 2, SSA indicates that objective medical evidence may include the results of standardized psychological tests. Given the great variety of psychological tests, some are more objective than others. Whether a psychological test is appropriately considered objective has much to do with the process of scoring. For example, unstructured measures that call for open-ended responding rely on professional judgment and interpretation in scoring; thus, such measures are considered less than objective. In contrast, standardized psychological tests and measures, such as those discussed in the ensuing chapters, are structured and objectively scored. In the case of non-cognitive self-report measures, the respondent generally answers questions regarding typical behavior by choosing from a set of predetermined answers. With cognitive tests, the respondent answers questions or solves problems, which usually have correct answers, as well as he or she possibly can. Such measures generally provide a set of normative data (i.e., norms), or scores derived from groups of people for whom the measure is designed (i.e., the designated population), to which an individual's responses or performance can be compared. Therefore, standardized psychological tests and measures rely less on clinical judgment and are considered to be more objective than those that depend on subjective scoring. Unlike measurements such as weight or blood pressure standardized psychological tests require the individual's cooperation with respect to self-report or performance on a task. The inclusion of validity testing, which will be discussed further in Chapters 4 and 5, in the test or test battery allows for greater confidence in the test results. Standardized psychological tests that are appropriately administered and interpreted can be considered objective evidence.

The use of psychological tests in disability determinations has critical implications for clients. As noted earlier, issues surrounding ecological validity (i.e., whether test performance accurately reflects real-world behavior) is of primary importance in SSA determination. Two approaches have been identified in relation to the ecological validity of neuropsychological assessment. The first focuses on “how well the test captures the essence of everyday cognitive skills” in order to “identify people who have difficulty performing real-world tasks, regardless of the etiology of the problem” (i.e., verisimilitude), and the second “relates performance on traditional neuropsychological tests to measures of real-world functioning, such as employment status, questionnaires, or clinician ratings” (i.e., veridicality) (Chaytor and Schmitter-Edgecombe, 2003, pp. 182–183). Establishing ecological validity is a complicated endeavor given the potential effect of non-cognitive factors (e.g., emotional, physical, and environmental) on test and everyday performance. Specific concerns regarding test performance include (1) the test environment is often not representative (i.e., artificial), (2) testing yields only samples of behavior that may fluctuate depending on context, and (3) clients may possess compensatory strategies that are not employable during the testing situation; therefore, obtained scores underestimate the test-taker's abilities.

Activities of daily living (ADLs) and the client's likelihood of returning to work are important considerations in disability determinations. Occupational status, however, is complex and often multidetermined requiring that psychological test data be complemented with other sources of information in the evaluation process (e.g., observation, informant ratings, environmental assessments) (Chaytor and Schmitter-Edgecombe, 2003). Table 3-1 highlights major mental disorders, relevant types of psychological measures, and domains of functioning.


Listings for Mental Disorders and Types of Psychological Tests.

Determination of disability is dependent on two key factors: the existence of a medically determinable impairment and associated limitations on functioning. As discussed in detail in Chapter 2, applications for disability follow a five-step sequential disability determination process. At Step 3 in the process, the applicant's reported impairments are evaluated to determine whether they meet or equal the medical criteria codified in SSA's Listing of Impairments. This includes specific symptoms, signs, and laboratory findings that substantiate the existence of an impairment (i.e., Paragraph A criteria) and evidence of associated functional limitations (i.e., Paragraph B criteria). If an applicant's impairments meet or equal the listing criteria, the claim is allowed. If not, residual functional capacity, including mental residual functional capacity, is assessed. This includes whether the applicant has the capacity for past work (Step 4) or any work in the national economy (Step 5).

SSA uses a standard assessment that examines functioning in four domains: understanding and memory, sustained concentration and persistence, social interaction, and adaptation. Psychological testing may play a key role in understanding a client's functioning in each of these areas. Box 3-1 describes ways in which these four areas of core mental residual functional capacity are assessed ecologically. Psychological assessments often address these areas in a more structured manner through interviews, standardized measures, checklists, observations, and other assessment procedures.

BOX 3-1

Descriptions of Tests by Four Areas of Core Mental Residual Functional Capacity. Remember location and work-like procedures Understand and remember very short and simple instructions

This chapter has identified some of the basic foundations underlying the use of psychological tests including basic psychometric principles and issues regarding test fairness. Applications of tests can inform disability determinations. The next two chapters build on this overview, examining the types of psychological tests that may be useful in this process, including a review of selected individual tests that have been developed for measuring validity of presentation. Chapter 4 focuses on non-cognitive, self-report measures and symptom validity tests. Chapter 5 then focuses on cognitive tests and associated performance validity tests. Strengths and limitations of various instruments are offered, in order to subsequently explore the relevance for different types of tests for different claims, per category of disorder, with a focus on establishing the validity of the client's claim.


  • AACN (American Academy of Clinical Neuropsychology). AACN practice guidelines for neuropsychological assessment and consultation. Clinical Neuropsychology. 2007;21(2):209–231. [PubMed: 17455014]

  • AERA (American Educational Research Association), APA (American Psychological Association), and NCME (National Council on Measurement in Education). Standards for educational and psychological testing. Washington, DC: AERA; 2014.

  • APA. Ethical principles of psychologists and code of conduct. 2010. [March 9, 2015]. http://www​ .

  • Brandt J, van Gorp W. American Academy of Clinical Neuropsychology policy on the use of non-doctoral-level personnel in conducting clinical neuropsychological evaluations. The Clinical Neuropsychologist. 1999;13(4):385–385.

  • Buros Center for Testing. Test reviews and information. 2015. [March 19, 2015]. http://buros​.org/test-reviews-information .

  • Chaytor N, Schmitter-Edgecombe M. The ecological validity of neuropsychological tests: A review of the literature on everyday cognitive skills. Neuropsychology Review. 2003;13(4):181–197. [PubMed: 15000225]

  • Cronbach LJ. Essentials of psychological testing. New York: Harper; 1949.

  • Cronbach LJ. Essentials of psychological testing. 2nd. Oxford, England: Harper; 1960.

  • De Ayala RJ. Theory and practice of item response theory. New York: Guilford Publications; 2009.

  • DeMars C. Item response theory. New York: Oxford University Press; 2010.

  • Furr RM, Bacharach VR. Psychometrics: An introduction. Thousand Oaks, CA: Sage Publications, Inc.; 2013.

  • Geisinger KF. Reliability. Geisinger KF, Bracken BA, Carlson JF, Hansen JC, Kuncel NR, Reise SP, Rodriguez MC, editors. Washington, DC: APA; APA handbook of testing and assessment in psychology. 2013;1

  • Groth-Marnat G. Handbook of psychological assessment. Hoboken, NJ: John Wiley & Sons; 2009.

  • Groth-Marnat G, Teal M. Block design as a measure of everyday spatial ability: A study of ecological validity. Perceptual and Motor Skills. 2000;90(2):522–526. [PubMed: 10833749]

  • Hambleton RK, Pitoniak MJ. Setting performance standards. Educational Measurement. 2006;4:433–470.

  • ITC (International Test Commission). ITC guidelines for translating and adaptating tests. Geneva, Switzerland: ITC; 2005.

  • Lezak M, Howieson D, Bigler E, Tranel D. Neuropsychological assessment. 5th. New York: Oxford University Press; 2012.

  • NAN (National Academy of Neuropsychology). NAN definition of a clinical neuropsychologist: Official position of the National Academy of Neuropsychology. 2001. [November 25, 2014]. https://www​.nanonline​.org/docs/PAIC/PDFs​/NANPositionDefNeuro.pdf .

  • PAR (Psychological Assessment Resources). Qualifications levels. 2015. [January 5, 2015]. http://www4​​/Supp/Qualifications.aspx .

  • Pearson Education. Qualifications policy. 2015. [January 5, 2015]. http://www​.pearsonclinical​.com/psychology/qualifications​.html .

  • Sattler JM. Foundations of behavioral, social, and clinical assessment of children. 6th. La Mesa, CA: Jerome M. Sattler, Publisher, Inc.; 2014.

  • Sireci SG, Sukin T. Test validity. Geisinger KF, Bracken BA, Carlson JF, Hansen JC, Kuncel NR, Reise SP, Rodriguez MC, editors. Washington, DC: APA; APA handbook of testing and assessment in psychology. 2013;1

  • SSA (Social Security Administration). Disability evaluation under social security—Part III: Listing of impairments—Adult listings (Part A)—section 12.00 mental disorders. n.d. [November 14, 2014]. http://www​​/professionals/bluebook/12​.00-MentalDisorders-Adult.htm .

  • Suzuki LA, Naqvi S, Hill JS. Assessing intelligence in a cultural context. Leong FTL, Comas-Diaz L, Nagayama Hall GC, McLoyd VC, Trimble JE, editors. Washington, DC: APA; APA handbook of multicultural psychology. 2014;1

  • Trimble JE. Encyclopedia of cross-cultural school psychology. New York: Springer; 2010. Cultural measurement equivalence; pp. 316–318.

  • Turner SM, DeMers ST, Fox HR, Reed G. APA's guidelines for test user qualifications: An executive summary. American Psychologist. 2001;56(12):1099.

  • Weiner IB. The assessment process. In: Weiner IB, editor. Handbook of psychology. Hoboken, NJ: John Wiley & Sons; 2003.


This may be in comparison to a nationally representative norming sample, or with certain tests or measures, such as the MMPI, particular clinically diagnostic samples.


The brief overview presented here draws on the works of De Ayala (2009) and DeMars (2010), to which the reader is directed for additional information.

By Eva L. Baker, Paul E. Barton, Linda Darling-Hammond, Edward Haertel, Helen F. Ladd, Robert L. Linn, Diane Ravitch, Richard Rothstein, Richard J. Shavelson, and Lorrie A. Shepard

Executive summary

Every classroom should have a well-educated, professional teacher, and school systems should recruit, prepare, and retain teachers who are qualified to do the job. Yet in practice, American public schools generally do a poor job of systematically developing and evaluating teachers.

Many policy makers have recently come to believe that this failure can be remedied by calculating the improvement in students’ scores on standardized tests in mathematics and reading, and then relying heavily on these calculations to evaluate, reward, and remove the teachers of these tested students.

While there are good reasons for concern about the current system of teacher evaluation, there are also good reasons to be concerned about claims that measuring teachers’ effectiveness largely by student test scores will lead to improved student achievement. If new laws or policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount, then more teachers might well be terminated than is now the case. But there is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. There is also little or no evidence for the claim that teachers will be more motivated to improve student learning if teachers are evaluated or monetarily rewarded for student test score gains.

A review of the technical evidence leads us to conclude that, although standardized test scores of students are one piece of information for school leaders to use to make judgments about teacher effectiveness, such scores should be only a part of an overall comprehensive evaluation. Some states are now considering plans that would give as much as 50% of the weight in teacher evaluation and compensation decisions to scores on existing tests of basic skills in math and reading. Based on the evidence, we consider this unwise. Any sound evaluation will necessarily involve a balancing of many factors that provide a more accurate view of what teachers in fact do in the classroom and how that contributes to student learning.

Evidence about the use of test scores to evaluate teachers

Recent statistical advances have made it possible to look at student achievement gains after adjusting for some student and school characteristics. These approaches that measure growth using “value-added modeling” (VAM) are fairer comparisons of teachers than judgments based on their students’ test scores at a single point in time or comparisons of student cohorts that involve different students at two points in time. VAM methods have also contributed to stronger analyses of school progress, program influences, and the validity of evaluation methods than were previously possible.

Nonetheless, there is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.

For a variety of reasons, analyses of VAM results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach. One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a “teacher effect” or the effect of a wide variety of other factors.

A study designed to test this question used VAM methods to assign effects to teachers after controlling for other factors, but applied the model backwards to see if credible results were obtained. Surprisingly, it found that students’ fifth grade teachers were good predictors of their fourth grade test scores. Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

VAM’s instability can result from differences in the characteristics of students assigned to particular teachers in a particular year, from small samples of students (made even less representative in schools serving disadvantaged students by high rates of student mobility), from other influences on student learning both inside and outside school, and from tests that are poorly lined up with the curriculum teachers are expected to cover, or that do not measure the full range of achievement of students in the class.

For these and other reasons, the research community has cautioned against the heavy reliance on test scores, even when sophisticated VAM methods are used, for high stakes decisions such as pay, evaluation, or tenure. For instance, the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences stated,

…VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.

A review of VAM research from the Educational Testing Service’s Policy Information Center concluded,

VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.

And RAND Corporation researchers reported that,

The estimates from VAM modeling of achievement will often be too imprecise to support some of the desired inferences…

and that

The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers or schools.

Factors that influence student test score gains attributed to individual teachers

A number of factors have been found to have strong influences on student learning gains, aside from the teachers to whom their scores would be attached. These include the influences of students’ other teachers—both previous teachers and, in secondary schools, current teachers of other subjects—as well as tutors or instructional specialists, who have been found often to have very large influences on achievement gains. These factors also include school conditions—such as the quality of curriculum materials, specialist or tutoring supports, class size, and other factors that affect learning. Schools that have adopted pull-out, team teaching, or block scheduling practices will only inaccurately be able to isolate individual teacher “effects” for evaluation, pay, or disciplinary purposes.

Student test score gains are also strongly influenced by school attendance and a variety of out-of-school learning experiences at home, with peers, at museums and libraries, in summer programs, on-line, and in the community. Well-educated and supportive parents can help their children with homework and secure a wide variety of other advantages for them. Other children have parents who, for a variety of reasons, are unable to support their learning academically. Student test score gains are also influenced by family resources, student health, family mobility, and the influence of neighborhood peers and of classmates who may be relatively more advantaged or disadvantaged.

Teachers’ value-added evaluations in low-income communities can be further distorted by the summer learning loss their students experience between the time they are tested in the spring and the time they return to school in the fall. Research shows that summer gains and losses are quite substantial. A research summary concludes that while students overall lose an average of about one month in reading achievement over the summer, lower-income students lose significantly more, and middle-income students may actually gain in reading proficiency over the summer, creating a widening achievement gap. Indeed, researchers have found that three-fourths of schools identified as being in the bottom 20% of all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account. Similar conclusions apply to the bottom 5% of all schools.

For these and other reasons, even when methods are used to adjust statistically for student demographic factors and school differences, teachers have been found to receive lower “effectiveness” scores when they teach new English learners, special education students, and low-income students than when they teach more affluent and educationally advantaged students. The nonrandom assignment of students to classrooms and schools—and the wide variation in students’ experiences at home and at school—mean that teachers cannot be accurately judged against one another by their students’ test scores, even when efforts are made to control for student characteristics in statistical models.

Recognizing the technical and practical limitations of what test scores can accurately reflect, we conclude that changes in test scores should be used only as a modest part of a broader set of evidence about teacher practice.

The potential consequences of theinappropriate use of test-based teacher evaluation

Besides concerns about statistical methodology, other practical and policy considerations weigh against heavy reliance on student test scores to evaluate teachers. Research shows that an excessive focus on basic math and reading scores can lead to narrowing and over-simplifying the curriculum to only the subjects and formats that are tested, reducing the attention to science, history, the arts, civics, and foreign language, as well as to writing, research, and more complex problem-solving tasks.

Tying teacher evaluation and sanctions to test score results can discourage teachers from wanting to work in schools with the neediest students, while the large, unpredictable variation in the results and their perceived unfairness can undermine teacher morale. Surveys have found that teacher attrition and demoralization have been associated with test-based accountability efforts, particularly in high-need schools.

Individual teacher rewards based on comparative student test results can also create disincentives for teacher collaboration. Better schools are collaborative institutions where teachers work across classroom and grade-level boundaries toward the common goal of educating all children to their maximum potential. A school will be more effective if its teachers are more knowledgeable about all students and can coordinate efforts to meet students’ needs.

Some other approaches, with less reliance on test scores, have been found to improve teachers’ practice while identifying differences in teachers’ effectiveness. They use systematic observation protocols with well-developed, research-based criteria to examine teaching, including observations or videotapes of classroom practice, teacher interviews, and artifacts such as lesson plans, assignments, and samples of student work. Quite often, these approaches incorporate several ways of looking at student learning over time in relation to a teacher’s instruction.

Evaluation by competent supervisors and peers, employing such approaches, should form the foundation of teacher evaluation systems, with a supplemental role played by multiple measures of student learning gains that, where appropriate, could include test scores. Some districts have found ways to identify, improve, and as necessary, dismiss teachers using strategies like peer assistance and evaluation that offer intensive mentoring and review panels. These and other approaches should be the focus of experimentation by states and districts.

Adopting an invalid teacher evaluation system and tying it to rewards and sanctions is likely to lead to inaccurate personnel decisions and to demoralize teachers, causing talented teachers to avoid high-needs students and schools, or to leave the profession entirely, and discouraging potentially effective teachers from entering it. Legislatures should not mandate a test-based approach to teacher evaluation that is unproven and likely to harm not only teachers, but also the children they instruct.


Every classroom should have a well-educated, professional teacher. For that to happen, school systems should recruit, prepare, and retain teachers who are qualified to do the job. Once in the classroom, teachers should be evaluated on a regular basis in a fair and systematic way. Effective teachers should be retained, and those with remediable shortcomings should be guided and trained further. Ineffective teachers who do not improve should be removed.

In practice, American public schools generally do a poor job of systematically developing and evaluating teachers. School districts often fall short in efforts to improve the performance of less effective teachers, and failing that, of removing them. Principals typically have too broad a span of control (frequently supervising as many as 30 teachers), and too little time and training to do an adequate job of assessing and supporting teachers. Many principals are themselves unprepared to evaluate the teachers they supervise. Due process requirements in state law and union contracts are sometimes so cumbersome that terminating ineffective teachers can be quite difficult, except in the most extreme cases. In addition, some critics believe that typical teacher compensation systems provide teachers with insufficient incentives to improve their performance.

In response to these perceived failures of current teacher policies, the Obama administration encourages states to make greater use of students’ test results to determine a teacher’s pay and job tenure. Some advocates of this approach expect the provision of performance-based financial rewards to induce teachers to work harder and thereby increase their effectiveness in raising student achievement. Others expect that the apparent objectivity of test-based measures of teacher performance will permit the expeditious removal of ineffective teachers from the profession and will encourage less effective teachers to resign if their pay stagnates. Some believe that the prospect of higher pay for better performance will attract more effective teachers to the profession and that a flexible pay scale, based in part on test-based measures of effectiveness, will reduce the attrition of more qualified teachers whose commitment to teaching will be strengthened by the prospect of greater financial rewards for success.

Encouragement from the administration and pressure from advocates have already led some states to adopt laws that require greater reliance on student test scores in the evaluation, discipline, and compensation of teachers. Other states are considering doing so.

Reasons for skepticism

While there are many reasons for concern about the current system of teacher evaluation, there are also reasons to be skeptical of claims that measuring teachers’ effectiveness by student test scores will lead to the desired outcomes. To be sure, if new laws or district policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount or reach a certain threshold, then more teachers might well be terminated than is now the case. But there is no current evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. Nor is there empirical verification for the claim that teachers will improve student learning if teachers are evaluated based on test score gains or are monetarily rewarded for raising scores.

The limited existing indirect evidence on this point, which emerges from the country’s experience with the No Child Left Behind (NCLB) law, does not provide a very promising picture of the power of test-based accountability to improve student learning. NCLB has used student test scores to evaluate schools, with clear negative sanctions for schools (and, sometimes, their teachers) whose students fail to meet expected performance standards. We can judge the success (or failure) of this policy by examining results on the National Assessment of Educational Progress (NAEP), a federally administered test with low stakes, given to a small (but statistically representative) sample of students in each state.

The NCLB approach of test-based accountability promised to close achievement gaps, particularly for minority students. Yet although there has been some improvement in NAEP scores for African Americans since the implementation of NCLB, the rate of improvement was not much better in the post- than in the pre-NCLB period, and in half the available cases, it was worse. Scores rose at a much more rapid rate before NCLB in fourth grade math and in eighth grade reading, and rose faster after NCLB in fourth grade reading and slightly faster in eighth grade math. Furthermore, in fourth and eighth grade reading and math, white students’ annual achievement gains were lower after NCLB than before, in some cases considerably lower. Table 1 displays rates of NAEP test score improvement for African American and white students both before and after the enactment of NCLB. These data do not support the view that that test-based accountability increases learning gains.

Table 1

Table 1 shows only simple annual rates of growth, without statistical controls. A recent careful econometric study of the causal effects of NCLB concluded that during the NCLB years, there were noticeable gains for students overall in fourth grade math achievement, smaller gains in eighth grade math achievement, but no gains at all in fourth or eighth grade reading achievement. The study did not compare pre- and post-NCLB gains. The study concludes, “The lack of any effect in reading, and the fact that the policy appears to have generated only modestly larger impacts among disadvantaged subgroups in math (and thus only made minimal headway in closing achievement gaps), suggests that, to date, the impact of NCLB has fallen short of its extraordinarily ambitious, eponymous goals.”1

Such findings provide little support for the view that test-based incentives for schools or individual teachers are likely to improve achievement, or for the expectation that such incentives for individual teachers will suffice to produce gains in student learning. As we show in what follows, research and experience indicate that approaches to teacher evaluation that rely heavily on test scores can lead to narrowing and over-simplifying the curriculum, and to misidentifying both successful and unsuccessful teachers. These and other problems can undermine teacher morale, as well as provide disincentives for teachers to take on the neediest students. When attached to individual merit pay plans, such approaches may also create disincentives for teacher collaboration. These negative effects can result both from the statistical and practical difficulties of evaluating teachers by their students’ test scores.

A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy—the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education. Rather, private-sector managers almost always evaluate their professional and lower-management employees based on qualitative reviews by supervisors; quantitative indicators are used sparingly and in tandem with other evidence. Management experts warn against significant use of quantitative measures for making salary or bonus decisions.2 The national economic catastrophe that resulted from tying Wall Street employees’ compensation to short-term gains rather than to longer-term (but more difficult-to-measure) goals is a particularly stark example of a system design to be avoided.

Other human service sectors, public and private, have also experimented with rewarding professional employees by simple measures of performance, with comparably unfortunate results.3 In both the United States and Great Britain, governments have attempted to rank cardiac surgeons by their patients’ survival rates, only to find that they had created incentives for surgeons to turn away the sickest patients. When the U.S. Department of Labor rewarded local employment offices for their success in finding jobs for displaced workers, counselors shifted their efforts from training programs leading to good jobs, to more easily found unskilled jobs that might not endure, but that would inflate the counselors’ success data. The counselors also began to concentrate on those unemployed workers who were most able to find jobs on their own, diminishing their attention to those whom the employment programs were primarily designed to help.

A third reason for skepticism is that in practice, and especially in the current tight fiscal environment, performance rewards are likely to come mostly from the redistribution of already-appropriated teacher compensation funds, and thus are not likely to be accompanied by a significant increase in average teacher salaries (unless public funds are supplemented by substantial new money from foundations, as is currently the situation in Washington, D.C.). If performance rewards do not raise average teacher salaries, the potential for them to improve the average effectiveness of recruited teachers is limited and will result only if the more talented of prospective teachers are more likely than the less talented to accept the risks that come with an uncertain salary. Once again, there is no evidence on this point.

And finally, it is important for the public to recognize that the standardized tests now in use are not perfect, and do not provide unerring measurements of student achievement. Not only are they subject to errors of various kinds—we describe these in more detail below—but they are narrow measures of what students know and can do, relying largely on multiple-choice items that do not evaluate students’ communication skills, depth of knowledge and understanding, or critical thinking and performance abilities. These tests are unlike the more challenging open-ended examinations used in high-achieving nations in the world.{{4 }}Indeed, U.S. scores on international exams that assess more complex skills dropped from 2000 to 2006,{{5 }}even while state and local test scores were climbing, driven upward by the pressures of test-based accountability.

This seemingly paradoxical situation can occur because drilling students on narrow tests does not necessarily translate into broader skills that students will use outside of test-taking situations. Furthermore, educators can be incentivized by high-stakes testing to inflate test results. At the extreme, numerous cheating scandals have now raised questions about the validity of high-stakes student test scores. Without going that far, the now widespread practice of giving students intense preparation for state tests—often to the neglect of knowledge and skills that are important aspects of the curriculum but beyond what tests cover—has in many cases invalidated the tests as accurate measures of the broader domain of knowledge that the tests are supposed to measure. We see this phenomenon reflected in the continuing need for remedial courses in universities for high school graduates who scored well on standardized tests, yet still cannot read, write or calculate well enough for first-year college courses. As policy makers attach more incentives and sanctions to the tests, scores are more likely to increase without actually improving students’ broader knowledge and understanding.6

The research community consensus

Statisticians, psychometricians, and economists who have studied the use of test scores for high-stakes teacher evaluation, including its most sophisticated form, value-added modeling (VAM), mostly concur that such use should be pursued only with great caution. Donald Rubin, a leading statistician in the area of causal inference, reviewed a range of leading VAM techniques and concluded:

We do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions.7

A research team at RAND has cautioned that:

The estimates from VAM modeling of achievement will often be too imprecise to support some of the desired inferences.8


The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers or schools.9

Henry Braun, then of the Educational Testing Service, concluded in his review of VAM research:

VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.10

In a letter to the Department of Education, commenting on the Department’s proposal to use student achievement to evaluate teachers, the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences wrote:

…VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.11

And a recent report of a workshop conducted jointly by the National Research Council and the National Academy of Education concluded:

Value-added methods involve complex statistical models applied to test data of varying quality. Accordingly, there are many technical challenges to ascertaining the degree to which the output of these models provides the desired estimates. Despite a substantial amount of research over the last decade and a half, overcoming these challenges has proven to be very difficult, and many questions remain unanswered…12

Among the concerns raised by researchers are the prospects that value-added methods can misidentify both successful and unsuccessful teachers and, because of their instability and failure to disentangle other influences on learning, can create confusion about the relative sources of influence on student achievement. If used for high-stakes purposes, such as individual personnel decisions or merit pay, extensive use of test-based metrics could create disincentives for teachers to take on the neediest students, to collaborate with one another, or even to stay in the profession.

Statistical misidentification of effective teachers

Basing teacher evaluation primarily on student test scores does not accurately distinguish more from less effective teachers because even relatively sophisticated approaches cannot adequately address the full range of statistical problems that arise in estimating a teacher’s effectiveness. Efforts to address one statistical problem often introduce new ones. These challenges arise because of the influence of student socioeconomic advantage or disadvantage on learning, measurement error and instability, the nonrandom sorting of teachers across schools and of students to teachers in classrooms within schools, and the difficulty of disentangling the contributions of multiple teachers over time to students’ learning. As a result, reliance on student test scores for evaluating teachers is likely to misidentify many teachers as either poor or successful.

The influence of student background on learning

Social scientists have long recognized that student test scores are heavily influenced by socioeconomic factors such as parents’ education and home literacy environment, family resources, student health, family mobility, and the influence of neighborhood peers, and of classmates who may be relatively more advantaged or disadvantaged. Thus, teachers working in affluent suburban districts would almost always look more effective than teachers in urban districts if the achievement scores of their students were interpreted directly as a measure of effectiveness.13

New statistical techniques, called value-added modeling (VAM), are intended to resolve the problem of socio-economic (and other) differences by adjusting for students’ prior achievement and demographic characteristics (usually only their income-based eligibility for the subsidized lunch program, and their race or Hispanic ethnicity).14 These techniques measure the gains that students make and then compare these gains to those of students whose measured background characteristics and initial test scores were similar, concluding that those who made greater gains must have had more effective teachers.

Value-added approaches are a clear improvement over status test-score comparisons (that simply compare the average student scores of one teacher to the average student scores of another); over change measures (that simply compare the average student scores of a teacher in one year to her average student scores in the previous year); and over growth measures (that simply compare the average student scores of a teacher in one year to the same students’ scores when they were in an earlier grade the previous year).15

Status measures primarily reflect the higher or lower achievement with which students entered a teacher’s classroom at the beginning of the year rather than the contribution of the teacher in the current year. Change measures are flawed because they may reflect differences from one year to the next in the various characteristics of students in a teacher’s classroom, as well as other school or classroom-related variables (e.g., the quality of curriculum materials, specialist or tutoring supports, class size, and other factors that affect learning). Growth measures implicitly assume, without justification, that students who begin at different achievement levels should be expected to gain at the same rate, and that all gains are due solely to the individual teacher to whom student scores are attached; growth measures do not control for students’ socioeconomic advantages or disadvantages that may affect not only their initial levels but their learning rates.

Although value-added approaches improve over these other methods, the claim that they can “level the playing field” and provide reliable, valid, and fair comparisons of individual teachers is overstated. Even when student demographic characteristics are taken into account, the value-added measures are too unstable (i.e., vary widely) across time, across the classes that teachers teach, and across tests that are used to evaluate instruction, to be used for the high-stakes purposes of evaluating teachers.16

Multiple influences on student learning

Because education is both a cumulative and a complex process, it is impossible fully to distinguish the influences of students’ other teachers as well as school conditions on their apparent learning, let alone their out-of-school learning experiences at home, with peers, at museums and libraries, in summer programs, on-line, and in the community.

No single teacher accounts for all of a student’s achievement. Prior teachers have lasting effects, for good or ill, on students’ later learning, and several current teachers can also interact to produce students’ knowledge and skills. For example, with VAM, the essay-writing a student learns from his history teacher may be credited to his English teacher, even if the English teacher assigns no writing; the mathematics a student learns in her physics class may be credited to her math teacher. Some students receive tutoring, as well as homework help from well-educated parents. Even among parents who are similarly well- or poorly educated, some will press their children to study and complete homework more than others. Class sizes vary both between and within schools, a factor influencing achievement growth, particularly for disadvantaged children in the early grades.17 In some schools, counselors or social workers are available to address serious behavior or family problems, and in others they are not. A teacher who works in a well-resourced school with specialist supports may appear to be more effective than one whose students do not receive these supports.18 Each of these resource differences may have a small impact on a teacher’s apparent effectiveness, but cumulatively they have greater significance.

Validity and the insufficiency of statistical controls

Although value-added methods can support stronger inferences about the influences of schools and programs on student growth than less sophisticated approaches, the research reports cited above have consistently cautioned that the contributions of VAM are not sufficient to support high-stakes inferences about individual teachers. Despite the hopes of many, even the most highly developed value-added models fall short of their goal of adequately adjusting for the backgrounds of students and the context of teachers’ classrooms. And less sophisticated models do even less well. The difficulty arises largely because of the nonrandom sorting of teachers to students across schools, as well as the nonrandom sorting of students to teachers within schools.

Nonrandom sorting of teachers to students across schools: Some schools and districts have students who are more socioeconomically disadvantaged than others. Several studies show that VAM results are correlated with the socioeconomic characteristics of the students.19 This means that some of the biases that VAM was intended to correct may still be operating. Of course, it could also be that affluent schools or districts are able to recruit the best teachers. This possibility cannot be ruled out entirely, but some studies control for cross-school variability and at least one study has examined the same teachers with different populations of students, showing that these teachers consistently appeared to be more effective when they taught more academically advanced students, fewer English language learners, and fewer low-income students.20 This finding suggests that VAM cannot control completely for differences in students’ characteristics or starting points.21

Teachers who have chosen to teach in schools serving more affluent students may appear to be more effective simply because they have students with more home and school supports for their prior and current learning, and not because they are better teachers. Although VAM attempts to address the differences in student populations in different schools and classrooms by controlling statistically for students’ prior achievement and demographic characteristics, this “solution” assumes that the socioeconomic disadvantages that affect children’s test scores do not also affect the rates at which they show progress—or the validity with which traditional tests measure their learning gains (a particular issue for English language learners and students with disabilities).

Some policy makers assert that it should be easier for students at the bottom of the achievement distribution to make gains because they have more of a gap to overcome. This assumption is not confirmed by research. Indeed, it is just as reasonable to expect that “learning begets learning”: students at the top of the distribution could find it easier to make gains, because they have more knowledge and skills they can utilize to acquire additional knowledge and skills and, because they are independent learners, they may be able to learn as easily from less effective teachers as from more effective ones.

The pattern of results on any given test could also be affected by whether the test has a high “ceiling”—that is, whether there is considerable room at the top of the scale for tests to detect the growth of students who are already high-achievers—or whether it has a low “floor”—that is, whether skills are assessed along a sufficiently long continuum for low-achieving students’ abilities to be measured accurately in order to show gains that may occur below the grade-level standard.22

Furthermore, students who have fewer out-of-school supports for their learning have been found to experience significant summer learning loss between the time they leave school in June and the time they return in the fall. We discuss this problem in detail below. For now, suffice it to say that teachers who teach large numbers of low-income students will be noticeably disadvantaged in spring-to-spring test gain analyses, because their students will start the fall further behind than more affluent students who were scoring at the same level in the previous spring.

The most acceptable statistical method to address the problems arising from the non-random sorting of students across schools is to include indicator variables (so-called school fixed effects) for every school in the data set. This approach, however, limits the usefulness of the results because teachers can then be compared only to other teachers in the same school and not to other teachers throughout the district. For example, a teacher in a school with exceptionally talented teachers may not appear to add as much value to her students as others in the school, but if compared to all the teachers in the district, she might fall well above average. In any event, teacher effectiveness measures continue to be highly unstable, whether or not they are estimated using school fixed effects.23

Nonrandom sorting of students to teachers within schools: A comparable statistical problem arises for teachers within schools, in that teachers’ value-added scores are affected by differences in the types of students who happen to be in their classrooms. It is commonplace for teachers to report that this year they had a “better” or “worse” class than last, even if prior achievement or superficial socioeconomic characteristics are similar.

Statistical models cannot fully adjust for the fact that some teachers will have a disproportionate number of students who may be exceptionally difficult to teach (students with poorer attendance, who have become homeless, who have severe problems at home, who come into or leave the classroom during the year due to family moves, etc.) or whose scores on traditional tests are frequently not valid (e.g., those who have special education needs or who are English language learners). In any school, a grade cohort is too small to expect each of these many characteristics to be represented in the same proportion in each classroom.

Another recent study documents the consequences of students (in this case, apparently purposefully) not being randomly assigned to teachers within a school. It uses a VAM to assign effects to teachers after controlling for other factors, but applies the model backwards to see if credible results obtain. Surprisingly, it finds that students’ fifth grade teachers appear to be good predictors of students’ fourth grade test scores.24 Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that students are systematically grouped into fifth grade classrooms based on their fourth grade performance. For example, students who do well in fourth grade may tend to be assigned to one fifth grade teacher while those who do poorly are assigned to another. The usefulness of value-added modeling requires the assumption that teachers whose performance is being compared have classrooms with students of similar ability (or that the analyst has been able to control statistically for all the relevant characteristics of students that differ across classrooms). But in practice, teachers’ estimated value-added effect necessarily reflects in part the nonrandom differences between the students they are assigned and not just their own effectiveness.

Purposeful, nonrandom assignment of students to teachers can be a function of either good or bad educational policy. Some grouping schemes deliberately place more special education students in selected inclusion classrooms or organize separate classes for English language learners. Skilled principals often try to assign students with the greatest difficulties to teachers they consider more effective. Also, principals often attempt to make assignments that match students’ particular learning needs to the instructional strengths of individual teachers. Some teachers are more effective with students with particular characteristics, and principals with experience come to identify these variations and consider them in making classroom assignments.

In contrast, some less conscientious principals may purposefully assign students with the greatest difficulties to teachers who are inexperienced, perhaps to avoid conflict with senior staff who resist such assignments. Furthermore, traditional tracking often sorts students by prior achievement. Regardless of whether the distribution of students among classrooms is motivated by good or bad educational policy, it has the same effect on the integrity of VAM analyses: the nonrandom pattern makes it extremely difficult to make valid comparisons of the value-added of the various teachers within a school.

In sum, teachers’ value-added effects can be compared only where teachers have the same mix of struggling and successful students, something that almost never occurs, or when statistical measures of effectiveness fully adjust for the differing mix of students, something that is exceedingly hard to do.

Imprecision and instability

Unlike school, district, and state test score results based on larger aggregations of students, individual classroom results are based on small numbers of students leading to much more dramatic year-to-year fluctuations. Even the most sophisticated analyses of student test score gains generate estimates of teacher quality that vary considerably from one year to the next. In addition to changes in the characteristics of students assigned to teachers, this is also partly due to the small number of students whose scores are relevant for particular teachers.

Small sample sizes can provide misleading results for many reasons. No student produces an identical score on tests given at different times. A student may do less well than her expected score on a specific test if she comes to school having had a bad night’s sleep, and may do better than her expected score if she comes to school exceptionally well-rested. A student who is not certain of the correct answers may make more lucky guesses on multiple-choice questions on one test, and more unlucky guesses on another. Researchers studying year-to-year fluctuations in teacher and school averages have also noted sources of variation that affect the entire group of students, especially the effects of particularly cooperative or particularly disruptive class members.

Analysts must average test scores over large numbers of students to get reasonably stable estimates of average learning. The larger the number of students in a tested group, the smaller will be the average error because positive errors will tend to cancel out negative errors. But the sampling error associated with small classes of, say, 20-30 students could well be too large to generate reliable results. Most teachers, particularly those teaching elementary or middle school students, do not teach enough students in any year for average test scores to be highly reliable.

In schools with high mobility, the number of these students with scores at more than one point in time, so that gains can be measured, is smaller still. When there are small numbers of test-takers, a few students who are distracted during the test, or who are having a “bad” day when tests are administered, can skew the average score considerably. Making matters worse, because most VAM techniques rely on growth calculations from one year to the next, each teacher’s value-added score is affected by the measurement error in two different tests. In this respect VAM results are even less reliable indicators of teacher contributions to learning than a single test score. VAM approaches incorporating multiple prior years of data suffer similar problems.

In addition to the size of the sample, a number of other factors also affect the magnitude of the errors that are likely to emerge from value-added models of teacher effectiveness. In a careful modeling exercise designed to account for the various factors, a recent study by researchers at Mathematica Policy Research, commissioned and published by the Institute of Education Sciences of the U.S. Department of Education, concludes that the errors are sufficiently large to lead to the misclassification of many teachers.25

The Mathematica models, which apply to teachers in the upper elementary grades, are based on two standard approaches to value-added modeling, with the key elements of each calibrated with data on typical test score gains, class sizes, and the number of teachers in a typical school or district. Specifically, the authors find that if the goal is to distinguish relatively high or relatively low performing teachers from those with average performance within a district, the error rate is about 26% when three years of data are used for each teacher. This means that in a typical performance measurement system, more than one in four teachers who are in fact teachers of average quality would be misclassified as either outstanding or poor teachers, and more than one in four teachers who should be singled out for special treatment would be misclassified as teachers of average quality. If only one year of data is available, the error rate increases to 36%. To reduce it to 12% would require 10 years of data for each teacher.

Despite the large magnitude of these error rates, the Mathematica researchers are careful to point out that the resulting misclassification of teachers that would emerge from value-added models is still most likely understated because their analysis focuses on imprecision error alone. The failure of policy makers to address some of the validity issues, such as those associated with the nonrandom sorting of students across schools, discussed above, would lead to even greater misclassification of teachers.

Measurement error also renders the estimates of teacher quality that emerge from value-added models highly unstable. Researchers have found that teachers’ effectiveness ratings differ from class to class, from year to year, and from test to test, even when these are within the same content area.26 Teachers also look very different in their measured effectiveness when different statistical methods are used.27 Teachers’ value-added scores and rankings are most unstable at the upper and lower ends of the scale, where they are most likely to be used to allocate performance pay or to dismiss teachers believed to be ineffective.28

Because of the range of influences on student learning, many studies have confirmed that estimates of teacher effectiveness are highly unstable. One study examining two consecutive years of data showed, for example, that across five large urban districts, among teachers who were ranked in the bottom 20% of effectiveness in the first year, fewer than a third were in that bottom group the next year, and another third moved all the way up to the top 40%. There was similar movement for teachers who were highly ranked in the first year. Among those who were ranked in the top 20% in the first year, only a third were similarly ranked a year later, while a comparable proportion had moved to the bottom 40%.29

Another study confirmed that big changes from one year to the next are quite likely, with year-to-year correlations of estimated teacher quality ranging from only 0.2 to 0.4.30 This means that only about 4% to 16% of the variation in a teacher’s value-added ranking in one year can be predicted from his or her rating in the previous year.

These patterns, which held true in every district and state under study, suggest that there is not a stable construct measured by value-added measures that can readily be called “teacher effectiveness.”

That a teacher who appears to be very effective (or ineffective) in one year might have a dramatically different result the following year, runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time. Such instability from year to year renders single year estimates unsuitable for high-stakes decisions about teachers, and is likely to erode confidence both among teachers and among the public in the validity of the approach.

Perverse and unintended consequences of statistical flaws

The problems of measurement error and other sources of year-to-year variability are especially serious because many policy makers are particularly concerned with removing ineffective teachers in schools serving the lowest-performing, disadvantaged students. Yet students in these schools tend to be more mobile than students in more affluent communities. In highly mobile communities, if two years of data are unavailable for many students, or if teachers are not to be held accountable for students who have been present for less than the full year, the sample is even smaller than the already small samples for a single typical teacher, and the problem of misestimation is exacerbated.

Yet the failure or inability to include data on mobile students also distorts estimates because, on average, more mobile students are likely to differ from less mobile students in other ways not accounted for by the model, so that the students with complete data are not representative of the class as a whole. Even if state data systems permit tracking of students who change schools, measured growth for these students will be distorted, and attributing their progress (or lack of progress) to different schools and teachers will be problematic.

If policy makers persist in attempting to use VAM to evaluate teachers serving highly mobile student populations, perverse consequences can result. Once teachers in schools or classrooms with more transient student populations understand that their VAM estimates will be based only on the subset of students for whom complete data are available and usable, they will have incentives to spend disproportionately more time with students who have prior-year data or who pass a longevity threshold, and less time with students who arrive mid-year and who may be more in need of individualized instruction. And such response to incentives is not unprecedented: an unintended incentive created by NCLB caused many schools and teachers to focus greater effort on children whose test scores were just below proficiency cutoffs and whose small improvements would have great consequences for describing a school’s progress, while paying less attention to children who were either far above or far below those cutoffs.31

As noted above, even in a more stable community, the number of students in a given teacher’s class is often too small to support reliable conclusions about teacher effectiveness. The most frequently proposed solution to this problem is to limit VAM to teachers who have been teaching for many years, so their performance can be estimated using multiple years of data, and so that instability in VAM measures over time can be averaged out. This statistical solution means that states or districts only beginning to implement appropriate data systems must wait several years for sufficient data to accumulate. More critically, the solution does not solve the problem of nonrandom assignment, and it necessarily excludes beginning teachers with insufficient historical data and teachers serving the most disadvantaged (and most mobile) populations, thus undermining the ability of the system to address the goals policy makers seek.

The statistical problems we have identified here are not of interest only to technical experts. Rather, they are directly relevant to policy makers and to the desirability of efforts to evaluate teachers by their students’ scores. To the extent that this policy results in the incorrect categorization of particular teachers, it can harm teacher morale and fail in its goal of changing behavior in desired directions.

For example, if teachers perceive the system to be generating incorrect or arbitrary evaluations, perhaps because the evaluation of a specific teacher varies widely from year to year for no explicable reason, teachers could well be demoralized, with adverse effects on their teaching and increased desire to leave the profession. In addition, if teachers see little or no relationship between what they are doing in the classroom and how they are evaluated, their incentives to improve their teaching will be weakened.

Practical limitations

The statistical concerns we have described are accompanied by a number of practical problems of evaluating teachers based on student test scores on state tests.

Availability of appropriate tests

Most secondary school teachers, all teachers in kindergarten, first, and second grades and some teachers in grades three through eight do not teach courses in which students are subject to external tests of the type needed to evaluate test score gains. And even in the grades where such gains could, in principle, be measured, tests are not designed to do so.

Value-added measurement of growth from one grade to the next should ideally utilize vertically scaled tests, which most states (including large states like New York and California) do not use. In order to be vertically scaled, tests must evaluate content that is measured along a continuum from year to year. Following an NCLB mandate, most states now use tests that measure grade-level standards only and, at the high school level, end-of-course examinations, neither of which are designed to measure such a continuum. These test design constraints make accurate vertical scaling extremely difficult. Without vertically scaled tests, VAM can estimate changes in the relative distribution, or ranking, of students from last year to this, but cannot do so across the full breadth of curriculum content in a particular course or grade level, because many topics are not covered in consecutive years. For example, if multiplication is taught in fourth but not in fifth grade, while fractions and decimals are taught in fifth but not in fourth grade, measuring math “growth” from fourth to fifth grade has little meaning if tests measure only the grade level expectations. Furthermore, the tests will not be able to evaluate student achievement and progress that occurs well below or above the grade level standards.

Similarly, if probability, but not algebra, is expected to be taught in seventh grade, but algebra and probability are both taught in eighth grade, it might be possible to measure growth in students’ knowledge of probability, but not in algebra. Teachers, however, vary in their skills. Some teachers might be relatively stronger in teaching probability, and others in teaching algebra. Overall, such teachers might be equally effective, but VAM would arbitrarily identify the former teacher as more effective, and the latter as less so. In addition, if probability is tested only in eighth grade, a student’s success may be attributed to the eighth grade teacher even if it is largely a function of instruction received from his seventh grade teacher. And finally, if high school students take end-of-course exams in biology, chemistry, and physics in different years, for example, there is no way to calculate gains on tests that measure entirely different content from year to year.

Thus, testing expert Daniel Koretz concludes that “because of the need for vertically scaled tests, value-added systems may be even more incomplete than some status or cohort-to-cohort systems.”32

Problems of attribution

It is often quite difficult to match particular students to individual teachers, even if data systems eventually permit such matching, and to unerringly attribute student achievement to a specific teacher. In some cases, students may be pulled out of classes for special programs or instruction, thereby altering the influence of classroom teachers. Some schools expect, and train, teachers of all subjects to integrate reading and writing instruction into their curricula. Many classes, especially those at the middle-school level, are team-taught in a language arts and history block or a science and math block, or in various other ways. In schools with certain kinds of block schedules, courses are taught for only a semester, or even in nine or 10 week rotations, giving students two to four teachers over the course of a year in a given class period, even without considering unplanned teacher turnover. Schools that have adopted pull-out, team teaching, or block scheduling practices will have additional difficulties in isolating individual teacher “effects” for pay or disciplinary purposes.

Similarly, NCLB requires low-scoring schools to offer extra tutoring to students, provided by the school district or contracted from an outside tutoring service. High quality tutoring can have a substantial effect on student achievement gains.33 If test scores subsequently improve, should a specific teacher or the tutoring service be given the credit?

Summer learning loss

Teachers should not be held responsible for learning gains or losses during the summer, as they would be if they were evaluated by spring-to-spring test scores. These summer gains and losses are quite substantial. Indeed, researchers have found that three-fourths of schools identified as being in the bottom 20% of all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account.34 Similar conclusions apply to the bottom 5% of all schools.35

Another recent study showed that two-thirds of the difference between the ninth grade test scores of high and low socioeconomic status students can be traced to summer learning differences over the elementary years.36 A research summary concluded that while students overall lose an average of about one month in reading achievement over the summer, lower-income students lose significantly more, and middle-income students may actually gain in reading proficiency over the summer, creating a widening achievement gap.37 Teachers who teach a greater share of lower-income students are disadvantaged by summer learning loss in estimates of their effectiveness that are calculated in terms of gains in their students’ test scores from the previous year.

To rectify obstacles to value-added measurement presented both by the absence of vertical scaling and by differences in summer learning, schools would have to measure student growth within a single school year, not from one year to the next. To do so, schools would have to administer high stakes tests twice a year, once in the fall and once in the spring.38 While this approach would be preferable in some ways to attempting to measure value-added from one year to the next, fall and spring testing would force schools to devote even more time to testing for accountability purposes, and would set up incentives for teachers to game the value-added measures. However commonplace it might be under current systems for teachers to respond rationally to incentives by artificially inflating end-of-year scores by drill, test preparation activities, or teaching to the test, it would be so much easier for teachers to inflate their value-added ratings by discouraging students’ high performance on a September test, if only by not making the same extraordinary efforts to boost scores in the fall that they make in the spring.

The need, mentioned above, to have test results ready early enough in the year to influence not only instruction but also teacher personnel decisions is inconsistent with fall to spring testing, because the two tests must be spaced far enough apart in the year to produce plausibly meaningful information about teacher effects. A test given late in the spring, with results not available until the summer, is too late for this purpose. Most teachers will already have had their contracts renewed and received their classroom assignments by this time.39

Unintended negative effects

Although the various reasons to be skeptical about the use of student test scores to evaluate teachers, along with the many conceptual and practical limitations of empirical value added measures, might suffice by themselves to make one wary of the move to test-based evaluation of teachers, they take on even greater significance in light of the potential for large negative effects of such an approach.

Disincentives for teachers to work with the neediest students

Using test scores to evaluate teachers unfairly disadvantages teachers of the neediest students. Because of the inability of value-added methods to fully account for the differences in student characteristics and in school supports, as well as the effects of summer learning loss, teachers who teach students with the greatest educational needs will appear to be less effective than they are. This could lead to the inappropriate dismissal of teachers of low-income and minority students, as well as of students with special educational needs. The success of such teachers is not
accurately captured by relative value-added metrics, and the use of VAM to evaluate such teachers could exacerbate disincentives to teach students with high levels of need. Teachers are also likely to be aware of personal circumstances (a move, an illness, a divorce) that are likely to affect individual students’ learning gains but are not captured by value-added models. Within a school, teachers will have incentives to avoid working with such students likely to pull down their teacher effectiveness scores.

Narrowing the curriculum

Narrowing of the curriculum to increase time on what is tested is another negative consequence of high-stakes uses of value-added measures for evaluating teachers. This narrowing takes the form both of reallocations of effort between the subject areas covered in a full grade-level curriculum, and of reallocations of effort within subject areas themselves.40

The tests most likely to be used in any test-based teacher evaluation program are those that are currently required under NCLB, or that will be required under its reauthorized version. The current law requires that all students take standardized tests in math and reading each year in grades three through eight, and once while in high school. Although NCLB also requires tests in general science, this subject is tested only once in the elementary and middle grades, and the law does not count the results of these tests in its identification of inadequate schools. In practice, therefore, evaluating teachers by their students’ test scores means evaluating teachers only by students’ basic math and/or reading skills, to the detriment of other knowledge, skills, and experiences that young people need to become effective participants in a democratic society and contributors to a productive economy.

Thus, for elementary (and some middle-school) teachers who are responsible for all (or most) curricular areas, evaluation by student test scores creates incentives to diminish instruction in history, the sciences, the arts, music, foreign language, health and physical education, civics, ethics and character, all of which we expect children to learn. Survey data confirm that even with the relatively mild school-wide sanctions for low test scores provided by NCLB, schools have diminished time devoted to curricular areas other than math and reading. This shift was most pronounced in districts where schools were most likely to face sanctions—districts with schools serving low-income and minority children.41 Such pressures to narrow the curriculum will certainly increase if sanctions for low test scores are toughened to include the loss of pay or employment for individual teachers.

Another kind of narrowing takes place within the math and reading instructional programs themselves. There are two reasons for this outcome.

First, it is less expensive to grade exams that include only, or primarily, multiple-choice questions, because such questions can be graded by machine inexpensively, without employing trained professional scorers. Machine grading is also faster, an increasingly necessary requirement if results are to be delivered in time to categorize schools for sanctions and interventions, make instructional changes, and notify families entitled to transfer out under the rules created by No Child Left Behind. And scores are also needed quickly if test results are to be used for timely teacher evaluation. (If teachers are found wanting, administrators should know this before designing staff development programs or renewing teacher contracts for the following school year.)

As a result, standardized annual exams, if usable for high-stakes teacher or school evaluation purposes, typically include no or very few extended-writing or problem-solving items, and therefore do not measure conceptual understanding, communication, scientific investigation, technology and real-world applications, or a host of other critically important skills. Not surprisingly, several states have eliminated or reduced the number of writing and problem-solving items from their standardized exams since the implementation of NCLB.42 Although some reasoning and other advanced skills can be tested with multiple-choice questions, most cannot be, so teachers who are evaluated by students’ scores on multiple-choice exams have incentives to teach only lower level, procedural skills that can easily be tested.

Second, an emphasis on test results for individual teachers exacerbates the well-documented incentives for teachers to focus on narrow test-taking skills, repetitive drill, and other undesirable instructional practices. In mathematics, a brief exam can only sample a few of the many topics that teachers are expected to cover in the course of a year.43 After the first few years of an exam’s use, teachers can anticipate which of these topics are more likely to appear, and focus their instruction on these likely-to-be-tested topics, to be learned in the format of common test questions. Although specific questions may vary from year to year, great variation in the format of test questions is not practical because the expense of developing and field-testing significantly different exams each year is too costly and would undermine statistical equating procedures used to ensure the comparability of tests from one year to the next. As a result, increasing scores on students’ mathematics exams may reflect, in part, greater skill by their teachers in predicting the topics and types of questions, if not necessarily the precise questions, likely to be covered by the exam. This practice is commonly called “teaching to the test.” It is a rational response to incentives and is not unlawful, provided teachers do not gain illicit access to specific forthcoming test questions and prepare students for them.

Such test preparation has become conventional in American education and is reported without embarrassment by educators. A recent New York Times report, for example, described how teachers prepare students for state high school history exams:

As at many schools…teachers and administrators …prepare students for the tests. They analyze tests from previous years, which are made public, looking for which topics are asked about again and again. They say, for example, that the history tests inevitably include several questions about industrialization and the causes of the two world wars.44

A teacher who prepares students for questions about the causes of the two world wars may not adequately be teaching students to understand the consequences of these wars, although both are important parts of a
history curriculum. Similarly, if teachers know they will be evaluated by their students’ scores on a test that predictably asks questions about triangles and rectangles, teachers skilled in preparing students for calculations involving these shapes may fail to devote much time to polygons, an equally important but somewhat more difficult topic in the overall math curriculum.

In English, state standards typically include skills such as learning how to use a library and select appropriate books, give an oral presentation, use multiple sources of information to research a question and prepare a written argument, or write a letter to the editor in response to a newspaper article. However, these standards are not generally tested, and teachers evaluated by student scores on standardized tests have little incentive to develop student skills in these areas.45

A different kind of narrowing also takes place in reading instruction. Reading proficiency includes the ability to interpret written words by placing them in the context of broader background knowledge.46 Because children come to school with such wide variation in their background knowledge, test developers attempt to avoid unfairness by developing standardized exams using short, highly simplified texts.47 Test questions call for literal meaning – identifying the main idea, picking out details, getting events in the right order—but without requiring inferential or critical reading abilities that are an essential part of proficient reading. It is relatively easy for teachers to prepare students for such tests by drilling them in the mechanics of reading, but this behavior does not necessarily make them good readers.48 Children prepared for tests that sample only small parts of the curriculum and that focus excessively on mechanics are likely to learn test-taking skills in place of mathematical reasoning and reading for comprehension. Scores on such tests will then be “inflated,” because they suggest better mathematical and reading ability than is in fact the case.

We can confirm that some score inflation has systematically taken place because the improvement in test scores of students reported by states on their high-stakes tests used for NCLB or state accountability typically far exceeds the improvement in test scores in math and reading on the NAEP.49 Because no school can anticipate far in advance that it will be asked to participate in the NAEP sample, nor which students in the school will be tested, and because no consequences for the school or teachers follow from high or low NAEP scores, teachers have neither the ability nor the incentive to teach narrowly to expected test topics. In addition, because there is no time pressure to produce results with fast electronic scoring, NAEP can use a variety of question formats including multiple-choice, constructed response, and extended open-ended responses.50 NAEP also is able to sample many more topics from a grade’s usual curriculum because in any subject it assesses, NAEP uses several test booklets that cover different aspects of the curriculum, with overall results calculated by combining scores of students who have been given different booklets. Thus, when scores on state tests used for accountability rise rapidly (as has typically been the case), while scores on NAEP exams for the same subjects and grades rise slowly or not at all, we can be reasonably certain that instruction was focused on the fewer topics and item types covered by the state tests, while topics and formats not covered on state tests, but covered on NAEP, were shortchanged.51

Another confirmation of score inflation comes from the Programme for International Student Assessment (PISA), a set of exams given to samples of 15-year-old students in over 60 industrialized and developing nations. PISA is highly regarded because, like national exams in high-achieving nations, it does not rely largely upon multiple-choice items. Instead, it evaluates students’ communication and critical thinking skills, and their ability to demonstrate that they can use the skills they have learned. U.S. scores and rankings on the international PISA exams dropped from 2000 to 2006, even while state and local test scores were climbing, driven upward by the pressures of test-based accountability. The contrast confirms that drilling students for narrow tests such as those used for accountability purposes in the United States does not necessarily translate into broader skills that students will use outside of test-taking situations.

A number of U.S. experiments are underway to determine if offers to teachers of higher pay, conditional on their students having higher test scores in math and reading, actually lead to higher student test scores in these subjects. We await the results of these experiments with interest. Even if they show that monetary incentives for teachers lead to higher scores in reading and math, we will still not know whether the higher scores were achieved by superior instruction or by more drill and test preparation, and whether the students of these teachers would perform equally well on tests for which they did not have specific preparation. Until such questions have been explored, we should be cautious about claims that experiments prove the value of pay-for-performance plans.

Less teacher collaboration

Better schools are collaborative institutions where teachers work across classroom and grade-level boundaries towards the common goal of educating all children to their maximum potential.52 A school will be more effective if its teachers are more knowledgeable about all students and can coordinate efforts to meet students’ needs. Collaborative work among teachers with different levels and areas of skill and different types of experience can capitalize on the strengths of some, compensate for the weaknesses of others, increase shared knowledge and skill, and thus increase their school’s overall professional capacity.

In one recent study, economists found that peer learning among small groups of teachers was the most powerful predictor of improved student achievement over time.{{53 }}Another recent study found that students achieve more in mathematics and reading when they attend schools characterized by higher levels of teacher collaboration for school improvement.54 To the extent that teachers are given incentives to pursue individual monetary rewards by posting greater test score gains than their peers, teachers may also have incentives to cease collaborating. Their interest becomes self-interest, not the interest of students, and their instructional strategies may distort and undermine their school’s broader goals.55

To enhance productive collaboration among all of a school’s staff for the purpose of raising overall student scores, group (school-wide) incentives are preferred to incentives that attempt to distinguish among teachers.

Individual incentives, even if they could be based on accurate signals from student test scores, would be unlikely to have a positive impact on overall student achievement for another reason. Except at the very bottom of the teacher quality distribution where test-based evaluation could result in termination, individual incentives will have little impact on teachers who are aware they are less effective (and who therefore expect they will have little chance of getting a bonus) or teachers who are aware they are stronger (and who therefore expect to get a bonus without additional effort). Studies in fields outside education have also documented that when incentive systems require employees to compete with one another for a fixed pot of monetary reward, collaboration declines and client outcomes suffer.56 On the other hand, with group incentives, everyone has a stronger incentive to be productive and to help others to be productive as well.57

A commonplace objection to a group incentive system is that it permits free riding—teachers who share in rewards without contributing additional effort. If the main goal, however, is student welfare, group incentives are still preferred, even if some free-riding were to occur.

Group incentives also avoid some of the problems of statistical instability we noted above: because a full school generates a larger sample of students than an individual classroom. The measurement of average achievement for all of a school’s students is, though still not perfectly reliable, more stable than measurement of achievement of students attributable to a specific teacher.

Yet group incentives, however preferable to individual incentives, retain other problems characteristic of individual incentives. We noted above that an individual incentive system that rewards teachers for their students’ mathematics and reading scores can result in narrowing the curriculum, both by reducing attention paid to non-tested curricular areas, and by focusing attention on the specific math and reading topics and skills most likely to be tested. A group incentive system can exacerbate this narrowing, if teachers press their colleagues to concentrate effort on those activities most likely to result in higher test scores and thus in group bonuses.

Teacher demoralization

Pressure to raise student test scores, to the exclusion of other important goals, can demoralize good teachers and, in some cases, provoke them to leave the profession entirely.

Recent survey data reveal that accountability pressures are associated with higher attrition and reduced morale, especially among teachers in high-need schools.58 Although such survey data are limited, anecdotes abound regarding the demoralization of apparently dedicated and talented teachers, as test-based accountability intensifies. Here, we reproduce two such stories, one from a St. Louis and another from a Los Angeles teacher:

No Child Left Behind has completely destroyed everything I ever worked for… We now have an enforced 90-minute reading block. Before, we always had that much reading in our schedule, but the difference now is that it’s 90 minutes of uninterrupted time. It’s impossible to schedule a lot of the things that we had been able to do before… If you take 90 minutes of time, and say no kids can come out at that time, you can’t fit the drama, band, and other specialized programs in… There is a ridiculous emphasis on fluency—reading is now about who can talk the fastest. Even the gifted kids don’t read for meaning; they just go as fast as they possibly can. Their vocabulary is nothing like it used to be. We used to do Shakespeare, and half the words were unknown, but they could figure it out from the context. They are now very focused on phonics of the words and the mechanics of the words, even the very bright kids are… Teachers feel isolated. It used to be different. There was more team teaching. They would say, “Can you take so-and-so for reading because he is lower?” That’s not happening… Teachers are as frustrated as I’ve ever seen them. The kids haven’t stopped wetting pants, or coming to school with no socks, or having arguments and fights at recess. They haven’t stopped doing what children do but the teachers don’t have time to deal with it. They don’t have time to talk to their class, and help the children figure out how to resolve things without violence. Teachable moments to help the schools and children function are gone. But the kids need this kind of teaching, especially inner-city kids and especially at the elementary levels.59


0 Replies to “Any Difficulties Met In Essay Test Directions”

Lascia un Commento

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *