1.
Practicallity
A
test that is prohibitively expensive is impractical. A test of language
proficiency that takes a student five hours to complete is impractical-it
consumes more time and money) than necessary to accomplish its objective. A
test that requires individual one-on-one proctoring is impractical for a group
of several hundred test-takers and only a handful of examiners. A test that
takes a few minutes for a student to take and several hours for an examiner to
evaluate is impractical for most classroom situations. A test that can be
scored only by computer is impractical if the test takes place a thousand miles
away from the nearest computer. The value and quality of a test sometimes hinge
on such nitty-gritty, practical considerations. Here's a little horror story
about practicality gone awry. An administrator of a six-week summertime short
course needed to place the 50 or so students who had enrolled in the program. A
quick search yielded a copy of an old English Placement Test from the University
of Michigan. It had 20 listening items based on an audio tape and 80 items on
grammar, vocabulary, and reading comprehension, all multiple
choice
format. A scoring grid accompanied the test. On the day of the test, the
required number of test booklets had been secured, a proctor had been assigned
to monitor the process, and the administrator and proctor had planned to have
the scoring completed by later that afternoon so students could begin classes
the next day, Sounds simple, right? Wrong. The students arrived, test booklets
were distributed, and directions were given the proctor started the tape. Soon
students began to look puzzled. By the time the tenth item played, everyone
looked bewildered. Finally, the proctor checked a test booklet and was horrified
to discover that the wrong tape was playing; it was a tape for another form of
the same test! Now what? She decided to randomly select a short passage from a
textbook that was in the room and give the students a dictation. The students
responded reasonably well. The next 80 non-tape based items proceeded without
incident, and the students handed in their score sheets and dictation papers.
2.
Reliability
A
reliable test is consistent and dependable. If you give the same test to the
same student or matched students on two different occasions, the test should
yield similar results. The issue of reliability of a test may best be addressed
by considering a number of factors that may contribute to the unreliability of
a test. Consider the following possibilities (adapted from Mousavi, 2002, p.
804): fluctuations in the student, in scoring in test administration, and in
the test itself.
a. Student-Related
Reliability
The most common
learner-related issue in reliability is caused by temporary illness, fatigue, a
"bad day" anxiety, and other physical or psychological factors, which
may make an observed"score deviate from one's "true" score. Also
included in this category are such factors as a test-taker's
test-wiseness" or strategies for efficient test taking (Mousavi, 2002,
p.804)
b. Rater
Reliability
Human error,
subjectivity, and bias may enter into the scoring process. Inter-rater
reliability occurs when two or more scorers yield inconsistent scores of the
same test possibly for lack of attention to scoring criteria, inexperience,
inattention, or even preconceived biases. In the story above about the
placement test, the initial scoring plan for the dictations was found to be
unreliable that is, the two scorers were not applying the same standards.
c. Test
Administration Reliability
Unreliability may also
result from the conditions in which the test is administered. I once witnessed
the administration of a test of aural comprehension in which a tape recorder
played items for comprehension, but because of street noise outside the
building, students sitting next to windows could not hear the tape accurately.
d. Test
Reliability
Sometimes the nature of
the test itself can cause measurement errors. If a test is too long,
test-takers may become fatigued by the time they reach the later items and
hastily respond incorrectly. Timed tests may discriminate against students who
do not perform well on a test with a time limit.
3.
Validity
By
far the most complex criterion of an effective test--and arguably the most
important principle-is validity, "the extent to which inferences made from
assessment results are appropriate, meaningful and useful in terms of the
purpose of the assessment" (Gronlund, 1998, p. 226). A valid test of
reading ability actually measures reading ability-not 20/20 vision, nor
previous knowledge in a subject, nor some other variable of questionable
relevance. To measure writing ability, one might ask students to write as many
words as they can in 15 minutes, then simply count the words for the final
score. Such a test would be easy to administer (practical), and the scoring
quite dependable (reliable). But it would not constitute a valid test of
writing ability without some consideration of comprehensibility, rhetorical
discourse elements, and the organization of ideas, among other factors.
a. Content-Related
Evidence
If a test actually
samples the subject matter about which conclusions are to be drawn, and if it
requires the test-taker to perform the behavior that is being measured, it can
claim content-related evidence of validity often popularly referred to as
content validity (eg. Mousavi, 2002, Hughes, 2003). You can usually identify
content related evidence observationally if you can clearly define the
achievement that you are measuring. A test of tennis competency that asks
someone to run a 100-yard dash obviously lacks content validity. If you are
trying to assess a person's ability to speak a second language in a
conversational setting, asking the learner to answer paper-and-pencil multiple
choice questions requiring grammatical judgments does not achieve content
validity.
b. Criterion-Related
Evidence
A second form of
evidence of the validity of a test may be found in what is called
criterion-related evidence, also referred to as criterion-related validity, or
the extent to which the "criterion" of the test has actually been
reached. You will recall that in Chapter 1 it was noted that most
classroom-based assessment with teacher designed tests fits the concept of
criterion-referenced assessment. In such tests, specified classroom objectives
are measured, and implied predetermined levels of performance are expected to
be reached (80 percent is considered a minimal passing grade), In the case of
teacher-made classroom assessments, criterion-related evidence is best demonstrated
through a comparison of results of an assessment with results of some other
measure of the same criterion.
c. Construct-Related
Evidence
A third kind of
evidence that can support validity, but one that does not play as large a role
for classroom teachers, is construct-related validity. Commonly referred to as
construct validity. A construct is any theory, hypothesis, or model that
attempts to explain observed phenomena in our universe of perceptions
Constructs may or may not be directly or empirically measured-their
verification often requires inferential data. "Proficiency and
communicative competence" are linguistic constructs: ”selfesteem" and
"motivation are psychological constructs. Virtually every issue in
language learning and teaching involves theoretical constructs.
d. Consequential
Validity
As well as the above
three widely accepted forms of evidence that may be introduced to support the
validity of an assessment, two other categories may be of some interest and
utility in your own quest for validating classroom tests. Messick (1989), Gronlund
(1998), McNamara (2000), and Brindley (2001), among others, underscore the
potential importance of the consequences of using an assessment. Consequential
validity encompasses all the consequences of a test, including such
considerations as its accuracy in measuring intended criteria, its impact on
the preparation of test-takers, its effect on the learner, and the intended and
unintended) social consequences of a test's interpretation and use.
e. Face
Validity
An important facet of
consequential validity is the extent to which students view the assessment as
fair relevant, and useful for improving learning" (Gronlund, 1998,p. 210),
or what is popularly known as face validity, "Face validity refers to the
degree to which a test looks right, and appears to measure the knowledge or
abilities it claims to measure, based on the subjective judgment of the
examinees who take it, the administrative personnel who decide on its use, and
other psychometrically unsophisticated observers" (Mousavi, 2002, p. 244)
Sometimes students don't know what is being tested when they tackle a test.
4.
Authenticity
A
fourth major principle of language testing is authenticity, a concept that is a
little slippery to define, especially within the art and science of evaluating
and designing tests. Bachman and Palmer (1996, p. 23) define authenticity as
"the degree of correspondence of the characteristics of a given language
test task to the features of a target language task," and then suggest an
agenda for identifying those target language tasks and for transforming them
into valid test items. Essentially, when you make a claim for authenticity in a
test task, you are saying that this task is likely to be enacted in the real
world. Many test item types fail to simulate real-world tasks. They may be
contrived or artificial in their attempt to target a grammatical form or a
lexical item. The sequencing of items that bear no relationship to one another
lacks authenticity. One does not have to look very long to find reading
comprehension passages in proficiency tests that do not reflect a real world
passage. In a test, authenticity may be present in the following ways:
•
The language in the test is as natural as possible.
•
Items are contextualized rather than isolated
•
Topics are meaningful (relevant interesting for the learner
•
Some thematic organization to items is provided, such as through a story line
or episode.
•
Tasks represent, or closely approximate, real-world tasks.
5.
Washback
A
facet of consequential validity, discussed above, is the effect of testing on
teaching and learning (Hughes, 2003, p. 1), otherwise known among language
testing specialists as washback. In large-scale assessment, washback generally
refers to the effects the tests have on instruction in terms of how students
prepare for the test "Cram" courses and teaching to the test"
are examples of such washback. Another form of washback that occurs more in
classroom assessment is the information that "washes back to students in
the form of useful diagnoses of strengths and weaknesses Washback also includes
the effects of an assessment on teaching and learning prior to the assessment
itself, that is, on preparation for the assessment. Informal performance
assessment is by nature more likely to have built-in washback effects because the
teacher is usually providing interactive feedback. Formal tests can also have
positive washback, but they provide no washback if the students receive a
simple letter grade or a single overall numerical score.
A
little bit of washback may also help students through a specification of the
numerical scores on the various subsections of the test. A subsection on verb
tenses for example, that yields a relatively low score may serve the diagnostic
purpose of showing the student an area of challenge. Another viewpoint on
washback is achieved by a quick consideration of differences between formative
and summative tests, formative tests, by definition, provide washback in the
form of information to the learner on progress toward goals. But teachers might
be tempted to feel that summative tests, which provide assessment at the end of
a course or program, do not need to offer much in the way of washback. Such an
attitude is unfortunate because the end of every language course or program is
always the beginning of further pursuits, more learning, more goals, and more
challenges to face.
Even
a final examination in a course should carry with it some means for giving
washback to students. In my courses I never give a final examination as the
last scheduled classroom session. I always administer a final exam during the
penultimate session, then complete the evaluation of the exams in order to
return them to students during the last class. At this time, the students
receive scores, grades, and comments on their work, and I spend some of the
class session addressing material on which the student were not completely
clear. My summative assessment is thereby enhanced by some beneficial washback
that is usually not expected of final examinations. Finally, washback also
implies that students have ready access to you to discuss the feedback and
evaluation you have given. While you almost certainly have known teachers with
whom you wouldn't care argue about a grade, an interactive, cooperative,
collaborative classroom nevertheless can promote an atmosphere of dialogue
between students and teachers regarding evaluative judgments. For learning to
continue, students need to have a chance to feed back on your feedback, to seek
clarification of any issues that are fuzzy, and to set new and appropriate
goals for themselves for the days and weeks ahead.
Source
:
Brown,
H. Douglas. 2003. Language Assessment
Principles and Classroom Practices. San Francisco, California
Tidak ada komentar:
Posting Komentar