Reliability is one of the most important characteristics of all tests in general, and language tests in particular. In fact, an unreliable test is worth nothing. In order to understand the concept of reliability, an example may prove helpful. Suppose a student took a test of grammar comprising one hundred items and received a score of 90. Suppose further that the same student took the same test two days later and got a score of 45. Finally suppose that the same student took the same test for the third time and received a score of 70. What would you think of these scores? What would you think of the student?
Assuming that the student’s knowledge of English cannot go under drastic changes within this short period, the best explanation would be that there must have been something wrong with the test. How would you rely on a test that does not produce consistent scores? How can you make a sound decision on the basis of such test scores? This is the essence of the concept of reliability, i.e., producing consistent scores.
Reliability, then, can be technically defined as “the extent to which a test produces consistent scores at different administrations to the same or similar group of examinees”. If a test produced exactly the same scores at different administrations to the same group, that test would be perfectly reliable. This perfect reliability, nevertheless, does not practically exist in reality. There are many factors influencing test score reliability.
These factors are
1. the sampling task
2. poor students motivation
3. test writer’s or the examiner’s control
4. the range from examinees' differing mental and physical conditions to the precision of the test items, and to the administration as well as scoring procedures.
Therefore, reliability is “the extent to which a test produces consistent scores." This means that the higher the extent, the more reliable the test.
Statistically speaking, reliability is represented by the letter “r”, whose magnitude fluctuates between zero and one; zero and one demonstrate maximum and minimum degree of test score reliability. It should be mentioned that “R” is an independent statistical concept. It does not have anything to do with the content or the form of the test. It solely deals with the scores produced by a test. In fact, one can estimate “R” without having any information about the content of the test. Thus, when one talks about the reliability of a test, he refers to the scores and not to the content or the form of the test.
Understanding the concept of reliability, one should next estimate “R” which requires some statistical competency. In the following section, an attempt is made to explain the procedures to estimate “r” in as non-technical terms as possible. Four methods of estimating reliability – (1) test-retest, (2) parallel forms, (3) split-half, and (4) KR-21.
1. Test-Retest Method
As the name implies, in this method a single test is administered to a single group of examinees twice. The first administration is called “test” and the second administration is referred to as “retest”. The correlation between the two sets of scores, obtained from testing and retesting, would determine the magnitude of reliability. Since there is a time interval (usually more than two weeks) between the two administrations, this kind of reliability estimate is also known as “stability of scores over time”.
Although obtaining reliability estimates through test-retest method seems very easy, it has some practical disadvantages. First, it is not very easy to have the same group of examinees available in two different administrations. Second, the time interval creates two obstacles. On the one hand, if it is very short, there might be practice effect as well as memorization effect carried over from the first administration. On the other hand, if the interval is too long, there will be the learning effect, i.e., the examinee’s state of knowledge will not be the same as it was in the first administration. To avoid these problems, other methods of estimating reliability have been developed.
2. Parallel Forms Method
In order to remove some of the problems inherent in the test-retest method, experts have developed the parallel forms method. In this method, two parallel forms of a single test are given to one group of examinees. The correlation between the scores obtained from the two tests is computed to indicate the reliability of the scores. This method has an advantage over the test-retest method in that there is no need for administering the test twice. Thus, the problem of examinees’ knowledge undergoing changes does not exist in this method. Nevertheless, this method has a major shortcoming. That is, constructing two parallel forms of a test is not an easy task.
There are certain logical and statistical criteria that a pair of parallel forms must meet. Therefore, most teachers and test developers avoid this method. Due to the complexity of the task, they prefer to use other methods of estimating reliability.
3. Split-Half Method
In test-retest method, one group of examinees was needed for two administrations. In parallel forms method, on the other hand, two forms of a single test were needed. Each of these requirements is considered a disadvantage. To obviate these shortcomings, the split-half method has been developed. In this method, a single form of a test is given to a single group of examinees. Then each examinee’s test is split (divided) into two halves. The correlation between the scores of the examinees in the first half and the second half will determine the reliability of the test scores. The only problem with this method is how to divide the test items into two halves. The best way is to use odd and even items to form each half, i.e., items numbered 1, 3,5, 7, etc. will constitute the first half, and items numbered 2, 4, 6, 8, ... will form the second half.
4. The KR-21 Method
The previously mentioned methods to estimate test score reliability require a statistical procedure called ‘correlation’. Majority of teachers and non-professional test developers, however, are not quite familiar with statistics. Thus, they may have some problems in using statistical formulas and interpreting the outcome of statistical analyses. To overcome these problems, two statisticians – named Kuder and Richardson – developed a series of formulas to be used in statistics. One of these formulas is used to estimate test score reliability through simple mathematical operations. The formula is called KR-21, in which K and R refer to the first initials of the two statisticians and 21 refers to the number of the formula in the series. This formula is used to estimate the reliability of a single test given to one group of examinees through a single administration. This method requires only the testers and teachers to be able to calculate two simple statistical parameters. These parameters are (1) the mean and (2) variance. The methods of computing the mean and variance are explained in almost all introductory statistics books. However, for the purposes of clarification, a brief explanation of each parameter will be given here. For further information, interested readers are to consult statistics books.
(1) The Mean: The mean, commonly known as the average, is the most frequently used concept in statistics. It simply refers to a single score that best represents the scores of a group. If each score is symbolized as X, then the mean (represented by X and read X bar) will be computed by adding up all Xs and dividing the sum by the number of scores (represented by N). To represent the sum of scores, the Greek letter (Σ), read 'sigma' is used in statistics. Thus the statistical formula to compute the mean would be:
This simply means that add up all scores (ΣX) and divide it by the number of scores
(N). A numerical example may be helpful. Consider the scores of fifteen students who
took a language test:
98 89 78
97 89 73
95 84 70
93 82 60
90 82 50
To determine the mean score of the test, add the fifteen scores, that is, ΣX = 1230. Then, divide it by N, 15, to give X = 82.
(2) The Variance: The variance, represented by the letter V refers to the variation of scores around the mean. Although the formula for computing variance may seem cumbersome, it is not actually difficult. To avoid complexities, the formula will be explained as follows.The formula means to do the following operations:
1. Compute the mean (X)
2. Compute the deviation scores by subtracting the mean from each single score (X-X).
3. Square every deviation score (X-X)²
4. Add up all deviation scores squared Σ(X-X)²
5. Divide the result of step 4 by N-1
In order to clarify the computational procedures, a numerical example is given below.
Consider the scores of ten subjects on a short grammar test: 3, 2, 3, 4, 5, 5, 5, 6, 6, 8.
To compute the variance we follow the instructions given before:
1. Compute the mean.
2. Compute the deviation scores.
3. Square each deviation score.
X X X-X (X-X)2
3 4,7 -1,7 2,89
2 4,7 -2,7 7,29
3 4,7 -1,7 2,89
4 4,7 -0,7 0,49
5 4,7 -0,3 0,09
5 4,7 -0,3 0,09
5 4,7 -0,3 0,09
6 4,7 -1,3 1,69
6 4,7 -1,3 1,69
8 4,7 -3,3 10,89
4. Add up all deviation scores squared Σ(X-X)² = 28.10
5. Divide the result of step 4 by N-1
Computing the magnitudes of the mean and variance, we are now ready to put these
values in the KR-21 formula and get the reliability of the test scores. The formula is as
In this formula, K refers to the number of items in the test, X represents the mean of
test scores, and V is the variance of test scores. Again, a numerical example follows:
Suppose a one-hundred-item test is administered to a group of students. The mean and
variance computed to be 65 and 100, respectively. Reliability of the scores will be
computed using the KR-21 formula:
The procedure may seem a little complex, but with some practice, it will prove easy and very useful. This method is especially valuable for those who do not have a strong statistical background. From the four methods of estimating reliability, KR-21 method is the most practical and commonly used one. Therefore, it is recommended that teachers and administrators use this method. After covering the first characteristic of a good test, i.e., reliability, and the ways of estimating reliability, the next section is devoted to explaining the second characteristic, namely, validity.