The word association / proficiency test: Can it still work?


Brent Wolter

Hokkaido University



For quite some time now, researchers have assumed that as learners gain in proficiency the structure of their mental lexicon comes to resemble that of native speakers. Early studies, while far from conclusive, at least hinted that this was indeed the case (Lambert, 1956; Politzer, 1978; Meara, 1982). This eventually led to a series of studies that attempted to devise multiple response word association tests as a quick and easy means of assessing proficiency (Randall, 1980; den Dulk, 1985; Kruse, Pankhurst & Sharwood-Smith, 1987). The results of these studies were not particularly convincing, though, and the Kruse, Pankhurst & Sharwood-Smith (1987) study, which drew largely on the previous two studies for its theoretical support, seemed to effectively put an end to attempts to develop a word association / proficiency test. There are, however, some theoretical and methodological problems with these studies that tend to limit the conclusions that can be drawn from them. To be precise, a) little consideration was given to selection of prompt words, and b) complex, theoretically unjustified scoring systems were used. Therefore, this study was undertaken in an attempt to devise a new word association / proficiency test with an eye on addressing these two issues.



For an effective word association / proficiency test to be developed, I felt that careful consideration had to be given to the selection of the prompt words. For this reason, I decided to part from the tradition of using words and normative data from the well-known Postman-Keppel lists (1970). Instead, I opted to use normative data from the Edinburgh Associative Thesaurus (EAT; Kiss, et al., 1973). The main advantage of using data from the EAT is that although only about 100 native speaker responses were collected for each prompt word (as opposed to 1000 in the Postman-Keppel data), the number of prompt words for which data was collected was far greater (8400 as opposed to 100 in the Postman-Keppel data). This allowed for much greater flexibility when choosing the prompt words. In selecting the prompt words, I decided to establish some criteria in order to avoid two types of response patterns.


The first type of prompt word I wanted to avoid were those that tended to elicit a high proportion of primary responses (e.g. man, black, etc.). The problem with such words is that they are not very well-suited for a multiple response test format. In short, it is often difficult to produce many responses that bear an overt semantic connection to the prompt word beyond the primary response. The second type of prompt word I avoided including in the test were words that tended to produce a high proportion of idiosyncratic responses (i.e. responses that were produced by only one respondent). The reason for this was that piloting indicated that these prompt words were often difficult to respond to, even for native speakers. However, I realized that I was going to need to establish some sort of objective criteria for screening prompt words, and this led to the formation of the so-called e15-60f rule. In short, the e15f part of this rule states that prompt words whose primary response made up more than 15% of the total number of responses were not included. Similarly, the e60f part of the rule requires the total proportion of non-idiosyncratic responses to account for at least 60% of the total. Thus, prompt words such are man were not included because 67% of the respondents produced the primary response woman, and prompt words such as ignore were ruled out owing to the fact that about half of the responses were idiosyncratic.


Another issue I wanted to address was the scoring system. The past studies had often used complex scoring systems which awarded points by multiplying a number assigned to responses based on their order in the normative data by the order in which the subject produced them. For various reasons I felt it was necessary to introduce a simpler and more theoretically-motivated scoring system (see Wolter, 2002 for a more detailed discussion of this issue). Therefore, in this study I assigned scores in two ways by using a) non-weighted scores and b) a straight weighting system for weighted stereotypy scores. For the non-weighted scoring system, I simply awarded 1 point to each response that appeared in the normative data, while the weighted stereotypy scores were calculated based on the number of native speakers who also produced the same response. It should be noted here that the implementation of the 15/60 ensured that no single response would receive and extremely high number of points, as no response was produced by more than 15 native speakers in the normative data.


These tests were given to a group of ESL learners (the NNS group, n=31) and a group of native speakers (the NS group, n=42). In addition, a C-test was given to the NNS group to serve as a means of assessing proficiency. Finally, as a follow-up analysis, comprehensive test scores for 17 of the subjects in the NNS group were also included.



Although the results were significant, they were not as good as I had hoped. The NS did perform better than the NNS group, as shown in Table 1.



Table 1. Comparison of non-native speaker (NNS) and native speaker (NS) performance on the WAT (Wolter, 2002).

                                         NNS (n = 31)                         NS (n = 42)    

                                    Mean          SD              Mean          SD                 t-value

Non-weighted                22.2            5.9              33.1            6.3                    7.5*           

Weighted                       105.6          32.8            168.1          45.9                  6.7*

* p < .001


However the correlations were not consistently high as shown in Table 2.


Table 2. Correlation coefficients for scores on the WAT, C-test and comprehensive test.

                                                C-test                            Comprehensive test (n=17)

Non-weighted scores                  .50**                                          .78**

Weighted scores                         .53**                                          .51*





In view of these results, it seems that we still have some way to go before we can develop and effective word association / proficiency test. However, a number of interesting findings emerged from this study which should help us in reformulating our approach to the issue. To begin with, some prompt words seemed to function better than others. In particular, delexicalized verbs resulted in a number of responses which seemed intuitively egoodf, but were nonetheless non-scoring (e.g. responses of calm, promise, and quiet to the prompt word keep were all non-scoring). Another group was prompt words that resulted in a high number of individualized responses such as travel and visit. It may be better screen out and eliminate such prompt words from future tests.


Another interesting finding has to do with the correlations between the responses and the comprehensive test scores. To be exact, a surprising pattern emerged when primary, secondary, and tertiary response scores were compared with comprehensive exam scores. The results are shown in Table 3.



Table 3. Correlation coefficients for primary, secondary, & tertiary responses and comprehensive examinations scores (n=17; R1=primary response, R2=secondary response, R3=tertiary response).

                                    R1                    R2                    R3                    Total

Non-weighted                .05                    .55*                  .80**                .78**

Weighted                             -.24                        .38                    .72**                .51*





Whatfs noteworthy about the correlations in Table 3 is the fact that they showed such a smooth pattern of increasing from the first to the last response, with the tertiary response scores resulting in some fairly impressive correlation coefficients. This suggests that most learners can produce at least one egoodf response to a prompt word regardless of their proficiency, but clearer distinctions can be made by requiring further responses. It would be worth designing research to establish what the optimum number of responses might be.


In conclusion, I have to say that the results of this study do not fully support the notion that a word association test can be designed to assess proficiency. However, I do feel that given enough thought, testing, and patience, an effective test is still within our reach.



Den Dulk, J.J., 1985. Productive vocabulary and the word association test. Unpublished masterfs thesis. University of Utrecht, Utrecht, The Netherlands.

Kiss, G.R., Armstrong, C., Milroy, R., & Piper, J., 1973. AN associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W., & Hamilton-Smith, N. (Eds.), The Computer and Literary Studies. University Press, Edinburgh.

Kruse, H., Pankhurst, J., & Sharwood Smith, M., 1987. A multiple word association probe in second language acquisition research. Studies in Second Language Acquisition 9, 141-154.

Lambert, W.E., 1956. Developmental Aspects of Second Language Acquisition. In Dil, A.S. (Ed.), Language, Psychology, and Culture. Stanford University Press, Stanford, California, pp. 9-31.

Meara, P., 1982. Word association in a foreign language: A report on the Birkbeck Vocabulary Project. Nottingham Linguistic Circular 11 (2), 29-37.

Postman, L., & Keppel, G. (Eds.), 1970. Norms of Word Association. Academic Press, New York.

Randall, M., 1980. Word association behavior in learners of English as a foreign language. Polyglot 2, fiche 2.

Wolter, B. (2002). Assessing proficiency through word associations: is there still hope? System, 30(3), pp. 315-329.