UPSI Digital Repository (UDRep)
Start | FAQ | About
Menu Icon

QR Code Link :

Type :Article
Subject :LB Theory and practice of education
ISSN :2232-1926
Main Author :Dean, Brown James
Title :What do the L2 generalizability studies tell us?1
Hits :4
Place of Production :Tanjong Malim
Publisher :Fakulti Teknikal dan Vokasional
Year of Publication :2011
Notes :Vol. 1 (2011): International Journal of Assessment and Evaluation in Education
Corporate Name :Perpustakaan Tuanku Bainun
PDF Full Text :Login required to access this item.

Abstract : Perpustakaan Tuanku Bainun
This research synthesis examines the relative magnitudes of the variance components found in 44 generalizability (G) theory studies in L2 testing. I begin by explaining what G theory is and how it works. In the process, I explain the diffrences between relative and absolute decisions, between crossed and nested facets, and between random and fixed facets, as well as what variance components (VCs) are and how VCs are calculated. Next, I provide an overview of G-theory studies in L2 testing and discuss the purposes of this research synthesis. In the methods section, I describe the materials used in this research synthesis in terms of the samples of students, the tests, and the G-study designs used. I also present the analyses in terms of how the data were compiled and analyzed. The results are sorted and displayed to reveal patterns in the relative contributions to test variance of various individual facets as well as interactions between and among facets for different types of tests. I next discuss these patterns and put them into perspective. I conclude by exploring what I think the results mean for L2 testing in general. Keywords generalizability theory, norm-referenced relative decisions, measurement facets, variance components

References

Abeywickrama, P. S. (2007). Measuring the knowledge of textual cohesion and coherence in

learners of English as a second language (ESL). (Unpublished PhD dissertation).

University of California at Los Angeles.

 

Alharby, E. R. (2006). A comparison between two scoring methods, holistic vs analytic, using

two measurement models, the generalizability theory and many-facet Rasch measurement,

within the context of performance assessment. (Unpublished PhD dissertation).

Pennsylvania State University, State College, PA.

 

Bachman, L. F. (1997). Generalizability theory. In C. Clapham & D. Corson (Eds.),

Encyclopedia of languages and education Volume 7: Language testing and assessment (pp.

255 ‒ 262). Dordrecht, Netherlands: Kluwer Academic.

 

Bachman, L. F. 2004: Statistical analyses for language assessment. Cambridge: Cambridge

University Press.

 

Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater

judgments in a performance test of foreign language speaking. Language Testing, 12(2),

239 ‒ 257.

 

Banno, E. (2008). Investigating an oral placement test for learners of Japanese as a second

language. (Unpublished PhD dissertation). Temple University, Philadelphia, PA.

 

Blok, H. (1999). Reading to young children in educational settings: A meta-analysis of recent

research. Language Learning, 49(2), 343 ‒ 371.

 

Bolus, R. E., Hinofotis, F. B., & Bailey, K. M. (1982). An introduction to generalizability theory

in second language research. Language Learning, 32, 245 ‒ 258.

 

Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College

Testing Program.

 

Brennan, R. L. (2001). Generalizability theory. New York: Springer.

 

Brown, J. D. (1982). Testing EFL reading comprehension in engineering English. (Unpublished

PhD dissertation). University of California at Los Angeles.

 

Brown, J. D. (1984). A norm-referenced engineering reading test. In A.K. Pugh & J.M. Ulijn

(Eds.), Reading for professional purposes: studies and practices in native and foreign

languages. London: Heinemann Educational Books.

 

Brown, J. D. (1988). 1987 Manoa Writing Placement Examination: Technical Report #1.

Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

 

Brown, J. D. (1989). 1988 Manoa Writing Placement Examination: Technical Report #2.

Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

 

Brown, J. D. (1990a). 1989 Manoa Writing Placement Examination: Technical Report #5.

Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

 

Brown, J. D. (1990b). Short-cut estimates of criterion-referenced test consistency. Language

Testing, 7(1), 77 ‒ 97.

 

Brown, J. D. (1991). 1990 Manoa Writing Placement Examination: Technical Report #11.

Honolulu, HI: Manoa Writing Program, University of Hawai‘i at Manoa.

 

Brown, J. D. (1993). A comprehensive criterion-referenced language testing project. In D.

Douglas & C. Chapelle (Eds.), A New Decade of Language Testing Research (pp. 163 ‒

184). Washington, DC: TESOL.

 

Brown, J. D. (1999). Relative importance of persons, items, subtests and languages to TOEFL

test variance. Language Testing, 16(2), 216 ‒ 237.

 

Brown, J. D. (2005a). Testing in language programs: A comprehensive guide to English

language assessment (New edition). New York: McGraw-Hill.

 

Brown, J. D. (2005b). Statistics corner ‒ Questions and answers about language testing

statistics: Generalizability and decision studies. Shiken: JALT Testing & Evaluation SIG

Newsletter, 9(1), 12 – 16. Retrieved from http://jalt.org/test/bro_21.htm. [accessed Dec. 10,

2006].

 

Brown, J. D. (2007). Multiple views of L1 writing score reliability. Second Language Studies

(Working Papers), 25(2), 1-31.

 

Brown, J. D. (2008). Raters, functions, item types, and the dependability of L2 pragmatic tests.

In E. Alcón Soler & A. Martínez-Flor (Eds.), Investigating pragmatics in foreign language

learning, teaching and testing (pp. 224 ‒ 248). Clevedon, UK: Multilingual Matters.

 

Brown, J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language

writing skills. Language Learning, 34, 21 ‒ 42.

 

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge:

Cambridge University.

 

Brown, J. D., & Ross, J. A. (1996). Decision dependability of item types, sections, tests, and

the overall TOEFL test battery. In M. Milanovic & N. Saville (Eds.), Performance testing,

cognition and assessment (pp. 231 ‒ 265). Cambridge: Cambridge University.

 

Chiu, C. W.T. (2001). Scoring performance assessments based on judgments: Generalizability

theory. Boston: Kluwer Academic.

 

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of

behavioral measurements: Theory of generalizability for scores and profiles. New York:

Wiley.

 

Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A

liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137 ‒ 163.

 

Gao, L., & Rodgers, T. (2007). Cognitive-psychometric modeling of the MELAB reading items.

Paper presented at the National Council of Measurement in Education Conference,

Chicago,IL.

 

Gerbil, A. (2009). Score generalizability of academic writing tasks: Does one test method fit

all? Language Testing, 26, 507 ‒ 531.

 

Gerbil, A. (2010). Bringing reading-to-writing and writing-only assessment tasks together: A

generalizability analysis. Assessing Writing, 15, 100 ‒ 117.

 

Glass, G. V. (1976). Primary, secondary, and meta-analysis. Educational Researcher, 5, 3 ‒ 8.

Goldschneider, J., & DeKeyser, R. M. (2001). Explaining the “natural order of L2 morpheme

acquisition” in English: A meta-analysis of multiple determinants. Language Learning, 51,

1–50.

 

Jeon, E., & Kaya, T. (2006). Effects of L2 instruction on interlanguage pragmatic development:

A meta-analysis. In J. Norris & L. Ortega (Eds.), Synthesizing Research on Language

Learning and Teaching (pp. 165 ‒ 211). Philadelphia: John Benjamins.

 

Kim, Y.H. (2009). A G-theory analysis of rater effect in ELS speaking assessment. Applied

Linguistics, 30(3), 435 ‒ 440.

 

Kirk, R. E. (1968). Experimental design: Procedures for the behavioral sciences. Belmont, CA:

Brooks/Cole.

 

Kozaki, Y. (2004). Using GENOVA and FACETS to set multiple standards on performance

assessment for certification in medical translation of Japanese into English. Language

Testing, 21(1), 1 ‒ 27.

 

Kunnan, A. J. (1992). An investigation of a criterion-referenced test using G-theory, and factor

and cluster analysis. Language Testing, 9(1), 30-49.

 

Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability

of a scoring system for sample essays. Applied measurement in education, 2(3), 195 ‒ 205.

 

Lee, Y.-W. (2005) Dependability of scores for a new ESL speaking test: Evaluating prototype

tasks. TOEFL Monograph MS-28. Princeton, NJ: ETS.

 

Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting of

integrated and independent tasks. Language Testing, 23(2), 131 ‒ 166.

 

Lee, Y.-W, Gentile, C., & Kantor, R. (2008). Analytic scoring of TOEFL CBT essays: Scores

from humans and e-rater. TOEFL Research Report RR-81. Princeton, NJ: ETS.

 

Lee, Y.-W, & Kantor, R. (2005). Dependability of ESL writing test scores: Evaluating prototype

tasks and alternative rating schemes. TOEFL Monograph MS-31. Princeton, NJ: ETS.

 

Lee, Y.-W, & Kantor, R. (2007). Evaluating prototype tasks and alternative rating schemes for

a new ESL writing test through G-theory. International Journal of Testing, 7(4), 353 – 385

 

Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement

in the development of performance assessments of the ESL speaking skills of immigrants.

Language Testing, 15, 158 ‒ 180.

 

Mackey, A., & Goo, J. (2007). Interaction research in SLA: A meta-analysis and research

synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition:

A series of empirical studies (pp. 407 – 452). Oxford: Oxford University.

 

Masgoret, A.-M., & Gardner, R. C. (2003). Attitudes, motivation, and second language learning:

A meta-analysis of studies conducted by Gardner and associates. Language Learning, 53,

123 – 163.

 

McNamara, T. F. (1996). Measuring second language performance. New York: Longman.

 

Molloy, H., & Shimura, M. (2005). An examination of situational sensitivity in medium-scale

interlanguage pragmatics research. In T Newfields, Y. Ishida, M. Chapman, & M. Fujioka

(Eds.), Proceedings of the May 22 ‒ 23, 2004 JALT Pan-SIG Conference Tokyo: JALT

 

Pan SIG Committee (pp. 16-32). Available online at www.jalt.org/pansig/2004/HTML/

ShimMoll.htm. [accessed Dec. 10, 2006].

 

Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and

quantitative meta-analysis. Language Learning, 50, 417 – 528.

 

Norris, J. M., & Ortega, L. (2006). The value and practice of research synthesis for language

learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on

language learning and teaching (pp. 3 – 50). Philadelphia: John Benjamins

 

Norris, J. M., & Ortega, L. (2007). The future of research synthesis in applied linguistics:

Beyond art or science. TESOL Quarterly, 41(4), 805 ‒ 815.

 

Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and

challenges. Annual Review of Applied Linguistics, 30, 85 ‒ 110.

 

Park, T. (2007). Investigating the construct validity of the Community Language Program

(CLP) English Writing Test. (Unpublished PhD dissertation). Teachers College, Columbia

University, New York, NY.

 

Rolstad, K., Mahoney, K., & Glass, G. (2005). Weighing the evidence: A meta-analysis of

bilingual education in Arizona. Bilingual Research Journal, 29, 43 ‒ 67.

 

Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of

experiential factors. Language Testing, 15(1), 1 ‒ 20.

 

Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition of

L2 grammar: A meta-analysis of the research. In J. M. Norris & L. Ortega (Eds.),

Synthesizing research on language learning and teaching (pp. 133 ‒ 164). Philadelphia:

John Benjamins.

 

Sahari, M. (1997). Elaboration as a text-processing strategy: A meta-analytic review. RELC

Journal, 28(1), 15 ‒ 27.

 

Sawaki, Y. (2003). A comparison of summarization and free recall as reading comprehension

tasks in web-based assessment of Japanese as a foreign language. (Unpublished PhD

dissertation). University of California at Los Angeles.

 

Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment:

Reporting a score profile and a composite. Language Testing, 24(3), 355-390.

 

Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation

modeling. Language Testing, 22(1), 1-30.

 

Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973-1980. British Journal

of Mathematical and Statistical Psychology, 34, 133-166.

 

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:

Sage.

 

Shin, S. (2002). Effects of subskills and text types on Korean EFL reading scores. Second

Language Studies (Working Papers), 20(2), 107-130. Retrieved from http://www.hawaii.

edu/sls/uhwpesl/on-line_cat.html. [accessed Dec. 10, 2006].

 

Solano-Flores, G., & Li, M. (2006). The use of generalizability (G) theory in testing of linguistic

minorities. Educational Measurement: Issues and Practice, Spring, 13-22.

 

Stansfield, C. W., & Kenyon, D. M. (1992). Research of the comparability of the oral

proficiency interview and the simulated oral proficiency interview. System, 20, 347-364.

 

Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory

and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing

Writing, 9, 239–261

 

Suen, H. K. (1990). Principles of test theories. Hillsdale, NJ: Lawrence Erlbaum.

 

Tang, X. (2006). Investigating the score reliability of the English as a Foreign Language

Performance Test. (Unpublished PhD dissertation). Queen’s University, Kingston, Ontario,

Canada.

 

Taylor, A., Stevens, J., & Asher, W. (2006). The effects of explicit reading strategy training on

L2 reading comprehension: A meta-analysis. In J. M. Norris & L. Ortega (Eds.),

Synthesizing research on second language learning and teaching (pp. 3-50). Philadelphia:

John Benjamins.

 

Van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing,

23(4), 411 ‒ 440.

 

Van Weeren, J., & Theunissen, T. J. J. M. (1987). Testing pronunciation: An Application of

generalizability theory. Language Learning, 37(1), 109 – 122.

 

Xi, X. (2003). Investigating language performance on the graph description task in a semidirect

oral test. (Unpublished PhD dissertation). University of California at Los Angeles.

 

Xi, X. (2007). Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST)

for operational use. Language Testing, 24(2) 251 ‒ 286.

 

Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL

Academic Speaking Test (TAST). TOEFL iBT Research Report, TOEFLiBT-01. Princeton,

NJ: ETS.

 

Yamamori, K. (2003). Evaluation of students’ interest, willingness, and attitude toward English

lessons: Multivariate generalizability theory. The Japanese Journal of Educational

Psychology, 51(2), 195 ‒ 204.

 

Yamanaka, H. (2005). Using generalizability theory in the evaluation of L2 writing. JALT

Journal, 27(2), 169-185.

 

Yoshida, H. (2004). An analytic instrument for assessing EFL pronunciation. (Unpublished

Ed.D. PhD dissertation). Philadelphia, PA: Temple University.

 

Yoshida, H. (2006). Using generalizability theory to evaluate reliability of a performance-based

pronunciation measurement. (Unpublished ms). Osaka Jogakuin College.

 

Zhang, S. (2004). Investigating the relative effects of persons, items, sections, and languages on

TOEIC score dependability. (Unpublished MA thesis). Ontario Institute for Studies in

Education of the University of Toronto.

 

Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languages on

TOEIC score dependability. Language Testing, 23(3), 351 – 369.

 

Zhang, Y. (2003). Effects of persons, items, and subtests on UH ELIPT reading test scores.

Second Language Studies, 21(2), 107-128. Retrieved from http://www.hawaii.edu/sls/

uhwpesl/on-line_cat.html. [accessed Dec. 10, 2006].


This material may be protected under Copyright Act which governs the making of photocopies or reproductions of copyrighted materials.
You may use the digitized material for private study, scholarship, or research.

Back to search page

Installed and configured by Bahagian Automasi, Perpustakaan Tuanku Bainun, Universiti Pendidikan Sultan Idris
If you have enquiries, kindly contact us at pustakasys@upsi.edu.my or 016-3630263. Office hours only.