ELT'oSpHere

Reflections on “Language Assessment Course”: Part 2 Assessing Writing

March17

Last week we had a “timed essay writing” practice with my intermediate level students which went really very bad.

We usually write essays in a process which involves multiple drafting and on-going feedback from the teacher and peers. After reading and listening to some input materials that would give my students some ideas for their outlines, we write in class and at times if they can’t finish their writing within due time, they also work at home. Occasionally we have timed writing as well. However this time only two of them were able to finish their writing within the given time frame (70 minutes as in their exam) and I thought “Well, they couldn’t do it because they didn’t want to…because they are not under exam conditions and they are not motivating themselves…etc.” But I have to admit that there were many statements regarding the difficulty of the topic which was gender inequality. Following this experience, last week in our “Language Assessment” course, we focused on assessing writing. Our discussions and Sara Cushing Weigle’s book entitled “Assessing Writing” helped me to view issues related to writing assessment under a different light. Here come the highlights…

Designing writing assessment tasks

According to Weigle (2002) development process for a test of writing involves certain stages such as 1) design, 2) operationalization and 3) administration. I would like to summarize points to consider that are suggested by Weigle (2002, p.78-82) at different stages to avoid potential problems with the test at a later test in the table below.

When Ece (a very dear classmate and a friend) said; “The stimulus material should be picked with respect to the construct definition of writing. Choosing a textual, a pictorial or a personal experience as a prompt in writing tasks should be in accordance with the construct definition and test takers’ characteristics” it rang a loud bell in my mind, explaining the inefficient timed writing experience I told you about at the beginning.

I have to admit that I may have overlooked some of the points listed above. For instance, as a teacher when I give my students a writing task to assess their language abilities I often skip pre-testing the items/ the writing prompt. But I have taken my lesson and you will see that in the coming Metamorphosis section 🙂

Importance of having test specifications

Test specifications are blueprints/ guidelines that give brief information about the tests so that when a group of educators have that in their hands, they can design assessment tasks that would be standard in assessing the constructs. Also test specifications provide a means for evaluating the finished test and its authenticity. There are many suggested formats for specifications but according to Douglas (2000 cited in Weigle) at a minimum they should contain:

A description of test content (how the test is organised, description of the number and type of test tasks, time given to each task, & description of items)
The criteria for correctness
Sample task items

I should also say that Weigle provided a particular format of test specifications in her book that was originally developed by Popham (1978) which entail detailed description and examples of test specifications that could help development of writing tests (2002, p.84-85).

Grading the writing papers

Weigle defines “score in a writing assessment” as the outcome of an interaction between test takers, the test/ the prompt or task, the written outcome, the rater(s), and the rating scale. She categorised three types of scales based on whether the scale is intended to be specific to a single writing task (primary trait score) generalized to a class of tasks (holistic or analytic scores) and whether a single score (primary trait or holistic) or multiple scores (analytic) are given to each written outcome.

In addition we discussed about advantages and disadvantages of using holistic and analytic scales in our class meeting and it was an interesting discussion, reflecting real life difficulties that we all encounter as teachers who need to score students’ written outcomes.

Holistic Scoring

Weigle argues that advantages of holistic scales cover 1) faster grading via assigning a single point rather than assigning different points for different aspects of writing, 2) focusing the reader’s attention to the strengths of the writer, rather than deficiencies in the writing, 3) being more authentic and valid than analytic scoring because it reflects the reader’s natural reaction to the text better. On the other hand some disadvantages of holistic scales are that a single assigned score may not provide useful diagnostic information regarding weaknesses in certain parts of writing ability.

Analytic scoring

Advantages are that it provides useful diagnostic information about students’ writing abilities, higher level of reliability because the criteria is more detailed and comprises of more items. As for the disadvantages it is argued that it takes a longer time to score compared to holistic scoring and raters may read holistically and adjust their scores analytically based on the criteria.

Standardisation

After our Thursday evening classes of ‘Language Assessment’ with Prof. Farhady and classmates (Ece, Volkan, Ece, Merve and Jerry) focusing on writing assessment, I thought about how we deal with this issue at my school.

This is a picture of me and my lovely colleagues just before the writing standardisation session.

Before grading the papers we come together in a standardisation session and go over our criteria. Then, we grade papers together within groups and assign grades and discuss the rationale behind our grading.

Although at times they take time, I really think that standardisation sessions help me because they refresh my understanding of the scoring and criteria and set the scene.

In standardisation sessions we have the opportunity to talk about how raters should arrive at their decisions independently and then compare and discuss their scores, how to treat students who responded to the writing question partially or fully off topic, what to do about memorized and/or incomplete responses.

Metamorphosis: lessons to be taken

Piloting and pre-testing items with a sample group who represent the target group will become my routine in the future.

I will be much more careful about clarity, validity; (“potential of the writing prompt for eliciting written products that span the range of ability of interest among test-takers” (Weigle, 2002, p.90)), reliability of scoring and the potential of the task for being interesting for the test-takers.

While choosing the writing topic (personal or general topic) it’s always a good idea to keep the homogeneity or heterogeneity of the test takers, the test purpose (general or academic writing), test takers’ interests, abilities, and their background knowledge into consideration

In order to sustain fair practice, one of the requirements should be evaluating scoring procedures involving assessing reliability of scores, validity of scoring procedures and evaluating the practicality of scoring procedures. Scoring and issues related to the procedures should be revisited frequently.

I will definitely work on having a user-oriented scoring rubric and familiarising students with these criteria. I really believe that such an understanding will guide them in their writing.

How do you deal with assessing writing at your institution? What’s the students’ reaction to writing test(s)? Please feel free to comment.

Next week we will deal with Assessment in ESP and I am looking forward to our Thursday class with Prof. Farhady …

Reference: Weigle Cushing, S. (2002). Assessing Writing. CUP, Edinburg.

posted under Assessing Writing, Testing and Assessment, Uncategorized | 6 Comments »

Reflections on “Language Assessment” Course: Part 1 Assessing Speaking

March16

Lessons in Language Assessment

For the last 3 weeks I have been auditing Prof. Hossein Farhady’s Language Assessment course given as a part of Yeditepe University PhD program in English Language Teaching. Though I have finished taking classes and I am on the verge of writing the proposal of my doctorate thesis, I still enjoy participating in Prof. Farhady’s class for 4 hours on Thursday evenings; asking and answering questions, reading articles and books and reflecting on issues related to fundamental concepts and principles of second language assessment, with a lovely group of classmates.

The professor is also my thesis advisor and I believe that our Thursday classes and discussions will help me to develop a critical view on a variety of existing assessment procedures, establish a better understanding of fair practice, forms, functions, uses, and psychometric characteristics of language assessment procedures, paving the way to my future thesis.

“Surrender is easy but don’t”

Thought provoking questions are flagged, real life scenarios are suggested and Prof. Farhady often plays the devil’s advocate when he corners us with his questions, requiring us to analyse the course content and to screen it against our experience as teachers who give tests to their students. When coming with an intelligent and satisfying answer becomes hard and he sees the question marks in our eyes, he says: “surrender is easy but don’t”. So, we promise him that we will keep our discussions in mind and always have a critical eye to our practices and pursue validation and reliability. After all changing the world starts with changing yourself, isn’t it?

This week on our agenda we had assessing speaking and writing and I would like to reflect on lessons to be taken for me.

Assessing Speaking

Do you have a speaking test in your institution? I think that designing a speaking test, coming up with tasks to be used in the test and devising a scoring scale for the test-takers’ performance requires a lot of hard work. In “Assessing Speaking” Sari Luoma (2004) suggests Hymes’s (1972) SPEAKING Framework to make the initial planning of a speaking test.

Situation (Consideration of physical setting and nature of the test- Is it an end of term test of speaking?)
Participants (How many examinees to take the test? Will they work in pair work? Group work? What would be the specifications about interlocutor and assessor?)
Ends (considerations about the outcomes of the test involving formative or summative use, how to provide feedback, test score and fair assessments)
Act Sequence (the form and content of speech acts that will be elicited through the test)
Key (How examiners are supposed to conduct their act and presence in assessment situations: Any scripts that will accompany, assessors guide regarding how supportive or impersonal they need to be?)
Instrumentalities (Which channels or modes (spoken, written, pre-recorded) and forms of speech (dialects, accents and varieties) will be used?)
Norms (Which norms of interaction, such as initiating conversation, asking clarification questions, elaborating, and (dis)agreeing, will be involved in the test?)
Genre

This framework can help design of a speaking test because it raises questions about linguistic, physical, psychological and social dimensions of the situation in which language is used. Consequently, task designer has to take input, goals, roles and settings into consideration.

Also, Prof. Farhady presented types of assessing speaking below:

Imitative (focus on repetition and pronunciation. E.G. Phone Pass test
Intensive(production of controlled language use and short phrases via minimum interaction)
Responsive (interacting to short conversations)
Interactive (transactional and interpersonal)
Extensive (oral presentations, story telling…)

He stressed that differentiating and understanding these types will help us gear our speaking test to better cater for the needs of our students.

Types of speaking tasks

Luoma (2004) provided a comprehensive summary of what speakers are asked to do in assessment situations. According to Brown and Yule types of informational talk encompass; description, instruction, story-telling and opinion expressing/justification (cited in Luoma, 2004, p.31). Bygate differentiates speaking tasks into factually oriented (description, narration, instruction, comparison) and Evaluative Talk (explanation, justification, prediction and decision). In addition to informational talk, there are also communicative speaking tasks. Common European Framework (2001) divided functional competence into Macrofunctions (description, narration, commentary, explanation, and demonstration) and Microfunctions (giving and asking for factual information, expressing and asking about attitudes, suasion-suggesting, requesting, warning-, socialising, structuring discourse and communication repair). There is a variety of task types that could be used in assessing speaking. Then how can task designers for a speaking exam decide which one(s) to use? Luoma (2004) suggests that task designers should make the organising principle for the assessment and teaching curriculum coherent (p.35).

Other considerations when designing speaking assessment tasks

This week in our testing class we once more saw that task designer’s burden is heavy. In addition to types of talks and communicative functions, they need to plan about how to operationalize these tasks. They need to rationalise whether individual, pair or group tasks will be used. Also assessment developers will choose whether to use real-life or pedagogical tasks, tape-based or live testing and determine between use of construct-based and task-based assessment. They also need to manipulate the difficulty of speaking task with regards to complexity of task materials, task familiarity, cognitive complexity and planning time. (Luoma, 2004, p.46)

Examples of Speaking Scales

One of the highlights of this week’s classes was having the chance to discuss a variety of both analytical and holistic speaking scales examples as well as rating checklists. Luoma outlined;

The Finnish National Certificate Scale
The American Council for teaching of foreign languages (ACTFL)
The Test of Spoken English Scale
The Common European Framework speaking scales
The Melbourne medical students’ diagnostic speaking scales (2004, p.60)

Metamorphosis; Lessons to be taken

At the end of each week I reflect on our class discussions and ask myself; “How will your future conduct change?”. Here are points to keep in mind for me to change for the better:

It’s important to prepare various versions of speaking scoring rubrics and scales catering for the needs of raters, teachers and examinees.

Holistic and analytical scales have their pros and cons and therefore, their use should be considered carefully. Holistic ones can be accompanied with rating checklists (detailed lists of features describing successful performances on task) for feedback purposes.

To develop good and clear level descriptors stems from examining performances of test-takers from different levels and describe features that makes them a certain level.

Differences between levels should be clear on the speaking scales and should not be blurred with too much dependence on quantifiers such as: many, few, adequately…etc.

I feel that being able to talk about questions in mind, assessment related issues we encounter in real-life and hearing about different perspectives and settings enrich my personal understanding regarding assessment. I really learn a lot…

Thursday testing classes and reflections will continue. Please stay tuned 🙂

Reference: Luoma, S. (2004). Assessing Speaking.The Cambridge Language Assessment Series, CUP, Edinburgh.

posted under Testing and Assessment | No Comments »

EALTA Summer School 2012: Goteborg Diaries Day 5

August13

In this last day of EALTA Summer School we started to work on real-life data that some of the course partipants provided.

Thanks to them, we had the opportunity to take a look at authentic data sets which portrayed students’ performance on a Maths test. Norman guided us by making us reflect on the data analysis and showed some short cuts to be utilized when we wanted to use our own data sets (instead off typing the whole data from scratch).

I met lovely people, had great fun and learned a lot. I would like to thank Gudrun, Marianne, course tutors and all partipants of 2nd EALTA Summer School for making this course such a memorable event for me.

Some of the course content and list of references are shared on EALTA’s Website.

Happy testing everyone…

posted under Testing and Assessment, Uncategorized | No Comments »

EALTA Summer School 2012: Goteborg Diaries Day 3

August13

What’s the relationship between storytelling and testing?

Norman (Verhelst)says that when testing we have a narrative but we need to be sceptical and critical towards the story and check whether the story we tell has any fallacies. In other words we need to check whether it’s trustable or not regardless of how beautiful the story is. Therefore, in order to check the narrative testers have to collect information.
So, on the 3rd day of the course we focused on ways of collecting information via one-dimensional and multi-dimensional models, likelihood and probability, Pascal’s triangle, joint maximum likelihood, conditional probability, and independence of probability.

But I would like to tell you another story here 🙂
The dinner we had at Pensionat Styrso Skaret…
It was such a lovely break after a hard day’s work.

2nd EALTA Summer School Dinner on PhotoPeach

posted under Testing and Assessment, Uncategorized | No Comments »

EALTA Summer School 2012: Goteborg Diaries Day 2

August12

I think the most difficult days were Day 2 and the following day- Day 3 because Jan-Eric Gustafsson carried out with classical measurement theory and we were introduced to ‘Item Response Theory’ by Norman Verhelst and both made me regret the days back at school when I tried (and unfortunately managed) to escape from the algebra lessons.

Jan Eric focused on Cronbach’s Alpha as a means to assess reliability score and outlined the assumptions of this measure which include; all components measuring the same underlying dimension, having the same relation to the underlying dimension and supposedly having same residual error variance. While constructing items in a test, if these assumptions are violated then there could be a reliability loss. Then, it was suggested that statistical tests, e.g. conducting a confirmatory factor analysis and checking inter-item correlation matrix and covariance matrix might act as a solution. We were also introduced to “a congeneric- Latent Variable Model”, “Path Analysis- Structural Equation Models (SEM)”, “Analysis of Moment Covariate Structures (AMOS) and Chi-Square Goodness of Fit Test”. Another point that Jan-Eric focused on was possibility of measuring a potential discrepancy between your data (what you observed in terms of test-scores) and your model, taking model complexity into consideration. Apparently the Root Mean Square Error of Approximation (RMSEA) test will yield whether your test has a good fit (if the value of the data analysis is less than 0.05).

As for validity Jan-Eric referred to Messick (1989) while defining ,exemplifying 3 classical forms of validity (content, criterion-related and construct)and conceptualising facets of validity as a “progressive matrix” taking evidential and consequential basis as well as test interpretation and test use into consideration. Then, as a final point that summed up the morning session, sources that can give information about construct validity and potential threats against construct validity were discussed. It was very informational and intense session and I am glad that we had the chance to be introduced to these analysis approaches and the underlying rationale. I felt that I would love to have more hands on tasks in the coming summer courses :)so that in the future we will be able to apply and transfer the course content fully in our local contexts.
I may have shunned from Math classes all through my education but there I was in our class of EALTA summer school, very happily and willingly pursue my professional development. Therefore I will give myself a bright star filled with the buzz words of the session.

In the afternoon, Norman gave us a battery of programs involving OPLM that we used for Rasch Analysis and Item Response Analysis. OPLM is a non-profit product that could be downloadable from the internet.

Let me show you how it looks like;

Then when you run the program it gives you information about probabilities of a student with a certain level of skill in getting an item with a certain level of difficulty right together with anaalysis of items.

At the end of the second day, I was confused a lot but I was also I felt comfortable because I knew that there would be more support; internal (Norman & other participants) and external ( e.g. the free manual of the free program OPLM). This was just an introduction…

posted under Testing and Assessment, Uncategorized | No Comments »

EALTA Summer School 2012: Goteborg Diaries Day 1

August7

EALTA Summer School Diaries: Day 1 on PhotoPeach

EALTA’s Testing and Assessment summer school kicked off yesterday with 24 participants coming from various nationalities. Most of the participants got wet under the pouring rain but none of us minded this because the coordinator of the event Gudrun (Ericson) and Marianne (Demarret) and the course tutors Professors Norman Verhelst, John de Jong and Jan-Eric Gustafsson gave us a warm welcome.

After a short introduction and orientation to the course we were introduced to Classical Measurement Theory and we sought answers to essential questions for fair practice in testing including:

“Why should one measure?”
“What are the differences between modern theories of measurement- Item response theory (IRT) and classical theory?
“How to interpret correlation between items of a test?”
“How to maintain reliability of a test?
“How to measure reliability of a test?”
“What are the reasons for reliability loss?”
“What are the factors which may lead to sources of variance in test scores?”
“What’s the relationship between text length and reliability?

Professor Gustafsson explained/defined/illustrated the answers by exemplifying the constructs, instruments, research design and efforts made to maintain reliability and validity of a large-scale research study; IEA Reading literacy Study that was conducted in 1991 with 4500 Swedish students. Thanks to real life examples, statistical tables and figures it was easier to grasp the answers provided in response to the questions listed above.

In addition to the rich content of the course, background and profile of the participants also contributed to the summer school. Some of the colleagues are working for ministry of education of their countries, some of them are involved in EU projects that aim at portraying language competencies across Europe, some of them are conducting research into testing and assessment and all of them are eager to talk about their experiences.

In short, I feel lucky and amazed maybe due to EALTA network in terms of collegial support and professional development or maybe because of the gorgeous “Welcome Reception” that we were treated with at the end of a trying but fruitful day.

posted under Testing and Assessment, Uncategorized | No Comments »