On a beautiful and sunny Goteborg morning 2nd EALTA Summer Course on testing started off with very interesting presentation by John de Jong about setting standards. There are many commercialized exams which argue that they measure the same construct in language development and when we compare these exams one to another it’s often assumed that they have exact same-equivalent scores.
To illustrate, my school accepts couple of external exams to exempt students that are enrolled in language preparatory school. Do these exams expect the same standards in terms of task difficulty, and standards expected form the students? I found this suggested point very interesting and you will be able to follow John de Jong’s point of view and data collection in detail when you refer to the presentation on EALTA’s website. Also, we talked about the standard setting procedures that took place in SurveyLang project and luckily we had Neus, Norman, Gudrun and other colleagues who were involved in the project. As a result, we had the chance to listen to their real life experience with respect to setting standards in a large scale EU project like Surveylang.
I would like to summarize the points that I reckon from these sessions:
- Governments, universities and test developers strive to set standards and screen exams accordingly and the success of standards setting procedure depends on planning, training and rigor. The more judges are guided and trained in the process the better and more smoothly runs the process.
- A variety of techniques including the Angoff Method, Basket procedure and Van den Schoot aid standard setting.
- Language development is conceptualized in 2 dimensions;
1) Quantity (How much a person can do in a test? How many different tasks?
2) Quality (How well can a person do these tasks? Efficiency?)
Furthermore, in the course of language development, combining these two dimensions (quality and quantity) is not a ladder. Instead, it’s a slippery slope which results in profiled development of students. E.G. a student being better in reading but relatively less able in speaking or listening abilities.
- Self-assessment can be unreliable in determining quality and quantity due to the Dunning-Kruger Effect.
- Instead ploting Rasch difficulty (Theta Values) against judgments (what people think about item difficulty) could give better estimates. A variety of approaches including odd& even ability estimates, split-half estimate, considering multi versus uni dimentionality per skill were mentioned.
- What does “a B1 Exam” mean?
As far as I have understood to have an exam at B1 Level we need to sample from all possible tasks that could be done at B1 level (“sampling from a wide universe of tasks”) and consider a student to be at B1 level if that student can master 50% of the task samples.
- Other Sources that you will not regret checking out 🙂
In short, setting standards requires meticulous work of considering expectations from students at a certain level, item difficulty and ratings of ability. It’s presented as a long, tiring, challenging and a valuable process which contributes to fair practice in testing and assessment.