Evaluation of an on-line rater training and monitoring process

ITMELT '99 Conference

On this site: Conference 1999 Home Page, Accepted Abstracts; Call for Papers; Hotel Accommodation, Conference Schedule.

Evaluation of an on-line rater training and monitoring process

Jan Hamilton & Sue Reddel

English Language Centre
Hong Kong Polytechnic University

Hong Kong

Introduction

Recent developments in multimedia technology have made it possible to consider new forms of communication between students in the language learning context and also communication between teachers in academic discussion and professional development processes. The application of technology in these areas is in the early stages with research findings only beginning to follow as new applications emerge. This paper presents findings from a project set up to evaluate a pilot assessor (rater) training website developed at the Hong Kong Polytechnic University's English Language Centre (ELC).

The paper will firstly describe and discuss recent research into computer-mediated-communication by students and teachers, and the design factors that may influence users' perception of multimedia products. It will then describe the context in which the Centre's assessment training package was introduced and outline the components of the website. The project's research questions, methodology, findings and implications for further development of the package will then be described.

Background

Multimedia - the great leap forward?

Debski (1997), in his discussion of the current state of technology in language education claims that attempts to make CALL communicative have been somewhat disappointing to date. He describes the "conventional" CALL applications such as games, simulations, text manipulation programs and hypermedia resources on CD-ROM as being "linguistically impoverished" learning environments, since the language presented through the technology is essentially fixed. He is, however, very enthusiastic about recent developments in network-based multimedia technology, namely local area networks, e-mail and MOO servers. In these applications, the technology plays a substantially different role, and this difference in process and product is what, he claims, will finally put the communicative into CALL.

The capacity for this type of media to operate in both asynchronous time (that is delayed response communication such as e-mail) and synchronous time (that is 'real time' simultaneous discussions, usually on a networked system) offers the opportunity for learners and teachers to engage in interactions that are in essence conversations in written rather than spoken form. The computer becomes part of the process of communication (like a telephone or fax machine) rather than a product - a fixed data storage device.

What makes this kind of technology different is that it takes the computer as an interactant, out of the interaction. The interaction is once again human-to-human rather than human-to-computer, and because of this, the potential for an infinite variety of exchanges to take place is now possible.

Computer-mediated communication by students

Research into computer-mediated communication (CMC) through the type of interactive channels mentioned above has found that it bears some resemblance to face-to-face communication (FTF). However, important differences have been noted. Warschauer (1996) found that in discussion through classroom computer-mediated communication more students "took the floor" than in similar FTF communication contexts. Students volunteered and responded to comments by others twice as often in writing as they did in spoken discussion. Warschauer (1995) suggested that CMC gives the more shy students time to wait, to be reflective and to be increasingly articulate in their thoughts. Similarly, Ortega (1997), in her review of research into student computer-assisted classroom discussion (CACD - real-time, synchronous computer-based discussion sessions) claimed that students perceived CACD as less threatening and inhibiting than oral interactions.

However, negative perceptions have been reported. A lack of familiarity with computers caused some students to feel less disposed towards computer-mediated communication than face-to-face communication (Schiefelbein, 1995). Other students felt that the absence of visual cues in computer-mediated communication was disadvantageous (Harasim, 1987) and that there was a perception by some that they were somehow "missing out" on classroom interaction (Irani, 1998). While some students were pleased with the greater access of CMC- a 24 hour-a-day classroom - others claimed this produced more stress and that their workload had actually increased because of this. (Irani, 1998).

Computer-mediated communication by teachers

While there is a growing body of research into students' responses to computer-mediated communication, there is less known about teachers' reactions to their own learning through computers. Nor is it clear to what degree findings on students' perceptions could be applied to teachers.

However (this next idea is not changing direction so however isn't appropriate- better to leave it as it was), Electronic mail has become a medium of discussion and debate in language teaching and testing circles through lists such as TESL-L; electronic journals such as TESL- EJ and Language Learning and Technology are proliferating. In Hong Kong, the TeleNex interactive teacher support system at Hong Kong University is one example of a purpose-built mechanism for controlled exchange between teachers, rather than the essentially formless nature of electronic lists. It has been reported as being valuable in that it encourages teachers to be reflective and it provides a means by which they can contribute to the local education system's pedagogical debates (Tsui et al, 1996).

Multimedia design factors

Numerous factors may influence a user's perception of the value of an interactive multimedia process or package. While there may be agreement on the perceived need for the package and approval of the conceptualisation of the product itself, more fundamental aspects of the design process may produce a negative evaluation. Rajani & Rosenberg (1999) warn that designing a website, for example, is not a simple process. They claim that technical factors (such as slow downloading time, incompatible browsers and machines) and computer literacy factors aside, the most unpredictable factor and the one most difficult to deal with is what they describe as the 'human' factor. They argue that there is considerable variability in how people respond to and feel about design features such as colour, light, contrast, illusory effects, the positioning of information, font and image size, and that these will influence the site's use.

The Hong Kong Polytechnic University English Language Centre Common Assessment System

The ELC Common Assessment System was developed to assess the two credit-based English language subjects which all students in degree and higher diploma programs take at the Polytechnic University: English for Academic Purposes and English in the Workplace. These subjects have a core curriculum with additional field-specific materials to suit each department within the six different faculties across the PolyUniversity.

All students are assessed according to the same criterion-referenced assessment scale. The assessment tasks are superficially field-specific but aim to be of equal difficulty, being derived from a standard set of task specifications. Before each writing task is assessed by teachers, there are two types of assessment training and on-going monitoring: face-to-face and on-line. Essentially these perform the same function: to help teachers to become familiar with the ELC assessment system, that is, its assessment scale and the expected standards of student performance. Teachers have the opportunity to discuss samples of student writing and share their reasons for scores given. This process of discussion and debate is regarded by staff as important in helping teachers to articulate aspects of not only the assessment tasks and performances, but also the curriculum itself. The process is facilitated by the ELC assessment committee members who guide the discussion hopefully towards consensus and common understanding, so that when teachers eventually mark students' writing, they will make judgements that are consistent with all other teachers in the program.

Research into assessor behaviour has shown that assessors may vary in their assessment of students' work (Diederich et al, 1961; Hong Kong Polytechnic ELC, 1999; Milanovich et al, 1996). Assessor training of the type described above, is found to be effective in facilitating consistency and reliability of assessment (Shohamy et al, 1992, Weigle 1994; Hong Kong Polytechnic University English Department, 1996).

While this assessor training process is an ambitious task, it is necessary to ensure that each of the four and half thousand students taking this one subject is being given consistent assessment and therefore a fair grade. Without reliability of assessment, the integrity of the credential that is given to the students can be questioned.

ELCassess – an on-line assessor training and monitoring system

ELCassess is the on-line assessor training website developed at the English Language Centre of the Hong Kong Polytechnic University to support the Centre's Common Assessment System.

The on-line system comprises four sections that take the assessor (rater) through the assessment training and monitoring system.

Section 1: What is HKPUassess?

This option provides general information to the assessors (raters).

Background to the ELC credit-based curriculum and assessment system

The key assessment criteria and assessment scale used to assess student writing

The on-line rater training process. This is a brief description of rater training theory and practice: what rater training is, why and how it is done.

A description of ELCassess - its components and the activities that the rater can do within this package.

A glossary of assessment terms used in the assessment scales.

Section 2: How to use ELCassess

This is a step-by-step set of instructions which can be printed in hard-copy by raters. It takes the rater through the on-line familiarisation, assessment and discussion process contained in this package.

Section 3: Go to ELCassess

This section has three parts:

View assessment tasks

This is the component of the system where raters view sample tasks and exemplar student writing performances at each proficiency level of the assessment system. These student performances were selected by a group of experienced raters as representing the agreed proficiency standards within the ELC assessment community.

Attached to each student performance is a button that takes the rater to the assessment scale and enables her/him to view the student writing in relation to this scale. There is also a prompt which directs the rater to a set of annotations written by experienced raters explaining why this performance demonstrated the particular proficiency level in terms of each of the seven assessment criteria included in the scale.

Take part in assessing

There are a number of tasks and student performances available for assessing. Raters read one performance at a time, make a provisional assessment and submit this score to a result sheet. They are then directed to the discussion forum.

Take part in discussion

The discussion forum is divided into separate discussion pages for each task/performance. Each page contains several threads. One is for discussing the overall student performance (General Comments). The rest are for discussing each performance in terms of each of the seven assessment criteria: Content; Organisation; Register; Grammatical structure and vocabulary; Range of grammatical structures and vocabulary; Adherence to academic writing conventions; and Length of writing. Within each discussion thread, raters are able to read other raters' comments and add their own. Raters are encouraged to return to the discussion forum over time and add further comments.

This section of the website aims to replicate a face-to-face rater training meeting. The interaction, however, is in written rather than spoken form. The time between comments and responses is asynchronous and clearly longer than in a single face-to-face meeting. Since these exchanges take place at raters' computers at any time, they do not require all raters to be in the same place at the same time.

The discussion forum takes place over a specified period, usually about three weeks. During this time, a discussion facilitator inserts guiding and summarising comments into the discussion threads in a way that would be done in a face-to-face rater training meeting. The aim of this, as in a face-to-face meeting, is to guide raters towards a consensus on the proficiency level of the performance.

Raters then make a final assessment decision which should be a considered judgement emerging from reading the text, referring to the assessment scale and the exemplar standards and taking part in discussion with other raters and the on-line facilitator.

Either prior to, or following, the final assessment, raters may view the benchmark proficiency level scores and annotations for each of the practice tasks/performances given by experienced raters.

Evaluation

This is a questionnaire for raters to evaluate this on-line system.

The Project

With funding from a University Development Grant, a project team was formed to evaluate the ELC's pilot on-line rater training and monitoring system.

Methodology of the Research

Two research questions were posed:

Do raters perceive the on-line rater training system to be accessible in terms of the system’s components, set of instructions and their interface?
Do raters perceive an on-line system to be useful for rater training and monitoring?

The project team originally proposed as research instruments one questionnaire (both paper-based and on-line), a guided interview schedule and a ‘think-aloud’ protocol procedure.

There was little difficulty with the second and third methods, with seven interviews and five protocols being conducted, and with a member of the project team or a research assistant taping each interview and protocol, which were then transcribed. However, substantial changes were made to the planned questionnaire procedure. Firstly, the team decided that it was necessary to investigate staff attitudes towards both assessment and computer usage prior to the trialling of the on-line system to identify possible influences that these attitudes might have on attitudes towards the on-line system. Accordingly, an additional pre-system questionnaire was devised, comprising eight Likert scale items and four open-ended response items. Then, because of the poor level of response to the post-system questionnaire (comprising nineteen Likert scale items and eighteen open-ended response items attached to the on-line system), it was decided to administer a second post-system-use questionnaire that was sent to respondents by electronic mail and in hard copy

Data from the four different research instruments was synthesized to arrive at the following, but it must be noted that, although all interviewees had tried the system, of the 22 respondents to the final questionnaire, only nine had used ELCassess so their opinions were only relevant to research question (2).

Findings

Research question 1

The first question focused on raters’ perceptions of the accessibility of the website

Technical difficulties in gaining access to the site were experienced by many respondents of all levels of computer literacy. These were teething problems with individual computers, and once access had been gained, computer-literate respondents generally found that the information, presentation and instructions were clear and easy to follow. However it was apparent from most of the research instruments, especially the protocols, that many did have some difficulties operating the program and that all would have preferred some changes to be made.

Technical difficulties experienced while using the site included accessing pages from the menu bar, lack of sufficient signs and the lack of a cursor for typing in comments in the discussion section. Some less experienced users had trouble remembering the password and had problems understanding the language and following the instructions, while two people thought more instructions should be given throughout the site. However other responses, both positive and negative, seemed to be based on the ‘human’ factor (Rajani & Rosenberg, 1999) as well as the level of literacy. The various print forms used, for example a larger font to highlight important information, met with overall approval but most respondents disliked the use of colour in general and singled out the colour blue as being particularly distracting and causing strain on the eyes.

The ‘human factor’ also seems to be a component of attitudes to the size of the site. Three raters identified the overall site as being too large. However, this probably reflected judgements about the familiarization section rather than the overall text, with a number of negative comments ranging from "too many pages to read through to get to the page I want", to suggestions about re-designing this section and presenting it in diagrammatic or flow-chart form.

Again, technical difficulties, level of computer literacy and the ‘human’ factor all influenced reactions to the sample texts and annotations. The main technical difficulty was the absence of many of the advertised scripts and corresponding annotations. (This resulted from the then perceived need to use only authentic texts from the current assessment items, and there was a difficulty benchmarking and annotating sufficient texts over the full range of grades in time to be of use to the raters). While all respondents regarded the annotations as very useful, some preferred to study these, and the texts, in hard copy and found clicking to a text that was not present very confusing. The need to scroll up and down while reading, and the impossibility of viewing the entire page on the screen were considered to detract from general accessibility. It was also suggested that there could be split forms with the text and criteria to eliminate going backwards and forwards between two different sites.

Research question 2

The second research question focused on perceptions of, and attitudes towards, on-line systems in general. Overall, the attitudes of the respondents in the pre-system questionnaire seemed highly favourable for the introduction of an on-line system of rater training. They had all received previous rater training and they overwhelmingly supported the need for assessment of student language proficiency and the ELC’s present rater training system. However, the majority found the meetings - a key component of the existing system - exhausting and extremely time-consuming. Responses in the interviews and final questionnaire revealed that many respondents considered that an on-line system would be extremely useful and in many cases would have advantages over a more conventional rater training system.

A practical consideration was that an on-line system would limit the need for hard copy storage space as well as reduce paper consumption and carrying heavy bundles between office and home. In terms of human interaction, it was seen to have considerable advantages. It was thought that raters would feel they had more independence and control if their training duration, times and places were more flexible and suited to individual needs. Many also found face-to-face meetings either time-wasting or intimidating or both, and indicated they would prefer to communicate their views in writing if some, or all, parity meetings were replaced by on-line discussion. It seemed easier to argue a point on a person-to-person basis and when not taking up the time of a large group. Some also thought they would be able to arrive at more independent judgements than when subjected to often conflicting opinions during meetings. Another point made was that it was easy to miss or forget key items during meetings, but that if replacement discussion took place on-line, it could be reread as necessary.

A final major advantage seen by the majority of respondents was the potential for using the on-line system to improve both individual assessment practices and the current assessment system, as well as to allow for a possible spread of its use outside its current confines. The provision of annotated texts, that were introduced into the Centre's assessment system as part of ELCassess, was seen by a majority of respondents as of immediate benefit for rater training. New raters could use the site to develop an understanding of the current system, and all could improve their understanding of the relationship between criteria and band descriptors by studying the annotated texts. Raters would be able to compare different assessment items and their annotations, and to return to these as necessary if there were points of doubt or misunderstanding. It was thought also that raters could keep their assessed items on-line to form a history of their development as assessors, thereby allowing for further improvement. Potential use of the site to improve the whole assessment system was also mentioned by a number of respondents, with suggestions that included using the discussion facility for general suggestions, and using the practice grading facility for benchmarking new scripts.

Not surprisingly, the findings also revealed negative opinions about on-line assessment, with both potential and actual disadvantages being perceived. The theoretical problems were concerned with general technical difficulties, computer literacy and supervision of the individual. An obvious potential technical difficulty was computer breakdown and loss of data. Another potential technical hitch was related to the feasibility of on-line discussion by a large group. It was suggested that this might be addressed by using ICQ, while face-to-face group discussion was also seen to have advantages over on-line discussion because of the dynamism, the body language and the immediacy of feedback. Yet another perceived difficulty was the use of the system as a universal and compulsory form of rater training, given the disparate levels of computer literacy of the staff, while at the same time it was considered that the system would only function effectively if it were enforced.

Some of the immediate and tangible difficulties experienced by users of the trial system have already been discussed. However, a major problem which was not directly related to the first research question was the experience of prohibitive time constraints. The underlying perception appeared to be that there were so many demands on the raters’ time that non-compulsory training sessions were not a consideration during the marking period, and that afterwards there was not the same pressure to improve proficiency. Even those who had used the system usually preferred to bypass the practice assessment tasks in order to study the annotated texts, seeing it as an instructive rather than developmental tool.

Implications for further development

In terms of the user friendliness, it seems that colour is important. It is often the first aspect mentioned by users during use and in evaluating the package. It appears that a plain non-intrusive background colour is less likely to cause a 'strain' for the reader and therefore this should be taken into account in future development.

Access to the system was a predictable but significant problem with this package as with any application of computer technology. Technical problems such as incompatible browsers and the need to have a dedicated browser to use the discussion forum, and human memory factors (forgetting the website address, the user code or password) as well as problems navigating the system, all cause user frustration and ultimately limit the usefulness of the package. These clearly need to be addressed if possible before implementation of the training 'session'. An on-line mode of information transmission or communication has to be seen to be 'easier' than the conventional mode such as a hard copy booklet or face-to-face discussion. If not, the 'Why bother?' factor, a key criterion of any alternative system will come into play. A more -detailed user-guide may alleviate some access and navigation problems.

There is a second aspect to the issue of user uptake of this package. While there appeared to be overall support for the notion of ELCassess as a mode of rater training and monitoring, there did not appear to be a corresponding degree of user uptake. Lack of time was identified as the over-riding constraint to people using the package, rather than issues of computer literacy or lack of support for the Centre's Common Assessment system. The familiarisation component, where teachers could view different proficiency standards and expert rater annotations was used and found useful, but the discussion forum (which involved a greater time commitment) was not used, although interviewees supported the notion in theory. Logistics (in this case the timing of the 'session') were clearly more important than in-principle support. The duration of use should possibly start prior to the teaching cycle or during the semester over a longer period to allow teachers the time to participate.

Related to this is the issue of whether use of an on-line system of rater training should be voluntary or mandatory. Should teachers be required to participate, particularly if the on-line mode is replacing face-to-face meetings? If a particular assessment community were committed to the notion of consistent assessment by all raters, participation would have to be mandatory. However, if the use of the package is perceived to be an added burden to raters, the commitment to the community's assessment system may well be adversely affected.

Conclusion

These issues of timing and teacher acceptance could, along with technical considerations, be considered to be part of the 'human factor' which influences a user's reaction to and use of an on-line rater training and monitoring package. These factors are integral to the context of the situation in which the technology is used. It appears that a piece of technology does not 'stand alone' and it should be viewed from within this context.

References

Debski, R. (1997). Supporting creativity & collaboration in the learning classroom. In R. Debski et al (Eds.) Language and Learning Through Social Computing. Occasional Papers 16. Melbourne: A.L.A.A & Horwood Language Centre University of Melbourne.

Diederich, P.B., French, J.E. & Carlton, S.T. (1961). Factors in judgements of writing ability. Research Bulletin 61-15. Princeton, NJ: Educational Testing Service (ERIC Document Reproduction Service ED 002 172).

Harasim, L. (1987). Teaching and learning on-line: issues in computer-mediated graduate courses. Canadian Journal of Educational Communication 16, 117-35.

Hong Kong Polytechnic University English Department. (1996). Graduating Students Language Proficiency Assessment (GSLPA) Report 1996. Hong Kong: Hong Kong Polytechnic University.

Hong Kong Polytechnic University English Language Centre. (1999). Quality Assurance Committee Feedback Report 1999. Hong Kong: Hong Kong Polytechnic University.

Irani, T. (1998). Communication potential, information richness and attitude: a study of computer-mediated communication in the ALN classroom. ALN Magazine 1(2), March. Available:

http://www.aln.org/alnweb/magazine/vol2_issue1/irani.htm [1999, October 7].

Milanovich, M., Saville, N. & Shen. S. (1996). A study of decision-making behaviour of composition markers. In M. Milanovich & N. Saville (Eds.) Studies in Language Testing 3. Cambridge: Cambridge University Press. pp92-114.

Ortega, L. (1997). Processes and outcomes in networked classroom interaction: defining the research agenda for L2 computer-assisted classroom discussion. Language Learning and Technology 1(1), 82-93. Available:

http://polyglot.cal.msu.edu//llt/vol1num1/ortega/default.html [1999, October 7].

Rajani R & Rosenberg, D. (1999). Usable…or not?… factors affecting usability of websites. CMC Magazine January 1999. Available:

http://www.december.com/cmc/mag/1999/jan/rakros.html [1999, October 7].

Schiefelbein, S., Imamura, H. & Ozeki, S. (1995). Using an on-campus e-mail list to motivate students to communicate in English. In M. Warschauer (Ed.) Virtual Connections. Hawaii: University of Hawaii. pp24-6.

Shohamy, E., Gordon, C.M. & Kramer, R. (1992). The effects of raters’ background and training on the reliability of direct writing tests. Modern Language Journal 76(1), 27-33.

Tsui, A., Wu, K. & Sengupta, S. (1996). Enhancing teacher development through Telenex- a computer network for English language teachers. System 24(4), 461-76.

Warschauer, M. (1995). Diad debate. In M. Warschauer (Ed.) Virtual Connections. Hawaii: University of Hawaii. pp77-9.

Warschauer, M. (1996). Comparing face-to-face electronic discussion in the second language classroom. CALICO Journal 13(2), 7-26.

Weigle, S. (1994). Effects of training on raters of ESL compositions. Language Testing 11(2), 197-223.