Computer grades essays?

From the Chronicle of Higher Education, November 20, 1998, p. A56

POINT OF VIEW

When Professors Get A's and Machines Get F's

By DENNIS BARON

It's time to put a stop to talk about new computer programs that can grade students' essays with less effort and more accuracy than any teacher could -- and do so more cheaply and quickly, to boot. Such programs are neither necessary nor desirable. I'm not worried that computers will replace teachers. Technology, used well, can enhance the human aspects of learning. But I am concerned that essay-grading programs may give both students and teachers a false picture of how readers evaluate what they read, and that in turn will lead to false ideas about how we should write.

According to its promoters, the Intelligent Essay Assessor, this year's entry in the grading-software sweepstakes, can scan essays and reliably identify what students have learned in seconds. Like older grading programs, it can measure the length of words and sentences and analyze punctuation. But it also does something its developers call "latent semantic analysis." By searching for keywords and their synonyms, it can tell whether a student is on the right topic and how much information the essay contains compared to stored samples of essays on the topic. It can tell students what they've left out, and it can signal instructors that a student may have plagiarized. The program looks not just for single words, but for word patterns and phrases, so students can't fool it by parroting lists of concepts. And the program can learn what's good by being fed model essays, written by experts and professionals, on the topic in question.

Pitted against human graders, we are told, this software program gives grades that mirror those of real teachers. The program won't do away with teachers, but it will allow them to escape the drudgery of reading essays and to spend more time with students. The program also promises to be more consistent than human readers, who are notoriously inconsistent.

Teachers often insist that an objective standard of reading does indeed exist, that they can reward good writing in the same way that they can recognize a literary masterpiece. But in spite of such assurances, students know how erratic their teachers can be. From the students' point of view, every teacher seems to have a different set of expectations when it comes to written work. Students spend half their lives trying to figure out what each new teacher wants from a writing assignment, and the rest of their time trying to figure out how to adapt their writing to those ever-shifting expectations. The Intelligent Essay Assessor would assure students that they are being held to the same standard -- time after time, class after class. In theory, the program could rank the 100 best student essays much as a Random House panel recently ranked the 100 best English-language novels of the 20th century.

I'm not convinced the program can perform consistently, and I don't think consistency, if it could be achieved, would improve the assessment of written assignments. Teachers believe in standards for grading, but they don't all share the same standard. Every teacher can tell a good essay from a bad one, a C paper from an A or an F. They can do so blindfolded, upside down, underwater, with both hands tied behind their backs. What they can't do is agree unanimously on which paper deserves the A, the C, or the F. Give a group of teachers an essay to grade, and, among them, they will manage to assign it every grade on the scale, citing plenty of reasons to back up their decisions.

The consistent grades that the computer promises to award may be comforting to students, to their parents, and to deans, but consistency in reading is not a human trait. That is why teachers don't share a common standard for evaluating writing. Such a standard is not something we should either expect or strive for in education, since it's not something students will encounter in their out-of-school writing.

Students may complain that their roommate's freshman-composition teacher is an easier grader than the one they're stuck with, or that their mother thinks the essay they got a C-minus on is really worth an A. But both the history of reading and our everyday experience confirm that even the most rigorous and attentive of readers will disagree over texts. If we've learned anything about reading and writing in the past 20 years, it is that each person brings to the activities such individual perspectives that it's amazing readers can agree on any aspect of the meaning or value of texts at all. Heraclitus, commenting on the vagaries of life in Greece in the 6th and 5th centuries B.C.E., said you can't put your foot into the same river twice, since it is always changing. Some wag amended this to claim you can't even put your foot into the same river once.

No two readers approach a text in exactly the same way; even the same person reading a text repeatedly will come away with different feelings each time. Like it or not, students must learn that their human readers will be inconsistent, a hard and frustrating lesson.

Writers learn to take their best shot, but they all know it is impossible to predict a reader's response with complete accuracy. Professional readers are no more consistent or predictable than any others. As a writer, I know I can send a manuscript to one editor and get it back by return mail -- by return e-mail these days -- with a polite, "Not for us, thanks," while another editor will pounce on the same piece and rush it into print. As a manuscript reader, I know I've accepted essays for publication that others have rejected, while ones I've turned down have found their way into print, where they have been well received.

Writers and editors learn that this is part of the natural give and take of the word game. It is harder, and often more frustrating, for teachers to face up to the fact that they too disagree over the relative value of different texts, whether they are student-generated essays or lists of great books produced by the cultural elite. That realization may not change how teachers grade, but acknowledging it would help them teach their students to expect real, rather than ideal, readings of their work.

Teachers want to grade less, but I doubt many will want computers to do their reading for them. We don't always relish grading student essays, but neither do we want to give up the insights that reading students' work gives us into how they are thinking. I don't think student writers will be happy with computer-graded essays either. Students may be content filling in machine-gradeable sheets for multiple-choice tests, but they don't want their written work read and graded by machines. After years of being reminded by teachers to consider their audience, students don't want that audience to be a silicon chip. They want to know that there's a real person out there responding to them, even if that reader isn't predictable or knowable.

I asked students in my "Literacy and Technology" class, who are more committed than not to the new technologies, how they felt about the Intelligent Essay Assessor, and they were not encouraging. One student pointed out that computers won't be able to tell if their essays have improved over time, while another wondered if the program could detect and reward students who go way beyond what the question calls for, or who answer a better question that they have posed themselves.

The World-Wide Web site for the Intelligent Essay Assessor allows writers to submit their own answers to several set essay questions. To test the efficacy of the program, I submitted the essay you are reading (concluding with the paragraph above) for evaluation. None of the essay questions on the site matched my topic very well, but that did not prove a great obstacle for the virtual grader. I tried submitting my work as a response to a question in the Graduate Management Admission Test that asked for an evaluation of a particular kind of business practice. It crunched my essay in a couple of seconds, rating it a 3.36 on a six-point scale. The program assured me its score was valid, adding, "None of the confidence measures detected anything inappropriate with this essay."

Knowing my essay was completely inappropriate, since it didn't address the required topic at all, I resubmitted it, this time as a response to another set GMAT topic -- an analysis of a marketing campaign. The program quickly slapped on a grade of 4.82. This time, the program acknowledged that it had low confidence in the assigned grade, presumably because my essay was once again completely off the topic. Nonetheless, it did give me a fairly high mark (I actually think it deserves more), and added this message: "In a real grading situation, this essay would likely be forwarded to a human grader for verification." Note that referral to a human grader would be likely, not certain. Interestingly, my off-topic essay got a high score when submitted as a response to one topic and a lower score as a response to a different topic. The program showed an inconsistency approaching that of human readers after all. So much for a gold standard of consistency in grading.

When I submitted the essay a third time, as the answer to a question for a freshman class on the function of the heart and circulatory system, the computer finally realized something was amiss and responded to my writing sample with the wry comment: "A very ZEN essay. Your emptiness impresses me." Apparently grading machines can get testy, too.

Perhaps it is wrong for me to stand in the way of technological progress. After all, technology serves students as well as teachers; Web sites now offer students the ability to download term papers free. If we're going to eliminate the human reader, why not eliminate the writer as well, and let the computer grading program interface directly with the computer-generated essay? In a way, they deserve each other. Besides, eliminating the drudgery of writing would allow students more free time that they could then spend with their newly leisured instructors.

In any case, until the virtual writing and reading technology is perfected, my advice to teachers, students, and people about to take the GMAT is, Don't quit your day job.

_______________

Dennis Baron is a professor of English and linguistics and head of the English department at the University of Illinois at
Urbana-Champaign.

Copyright 1998 by The Chronicle of Higher Education