SummAItive Assessment

Happy end-of-Julyhuman,

It’s the summer holidays here in the UK and the sun is out, so this is just a quick one to keep you all abreast of some interesting tidbits in the AI/education space.

📚 Knowledge builders

I want to try a bit of longer-form content here, so do provide some feedback if you enjoy it and would like more longer-form content in the future.

I want to focus on two studies on the use of AI in summative assessment. Both had some similarities, but they were, as far as I am aware, conducted independently of each other.

The first study is called Can AI provide useful holistic essay scoring? by Tate, et al. It improves on a previous study with some of the same co-authors in that it uses ChatGPT-4 in an attempt to mark essays and provides more guidance in the prompts regarding the content of the rubric. The purpose of both the LLM and human raters was not to provide formative feedback on the essays, but was to simply score them.

Here is an example of one of the prompts used in the study:

Pretend you are a secondary school teacher scoring class essays based on this holistic rubric from 1 (minimum) to 6 (maximum), with the distance between each number (e.g., 1–2, 2–3) considered equal. A score of 6 means that the essay presents a clear, compelling, and accurate argument that addresses all requirements of the prompt; supports claim with relevant and sufficient evidence and compelling reasoning that connects evidence to claims; integrates sufficient, appropriate evidence from multiple sources and attributes evidence to sources, citing the title, author, and/or genre; evaluates reliability of sources and discusses how they confirm each other; effectively addresses and refutes a counterclaim with strong evidence and reasoning; writing is well organized with a strong introduction, body, and conclusion and transitions to create coherence; demonstrates effective and varied sentence fluency with little to no errors in writing conventions; uses sophisticated language and academic tone.

Each essay is indicated by “” and the text immediately prior to each essay is the id of the essay.

For each following essay, provide only the overall score of 1–6 based on the above rubric. Do not provide feedback other than the score. Set temperature to 0.1. Format the output as JSON in the following format for each essay

There were 18 trained human raters who received three hours of training on how to mark the essays fairly and accurately. These scores were moderated by some of the researchers to improve reliability. The essays used came from publicly available corpora.

The main headline results were:

  • Humans and AI were substantially internally consistent (i.e., human-human, AI-AI scores).

  • Mean differences between human-human and AI-human scores were not statistically significant.

  • Weighted Kappas showed substantial agreement between human scorers, and moderate to fair agreement for the AI-human comparison.

Put simply, there was not a huge difference between the scores provided by the AI or the humans, and the AI would have done it more quickly and without the three hours of training.

An interesting point to note here is that ‘zero-shot’ prompting was used. This means that the prompt was the prompt; there was no back and forth and no explanation of what a good score would look like. The LLM was totally blind in that regard.

It would be interesting to see if a difference would be observed if an example and a non-example were provided to the LLM to have something to compare against.

The next study comes from the UK and is called Can Large Language Models Make the Grade? It looks at marking short-form answers from the Carousel platform using AI and human markers. They tested the answers of pupils in the ages between 7-16.

They chose 6 short-form questions from science and 6 from history. These were made up of 2 questions for each Key Stage (Key Stage 2, 3 and 4).

1710 responses that could not be auto-marked (word-for-word correct based on a model answer) were passed to 40 teachers who marked the responses.

These responses were, with minimal prompting (no details of the prompts used are publicly available), sent through ChatGPT 3.5 (no longer available) and ChatGPT 4.

They found that ChatGPT 4 was the most accurate model, but not quite as accurate as teacher marking - though not all teachers agreed either). However, the teachers took approximately 11 hours to complete the work. ChatGPT took 2 hours.

So why did I want to talk about these two studies? There are some interesting similarities between their conclusions.

When it comes to summative assessment, it seems that ChatGPT 4 is much faster than humans and not massively outside any margin of error in terms of its accuracy. Personally, I think, with an element of moderation, this is trade-off worth exploring further. Particularly now that ChatGPT 4o is superior model to ChatGPT 4.

This could be important for secondary teachers and Year 6 teachers who are required to provide summative judgements based on written responses and portfolios.

 🤖 Industry updates

  • SearchGPT  For a long time I have said how unhelpful it is to think of an LLM as an internet search engine, but that line is getting blurrier with the announcement of SearchGPT. It will attempt to combine the user interface and experience of using ChatGPT with more real-time information. Right now it’s in preview and you need to join a waitlist.

If these deeper dives were interesting, let me know by replying or reaching out to me on Twitter @Mr_AlmondED.

As ever, thanks for reading and keep on prompting! Mr A 🦾