As part of our Skills AI innovation, we've integrated generative AI into the Workera software in a variety of ways. Here, our psychometrics team explain how it works.
Months ago, we decided to join organizations across the globe by leveraging ChatGPT to perform some of our most time-consuming and creatively challenging tasks. For Workera, that includes the task of writing multiple choice questions, what we call “assessment items.”
Because generative AI can accomplish creative and computational tasks so efficiently, many people are under the impression that AI poses a threat to existing jobs. However, our team’s early experience with AI has revealed that the relationship between this cutting-edge technology and human work must be one of cooperation and enhancement rather than replacement. The creativity, knowledge, and expertise of our assessment developers – those who previously wrote out items manually – has merely been redirected to different points in the item writing process, including prompt engineering and validation. Using ChatGPT for item writing has freed up our assessment developers for research, development, and other creative work, increasing the team’s productivity exponentially.
After months of using ChatGPT to develop high-quality multiple choice questions, we believe that any team will be better positioned to capitalize on the creative power, and overcome some of the puzzling idiosyncrasies of GAI model behavior. Here's our guide to leveraging ChatGPT based on what we've learned.
What to Include in Your Prompt
Suppose you work for a startup and are asked to develop a presentation on the capabilities of GenAI. You sit down to outline the presentation but quickly realize that you still need a lot of questions answered. How long should the presentation be? Who will be in the audience? Should the tone be technical or informal? However, when you ask for further direction, all you receive is radio silence. With the deadline approaching, you have no choice but to complete a first draft and turn it in, only to receive feedback that the presentation was far too long and complex for those who would be attending the session.
We all know how difficult it can be to succeed in a task without clear instructions. Yet when it comes to ChatGPT, many of us expect it to read our minds – we’re the bad boss who is great at delegating tasks but terrible at defining expectations. We want to be able to input “Write a multiple choice question about X” and have the model output exactly the kind of question we were picturing in our heads… but it doesn’t work that way.
ChatGPT isn’t so different from the rest of us: it must be given instructions that are as specific as the desired output. If we want ChatGPT to write a multiple choice question that is difficult, accurate, and appropriate for a defined audience, we must ask precisely for a multiple choice question that is difficult, accurate, and appropriate for the defined audience.
Completing the following steps will help you identify which information to include in the prompt for your multiple choice question:
Define the audience. If those who will be taking your assessment are “business leaders” or “data translators,” let ChatGPT know. Or maybe the assessment has several possible audiences? Let it know that, too! However, if your item is targeted to a couple different groups, it can be helpful to identify the groups that are of primary and secondary importance.
Identify the difficulty and cognitive level. You need to include in your prompt if you want your item to be easy, medium, or hard, and whether the complexity should be at a recall, application, or analysis level. Here’s the catch: We find that ChatGPT tends to produce items slightly less difficult and complex than you ask for. So, if you’re writing a high school quiz and think you want an application question of medium difficulty, you may want to ask for a hard analysis question instead.
Require that the question be “research-based”. This is a critical step, as it forces ChatGPT to cite its sources, making it easier to identify when ChatGPT is “hallucinating” – which is AI experts’ preferred euphemism for “telling a lie for no apparent reason.”
Ask for a rationale for all response options. You’re just helping yourself out with this one. Every option should be plausible but clearly right or wrong (depending on whether the option in question is the key or a distractor), and ChatGPT’s rationale will help you quickly assess if that’s the case.
Provide stylistic direction. For example, at Workera, we steer clear of using first- and second-person pronouns in our assessment items. To save ourselves editing time, we specify in our prompt that we do not want ChatGPT to include these pronouns.
Overcoming Issues
Unfortunately, even with the right parameters specified in the prompt, ChatGPT will not always generate a product that meets your desires and expectations. Here are some examples of persisting issues we have encountered while developing our assessment items, as well as some strategies for overcoming those issues:
Issue One: Distractors that are just too easy to eliminate. For whatever reason, ChatGPT often struggles to generate response options that are definitely wrong and plausible.
Solution: A quick fix is to ask for more distractors than are needed (maybe 5 or 6), out of which you can select the best of the bunch. Another strategy is to ask ChatGPT to generate a multiple choice question that asks the learner to identify the option that is BEST or MOST likely. Consider, for instance, the difference between the response option sets for the questions “What is the definition of artificial intelligence?” and “Which of the following is the BEST definition of artificial intelligence?” As you can imagine, the latter question requires a much more nuanced set of response options. A final strategy is to ask ChatGPT to self-reflect on any poor distractors and generate better ones where necessary. This last approach may require a longer conversation, but can produce the best results.
Issue Two: Item does not seem targeted or relevant to the skill. Every once in a while, ChatGPT will generate an item that does not seem to assess the skill at all.
Solution: The key here is to ensure, even before generating a first draft, that ChatGPT is familiar with the concept or skill being assessed. You can do this by simply asking “Do you know about X skill or concept?” or, if you don’t feel like entering into conversation with the model, you can preemptively copy and paste relevant passages from a reputable and relevant source into the prompt to inform the output.
Issue Three: The scenario in the stem is unnecessary. Sometimes the scenario makes the item look like it has a cognitive level of application when it is really just a recall question.
Solution: For example, a stem might read: “An international e-commerce company has decided that it needs to update its sales KPIs based on new market research conducted by a data analyst. Which of the following BEST describes a KPI?” Clearly, the scenario has little to do with the cognitive task being asked of the learner. As the assessment developer, you have a couple ways forward. First, you can simply delete the scenario and use the item as a recall question. However, if you need a more complex or difficult item, the most time-effective option is to regenerate the item altogether, but this time asking for a “Situational Judgement Test” rather than a multiple choice question. This will yield questions that are more likely to have a legitimate scenario in the stem.
Issue Four: ChatGPT gets tired and lazy. But seriously. If you generate several multiple choice questions in a row, the quality of the model’s results may begin to tail off noticeably.
Solution: As is so often the case, you can start by restarting the application. If this measure doesn’t work and you are still noticing symptoms of fatigue, you can work around the model’s exhaustion by providing clearer direction in the prompt–in other words, doing some of the work of ChatGPT yourself. Practically, this means you should carry out the measures outlined in the solution for Issue #2 here as well. As a side note, it is worth mentioning that when you are in a long conversation with ChatGPT, you should make it a habit to reiterate the context of the conversation to the model from time to time. This will keep the conversation focused and productive.
Issue Five: The correct answer to an item seems subjective. Sometimes ChatGPT will generate a multiple choice question where the correct answer – or “key” – seems arguable.
Solution: There are a few viable approaches to ensuring that an item is not subjective. First, you can ask ChatGPT 3.5 (or another AI application, such as Bard) to answer the same question as the one you’ve generated in ChatGPT 4.0. If the question is subjective, you'll receive incorrect responses – maybe even a variety of them. Another option is to paste the whole question into ChatGPT and ask it to provide the key. If there is any subjectivity in the question, it will output the wrong answer. Finally, refer to the rationale provided by ChatGPT. If the question is subjective, the rationale will be very generic and questionable.
Key Takeaways
Okay, we know – that was a lot. It’s time to summarize. Here are a few simple principles that can guide your use of ChatGPT for generating multiple choice questions, and honestly, just about any other complex task:
You get out as much as you put in. ChatGPT can greatly reduce the work required to accomplish a task, but it does not replace work altogether. While we don’t have to write multiple choice questions from scratch anymore, developing the perfect prompt to generate high quality items requires a great deal of thought and technique.
Human input is crucial. Because of ChatGPT’s tendency to hallucinate or misunderstand a prompt, it is essential that those who are engineering prompts for assessment items have a strong knowledge of the relevant subject matter as well as assessment best practices.
When in doubt, treat ChatGPT like a human. This might be the most important and helpful point of all: best practices of using ChatGPT are essentially identical with best practices in human conversation. And that shouldn’t surprise us, given the fact that neural network AI models are modeled on the human brain. ChatGPT gets exhausted and needs to rest and reboot, it can lose focus during long conversations, and its claims can be riddled with factual errors or subjectivity. Sound familiar?
Conclusion
At Workera, we're excited to be at the forefront of GAI adoption, and we hope that the insights in this resource will help your organization implement GAI to similarly positive results. Far from threatening the jobs of even the most skilled and knowledgeable members of your team, ChatGPT is an incredibly powerful tool best used to enhance the existing knowledge and skill of the user. It is important to remember that ChatGPT is like any other technology, helping us accomplish tasks but requiring, precisely in proportion to its power, oversight from those with a strong understanding of the technology’s risks.
On our assessment development team, this has meant going above and beyond to ensure that every assessment is thoroughly reviewed by those with subject matter expertise in order that any instances of hallucinations and subjectivity are detected and eliminated from the model’s outputs. If your organization takes care to adopt best practices when working with GAI - many of which we have outlined above – then we are confident that you too will be positioned to leverage this incredible technology to improve the product and productivity of your team.