Talking parrots and biased outputs in large language models

Parrot reading a book

As the business world integrates large language models (LLMs) like ChatGPT into business processes, it will need to consider the hidden, non-obvious biases that these LLMs create. These biases may be difficult to identify, anticipate, and remedy.

I discovered a non-obvious bias in ChatGPT 4’s outputs when creating bedtime stories for my son. “Dad stories,” as my son likes to call them, are choose-your-own-adventure-style stories that feature my son, our dog, and a challenge to overcome. I start with the same basic prompt to generate the beginning of a story:

Please draft an epic choose-your-own-adventure bedtime story for a five-year-old boy named [my son]. At least three times during the course of the story, you should provide a choice of what should happen next. The choice should be offered as either choice A or choice B. You should wait for my response — either A or B. Then you should continue the story based on the answer that I choose.

The story should include [my son] and his trusty sidekick, a brown-and-white dog named Henry. The story should have a beginning, middle, and end, and it should have a challenge that [my son] has to overcome. The story should have a few colorful and entertaining characters who have speaking roles and help to move the plot along. The story should be about [my son] and Henry…[enter plot here].

Prompt for ChatGPT

After a few weeks of nightly storytelling, I started noticing a recurring character: a talking parrot! Was there something in my prompt that kept calling for parrots? Reviewing my prompt, I thought I found the culprit: I asked ChatGPT to supply “a few colorful and entertaining characters.” Parrots are often colorful in both personality and appearance, so it makes sense that parrots would be a statistically probable output. Going forward, I changed “colorful and entertaining” to “interesting and entertaining,” hoping to give ChatGPT more opportunity to choose non-parrot characters.

But the parrots kept coming. I write this blog post after having just finished a story about my son introducing his baby sister to daycare…which featured the help of none other than a talking parrot! Out of 46 “Dad stories,” talking parrots have appeared in nine of them, including a story about creating a garden, visiting the Bronx Zoo, a separate story about visiting a tortoise at the Bronx Zoo, visiting the USS Intrepid on the Upper West Side, taking Henry on a walk around the neighborhood, participating in a dog agility contest, taking trash to the waste transfer center, and going on a sailboat.

Talking parrots aren’t the only hidden bias in these stories. An “enchanted forest” has appeared in eight stories, including one story that was supposed to be about my son and his dog “going to the grocery store to get eggs and English muffins” (my son chooses what the stories will be about). A talking owl has also appeared in eight stories, including one story about “going to see something cool and really large” (again, my son chooses what the stories will be about).

While it might not be surprising for a parrot to appear in a story about going to the zoo or going sailing, it is puzzling when a parrot appears in a story about a dog agility contest. And when an owl shows up in a story that could literally be about anything (“something cool and really large”), it’s hard to deny that there is some sort of bias at work. There must be something else in my prompt that is causing ChatGPT to create stories with parrots, enchanted forests, and owls: maybe it’s the request for an “epic” story; maybe it’s a request for a children’s story with talking characters; or maybe it’s the inclusion of a dog as the first animal character. More likely, the confluence of all the requests in the prompt make ChatGPT more likely to choose parrots, enchanted forests, and owls, than most other substitutes.

Businesses have already started using LLMs to interface with customers, generate advertising content, and create summaries of reports from lengthy data sources. If we view each of the outputs of LLMs in isolation, we might have trouble distinguishing their biases. But by running the LLMs on the same prompt over and over, or with slightly different prompts, we may be able to uncover biases that we didn’t recognize and didn’t anticipate.

Biased outputs can be more insidious too. One can imagine how LLMs might, in the customer-service-chatbot context, consistently direct customers using one dialect of language to a more favorable outcome than customers using another dialect of language. One could imagine how customers with large vocabularies might be accommodated with more ease than those who are marginally literate, which has broader implications for how we treat people in society.

It is imperative, then, for businesses to closely examine how they use LLMs, and what hidden biases might be contributing to their outputs. And after undertaking such an examination, businesses must have a testable plan to remedy any bias in the outputs. Otherwise, we face the prospect of drowning in talking parrots…or worse.