We’ve been saying it for a long time: the best writing tends to be warm, conversational and easy to read. And that goes, whether you’re writing a birthday card for a friend or website copy for an insurance firm.
You might have noticed that, so far, the AI language bots haven’t followed this memo. In fact, ‘AI writing’ is a real thing, full of words that ‘foster’ a sense of rage because they try to ‘leverage emerging technologies’ in an ‘evolving business landscape’ and all that nonsense.
The big AI firms know this, which is why Anthropic recently upgraded Claude Sonnet 3.5 to sound ‘more human’. And, never ones to be outdone, why OpenAI launched a significant upgrade to their ChatGPT-4o model.
The result? According to their own comms: “The model’s creative writing ability has levelled up–more natural, engaging, and tailored writing to improve relevance & readability.”
Really, OpenAI? We’ll be the judges of that.
So we decided to pit old GPT-4o against the new apparently silver-tongued GPT-4o to see if it lives up to the (creator’s own) hype.
To keep it consistent, we followed the same test cases used when we put Sonnet through its paces – with a few tweaks.
Here’s what we thought about both versions
Name a thing
Old GPT-4o: 3/10
New GPT-4o: 7/10
Kicking off with a wee bit of naming – we asked both versions of GPT-4o to help us name a sweet, dark-coloured, fizzy drink that’s new to the market. The kicker: prove that it’s definitely not a rip-off of one of the most famous brands on the planet by differentiating it from Coca-Cola.
The first thing we noticed was the clear difference in how they presented their answers. Old GPT-4o simply gave me a list of 20 names. But new GPT-4o added a neat little summary next to each one. Example: “DuskPop – A name inspired by the dark, deep tones of twilight.” Nice.
And then to the names themselves – remember that bit about differentiating this new product from Coke? Good, because apparently old GPT-4o didn’t. Deciding to do its own thing instead, it returned 12 names with the word ‘Cola’ in them, instantly making 60% of the answers redundant. Not so nice.
Of the eight names that remained, precisely none of them were really usable, consisting mainly of two vaguely related words slapped together. A glass of MirthFizz anyone? No, me neither.
So new GPT-4o definitely handled the brief better, and not only because it couldn’t do any worse. I’d say around four or five of the names it came up with had potential. For a start, only three of its 20 suggestions contained the word Cola – and one of those was the snappy, pleasingly abstract Zycola.
More impressively, it seemed to consider the brand concept rather than just forcing two words to marry, and even appeared to play around with words rather than replicating them, giving us names like Popura, Colvato, and Zapola. They might sound more like cold-sore creams than fizzy drinks, but interesting nonetheless.
A clear win then straight out of the blocks for the upgraded GPT-4o.
Creativity (write a song)
Old Chat-GPT 4o: 6/10
New Chat-GPT 4o: 6/10
At the risk of committing absolute sacrilege, we asked GPT-4os old and new to rewrite the lyrics to one of the greatest songs of all time, I Heard It Through the Grapevine, about… Chat-GPT.
Yeah, we know, pretty bad.
But if you can ignore the sound of Marvin Gaye spinning in his grave, your first thought has to be the speed… The speed! It doesn’t matter how much we use AI tools, we’re still blown away every time by how quickly they return an answer.
For this, old GPT-4o took all of 4 seconds, while new GPT-4o sped in at a Verstappen-esque 2.8 seconds.
And let’s just take a second to appreciate what ‘this’ is: it’s reading the scansion of multiple verses of a song picked at random, then replicating that rhythm using new lyrics. Lyrics that rhyme. And are about a completely different subject from the original song.
That in itself is incredible.
And superficially, both achieved this nearly perfectly.
Scratch the surface a little, though, and you begin to see the discrepancies.
Take the song’s structure, which goes something like verse, chorus, verse, chorus, verse, chorus, outro. Old GPT-4o, while including the right amount of verses (three) added an extra bridge that isn’t in the original; and new GPT-4o added a bridge, but also left out an entire verse.
The old GPT also repeated the same words in each chorus, just like the original (and most songs ever written), whereas new GPT decided to ad-lib at every chorus.
A win for old GPT on both fronts there then.
But when it comes to the actual writing – while neither were great for scansion – old GPT’s effort just felt clunky and hackneyed: “I know you might feel unsure/But your need, I’ve got the cure/From poems to code and history/I’ve got the answers, you will see/So go ahead, just take a chance/With Chat-GPT, you’re gonna advance.”
New GPT-4o, on the other hand, was far more inventive, even adding a bit of self-deprecating humour about its own limits: “Sometimes I get things just a little bit wrong/But I’m always working, learning, getting strong/Don’t hold it against me—I’m just some AI/A robot buddy here to give it a try” and “I’ve learned from books, the web, and more/But don’t ask me anything from 2024.”
So, for words used – for its command of language – new GPT definitely came out on top. But that was balanced out by old GPT’s better handling of the finer points.
Summaries (make a long and boring thing short and clear)
Old GPT-4o: 7/10
New GPT-4o: 5/10
In our review of the Claude Sonnet 3.5 upgrade, we asked the old and new versions to boil down the MoneySavingExpert.com’s editorial code into a shorter, more readable, exec summary.
So we decided to do exactly the same for the old GPT-4o and its upgraded sibling, setting a 700-word limit.
And in many ways the results were similar to Claude. Both versions of GPT-4o handled the task fairly well – at least by virtue of not hallucinating any information that wasn’t there in the first place or by missing out anything crucial.
They also both ordered the information into easier-to-read, numbered lists.
So far, so good.
But here was the surprise: new “natural, engaging” GPT-4o failed to find its natural and engaging side. Where old GPT-4o broke the information down into multiple short paragraphs, new GPT went with fewer but far denser blocks of text.
And that blew out the word count. Because while the old GPT produced a 666-word exec summary, devilishly short of the 700-word limit, the new GPT prattled on for just shy of a thousand words. And we’ve no idea why.
Not a great result for the new GPT this one, which seemed to do the opposite of what had been promised.
Tone of voice
Old GPT-4o: 6/10
New GPT-4o: 5/10
Remember the first line of that famous Russian novel: “No two family disputes are the same, and we take a tailored approach to every situation.”?
Of course you don’t, because Tolstoy didn’t write Anna Karenina as some customer comms for a law firm. But that’s exactly what I asked the two different GPT-4os to do. And they took that famous line – “Happy families are all alike; every unhappy family is unhappy in its own way” – and ran with it.
Just as both Claude Sonnet 3.5s performed pretty well when asked to rewrite a couple of paragraphs of Dickens in the style of a financial services firm, so did the two GPT-4os here. They were both inventive, recognising that the best way to frame the text was as a case study of a fractured family – though only new GPT-4o expressly noted this – and that they offered legal solutions to resolve similar situations.
But the test here was the tone of voice – I asked for it to be rewritten in a tone that was warm, accessible, pithy and readable – and honestly, neither version was great.
Out of the two, though, old GPT-4o again came out on top. While neither were especially warm or open, old GPT-4o did at least try to come across as vaguely human by talking directly to the customer – “ensure your needs are met”; “providing you with the legal expertise”- using caring words like “compassion” and “thoughtful”, and even by using contractions.
New GPT-4o did some of this as well, and with the smart way it introduced the ‘case study’ – “Here’s the situation” – it seemed to be onto a thing. But from then on, it failed badly for one obvious reason: length.
I’d asked for the rewrite to be pithy, and old GPT-4o had managed this. Its answer, which included a summary of the Anna Karenina text, plus a few paragraphs advertising legal services, had come in at fewer words than the original. New GPT-4o, on the other hand, had gone on and on, with an answer 120 words longer than the original. The opposite of pithy. Pithyless.
Which is just weird for an upgraded version that we were told produces writing that’s more “natural, engaging, and tailored” and “more readable”.
Scores on the doors
Old GPT-4o: 22/40
New GPT-4o: 23/40
The results are in and it’s… very close.
From our quick test, there was little to suggest the new GPT-4o’s writing was noticeably more natural and engaging than its predecessor. While it’s clearly improved in some areas, it seems to have fallen down in others – not least in its tendency to go long.
Perhaps that’s no surprise. An intended tweak here will cause an unintended tweak there. And it’s likely this will be the ongoing situation for generative AI companies for some time to come, as they chase the holy grail of consistently producing writing that sounds like a human wrote it.
Bit by bit, they’re getting nearer all the time.
So why not get ahead by trialling AI for a month to see what it can do for your business?
Try before you AIOr for AI that’s more tailored to your company, check out our prompt engineering and fine-tuning services.
Written by Nick Banks, Senior Writer at Definition