The Crack in the Mirror: The Limit to Winning the Gen AI War

September 24, 2024

The big question in generative AI these days: what exactly is it good for? And especially, what can you charge for?

A recent Fortune article describes what Microsoft is doing “as the war to make AI more useful intensifies.” In short, they are developing autonomous AI agents to compete with Salesforce, Google, Amazon, and Apple. They are also updating their Office 365 copilots to make them more useful.

No question: generative AI is magic. But how much of it is illusion vs reality? Even the new Strawberry “reasoning” release from OpenAI, which reviewers are praising as a real step forward, is coming under fire for occasionally deceiving the user.

Furthermore, apparently OpenAI doesn’t want the magic tricks in Strawberry to be revealed.

Issues with the prior versions of Chat GPT and other LLM driven chats are well known: hallucinations, bias, inaccuracies, false positives, lack of clarity, too verbose, etc.

This is not to say the new technology is no good. It’s actually amazing and a big step forward for the IT industry. It completely changes how humans interact with computers – and potentially changes how computers interact with each other as well.

But there’s a fundamental flaw in the gen AI human language mirror: because human language is not precise, it’s impossible for gen AI to be precise. Given the nature of human language, it’s impossible to fully transform human language into computer language.

There will always be a limit to what Gen AI is good for, and in particular this limitation does not bode well for the success of autonomous AI agents.

Vendor Perspectives

In talking with Root Signals last week, an interesting side conversation came up about the challenges of confirming that an AI bot is actually doing something useful.

Root Signals’ evaluation platform works with any gen AI product. Users configure a set of evaluators such as truthfulness, clarity, and relevance and assign weighted scores to those characteristics. Users submit a prompt to an LLM driven chatbot and Root Signals uses AI to score the results.

These scores are relative – not absolute — because it isn’t possible to precisely evaluate human language. A human has to use the platform to evaluate the results of the AI evaluation of the LLM.

On top of that is just the inherent gray area of statistical matching. We also spoke recently with 3LC, who has created a visual gaming style interface to help eliminate errors in custom LLMs for applications such as automatically detecting bird nests in power lines or eliminating bias in determining whether an ID photo is male or female.

3LC’s sophisticated GUI highlights anomalies and false positives in the training data, allowing you to identify and fix the errors, iteratively improving training data to improve results.

One of Snyk’s founders, Guy Podjarny recently started a new gen AI company called Tessl, whose goal is to develop and maintain software autonomously (this is in the future, as they say).

In a recent blog post, Podjarny offers a 2×2 map for categorizing gen AI applications and gauging their utility. I think we’ve all known this for a while – that applications of AI that assist in performing tasks you are already performing are so far the most successful.

Applications that fall into this category include meeting summary bots from Teams, Zoom, Google Meet and others, and co-pilot style code assistants.

Another such AI application is Workbench from Tines, which assists security analysts executing existing workflows to triage and remediate security related outages and incidents. The Tines Workbench uses gen AI to summarize and interact with volumes of log and observability data, to reduce human effort. When Workbench suggests an action, however, it confirms the proposed action with a human.

The Battle of the Big Guys

The Fortune article references the competitive environment around the autonomous agents Salesforce is rolling out, and says that Microsoft and Google is also working on them. Amazon and Apple are also upgrading Alexa and Siri.

Bill Gates is quoted as saying whoever wins the battle for the most capable AI agents will eclipse the competition and win the “war.” This is because people in the future may interact with autonomous AI agents instead of typing their own search strings into Google, or selecting their own products from the Amazon website catalog.

The speculation goes on to include the idea that people will use autonomous AI agents instead of office software for sending email, creating spreadsheets, documents, or presentations. In Podjorney’s taxonomy, this will work only if the users trust the agents.

But how can you, when AI is only able to statistically understand human language and work on the basis of probabilities? Autonomous AI agents are using the same LLMs the chatbots are using, and will face the same issues of hallucinations, inaccuracies, bias, and false positives.

Last year Microsoft released a co-pilot for each of its Office products. According to the Fortune article, the most successful is the co-pilot for Teams that summarizes meetings. The article also said that many CIOs are finding it hard to justify the $30 per month per user expense for adding co-pilots to the other Office 365 products.

This goes back to another one of the points Podjarney makes in his blog post: if the gen AI application requires you to change the way you do something, that’s a barrier to adoption. I remember when Microsoft introduced the Ribbon Framework, for example.

The Ribbon Framework is no doubt – now that I’m used to it at least – a better way to organize GUI widgets and functions than what they previously provided. However, I also remember it was a big pain in the ass to relearn to find all the functions of Word that I normally use. No one likes to re-learn how to use something they’ve been using for a long time.

Gen AI applications that change how someone is used to working are likely to face a similar barrier.

Understanding Human Language

Years ago I was helping a top software engineer write a system architecture specification. Several reviewers commented that they couldn’t understand parts of the text. I edited them to make them more understandable. But the engineer rejected my changes, saying he preferred that the text “be precise” rather than understandable.

If the intended audience for the specification can’t understand it, it doesn’t matter whether the text is precise or not. The purpose of the language is to communicate information to other people. Communication happens when the reader (or listener) interprets the language in his or her own head. Communication does not happen by the simple fact of expressing an idea.

People who communicate with others as part of their profession (for example, me here writing this column) often spend considerable time and effort to make sure that their audience understands what they say or write. You really have to put yourself in your audience’s shoes, so to speak, to be an effective communicator.

But we still all know that one of Murphy’s Laws (or at least should be one) is that “anything that can be misunderstood, will be.” As hard as you try to clearly communicate, you also know that understanding will not be perfect.

The branch of linguistics called linguistic relativity, for example, assumes that human language is an expression of human thought, and that because language is distinct from thought, different languages are capable of expressing different thoughts, which in turn influences thinking.

This basically means something beyond language exists in the human brain, called thoughts (in English at least), of which human language is an abstract expression. And because different languages have different expressions, translation from one language to the another is also never perfect.

The illustration on the Wikipedia page about language semantics illustrates the difference between a real world apple and a human thought of an apple. I like to say the only real word is “word” because it both is itself and describes itself. Most other words are abstractions of something else. The word “tree” is not a tree, for example, but describes one.

Verbal communication depends on the speaker and the listener having the same, or at least compatible, understanding of words. For simple words and concepts this works pretty well. But the more complex an abstraction is – say a description of a computer system designed to process business transactions reliably – the more difficult it is to communicate clearly.

All this is just to say that LLMs based on understanding human language have built in limitations because of the limits of human language itself to express and communicate thought, and the limits of precisely expressing complex abstractions.

Perhaps, to succeed at the gen AI usefulness ”‘war” we will have to invent a new human language more suited to the task, or identify a subset of human language every LLM can understand. An “Esperanto” for gen AI if you will.

In other words, LLMs may never be able to understand human language well enough for autonomous agents, and gen AI applications may always need a human in the loop. Or humans will have to adapt their language to the limitations of LLMs, which I would not really expect.

The Intellyx Take

It doesn’t seem possible for any practical application of AI to avoid having a human in the loop. I appreciate the dream of the autonomous AI agent, but I am very skeptical of its reality, given the inherent limitations of transforming human language into computer language.

Human language isn’t compiled into precise executable code. Human language is something each person understands differently, and the only evaluation that really matters is what someone understands – not a probability of it. And as Murphy should have said, LLMs will misunderstand.

It’s well known that gen AI results can be improved by prompt engineering, i.e. improving the input improves the results. It’s also well known that the results can be improved by improving the data used to train a model.

However, there’s a limit to how much the results can be improved. Model training is not an exact science. To expect exact results from every gen AI bot interaction is a mistake.

This is why you will always need a human in the picture to interpret the results, and why autonomous agents are not likely to win the war of gen AI.

Copyright © Intellyx B.V. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. As of the time of writing, Tines is an Intellyx customer. None of the other organizations mentioned in this article is an Intellyx customer. No AI was used to write this article. Image by Trevor Lawrence, from Pexels.