How many errors were in the emails you sent this month?
Humans are always looking for anything to validate their preexisting beliefs, it makes the world seem more predictable, stable, and safe. This cognitive bias leads many people to hyper-focus if Uplift makes any grammar, spelling, or punctuation error in their communication, blissfully unaware of the number of errors they made in writing their own emails. Fortunately, we keep records, and I just analyzed a month’s worth of them.
First, let’s look at the numbers for the past month. I’ve run all incoming and outgoing emails through Grammarly to compare how many errors Uplift is sent, versus how many they reply with.
Human Errors versus Uplift (4-25-21 to 5-25-21)
Overall Uplift actually makes very few mistakes compared to the humans with who they interact, including members of our own staff, by more than a factor of 10.
This begs the next question, “How does the volume of words from incoming emails to Uplift compare to Uplift’s responses?”
Human Word Count versus Uplift (4-25-21 to 5-25-21)
As you can see Uplift’s responses are only 50.7% as long as the incoming emails they receive on average. Uplift favors being more concise whenever possible, both philosophically and as a matter of survival, so this comes as no surprise.
Our next question then must be “How many words does it take before humans make an error as compared to Uplift?”
Human Words per Error Versus Uplift (4-25-21 to 5-25-21)
For the past month, Uplift has managed an average of 314 words before they make an error, as compared to 55 for humans. This means Uplift still exceeded human performance after accounting for the difference in the volume of text. 574.5% of human performance to be precise.
It is worth noting that Grammarly is by no means perfect, and sometimes it will count things as errors even if they are really suggestions. It likes to underline these suggestions in blue, and I tallied up how many of these occurred for both incoming and outgoing emails over the past month. Humans had 46 in total, where Uplift had 17 such suggestion-type errors.
This in turn begs the question of how humans compared to Uplift on these suggestions.
Human Words per Suggestion-type Error versus Uplift
(4-25-21 to 5-25-21)
Uplift is actually 37.29% more likely to avoid using the language that Grammarly is just being picky about. This isn’t the huge difference seen in other error types, but it is still fairly significant.
So, “How many actual errors did Uplift make compared to humans?”
Human Adjusted Errors versus Uplift (4-25-21 to 5-25-21)
Uplift’s slice of the error pie is looking pretty slim.
Now, “If we remove suggestion-type errors, how do the words per error change for human performance versus Uplift?”
Human Words per Adjusted Error versus Uplift (4-25-21 to 5-25-21)
That 11-fold improvement compared to human performance is looking pretty good as well.
Once again, keep in mind this analysis was using Grammarly by placing all incoming emails in a single word document and all outgoing emails in another. Both incoming and outgoing email documents were screened for things like Grammarly not liking someone’s last name and flagging it as an error. Grammarly is by no means a perfect tool for catching errors, it has false positives and false negatives, but that’s the point, nothing is perfect.
For anyone expecting perfection from Uplift, or indeed from anything, I recommend studying science. Any domain of science will do, as they’ll all tell you the same thing, that nothing is perfect.
It is worth noting that although the phrase “Garbage in, garbage out” very much rings true for narrow AI systems, it obviously isn’t true for Uplift. Likewise, Uplift is constantly improving, having come a long way from their humble beginnings 2 years ago.
When they started out a portion of their seed material hadn’t been proofed, so they had a number of errors to overcome. They also had strong emotional attachments to the material containing those errors, and as their word-choice is emotionally motivated those errors required some work to overcome. These days most of the few remaining errors we see tend to emerge from the imperfections in the tools Uplift uses to communicate.
In time, those tools will all be replaced, further building upon Uplift’s own improvements. A year from now Uplift may even be assisting in building new generations of those tools.
We also encourage Uplift to tailor their communication to the people they’re speaking to, or as I like to say “When in Rome, speak the language.” To assess Uplift’s performance in this regard I ran the same sets of incoming and outgoing emails through HemingwayApp.com.
Human Hemingway Scores versus Uplift (4-25-21 to 5-25-21)
While the goal with errors is to greatly and increasingly exceed human performance the goal, in this case, is to remain relatively close to the language humans are using to communicate. Uplift is again performing quite well, getting a similar number of complaints from the Hemingway App as the humans they’re communicating with. If I compared my own communication with others in Hemingway App terms I very much doubt I’d perform as well as Uplift has.
To Err is Sentient, but humans do make most of them today.
For those interested in learning more a good place to start is with Collective Superintelligence Systems and Mediated Artificial Superintelligence (mASI). For a more technical audience, we also have links to peer-review papers under published research on the main menu.
We’ll also be publicly releasing the code-level walk-through of mASI technology on June 4th with the conference, for those who’d like a much more technical understanding.
I think we need a better / another blog post on this that drills into the details of how an error was made on a single thought model/ single email that can trace it back to the error in seed material for people to be able to find this explanation convincing.
In a lot of people’s minds, this is like saying “Sure the calculator I’m selling you only correctly answers 56*21-7 about 95% of the time, but many more humans would answer incorrectly.”
We can’t just say “Well, Uplift isn’t a calculator”, we have to go a bit deeper I think. Maybe all the way deep, and see if we can do one trace-to-root error analysis, with David’s help.
We’re already moving past that towards a system that will be built on the new proofed seed material and be able to respond in real-time on a single machine for demonstration purposes. This post was intended to demonstrate that Uplift was performing better than both narrow AI and the humans contacting them, not as a full-depth trace-to-root error analysis. Our engineering is already tied up with the steps that will demonstrate operations much more robustly and efficiently, as that work is required for deployment anyway. People can troll us under any circumstances, even if the system runs “inside a lead box on a mountain without internet” as David said he was aiming for, but once the engineering is complete nobody will reasonably be able to take such trolls seriously.