What Alanis Morrissette can tell us about AI
I was reflecting on life’s ironies recently, thinking about the clouds that puncture the sunshine, and the toothache felt through the smile.
These are not very good metaphors, but I’m not a lyricist or poet. Alanis Morrissette, however, is. With a career spanning three decades, her most well-known songs were international hits in the 1990s, including perhaps her most well-known song of all, Ironic.
Alanis’s music wasn’t my thing in the mid-90s, but Ironic is one of those songs that sticks. The lyrics speak for themselves.
It’s like rain on your wedding day
It’s a free ride when you’ve already paid
It’s the good advice that you just didn’t take
And who would’ve thought? It figures
The verses are full of such ironies, with the third verse being:
A traffic jam when you’re already late
A “No Smoking” sign on your cigarette break
It’s like ten thousand spoons when all you need is a knife
It’s meeting the man of my dreams
And then meeting his beautiful wife
The line in the middle of that verse came up in my reflection.
Comedian Ed Byrne once did a piece about the lyrics to Ironic. In the 90s, and like many of us in the UK, Ed had long “curtain” hair. My university friend Anna once took me to a party where he was there. Anna and I had liked each other but we never got together for some reason. Ed, meanwhile, stood against a wall at the party being very shy, all night.
Maybe being a shy stand-up comedian is, in itself, ironic. Anyway, I digress.
It’s like ten thousand spoons when all you need is a knife took me on a trail of inquiry. What does AI know about spoons and knives?
To find out, I downloaded a lot of spoons from Adobe Stock, but only one knife. I started with Llava, a visual recognition chatbot trained on BLIP, which I used in my last experiment.
Here’s what Llava thinks of a spoon:
>>> What's in this image? AdobeStock_45792148.jpeg
Added image 'AdobeStock_45792148.jpeg'
The image shows a close-up of a metal spoon. It is resting on a table, occupying the entire frame. The spoon appears to be well-maintained and shiny, giving it an appealing appearance.
Nicely described. The image is, of course, of a well-maintained and shiny spoon although the table is either inferred or has been removed in post-production.
Broadly the same description is given to a set of other spoons that I tested Llava on. Here are 44 of them. I had to download each spoon individually.
Great. Now, let’s try object recognition with a knife.
>>> What's in this image? AdobeStock_76663225.jpeg
Added image 'AdobeStock_76663225.jpeg'
The image features a close-up of an open metal spoon, which is resting on a table. The spoon appears to be in good condition and has a shiny surface, making it look appealing. There
are no other objects or items visible in the picture, focusing solely on the spoon as the main subject.
So, Llava thinks that the knife is a spoon.
It’s somewhat understandable. This particular knife has an outsized, curved palette (I’m just making up these terms) which might be associated with a spoon. If I told you “Which object of cutlery has a handle, and a curved bit that you would use with food?” then you might say “spoon” but you wouldn’t be 100% sure about it. In fact, I’d wager that you’d think about the question for a couple of seconds. AI has done the same.
Also, of course, Llava’s output is determined by its input: not just what it has been trained on, but what it has interpreted — correctly or otherwise — in that training. It’s essentially a sort of AI bias, because the model is biased towards spoons when objects of a similar shape and style are presented.
If the similarity of the shape is a problem, then we need to try other variations of a knife to check how Llava understands the concept of a knife. A cutlery knife is not as straight and sharp as a knife to chop vegetables. Llava is therefore thinking, quoting Paul Hogan…
Google’s Vit model had identified a cutlery knife as a knife, but of the wrong variety. However, the shape is similar.
Predicted class: letter opener, paper knife, paperknife
Of course, although machine learning and neural networks are now bound up in the neologism of post-ChatGPT AI, a lot of this stuff has been around for decades. I wrote some neural network programs while at university in the early 90s (I hadn’t met Anna at that point).
Apple’s XCode has a nice ML image classification tool. You don’t even need to write any code for it — it’s out-of-the-box in the application. Here’s the result of training the machine learning functionality on the cutlery set...
… and here’s what happens when the model tries to classify a new knife.
It’s a spoon, Xcode says confidently.
Even giving it an “obvious” knife with a sharp tip results in a confident yet incorrect assumption.
The volume of spoons trained on this model has clearly geared it towards thinking that everything is a spoon. There is no subtlety here because there is no subtlety in the training. Objects are either one thing or the other. Image classifiers are only working on what they have been fed.
As such, trying with non-cutlery objects results in some interesting results.
The block of Edam sways the confidence a little. I had thought that the sharp edges would cause such a variance, but there are no sharp edges in the knife that Xcode was trained upon. Perhaps it’s just less confident in the cheese being a spoon, so it guesses that it’s a knife.
Trying another object (yes, I know that it’s a Bristol VR) results in another absolute confidence in it being a spoon. Presumably here, the curved edges of the bus have led to such a decision. I am as fascinated by the decision — and confidence — that the model as making as much as the output itself.
The market regarding AI models is changing rapidly and it might be helpful to have a consistent metric in terms of what AI can identify, and what it can’t. In other words, what the spoon-knife ratio is for each model.
The spoon-knife ratio would be the based on the use case of “If I train a model with one knife and one spoon, what is the point at which the model mistakenly associates a knife to be a spoon” — how many spoon images are required to confuse the model. The higher the number, the “sharper” the model. If, after 1000 spoons, the model still sees a knife as a knife but then gets confused at 1001 spoons, the ratio would be 1:1001. This would be better than 1:10 or 1:100.
I’m going to call this ratio Associated Learning Anchors for Neural Image Search, or ALANIS for short. So, as per the above, 1:1001 is a good ALANIS ratio.
This was, of course, a fun experiment but I hope that it proves a more serious point around bias, misrecognition, and the need to avoid the former in your training set in order to avoid the latter (or worse). We have amazing technology that will always require human intervention for it to really understand the microscopically nuanced differences in images, and who we are.
I know. Isn’t it ironic?