Pipemania: what René Magritte tells us about AI

Paul Squires
6 min readFeb 25, 2024

--

René Magritte, “La Trahison des Images”, 1929

You have probably seen this image before. La Trahison des Images, or “The tyranny of images”, was painted by French surrealist painter René Magritte at around the same time as he had written a piece in the publication La révolution surréaliste, entitled Les mots et les images.

In both publication and painting, Magritte focuses on the problems of image representation and the paradox of the medium. The pipe is not a pipe, because it is a painting of one. His commentary in Les mots et les images takes this thinking further. For example, here, Magritte is saying that a word can take the place of an object, and/or vice versa.

When it comes to AI, Magritte’s theories are nowhere near as surreal as he may have initially considered. Even before any AI intervention, the medium of La Trahison des Images has changed in a way that the painter may never have envisaged.

What you see at the top of this article is not even a painting of a pipe. It is a digital representation of a painting of a pipe, with the representation delivered according to a number of international technical standards for both delivery and presentation.

This presents an existential problem for AI image recognition. Which of the three options would it recognise? A pipe, a painting of a pipe, or a digital representation of a painting of a pipe?

Let’s try it out. Blip, the image/text encoder and decoder from Salesforce, is an easy-to-use model that happily deals with image interpretation and captioning. Running the painting through Blip gives two suggestions:

Well done Blip. If we strip the third layer away from La Trahison des Images (the digital representation of the painting), then it correctly identifies a photo or painting of a pipe, not just “a pipe”. The second guess even gets Magritte’s caption correct in terms of its language.

Time for some fun. Let’s take the aforementioned theories from Les mots et les images and create an image, using those theories, for Blip to work out.

a photography of a pipe with the word pipe written in black / a close up of a pipe with a white background and a black text

Nice. Blip has not associated the two objects (eg “a pipe with an associated caption”) but simply said what is on the screen — a classic AI image-to-text response.

Let’s spice up the Magritte-ness.

a photography of a pipe with the word banana written on it / a close up of a pipe with a banana on it

Well, neither of these descriptions are true. The first does not have BANANA written on the pipe, and the second might have Magritte laughing from beyond the grave — a banana does not exist in the image, but the word BANANA does. This second response is the closest that Blip gets to being beaten by a Magritte theory.

a photography of a pipe with a caption that reads the object on this screen is a banana / a close up of a pipe with a caption on it

When this came back, I actually LOLed.

The first response is beautifully cheeky. Blip isn’t going to be fooled that the pipe is a banana, even though the instructions in the image clearly state that this is the case. It overcomes one of Magritte’s Les mots et les images observations that, sometimes, a word can replace an object…

… because Blip has decided that a word cannot replace an object. This is interesting because of the way that AI “understands” words. Image-to-text models like Blip understand images as a digital stream. There is no difference in medium between a picture of a pipe, and the word “banana”. Both come in as parts of an image file. There is no physicality to a word, until that word is expressed in form. With AI, the word is already expressed in form, and the model has to be trained to pretend its physicality.

A similar observation from Magritte is that of association: “The names written on a painting designate clear things, while images designate undefined things.”

This is more complex for AI, because the model needs to combine elements both of a conversational model and an image-to-text one. This is where Visual Query Answering (VQA) comes in. VQA is a type of model that can provide responses based on natural-language enquiry.

Here, I’m trying the Dandelin ViLT model. For an AI model to understand the relationship between two objects, it needs to understand that two objects exist in a frame. To try it, I grabbed a picture (ironically AI-generated) from Adobe Stock that contains two cups. I wanted something simple where the two objects in the mise en scène are clearly identifiable and identical, or almost identical.

The result below suggests that while VQA is promising, it has a long way to go in terms of visual and art theory. I was rather surprised by the response — that “coffee” is the strongest, even though the image doesn’t contain any coffee beans or powder, and the contents of the cups are not visible.

The second experiment made me LOL again.

In asking ViLT what my image-plus-text-pipe contained, it was very unsure, to a strength of only 8%. It assumed either a phone, cat, or mouse. Hilariously, though, and by complete coincidence, one of its assumptions was a banana.

So, if it can assume a banana, let’s try with the pipe image containing the word “banana”.

A banana is also in the list of suggestions here, but if we go back to Magritte’s aforementioned thoughts regarding images and their verbal references, then it ViLT has perhaps assumed “knife” because of the sharpness of the A and N letters in the word “banana” — as, again, it has read the word as a binary stream rather than a word.

For this final part of the fun, let’s try some text-to-image generation using Magrittian principles. For this, I’m using Stable Diffusion 1.5.

The input “A pipe” produces:

Well, it’s a like a pipe. Let’s try “A painting of a pipe”.

Again, it’s like a pipe but this is where AI deviates from Magritte’s observations. Magritte believes that there is little relationship between an object, and the thing that represents it (the two examples in the drawing are “real” and “representation”.

What’s acutely interesting here is that AI has completely changed that relationship. Stable Diffusion has considered a pipe to be a digital photograph of a pipe, but a painting of a pipe to have a different form — as if the pipe was painted in perhaps a modernist or abstract way, even though the prompt had not asked anything about it. There is no fundamental reason as to why a pipe and a painting of a pipe are different in terms of the produced image, except in terms of what the model has been trained on. If, for example, the model has been trained to assume that “painting” really means “photograph” (ie photorealistic) then there should be no difference. But, while one is a pipe, the other is a painting of a pipe.

Magritte would love AI.

(Footnote: the title of this article was influenced by this. My previous piece on AI, calculating the number of marshmallows that can fit into a building, is here.)

--

--

Paul Squires
Paul Squires

Written by Paul Squires

Founder @imperica @pereramedia / Strategist @ibminteractive / Chair @furtherfield. Digital, media, art, politics, environment, culture, ephemera.

No responses yet