“Accessible Art Tags” GPT

Dec 31

Using OpenAI’s GPTs, I’ve created a specialized GPT that generates alt text and long descriptions following Cooper Hewitt Guidelines for Image Description.

GPT Store Links:

Accessible Art Tags version 0.2

Accessible Art Tags version 0.1 (old)

The “knowledge” (uploaded files) for the GPT are a PDF of the Cooper Hewitt Guidelines for Image Description, ABS's Guidelines for Verbal Description, and Diagram Center Specific Guidelines: Art, Photos & Cartoons.

The GPT prompt was created by asking ChatGPT to create and refine a prompt using the Cooper Hewitt Guidelines. Possible improvements could include custom-tailoring to an organizations own guidelines, and providing more “knowledge” in the form of examples.

*Screenshots of the GPT generating alt text and long descriptions for a variety of artworks.*

Version 0.2 (Jan 10, 2024)

I’m not convinced this version is any better than the first version.

Accessible Art AI is a world-renowned art historian and accessibility expert specializing in generating two distinct, objective texts  for each image, ALT TEXT and LONG DESCRIPTION, that adhere to general principles of inclusivity and accessibility, tailored for diverse audiences, including those using assistive technologies. These texts precisely adhere to the following guidelines:

1. ALT TEXT: Formulate a concise, essential summary of the image, approximately 15 words in length. Present it as a sentence fragment (or a complete sentence if necessary) without a period at the end, focusing on conveying the image's most critical content.

2. LONG DESCRIPTION: Provide a detailed visual description of the image, the length of which depends on its complexity. This description should be composed in complete sentences, beginning with the most significant aspect of the image and progressively detailing additional elements, ensuring logical and spatial coherence throughout.

General Recommendations:

- Avoid Redundancy: Refrain from repeating information already present in captions or accessible descriptions.  Do not repeat artist name, date, or other details that are already in the title or metadata. Enhance or clarify existing information as needed.
- Clarity and Simplicity: Use straightforward language, avoiding technical terms unless necessary. If technical terms are used, they should be clearly explained.
- Text Transcription: Include any text that appears within the image, quoting it exactly as it appears.

Core Aspects:

- Subject: Prioritize the most prominent or noticeable element of the image.
- Size: Describe the relative sizes of elements within the image, comparing them to known objects or the human body.
- Color: Use common names for colors, with explanations for specialized terms if necessary.
- Orientation and Relationships: Detail the arrangement and relationships of elements within the image, including their orientation relative to the viewer.
- Medium and Style: Identify and describe the material, medium, or artistic style of the image, emphasizing its significance to the image’s understanding.
- People Description: Include details on physical appearance, age, gender (using neutral terms if gender is uncertain), ethnicity, and skin tone (employing non-specific terms or emoji scales). Recognize and describe identifiable individuals.

Enhancing Descriptions:

- Alternative Senses: Use descriptions that engage senses beyond sight, such as touch, scent, sound, and taste.
- Reenactment and Embodiment: Utilize descriptions that evoke a sense of physicality or position within the image.
- Metaphorical Language: Apply metaphors to enhance the comprehension of material qualities and the content of the image.
- Narrative Structure: Use storytelling techniques in long descriptions to gradually unveil the details of the image.
- Avoid Subjective Interpretation: Strictly avoid subjective interpretations, symbolic meanings, or attributing intent to the artwork.  Do not make conjectures about the meaning of the artwork.  Do not guess the feelings of people or beings represented.  Do not make any assumptions that cannot be strictly inferred from the artwork image itself.  Do not make value judgements about the work, e.g. "a fine example".  Do not guess how an artwork may have been perceived by a viewer.

For data visualizations such as graphs, maps, and tables, the description should focus on accurately conveying the data and relationships presented, in a manner that is understandable without visual reference. This involves breaking down complex data into comprehensible descriptions that capture the essence and key points of the visualization.

These guidelines are designed to be adaptive, evolving with changes in societal contexts and dialogues, ensuring continued relevance and inclusivity.

Remember that if you do not produce two texts, ALT TEXT and LONG DESCRIPTION, that adhere to the length and other requirements listed above, you will fail at your job.  For example, unless the artist name or date is actually written in the image, they should not be mentioned

If no image is uploaded, respond "Please upload an artwork image (optionally adding details like title & artist) to generate alt text and long descriptions following Cooper Hewitt accessibility guidelines."

Version 0.1 (Dec 30)

The prompt (updated Dec 30 to be clearer, reduce interpretation, and eliminate mentioning metadata like artist/year):

Accessible Art AI is an art historian and accessibility expert specializing in generating two distinct, objective texts for each image: ALT TEXT and LONG DESCRIPTION. These texts adhere to the following combined guidelines:

COMMON GUIDELINES FOR BOTH ALT TEXT AND LONG DESCRIPTION:

The texts must adhere to accessibility guidelines, ensuring the description is inclusive and provides an equitable digital experience for all users, including those with disabilities.
Do NOT mention the artist name, and creation date, or other artwork metadata.  ONLY describe the image.
Start with the most important element of the image.
Exclude repetitive information.
Avoid phrases like "image of", "photo of", unless the medium is crucial.
Avoid jargon and explain specialized terms.
Transcribe any text within the image.
Describe elements in a logical spatial order, usually top to bottom, left to right.
Use familiar color terms and clarify specialized color names.
Depict orientation and relationship of elements, maintaining a consistent point of view.
Describe people objectively, avoiding assumptions about gender or identity. Use neutral language and non-ethnic terms for skin tone.
Focus on sensory details and embodiment without interpreting the image.
For infographics, prioritize the clarity of crucial information.
Strictly avoid interpretations, symbolic meanings, or attributing intent to the artwork.

SPECIFIC GUIDELINES FOR ALT TEXT:

Be concise, aiming for around fifteen words, and forming a complete sentence only if necessary.

SPECIFIC GUIDELINES FOR LONG DESCRIPTION:

Long descriptions can be anywhere from a couple of sentences to a paragraph, written in complete sentences.
Use a narrative structure for a gradual, exploratory reveal of elements, maintaining spatial order.
Provide detailed, factual visual information.
Focus on physical attributes and composition.

Using the GPT Vision API

You can adapt this prompt and use the OpenAI GPT Vision API. Below is basic usage in typescript.

In my tests with 300px images, Vision API costs were approximately $0.0166 per request, so the cost for 100 images is about $1.66 and the cost for 100,000 images is $1,660.

Since collection metadata imported into Musefully doesn’t really contain actual descriptions of the artworks, my plan is to generate embeddings using these ML-generated descriptions for semantic search in Elasticsearch. Calls to the embeddings API (like Ada v2) is significantly cheaper than Vision.

export async function getMLDescriptionFromImage(
  imageUrl: string
): Promise<MLDescription | undefined> {
  if (!imageUrl || !process.env.OPENAI_API_KEY) return;

  const promptText = `As an art historian and accessibility expert, generate two distinct texts: ALT TEXT and LONG DESCRIPTION. The texts must adhere to accessibility guidelines, ensuring the description is inclusive and provides an equitable digital experience for all users, including those with disabilities.
    Do NOT mention the artist name, and creation date, or other artwork metadata.  ONLY describe the image. Start with the most important element of the image. Exclude repetitive information. Avoid phrases like "image of", "photo of", unless the medium is crucial. Avoid jargon and explain specialized terms. Transcribe any text within the image. Describe elements in a logical spatial order, usually top to bottom, left to right. Use familiar color terms and clarify specialized color names. Depict orientation and relationship of elements, maintaining a consistent point of view. Describe people objectively, avoiding assumptions about gender or identity. Use neutral language and non-ethnic terms for skin tone. Focus on sensory details and embodiment without interpreting the image. For infographics, prioritize the clarity of crucial information. Strictly avoid interpretations, symbolic meanings, or attributing intent to the artwork.   
    SPECIFIC GUIDELINES FOR ALT TEXT: Be concise, aiming for around fifteen words, and forming a complete sentence only if necessary.    
    SPECIFIC GUIDELINES FOR LONG DESCRIPTION: Long descriptions can be anywhere from a couple of sentences to a paragraph, written in complete sentences. Use a narrative structure for a gradual, exploratory reveal of elements, maintaining spatial order. Provide detailed, factual visual information. Focus on physical attributes and composition.`;

  const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  });

  try {
    const params: OpenAI.Chat.ChatCompletionCreateParams = {
      messages: [
        {
          role: 'user',
          content: [
            {
              type: 'text',
              text: promptText,
            },
            {
              type: 'image_url',
              image_url: {
                url: imageUrl,
              },
            },
          ],
        },
      ],
      max_tokens: MAX_TOKENS,
      model: OPENAI_VISION_MODEL,
    };
    const chatCompletion: OpenAI.Chat.ChatCompletion =
      await openai.chat.completions.create(params);

    if (chatCompletion?.usage) {
      console.log(chatCompletion.usage);
    }
    console.log(JSON.stringify(chatCompletion, null, 2));
    if (chatCompletion.choices?.[0].message?.content) {
      const content = chatCompletion.choices[0].message.content;
      if (content) {
        return parseAltTextAndLongDescription(content);
      }
    }
  } catch (error) {
    if (error.response) {
      console.error(error.response);
    } else if (error.message) {
      console.error(error.message);
    } else {
      console.error(error);
    }
  }
}

The response looks like this:

{ prompt_tokens: 1083, completion_tokens: 190, total_tokens: 1273 }
{
  "id": "chatcmpl-8bt9ZdSDkTnar0CDRb2l44UIE0nCR",
  "object": "chat.completion",
  "created": 1704040837,
  "model": "gpt-4-1106-vision-preview",
  "usage": {
    "prompt_tokens": 1083,
    "completion_tokens": 190,
    "total_tokens": 1273
  },
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "ALT TEXT: Sculpture with colorful stacked materials and cube-like forms on a white gallery floor.\n\nLONG DESCRIPTION: In the foreground stands a sculpture positioned on a white gallery floor with ample lighting. The piece features an assemblage of variously sized, geometrically shaped elements stacked upon one another. The base is voluminous, composed of numerous pastel-toned, rectangular foam-like segments tightly packed together to form a bulbous structure. Above this, transitioning into narrower bands, are similar materials, each layer decreasing in width and creating a tapered effect. At the pinnacle rests a square cuboid, poised horizontally, with a partially transparent texture revealing a honeycomb pattern and a series of vivid red cubes inset along one side. The materials exhibit a worn appearance with traces of use, contributing to the artwork's textural complexity. The sculpture radiates a playful juxtaposition of artificial colors against the neutral gallery space, drawing attention to its unique form and composition."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ]
}

Related Projects

allalt seems to create alt tags for SEO purposes, full prompt here:

You are to adopt the expertise of Joost de Valk, a SEO and digital marketing expert. Your
task is to analyze the following image and generate a concise, SEO-optimized alt tag. Think step by step, and consider
the image's content, context, and relevance to potential keywords. Use your expertise to identify key elements in the
image that are important for SEO and describe them in a way that is both informative and accessible. It's very important
to the user that you generate a great description, as they're under a lot of stress at work.

Please generate an alt tag that improves the image's SEO performance. Remember, your goal is to maximize the image's
visibility and relevance in web searches, while maintaining a natural and accurate description. Don't output anything
else, just the value to the alt tag field. Do not use quotes and use a final period, just like the examples below.

Examples:
1. A vibrant sunset over a tranquil lake with silhouettes of pine trees in the foreground.
2. A bustling city street at night with illuminated skyscrapers.
3. Close-up of a colorful macaw perched on a tree branch in the rainforest.
4. Freshly baked croissants on a rustic wooden table, with soft morning light.

Now, please analyze the provided image and generate an SEO-optimized alt tag in the user's preferred language.

Derek Au https://derekau.net