Natural Language AI Features (Experimental)

The "Natural Language Categorization & Search" feature in Infinite Image Browsing (IIB) provides advanced capabilities for organizing and searching your image library based on the semantic content of their generation prompts. It leverages OpenAI-compatible large language models (LLMs) and embedding models to understand, categorize, and retrieve images in a more intuitive, natural language-driven way. This feature is currently experimental.

Overview

This feature allows you to:

Automatically group images by the semantic similarity of their prompts, generating "topic cards."
Search for images using natural language queries, similar to a Retrieval-Augmented Generation (RAG) system.
Visualize relationships between tags through a hierarchical "tag graph" with LLM-driven abstraction layers.

How it Works

The core functionality of these AI features involves several steps:

Prompt Extraction & Normalization:
- The system first reads the image's EXIF data or associated .txt files to extract the raw generation prompt.
- It then normalizes this prompt by removing boilerplate terms (e.g., quality descriptors, photography parameters, LoRA tags) to focus on the core subject and theme semantics. This step is crucial for getting meaningful embeddings.
Embeddings:
- The cleaned prompt text is sent to an OpenAI-compatible embedding model.
- This model converts the text into a high-dimensional vector (an "embedding") that captures its semantic meaning. Similar prompts will have similar vector representations.
- These vectors are stored in the image_embedding SQLite table.
Clustering:
- Images within a selected scope (folder(s)) are grouped based on the similarity of their embeddings.
- An incremental clustering algorithm (centroid-sum clustering) is used to form "topics" or "clusters" of semantically related images.
- Highly similar clusters can be merged, and members of very small clusters can be reassigned to larger, more confident clusters to reduce noise.
Title Generation (LLM):
- For each identified cluster, representative prompt snippets are provided to an OpenAI-compatible chat model.
- The chat model is instructed to generate a concise, human-readable title and a few keywords that summarize the theme of the cluster.
- These titles and keywords are cached in the topic_title_cache SQLite table.
Retrieval (RAG-like Search):
- When a user submits a natural language query, the query itself is first converted into an embedding using the same embedding model.
- This query embedding is then compared against the embeddings of all images in the selected search scope.
- Images are ranked by cosine similarity to the query embedding, and the top-K most similar images are returned.

Caching and Incremental Updates

To optimize performance and minimize API calls/costs, IIB employs a robust caching strategy:

Embedding Cache (image_embedding table):
- Where: Stored in the image_embedding table, keyed by image_id.
- Skip Rule (Incremental Update): An image's embedding is skipped if the same model was used, the text_hash (derived from the normalized prompt and normalization version) is identical, and an existing vector is present. This prevents re-embedding unchanged prompts.
- Re-vectorization Cache Key: text_hash = sha256(f"{normalize_version}:{prompt_text}"). The normalize_version is a code-derived fingerprint of the normalization rules, ensuring cache invalidation when the rules change.
- Force Rebuild: Can be forced by passing force=true to build_iib_output_embeddings or force_embed=true to cluster_iib_output_job_start.
Topic Title Cache (topic_title_cache table):
- Where: Stored in the topic_title_cache table, keyed by cluster_hash.
- Hit Rule: When use_title_cache=true (default) and force_title=false, titles and keywords are reused from the cache.
- Cache Key (cluster_hash): Includes the member image_ids (sorted), embedding model, clustering threshold, min_cluster_size, title generation model, output lang, and the prompt normalization fingerprint (normalize_version) and mode. This ensures titles are only reused if all relevant parameters are unchanged.
- Force Regeneration: Can be forced by force_title=true.
Topic Cluster Cache (topic_cluster_cache table):
- Where: Stores the final clustering result (clusters and noise) as JSON text.
- Skip Rule: The entire clustering process (including LLM titling) is skipped if the embeddings haven't changed (based on embeddings_count and embeddings_max_updated_at metadata) and clustering parameters are identical.

AI Configuration Environment Variables

These settings are crucial for enabling and fine-tuning the AI-powered features. They must be set as environment variables (e.g., in a .env file in the application's root directory). For more details, refer to the [Configuration Guide].

API Credentials and Endpoints:
- OPENAI_API_KEY: Your API key for accessing OpenAI-compatible services.
- OPENAI_BASE_URL: The base URL for your OpenAI-compatible API endpoint (e.g., https://api.openai.com/v1).
AI Models:
- AI_MODEL: The default chat model (fallback for TOPIC_TITLE_MODEL).
- EMBEDDING_MODEL: The model for generating vector embeddings.
- TOPIC_TITLE_MODEL: The specific chat model for generating cluster titles and keywords.
Prompt Normalization:
- IIB_PROMPT_NORMALIZE: Enables or disables cleaning of prompts before embedding (1 for enable, 0 for disable).
- IIB_PROMPT_NORMALIZE_MODE: Sets the aggressiveness of prompt cleaning (balanced or theme_only).
Tag Graph Generation:
- IIB_TAG_GRAPH_MAX_TAGS_FOR_LLM: Maximum tags sent to LLM for abstraction.
- IIB_TAG_GRAPH_TOPK_TAGS_FOR_LLM: Top K tags by frequency/weight for LLM input.
- IIB_TAG_GRAPH_LLM_TIMEOUT_SEC: Timeout for LLM requests during graph generation.
- IIB_TAG_GRAPH_LLM_MAX_ATTEMPTS: Max retry attempts for LLM calls.

Tag Graph Features

The Tag Graph provides a hierarchical visualization of tag relationships derived from the clustering results, offering an LLM-driven abstraction of your image themes.

Structure

The graph is structured in multiple layers, similar to a neural network:

Layer 0: Cluster Nodes: Represents individual image clusters identified by the clustering algorithm. Each node shows the cluster's title and size (number of images).
Layer 1: Tag Nodes: Represents significant keywords (tags) extracted from the cluster prompts. These are deduplicated and show connections to the clusters they belong to.
Layer 2+ (Abstract Layers): Generated by an LLM, these layers group the Layer 1 tags into higher-level abstract categories. This provides a summarized view of overarching themes. The system aims for up to two abstract layers depending on the complexity and number of tags.

Components

LayerNode: Represents an entity in the graph (a cluster, a tag, or an abstract concept). It has an id, a label, and a size (representing its importance or image count).
GraphLayer: Defines a level in the hierarchy, with a level number (0, 1, 2+), a name (e.g., "Clusters", "Tags", "Abstract-1"), and a list of LayerNodes.
GraphLink: Represents a connection between two nodes (source and target) with an associated weight.

API Endpoints for Tag Graph

POST /infinite_image_browsing/db/cluster_tag_graph:
- Description: Builds the hierarchical tag graph based on existing clustering results for specified folders. It generates the multi-layer structure, including LLM-driven abstract layers.
- Request: folder_paths (list of strings, required), lang (optional string, for LLM output language).
- Response: Returns the complete graph structure, including layers, links, and stats.
POST /infinite_image_browsing/db/cluster_tag_graph_cluster_paths:
- Description: Fetches the full paths of images belonging to a specific cluster, identified by topic_cluster_cache_key and cluster_id. This allows fetching cluster members on demand without embedding large path lists directly in the main graph response.
- Request: topic_cluster_cache_key (string, required), cluster_id (string, required).
- Response: Returns a list of paths (strings).

These features provide powerful new ways to explore and manage your generated images, moving beyond simple keyword search to semantic understanding.