SigLIP2-base-patch16-512 API Specification

Endpoint

POST https://<your-endpoint>.endpoints.huggingface.cloud

Authentication

Authorization: Bearer <HF_TOKEN>
Content-Type: application/json

Modes

The API supports 4 modes: zero_shot, image_embedding, text_embedding, similarity. Mode is auto-detected from input structure or can be explicitly set via parameters.mode.

Mode 1: Zero-Shot Classification

Classify an image against a list of candidate text labels.

Request

{
    "inputs": "<image_url_or_base64>",
    "parameters": {
        "candidate_labels": ["label1", "label2", "label3"]
    }
}

Request (Batch)

{
    "inputs": ["<image1_url>", "<image2_url>"],
    "parameters": {
        "candidate_labels": ["label1", "label2", "label3"]
    }
}

Parameters

Field	Type	Required	Description
`inputs`	string \| string[]	Yes	Image URL (http/https) or base64-encoded image string. Can be array for batch.
`parameters.candidate_labels`	string[]	Yes	List of text labels to classify against.
`parameters.mode`	string	No	Explicitly set to `"zero_shot"`. Auto-detected if `candidate_labels` present.

Response (Single Image)

[
    {"label": "ceramic tile", "score": 0.8234},
    {"label": "porcelain tile", "score": 0.1521},
    {"label": "natural stone", "score": 0.0245}
]

Response (Batch)

[
    [
        {"label": "ceramic tile", "score": 0.8234},
        {"label": "porcelain tile", "score": 0.1521}
    ],
    [
        {"label": "wood flooring", "score": 0.7891},
        {"label": "laminate", "score": 0.2109}
    ]
]

Response Fields

Field	Type	Description
`label`	string	The candidate label
`score`	float	Probability score (0-1), softmax normalized. All scores sum to 1.

Results are sorted by score descending.

Mode 2: Image Embedding

Extract normalized 768-dimensional embedding vectors from images.

Request

{
    "inputs": "<image_url_or_base64>",
    "parameters": {
        "mode": "image_embedding"
    }
}

Request (Batch)

{
    "inputs": ["<image1_url>", "<image2_url>", "<image3_url>"],
    "parameters": {
        "mode": "image_embedding"
    }
}

Parameters

Field	Type	Required	Description
`inputs`	string \| string[]	Yes	Image URL or base64 string. Array for batch processing.
`parameters.mode`	string	Yes	Must be `"image_embedding"`

Response

[
    {"embedding": [0.0234, -0.0891, 0.1234, ..., -0.0567]}
]

Response (Batch)

[
    {"embedding": [0.0234, -0.0891, ...]},
    {"embedding": [0.1123, -0.0456, ...]},
    {"embedding": [-0.0789, 0.0234, ...]}
]

Response Fields

Field	Type	Description
`embedding`	float[768]	L2-normalized embedding vector. Dimension is always 768.

Mode 3: Text Embedding

Extract normalized 768-dimensional embedding vectors from text.

Request

{
    "inputs": "a photo of ceramic floor tiles",
    "parameters": {
        "mode": "text_embedding"
    }
}

Request (Batch)

{
    "inputs": [
        "ceramic tile with matte finish",
        "glossy porcelain tile",
        "natural wood flooring"
    ],
    "parameters": {
        "mode": "text_embedding"
    }
}

Parameters

Field	Type	Required	Description
`inputs`	string \| string[]	Yes	Text string or array of text strings.
`parameters.mode`	string	Yes	Must be `"text_embedding"`

Response

[
    {"embedding": [0.0567, -0.1234, 0.0891, ..., 0.0345]}
]

Response (Batch)

[
    {"embedding": [0.0567, -0.1234, ...]},
    {"embedding": [0.0891, -0.0456, ...]},
    {"embedding": [-0.0234, 0.0789, ...]}
]

Response Fields

Field	Type	Description
`embedding`	float[768]	L2-normalized embedding vector. Dimension is always 768.

Mode 4: Similarity Scoring

Calculate cosine similarity scores between image(s) and text(s).

Request

{
    "inputs": {
        "image": "<image_url_or_base64>",
        "texts": ["description 1", "description 2", "description 3"]
    },
    "parameters": {
        "mode": "similarity"
    }
}

Request (Multiple Images)

{
    "inputs": {
        "images": ["<image1_url>", "<image2_url>"],
        "texts": ["ceramic tile", "wood flooring", "heat pump"]
    },
    "parameters": {
        "mode": "similarity"
    }
}

Parameters

Field	Type	Required	Description
`inputs.image`	string	Yes*	Single image URL or base64. Use `image` or `images`.
`inputs.images`	string[]	Yes*	Array of image URLs or base64 strings. Use `image` or `images`.
`inputs.text`	string	Yes*	Single text string. Use `text` or `texts`.
`inputs.texts`	string[]	Yes*	Array of text strings. Use `text` or `texts`.
`parameters.mode`	string	No	Set to `"similarity"`. Auto-detected if inputs has `image` and `texts` keys.

Response

{
    "similarity_scores": [[0.8912, 0.2345, 0.1234]],
    "image_count": 1,
    "text_count": 3
}

Response (Multiple Images)

{
    "similarity_scores": [
        [0.8912, 0.2345, 0.1234],
        [0.1567, 0.8901, 0.0987]
    ],
    "image_count": 2,
    "text_count": 3
}

Response Fields

Field	Type	Description
`similarity_scores`	float[][]	Matrix of shape [image_count, text_count]. Each value is cosine similarity (-1 to 1, typically 0 to 1 for this model).
`image_count`	int	Number of images processed
`text_count`	int	Number of texts processed

Matrix interpretation: similarity_scores[i][j] = similarity between image i and text j.

Image Input Formats

The API accepts images in three formats:

Format	Example	Description
HTTP URL	`"https://example.com/image.jpg"`	Direct URL to image. Must be publicly accessible.
HTTPS URL	`"https://cdn.example.com/image.png"`	Secure URL to image.
Base64	`"iVBORw0KGgoAAAANSUhEUgAA..."`	Raw base64-encoded image bytes (no data URI prefix).

Supported image formats: JPEG, PNG, GIF, WebP, BMP

Auto-Detection Logic

If parameters.mode is not specified, the API auto-detects:

Condition	Detected Mode
`inputs` has keys `image`/`images` AND `text`/`texts`	`similarity`
`parameters.candidate_labels` is present	`zero_shot`
`inputs` is array of short strings (< 500 chars, no http prefix)	`text_embedding`
Otherwise	`image_embedding`

Recommendation: Always explicitly set mode for predictable behavior.

Error Responses

Missing Required Field

{
    "error": "candidate_labels is required for zero-shot classification"
}

Invalid Image

{
    "error": "Unsupported image input type: <class 'NoneType'>"
}

Unknown Mode

{
    "error": "Unknown mode: invalid_mode. Supported: zero_shot, image_embedding, text_embedding, similarity"
}

Complete Examples

Example 1: Classify Product Image

curl -X POST "https://your-endpoint.endpoints.huggingface.cloud" \
  -H "Authorization: Bearer hf_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "https://catalog.example.com/products/tile-001.jpg",
    "parameters": {
      "candidate_labels": [
        "ceramic floor tile",
        "porcelain wall tile",
        "natural stone slab",
        "mosaic tile sheet",
        "wood laminate plank",
        "vinyl flooring",
        "heat pump outdoor unit",
        "heat pump indoor unit",
        "technical diagram",
        "company logo"
      ]
    }
  }'

Example 2: Build Product Search Index

curl -X POST "https://your-endpoint.endpoints.huggingface.cloud" \
  -H "Authorization: Bearer hf_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      "https://catalog.example.com/products/tile-001.jpg",
      "https://catalog.example.com/products/tile-002.jpg",
      "https://catalog.example.com/products/tile-003.jpg"
    ],
    "parameters": {
      "mode": "image_embedding"
    }
  }'

Example 3: Embed Product Descriptions

curl -X POST "https://your-endpoint.endpoints.huggingface.cloud" \
  -H "Authorization: Bearer hf_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      "matte finish ceramic tile 60x60cm",
      "glossy white porcelain tile",
      "rustic oak wood flooring",
      "inverter heat pump 12kW"
    ],
    "parameters": {
      "mode": "text_embedding"
    }
  }'

Example 4: Match Image to Descriptions

curl -X POST "https://your-endpoint.endpoints.huggingface.cloud" \
  -H "Authorization: Bearer hf_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "image": "https://catalog.example.com/products/unknown-product.jpg",
      "texts": [
        "beige ceramic floor tile with matte finish",
        "white glossy porcelain wall tile",
        "gray natural stone tile with rough texture",
        "colorful mosaic glass tile"
      ]
    },
    "parameters": {
      "mode": "similarity"
    }
  }'

Example 5: Python Client

import requests
import base64

API_URL = "https://your-endpoint.endpoints.huggingface.cloud"
HEADERS = {"Authorization": "Bearer hf_xxxxx"}

# Zero-shot classification
def classify_product(image_url: str, labels: list[str]) -> list[dict]:
    response = requests.post(API_URL, headers=HEADERS, json={
        "inputs": image_url,
        "parameters": {"candidate_labels": labels}
    })
    return response.json()

# Get image embedding
def get_image_embedding(image_url: str) -> list[float]:
    response = requests.post(API_URL, headers=HEADERS, json={
        "inputs": image_url,
        "parameters": {"mode": "image_embedding"}
    })
    return response.json()[0]["embedding"]

# Get text embedding
def get_text_embedding(text: str) -> list[float]:
    response = requests.post(API_URL, headers=HEADERS, json={
        "inputs": text,
        "parameters": {"mode": "text_embedding"}
    })
    return response.json()[0]["embedding"]

# Similarity scoring
def get_similarity(image_url: str, texts: list[str]) -> list[float]:
    response = requests.post(API_URL, headers=HEADERS, json={
        "inputs": {"image": image_url, "texts": texts},
        "parameters": {"mode": "similarity"}
    })
    return response.json()["similarity_scores"][0]

# From local file (base64)
def classify_local_image(file_path: str, labels: list[str]) -> list[dict]:
    with open(file_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()
    response = requests.post(API_URL, headers=HEADERS, json={
        "inputs": image_b64,
        "parameters": {"candidate_labels": labels}
    })
    return response.json()

Model Specifications

Property	Value
Model	google/siglip2-base-patch16-512
Parameters	~400M
Input Resolution	512 x 512 pixels
Embedding Dimension	768
Normalization	L2 (embeddings are unit vectors)
Similarity Metric	Cosine similarity (dot product of normalized vectors)

Rate Limits & Performance

Depends on your Inference Endpoint instance type. Typical latencies:

Instance	Single Image	Batch (10 images)
GPU (T4)	~50-100ms	~200-400ms
GPU (A10G)	~30-60ms	~100-200ms
CPU	~500-2000ms	~2000-5000ms

GPU recommended for production workloads.