The Saved Searches feature implements an intelligent deduplication system to prevent database bloat and improve user experience by merging semantically similar searches while respecting important contextual differences.
Without Deduplication:
Result: 4 separate database entries for essentially the same search
With Smart Deduplication:
Model: Claude Haiku 4.5 (fast, cost-effective)
Process: User Query → AI Analysis → Extract: ├── Core Material: "cement tile" ├── Attributes: ["grey", "outdoor", "floor"] ├── Application Context: "outdoor flooring" ├── Intent Category: "product_search" └── Semantic Fingerprint: embedding vector
Layer 1: Exact Match (Fast)
Layer 2: Semantic Similarity (AI)
Layer 3: Metadata Context (Smart)
┌─────────────────────────────────────┐ │ New Search: "cement tile for floor" │ └──────────────┬──────────────────────┘ │ ▼ ┌──────────────────────┐ │ Find Similar Searches │ │ (Semantic + Metadata) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Similarity Score > 0.85? │ └──────────┬───────────┘ │ ┌──────┴──────┐ │ │ YES NO │ │ ▼ ▼ ┌───────────────┐ ┌──────────────┐ │ Context Match?│ │ Create New │ └───────┬───────┘ │ Search Entry │ │ └──────────────┘ ┌────┴────┐ │ │ YES NO │ │ ▼ ▼ ┌─────────┐ ┌──────────────┐ │ MERGE │ │ Create New │ │ Update │ │ Search Entry │ │ Existing│ └──────────────┘ └─────────┘
The deduplication system requires additional fields on the saved_searches table: semantic_fingerprint (vector for CLIP embedding), normalized_query (cleaned query text), core_material (extracted material name), material_attributes (JSONB with color, texture, finish, etc.), application_context (floor, wall, outdoor, indoor, etc.), intent_category (product_search, comparison, recommendation), merged_from_ids (UUID array tracking merged search IDs), merge_count (integer, default 1), and last_merged_at (timestamptz).
Indexes are needed on semantic_fingerprint (ivfflat cosine), normalized_query, and (core_material, application_context).
The backend service (mivaa-pdf-extractor/app/services/search_deduplication_service.py) uses Claude Haiku 4.5 to extract semantic components from each search query, returning the core material, attributes (color, texture, finish, etc.), application context, intent category, and a CLIP-based semantic fingerprint.
The find_or_merge_search function:
Merge criteria (should_merge):
Conflict detection (has_conflicting_attributes) blocks merging when attributes clash, such as different colors, outdoor vs. indoor, wall vs. floor, or matte vs. glossy.
When merging, the system:
merge_countmerged_from_idslast_merged_atUser clicks "Save Search" ↓ AI analyzes query in background ↓ Check for similar searches ↓ ┌───────┴───────┐ │ │ Found Not Found │ │ ▼ ▼ ┌─────────────┐ ┌──────────────┐ │ Show Modal: │ │ Show Normal │ │ │ │ Save Modal │ │ "Similar │ └──────────────┘ │ search │ │ found!" │ │ │ │ Options: │ │ • Merge │ │ • Save New │ └─────────────┘
The frontend (src/components/Search/MergeSearchModal.tsx) displays:
AI analysis results are cached for 1 hour per query string. Similarity search results are cached for 5 minutes per user.
A background job (deduplicate_existing_searches) can run nightly or on-demand to deduplicate a user's existing saved searches by grouping them by core material and processing each group.
A materialized view (search_deduplication_stats) tracks daily totals: total searches, total merges, average merges per search, deduplicated search count, and deduplication rate percentage.
The admin dashboard shows total searches, merged searches count, and database savings percentage.
Key configuration options include:
semantic_similarity_threshold: 0.85 (range 0.0–1.0)exact_match_threshold: 0.95 (for normalized queries)require_context_match: True (must match application context)allow_null_context_merge: True (merge if both contexts are null)merge_compatible_attributes: True (union of attributes)block_on_conflicts: True (don't merge if conflicts)price_range_tolerance: 0.2 (20% difference allowed)color_tolerance: "exact"analysis_model: "claude-haiku-4.5"embedding_model: "openai-clip"auto_merge_threshold: 0.95 (auto-merge if > 95% similar)show_merge_suggestion: True (show modal for 85-95%)min_similarity_to_suggest: 0.85✅ MERGE:
❌ SEPARATE:
User saves "grey cement tiles for kitchen floor" → core_material: "cement tile", attributes: {color: "grey"}, application_context: "kitchen floor".
Same user searches "gray cement tile kitchen" → similarity 0.92, context match ✅, attributes compatible ✅ → Merge suggested, user accepts → 1 database entry instead of 2.
Same user searches "grey cement tile for bathroom" → similarity 0.88, context match ❌ (kitchen ≠ bathroom) → Save as new search → 2 entries (correct!).