Complete documentation for the unified data import system supporting XML files and web scraping.
📚 Related Documentation:
- Async Processing & Limits - Concurrency limits and async architecture
- Web Scraping Integration - Web scraping details
- Product Discovery Architecture - AI-powered product extraction
The Data Import System enables ingesting products from multiple sources including XML files, web scraping, and PDF processing through a unified data import hub. It provides dynamic field mapping, AI-assisted configuration, batch processing, and real-time progress tracking.
XML Import uses fully async processing with unified concurrency limits:
| Feature | Limit | Purpose |
|---|---|---|
| Product Batch Size | 10 products | Memory optimization |
| Image Downloads | 5 concurrent | Network optimization |
| Image Upload | 10 concurrent | Supabase Storage limit |
| Qwen Vision | 5 concurrent | AI classification |
| Claude Validation | 2 concurrent | Validation |
| CLIP Batch | 20 images | Embedding generation |
| Download Timeout | 30 seconds | Per-image timeout |
| Max File Size | 10 MB | Image size limit |
See Async Processing & Limits for complete details.
┌─────────────────────────────────────────────────────────────┐ │ FRONTEND (DataImportHub) │ │ ├─ XML Import Tab │ │ ├─ Web Scraping Tab │ │ └─ Import History Tab │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ EDGE FUNCTION (xml-import-orchestrator) │ │ ├─ Parse XML and detect fields │ │ ├─ AI-powered field mapping (Claude Sonnet 4.5) │ │ ├─ Create data_import_jobs record │ │ └─ Call Python API (non-blocking) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ PYTHON API (DataImportService) │ │ ├─ Batch processing (10 products at a time) │ │ ├─ Image downloads (5 concurrent) │ │ ├─ Product creation with metadata │ │ ├─ Image linking via document_images │ │ ├─ Async text processing queue │ │ └─ Real-time progress updates │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ ASYNC PROCESSING (Background) │ │ ├─ Chunking (UnifiedChunkingService) │ │ ├─ Text Embeddings (RealEmbeddingsService) │ │ └─ Product enrichment (optional) │ └─────────────────────────────────────────────────────────────┘
Main hub component (src/components/Admin/DataImportHub.tsx) with 3 tabs:
src/components/Admin/DataImport/XMLFieldMappingModal.tsx)Interactive UI for reviewing AI-suggested field mappings:
src/components/Admin/DataImport/ImportHistoryTab.tsx)Displays past import jobs with:
src/components/Admin/DataImport/ScheduleImportModal.tsx)Configure cron schedules for recurring imports:
supabase/functions/xml-import-orchestrator/index.ts)Purpose: Parse XML, detect fields, suggest mappings, create import jobs
Endpoints:
POST /xml-import-orchestrator - Upload XML and create import jobFeatures:
Request parameters: workspace_id, category, xml_content (base64 encoded), optional preview_only flag, optional field_mappings, optional mapping_template_id, and optional parent_job_id.
Response (Preview Mode): success, detected_fields array, total_products count.
Response (Import Mode): success, job_id, total_products count.
supabase/functions/scheduled-import-runner/index.ts)Purpose: Run scheduled imports via Supabase Cron
Trigger: Supabase Cron (every 15 minutes)
Features:
mivaa-pdf-extractor/app/services/data_import_service.py)Main orchestrator for processing import jobs.
Key Methods:
process_import_job() - Process complete import job_process_batch() - Process batch of 10 products_normalize_product() - Apply field mappings_download_images() - Download images concurrently_queue_product_processing() - Create products in database_link_images_to_product() - Link images to products_queue_text_processing() - Queue async text processingFeatures:
mivaa-pdf-extractor/app/services/image_download_service.py)Handles concurrent image downloads with validation and retry logic.
Key Methods:
download_images() - Download multiple images concurrently_download_single_image() - Download single image with retryvalidate_image_url() - Validate URL formatstore_image_in_storage() - Upload to Supabase StorageFeatures:
mivaa-pdf-extractor/app/api/data_import_routes.py)Endpoints:
See API Reference for detailed documentation.
Upload XML file and create import job.
Request Body parameters: workspace_id (UUID), category (e.g., "materials"), xml_content (base64-encoded XML), optional preview_only flag (default false), optional field_mappings object mapping XML fields to platform fields, optional mapping_template_id, and optional parent_job_id.
Response: success, job_id, total_products count.
Start processing an import job (called by Edge Function).
Request Body: job_id and workspace_id.
Response: success, message, job_id.
Get import job status and progress.
Response: job_id, status, import_type, source_name, total_products, processed_products, failed_products, progress_percentage, current_stage, started_at, completed_at, error_message, and estimated_time_remaining.
Get import history for a workspace.
Query Parameters:
workspace_id (required) - Workspace IDpage (optional, default: 1) - Page numberpage_size (optional, default: 20) - Items per pagestatus (optional) - Filter by statusimport_type (optional) - Filter by import typeResponse: imports array (each with job_id, import_type, source_name, status, total_products, processed_products, failed_products, created_at, completed_at), total_count, page, and page_size.
Health check for data import API.
Response: status, service name, version, and a features object indicating which capabilities are enabled (xml_import, web_scraping, batch_processing, concurrent_image_downloads, checkpoint_recovery, real_time_progress).
Tracks import jobs with status and progress. Key fields include: id, workspace_id, import_type ('xml' or 'web_scraping'), source_name, source_url, status ('pending', 'processing', 'completed', 'failed'), total_products, processed_products, failed_products, category, original_xml_content (for re-runs), field_mappings (JSONB), mapping_template_id, parent_job_id (for re-runs and scheduled runs), is_scheduled, cron_schedule, last_run_at, next_run_at, started_at, completed_at, error_message, and metadata (stores products for processing).
Tracks individual product imports for audit trail. Key fields include: id, job_id (references data_import_jobs), source_data (JSONB with original product data from XML), normalized_data (JSONB with normalized product data after field mapping), and processing_status ('pending', 'success', or 'failed').
Stores reusable field mapping templates. Key fields include: id, workspace_id, name, description, field_mappings (JSONB mapping XML fields to platform fields), created_by, created_at, and updated_at. A unique constraint applies on (workspace_id, name).
Location: scripts/testing/test-xml-import-phase2.js
Usage: Run with node scripts/testing/test-xml-import-phase2.js.
Test Flow:
pdf-tiles bucketproducts tabledocument_images tablechunks tabledata_import_history tableThe Data Import System implements complete production hardening across all import methods (PDF, XML, Web Scraping):
All imported data is tagged with source information for complete traceability:
| Field | Purpose | Example Values |
|---|---|---|
| source_type | Import method | 'pdf_processing', 'xml_import', 'web_scraping' |
| source_job_id | Originating job | Job UUID from background_jobs or data_import_jobs |
Applied to:
Benefits:
All import methods update heartbeat timestamps to detect stuck/crashed jobs:
| Method | Heartbeat Field | Update Frequency | Stuck Threshold |
|---|---|---|---|
| PDF Processing | last_heartbeat |
Every stage | >10 minutes |
| XML Import | last_heartbeat |
Every batch (10 products) | >30 minutes |
| Web Scraping | last_heartbeat_at |
Every 30 seconds | >5 minutes |
Features:
Comprehensive error tracking and performance monitoring across all import methods:
| Feature | XML | Web Scraping | |
|---|---|---|---|
| Transaction Tracking | ✅ | ✅ | ✅ |
| Breadcrumbs | ✅ | ✅ | ✅ |
| Exception Capture | ✅ | ✅ | ✅ |
| Performance Monitoring | ✅ | ✅ | ✅ |
| Error Context | ✅ | ✅ | ✅ |
Benefits:
| Feature | XML | Web Scraping | Status | |
|---|---|---|---|---|
| Source Tracking | ✅ | ✅ | ✅ | COMPLETE |
| Heartbeat Monitoring | ✅ | ✅ | ✅ | COMPLETE |
| Sentry Tracking | ✅ | ✅ | ✅ | COMPLETE |
| Error Handling | ✅ | ✅ | ✅ | COMPLETE |
| Progress Tracking | ✅ | ✅ | ✅ | COMPLETE |
| Checkpoint Recovery | ✅ | ✅ | ✅ | COMPLETE |
| Auto-Recovery | ✅ | ✅ | ✅ | COMPLETE |
For detailed implementation, see:
data_import_jobs pipeline