Data Import System

Complete documentation for the unified data import system supporting XML files and web scraping.

📚 Related Documentation:


📋 Table of Contents

  1. Overview
  2. Architecture
  3. Features
  4. API Reference
  5. Database Schema
  6. Usage Guide
  7. Testing
  8. Performance

Overview

The Data Import System enables ingesting products from multiple sources including XML files, web scraping, and PDF processing through a unified data import hub. It provides dynamic field mapping, AI-assisted configuration, batch processing, and real-time progress tracking.

Key Features

Async Processing & Limits

XML Import uses fully async processing with unified concurrency limits:

Feature Limit Purpose
Product Batch Size 10 products Memory optimization
Image Downloads 5 concurrent Network optimization
Image Upload 10 concurrent Supabase Storage limit
Qwen Vision 5 concurrent AI classification
Claude Validation 2 concurrent Validation
CLIP Batch 20 images Embedding generation
Download Timeout 30 seconds Per-image timeout
Max File Size 10 MB Image size limit

See Async Processing & Limits for complete details.

Use Cases

  1. Supplier Catalog Imports - Import products from supplier XML catalogs
  2. Recurring Updates - Schedule automatic imports from supplier URLs
  3. Manual Re-runs - Re-import catalogs with one click
  4. Multi-source Integration - Combine XML, web scraping, and PDF sources

Architecture

System Overview

┌─────────────────────────────────────────────────────────────┐ │ FRONTEND (DataImportHub) │ │ ├─ XML Import Tab │ │ ├─ Web Scraping Tab │ │ └─ Import History Tab │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ EDGE FUNCTION (xml-import-orchestrator) │ │ ├─ Parse XML and detect fields │ │ ├─ AI-powered field mapping (Claude Sonnet 4.5) │ │ ├─ Create data_import_jobs record │ │ └─ Call Python API (non-blocking) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ PYTHON API (DataImportService) │ │ ├─ Batch processing (10 products at a time) │ │ ├─ Image downloads (5 concurrent) │ │ ├─ Product creation with metadata │ │ ├─ Image linking via document_images │ │ ├─ Async text processing queue │ │ └─ Real-time progress updates │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ ASYNC PROCESSING (Background) │ │ ├─ Chunking (UnifiedChunkingService) │ │ ├─ Text Embeddings (RealEmbeddingsService) │ │ └─ Product enrichment (optional) │ └─────────────────────────────────────────────────────────────┘

Data Flow

  1. User uploads XML file ↓
  2. Edge Function parses XML and detects fields ↓
  3. AI suggests field mappings (Claude Sonnet 4.5) ↓
  4. User reviews and confirms mappings ↓
  5. Edge Function creates import job ↓
  6. Python API processes job in batches ↓
  7. Images downloaded concurrently ↓
  8. Products created in database ↓
  9. Images linked to products ↓
  10. Text processing queued (async) ↓
  11. Job marked as completed

Frontend Components

DataImportHub

Main hub component (src/components/Admin/DataImportHub.tsx) with 3 tabs:

2. XMLFieldMappingModal (src/components/Admin/DataImport/XMLFieldMappingModal.tsx)

Interactive UI for reviewing AI-suggested field mappings:

3. ImportHistoryTab (src/components/Admin/DataImport/ImportHistoryTab.tsx)

Displays past import jobs with:

4. ScheduleImportModal (src/components/Admin/DataImport/ScheduleImportModal.tsx)

Configure cron schedules for recurring imports:

Edge Functions

xml-import-orchestrator (supabase/functions/xml-import-orchestrator/index.ts)

Purpose: Parse XML, detect fields, suggest mappings, create import jobs

Endpoints:

Features:

Request parameters: workspace_id, category, xml_content (base64 encoded), optional preview_only flag, optional field_mappings, optional mapping_template_id, and optional parent_job_id.

Response (Preview Mode): success, detected_fields array, total_products count.

Response (Import Mode): success, job_id, total_products count.

scheduled-import-runner (supabase/functions/scheduled-import-runner/index.ts)

Purpose: Run scheduled imports via Supabase Cron

Trigger: Supabase Cron (every 15 minutes)

Features:


Backend Data Processing

Services

1. DataImportService (mivaa-pdf-extractor/app/services/data_import_service.py)

Main orchestrator for processing import jobs.

Key Methods:

Features:

2. ImageDownloadService (mivaa-pdf-extractor/app/services/image_download_service.py)

Handles concurrent image downloads with validation and retry logic.

Key Methods:

Features:

API Routes

Data Import Routes (mivaa-pdf-extractor/app/api/data_import_routes.py)

Endpoints:

  1. POST /api/import/process - Start processing an import job
  2. GET /api/import/jobs/{job_id} - Get import job status
  3. GET /api/import/history - Get import history
  4. GET /api/import/health - Health check

See API Reference for detailed documentation.


API Reference

Edge Function API

POST /xml-import-orchestrator

Upload XML file and create import job.

Request Body parameters: workspace_id (UUID), category (e.g., "materials"), xml_content (base64-encoded XML), optional preview_only flag (default false), optional field_mappings object mapping XML fields to platform fields, optional mapping_template_id, and optional parent_job_id.

Response: success, job_id, total_products count.

Python API

POST /api/import/process

Start processing an import job (called by Edge Function).

Request Body: job_id and workspace_id.

Response: success, message, job_id.

GET /api/import/jobs/{job_id}

Get import job status and progress.

Response: job_id, status, import_type, source_name, total_products, processed_products, failed_products, progress_percentage, current_stage, started_at, completed_at, error_message, and estimated_time_remaining.

GET /api/import/history

Get import history for a workspace.

Query Parameters:

Response: imports array (each with job_id, import_type, source_name, status, total_products, processed_products, failed_products, created_at, completed_at), total_count, page, and page_size.

GET /api/import/health

Health check for data import API.

Response: status, service name, version, and a features object indicating which capabilities are enabled (xml_import, web_scraping, batch_processing, concurrent_image_downloads, checkpoint_recovery, real_time_progress).


Database Schema

data_import_jobs

Tracks import jobs with status and progress. Key fields include: id, workspace_id, import_type ('xml' or 'web_scraping'), source_name, source_url, status ('pending', 'processing', 'completed', 'failed'), total_products, processed_products, failed_products, category, original_xml_content (for re-runs), field_mappings (JSONB), mapping_template_id, parent_job_id (for re-runs and scheduled runs), is_scheduled, cron_schedule, last_run_at, next_run_at, started_at, completed_at, error_message, and metadata (stores products for processing).

data_import_history

Tracks individual product imports for audit trail. Key fields include: id, job_id (references data_import_jobs), source_data (JSONB with original product data from XML), normalized_data (JSONB with normalized product data after field mapping), and processing_status ('pending', 'success', or 'failed').

xml_mapping_templates

Stores reusable field mapping templates. Key fields include: id, workspace_id, name, description, field_mappings (JSONB mapping XML fields to platform fields), created_by, created_at, and updated_at. A unique constraint applies on (workspace_id, name).


Usage Guide

1. Upload XML File

  1. Navigate to Admin Dashboard → Data Import Hub
  2. Click "XML Import" tab
  3. Select category (e.g., "materials")
  4. Upload XML file
  5. Review AI-suggested field mappings
  6. Adjust mappings if needed
  7. Optionally save as template
  8. Click "Import"

2. Schedule Recurring Import

  1. Go to Import History tab
  2. Find completed import
  3. Click "Schedule Cron" button
  4. Enter source URL
  5. Select schedule (hourly, daily, weekly, custom)
  6. Click "Schedule"

3. Manual Re-run

  1. Go to Import History tab
  2. Find completed import
  3. Click "Manual Re-run" button
  4. Confirm re-run
  5. New job created with same mappings

Testing

Integration Test Script

Location: scripts/testing/test-xml-import-phase2.js

Usage: Run with node scripts/testing/test-xml-import-phase2.js.

Test Flow:

  1. Upload XML with 3 sample products
  2. Monitor job progress (polls every 5s, max 5 min)
  3. Verify products created in database
  4. Verify images downloaded and linked
  5. Verify import history records
  6. Display comprehensive summary

Performance

Batch Processing

Image Downloads

Database Operations


🛡️ Production Hardening

The Data Import System implements complete production hardening across all import methods (PDF, XML, Web Scraping):

Source Tracking ✅

All imported data is tagged with source information for complete traceability:

Field Purpose Example Values
source_type Import method 'pdf_processing', 'xml_import', 'web_scraping'
source_job_id Originating job Job UUID from background_jobs or data_import_jobs

Applied to:

Benefits:


Heartbeat Monitoring ✅

All import methods update heartbeat timestamps to detect stuck/crashed jobs:

Method Heartbeat Field Update Frequency Stuck Threshold
PDF Processing last_heartbeat Every stage >10 minutes
XML Import last_heartbeat Every batch (10 products) >30 minutes
Web Scraping last_heartbeat_at Every 30 seconds >5 minutes

Features:


Sentry Error Tracking ✅

Comprehensive error tracking and performance monitoring across all import methods:

Feature PDF XML Web Scraping
Transaction Tracking
Breadcrumbs
Exception Capture
Performance Monitoring
Error Context

Benefits:


Production Hardening Status

Feature PDF XML Web Scraping Status
Source Tracking COMPLETE
Heartbeat Monitoring COMPLETE
Sentry Tracking COMPLETE
Error Handling COMPLETE
Progress Tracking COMPLETE
Checkpoint Recovery COMPLETE
Auto-Recovery COMPLETE

For detailed implementation, see:


Future Enhancements

Frontend Improvements

Web Scraping Expansion