Web Scraping Integration Guide

📚 Related Documentation:

🌐 Overview

The Material Kai Vision Platform now supports automatic product discovery from web scraping using Firecrawl integration. This feature allows you to scrape product catalogs from manufacturer websites and automatically create products with AI-powered metadata extraction.

Async Processing

Web scraping uses fully async processing with the same concurrency limits as PDF processing:

See Async Processing & Limits for complete details.

🎯 Key Features

📊 How It Works

┌─────────────────────┐ │ User Triggers │ │ Web Scraping │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Firecrawl API │ │ Scrapes Website │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Edge Function │ │ (scrape-session- │ │ manager) │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Python API │ │ (WebScrapingService)│ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ AI Discovery │ │ (Claude/GPT) │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Products Created │ │ in Database │ └─────────────────────┘

🚀 Getting Started

1. Trigger Web Scraping

Invoke the scrape-session-manager Supabase Edge Function with a request body containing url, workspace_id, scraping_service ('firecrawl'), and an optional max_pages limit.

2. Monitor Progress

Subscribe to the scraping_sessions table via Supabase real-time, filtering on the specific session_id, to receive progress updates including the progress_percentage field.

3. View Results

Query the products table filtering by source_type = 'web_scraping' and source_id = sessionId to fetch all products created from a specific scraping session.

🔄 Processing Pipeline

Stage 1: Web Scraping (0-50%)

Firecrawl Edge Function

Progress Updates:

Stage 2: Product Discovery (50-100%)

Python API Processing

Progress Updates:

🤖 AI Models

Claude Sonnet 4.5 (Default)

GPT-5

Claude Haiku 4.5

📋 Comparison with Other Methods

Feature Web Scraping PDF Processing XML Import
AI Discovery ✅ Yes ✅ Yes ❌ No (direct mapping)
Image Extraction ✅ Automatic ✅ Automatic ⚠️ Manual URLs
Metadata Quality ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐
Processing Speed Fast (2-5 min) Medium (5-15 min) Very Fast (<1 min)
Cost per Product $0.02-0.05 $0.05-0.15 $0.00
Best For Websites PDF Catalogs Structured Data

🔧 Configuration

Scraping Options

The scraping configuration accepts: url (website URL to scrape), workspace_id, scraping_service (currently only 'firecrawl'), optional max_pages (default: 10), optional categories array (default: ['products']), and optional model ('claude', 'gpt', or 'haiku', default: 'claude').

Discovery Options

The discovery configuration accepts: categories array (['products', 'certificates', 'logos']), model string ('claude', 'gpt', or 'haiku'), and workspace_id.

📊 Monitoring & Debugging

Check Session Status

Send a GET request to https://v1api.materialshub.gr/api/scraping/session/{session_id}/status with your authorization token.

View Scraping Logs

Query the scraping_sessions table by session ID to check session status. Query scraping_pages filtering by session_id to inspect individual page statuses and markdown lengths. Query products filtering by source_type = 'web_scraping' and source_id to see created products with their source URLs.

Common Issues

Issue: "Session not found"

Cause: Invalid session ID or session deleted Solution: Verify session ID exists in scraping_sessions table

Issue: "No products discovered"

Cause: Website content doesn't contain product information Solution:

Issue: "Webhook failed after 3 retries"

Cause: Python API unreachable or authentication failed Solution:

Issue: "AI analysis timeout"

Cause: Too much content or AI API slow Solution:

🎯 Best Practices

1. Start Small

2. Choose Right Model

3. Monitor Costs

4. Handle Failures

5. Optimize Performance

🔐 Security

Authentication

Data Privacy

Rate Limiting

📈 Performance Metrics

Typical Processing Times

Pages Products Time Cost
1-5 1-10 1-2 min $0.10-0.50
5-10 10-25 2-5 min $0.50-1.25
10-20 25-50 5-10 min $1.25-2.50
20-50 50-100 10-20 min $2.50-5.00

Success Rates

🚨 Troubleshooting

Enable Debug Logging

Add debug console.log statements in the Edge Function to output session data and markdown lengths during processing.

Check Database State

Query scraping_sessions by session ID to check status, progress percentage, and error message. Query sessions with non-null scraping_config->>'webhook_retry_count' to find sessions that have experienced webhook retries.

Manual Retry

Send a POST request to https://v1api.materialshub.gr/api/scraping/session/{session_id}/retry with your authorization token.


🛡️ Production Hardening

Web Scraping implements complete production hardening for reliability and monitoring:

Source Tracking ✅

Every product, chunk, and image is tagged with source information. Products, chunks, and images all receive source_type: 'web_scraping' and source_job_id: session_id fields.

Benefits:


Heartbeat Monitoring ✅

Updates last_heartbeat_at field every 30 seconds to detect stuck jobs. The update writes the current timestamp and current status, plus session metadata (pages scraped, products found) to the scraping_sessions table.

Implementation:


Sentry Error Tracking ✅

Comprehensive error tracking and performance monitoring. The implementation uses Sentry transaction tracking with op: 'web_scraping' and tags for session and workspace IDs, breadcrumbs for each page scraped, exception capture on error, and transaction status set to 'ok' on success or 'internal_error' on failure.

Features:


Production Hardening Status

Feature Status Details
Source Tracking ✅ COMPLETE All tables have source_type='web_scraping' and source_job_id
Heartbeat Monitoring ✅ COMPLETE Updates every 30s, 5-minute stuck threshold
Sentry Tracking ✅ COMPLETE Transactions, breadcrumbs, exception capture
Error Handling ✅ COMPLETE Comprehensive try-catch with Sentry integration
Progress Tracking ✅ COMPLETE Real-time progress updates via scraping_sessions table
Checkpoint Recovery ✅ COMPLETE Resume from last scraped page
Auto-Recovery ✅ COMPLETE Automatic retry of stuck/failed sessions

📚 Related Documentation

🆘 Support

For issues or questions:

  1. Check this guide first
  2. Review Edge Function logs
  3. Check Python API logs
  4. Contact support with session ID