📚 Related Documentation:
- Async Processing & Limits - Concurrency limits and async architecture
- Product Discovery Architecture - AI-powered product extraction
- Data Import System - Unified import hub
The Material Kai Vision Platform now supports automatic product discovery from web scraping using Firecrawl integration. This feature allows you to scrape product catalogs from manufacturer websites and automatically create products with AI-powered metadata extraction.
Web scraping uses fully async processing with the same concurrency limits as PDF processing:
See Async Processing & Limits for complete details.
┌─────────────────────┐ │ User Triggers │ │ Web Scraping │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Firecrawl API │ │ Scrapes Website │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Edge Function │ │ (scrape-session- │ │ manager) │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Python API │ │ (WebScrapingService)│ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ AI Discovery │ │ (Claude/GPT) │ └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ Products Created │ │ in Database │ └─────────────────────┘
Invoke the scrape-session-manager Supabase Edge Function with a request body containing url, workspace_id, scraping_service ('firecrawl'), and an optional max_pages limit.
Subscribe to the scraping_sessions table via Supabase real-time, filtering on the specific session_id, to receive progress updates including the progress_percentage field.
Query the products table filtering by source_type = 'web_scraping' and source_id = sessionId to fetch all products created from a specific scraping session.
Firecrawl Edge Function
scraping_pages tableProgress Updates:
Python API Processing
ProductDiscoveryService.discover_products_from_text()Progress Updates:
| Feature | Web Scraping | PDF Processing | XML Import |
|---|---|---|---|
| AI Discovery | ✅ Yes | ✅ Yes | ❌ No (direct mapping) |
| Image Extraction | ✅ Automatic | ✅ Automatic | ⚠️ Manual URLs |
| Metadata Quality | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Processing Speed | Fast (2-5 min) | Medium (5-15 min) | Very Fast (<1 min) |
| Cost per Product | $0.02-0.05 | $0.05-0.15 | $0.00 |
| Best For | Websites | PDF Catalogs | Structured Data |
The scraping configuration accepts: url (website URL to scrape), workspace_id, scraping_service (currently only 'firecrawl'), optional max_pages (default: 10), optional categories array (default: ['products']), and optional model ('claude', 'gpt', or 'haiku', default: 'claude').
The discovery configuration accepts: categories array (['products', 'certificates', 'logos']), model string ('claude', 'gpt', or 'haiku'), and workspace_id.
Send a GET request to https://v1api.materialshub.gr/api/scraping/session/{session_id}/status with your authorization token.
Query the scraping_sessions table by session ID to check session status. Query scraping_pages filtering by session_id to inspect individual page statuses and markdown lengths. Query products filtering by source_type = 'web_scraping' and source_id to see created products with their source URLs.
Cause: Invalid session ID or session deleted
Solution: Verify session ID exists in scraping_sessions table
Cause: Website content doesn't contain product information Solution:
scraping_pages.markdown_contentCause: Python API unreachable or authentication failed Solution:
Cause: Too much content or AI API slow Solution:
max_pages to scrape fewer pagesmk_*)| Pages | Products | Time | Cost |
|---|---|---|---|
| 1-5 | 1-10 | 1-2 min | $0.10-0.50 |
| 5-10 | 10-25 | 2-5 min | $0.50-1.25 |
| 10-20 | 25-50 | 5-10 min | $1.25-2.50 |
| 20-50 | 50-100 | 10-20 min | $2.50-5.00 |
Add debug console.log statements in the Edge Function to output session data and markdown lengths during processing.
Query scraping_sessions by session ID to check status, progress percentage, and error message. Query sessions with non-null scraping_config->>'webhook_retry_count' to find sessions that have experienced webhook retries.
Send a POST request to https://v1api.materialshub.gr/api/scraping/session/{session_id}/retry with your authorization token.
Web Scraping implements complete production hardening for reliability and monitoring:
Every product, chunk, and image is tagged with source information. Products, chunks, and images all receive source_type: 'web_scraping' and source_job_id: session_id fields.
Benefits:
Updates last_heartbeat_at field every 30 seconds to detect stuck jobs. The update writes the current timestamp and current status, plus session metadata (pages scraped, products found) to the scraping_sessions table.
Implementation:
scrape-session-manager Edge FunctionComprehensive error tracking and performance monitoring. The implementation uses Sentry transaction tracking with op: 'web_scraping' and tags for session and workspace IDs, breadcrumbs for each page scraped, exception capture on error, and transaction status set to 'ok' on success or 'internal_error' on failure.
Features:
| Feature | Status | Details |
|---|---|---|
| Source Tracking | ✅ COMPLETE | All tables have source_type='web_scraping' and source_job_id |
| Heartbeat Monitoring | ✅ COMPLETE | Updates every 30s, 5-minute stuck threshold |
| Sentry Tracking | ✅ COMPLETE | Transactions, breadcrumbs, exception capture |
| Error Handling | ✅ COMPLETE | Comprehensive try-catch with Sentry integration |
| Progress Tracking | ✅ COMPLETE | Real-time progress updates via scraping_sessions table |
| Checkpoint Recovery | ✅ COMPLETE | Resume from last scraped page |
| Auto-Recovery | ✅ COMPLETE | Automatic retry of stuck/failed sessions |
For issues or questions: