Material Scraping Workflow Guide
Quick Start
Step 1: Enter Your URL
Start by entering the website URL or search query you want to scrape.
Examples:
- Single product page:
https://example.com/products/ceramic-tile-white
- Sitemap:
https://example.com/sitemap.xml
- Website to crawl:
https://example.com
- Search query:
ceramic tiles suppliers Italy
Step 2: Choose Scraping Mode
π― Single Page
When to use: Testing extraction on one specific page
- Perfect for: Product detail pages, sample testing
- Speed: Fastest (1 page)
- Best for: Initial testing before bulk scraping
Example: Scrape one product page to test field mappings
πΊοΈ Sitemap
When to use: You have a sitemap.xml with product URLs
- Perfect for: E-commerce sites with sitemaps
- Speed: Fast (parallel processing)
- Best for: Bulk scraping known URLs
Example: https://example.com/sitemap.xml β scrapes all product URLs
π·οΈ Crawl
When to use: Auto-discover all pages on a website
- Perfect for: Unknown site structure
- Speed: Slower (sequential discovery)
- Best for: Comprehensive site scraping
Example: Start at homepage, find all product pages automatically
π Search
When to use: Find pages via search engines
- Perfect for: Finding suppliers across the web
- Speed: Medium (search + scrape)
- Best for: Market research, supplier discovery
Example: "marble suppliers Greece" β finds and scrapes relevant pages
π Map
When to use: Get URL list without scraping content
- Perfect for: Planning, URL discovery
- Speed: Fastest (no content extraction)
- Best for: Site mapping before scraping
Example: Get all URLs from a website to review before scraping
Step 3: Configure Field Mappings
Define what data to extract from each page:
Standard Fields:
- Name (required): Material/product name
- Description: Product description
- Price: Price with currency
- Images: Product images (URLs)
- Category: Material category (tiles, stone, wood, etc.)
- Properties: Dimensions, color, finish, etc.
- Supplier: Manufacturer/supplier name
Custom Fields:
You can add custom fields based on your needs.
Step 4: Preview Extraction
Before running the full scrape:
- System scrapes one sample page
- Shows extracted materials
- You review the data quality
- Adjust field mappings if needed
- Confirm to proceed
Step 5: Run Full Scrape
Once confirmed:
- System processes all pages
- Extracts materials using AI
- Creates embeddings for search
- Chunks data for AI processing
- Stores in database
Scraping Mode Comparison
| Feature |
Single Page |
Sitemap |
Crawl |
Search |
Map |
| Speed |
β‘β‘β‘ |
β‘β‘ |
β‘ |
β‘β‘ |
β‘β‘β‘ |
| Pages |
1 |
10-1000 |
10-1000 |
10-100 |
100-10000 |
| Discovery |
Manual |
Sitemap |
Auto |
Search |
Auto |
| Best For |
Testing |
Bulk |
Unknown |
Research |
Planning |
| Complexity |
Simple |
Medium |
High |
Medium |
Low |
Configuration Tips
Firecrawl Options
Essential Settings:
- LLM Extraction: β
Always ON (best accuracy)
- Remove Base64 Images: β Always OFF (we need images!)
- Main Content Only: β
ON (cleaner extraction)
- Wait For: 2000ms (dynamic content)
- Block Ads: β
ON (cleaner data)
Output Formats:
- Markdown: β
Recommended (structured text)
- HTML: β
Recommended (preserves structure)
- Links: Optional (for crawling)
- Screenshot: Optional (visual verification)
Performance Settings:
- Timeout: 30000ms (30 seconds per page)
- Retry Count: 3 (automatic retries)
- Concurrent Pages: 5 (parallel processing)
Common Workflows
Workflow 1: Test Single Product
- Mode: Single Page
- URL: One product page
- Preview: Review extraction
- Adjust: Fix field mappings
- Scale: Switch to Sitemap/Crawl mode
Workflow 2: Bulk Scrape E-commerce
- Mode: Sitemap
- URL:
https://example.com/sitemap.xml
- Max Pages: 100
- Preview: Test on first page
- Run: Process all pages
Workflow 3: Discover Suppliers
- Mode: Search
- Query: "ceramic tile suppliers Spain"
- Max Results: 20
- Preview: Review found pages
- Run: Scrape all results
Workflow 4: Map Then Scrape
- Mode: Map
- URL:
https://example.com
- Get: All URLs
- Review: Filter product URLs
- Switch: Use Sitemap mode with filtered URLs
Troubleshooting
No Materials Found
- β
Check if page has product data
- β
Verify LLM extraction is enabled
- β
Adjust extraction prompt
- β
Review field mappings
Missing Images
- β
Ensure "Remove Base64 Images" is OFF
- β
Check if images are in HTML
- β
Verify image URLs are valid
Timeout Errors
- β
Increase timeout setting
- β
Reduce concurrent pages
- β
Check website speed
Rate Limit Errors
- β
Reduce concurrent pages
- β
Add delays between requests
- β
Check Firecrawl plan limits
Best Practices
Before Scraping:
- β
Test with single page first
- β
Review website's robots.txt
- β
Set reasonable page limits
- β
Configure field mappings
- β
Preview before full scrape
During Scraping:
- β
Monitor progress
- β
Check for errors
- β
Review extracted data
- β
Adjust if needed
After Scraping:
- β
Verify data quality
- β
Check image URLs
- β
Review embeddings
- β
Test search functionality
Next Steps
After successful scraping:
- Materials Created: View in Materials section
- Embeddings Generated: Ready for AI search
- Chunks Created: Optimized for AI processing
- Search Enabled: Find materials semantically
Support
For issues or questions:
- Check Firecrawl documentation
- Review error messages
- Test with single page mode
- Adjust configuration settings