Material Scraping Workflow Guide

Quick Start

Step 1: Enter Your URL

Start by entering the website URL or search query you want to scrape.

Examples:

Single product page: https://example.com/products/ceramic-tile-white
Sitemap: https://example.com/sitemap.xml
Website to crawl: https://example.com
Search query: ceramic tiles suppliers Italy

Step 2: Choose Scraping Mode

🎯 Single Page

When to use: Testing extraction on one specific page

Perfect for: Product detail pages, sample testing
Speed: Fastest (1 page)
Best for: Initial testing before bulk scraping

Example: Scrape one product page to test field mappings

🗺️ Sitemap

When to use: You have a sitemap.xml with product URLs

Perfect for: E-commerce sites with sitemaps
Speed: Fast (parallel processing)
Best for: Bulk scraping known URLs

Example: https://example.com/sitemap.xml → scrapes all product URLs

🕷️ Crawl

When to use: Auto-discover all pages on a website

Perfect for: Unknown site structure
Speed: Slower (sequential discovery)
Best for: Comprehensive site scraping

Example: Start at homepage, find all product pages automatically

🔍 Search

When to use: Find pages via search engines

Perfect for: Finding suppliers across the web
Speed: Medium (search + scrape)
Best for: Market research, supplier discovery

Example: "marble suppliers Greece" → finds and scrapes relevant pages

📋 Map

When to use: Get URL list without scraping content

Perfect for: Planning, URL discovery
Speed: Fastest (no content extraction)
Best for: Site mapping before scraping

Example: Get all URLs from a website to review before scraping

Step 3: Configure Field Mappings

Define what data to extract from each page:

Standard Fields:

Name (required): Material/product name
Description: Product description
Price: Price with currency
Images: Product images (URLs)
Category: Material category (tiles, stone, wood, etc.)
Properties: Dimensions, color, finish, etc.
Supplier: Manufacturer/supplier name

Custom Fields: You can add custom fields based on your needs.

Step 4: Preview Extraction

Before running the full scrape:

System scrapes one sample page
Shows extracted materials
You review the data quality
Adjust field mappings if needed
Confirm to proceed

Step 5: Run Full Scrape

Once confirmed:

System processes all pages
Extracts materials using AI
Creates embeddings for search
Chunks data for AI processing
Stores in database

Scraping Mode Comparison

Feature	Single Page	Sitemap	Crawl	Search	Map
Speed	⚡⚡⚡	⚡⚡	⚡	⚡⚡	⚡⚡⚡
Pages	1	10-1000	10-1000	10-100	100-10000
Discovery	Manual	Sitemap	Auto	Search	Auto
Best For	Testing	Bulk	Unknown	Research	Planning
Complexity	Simple	Medium	High	Medium	Low

Configuration Tips

Firecrawl Options

Essential Settings:

LLM Extraction: ✅ Always ON (best accuracy)
Remove Base64 Images: ❌ Always OFF (we need images!)
Main Content Only: ✅ ON (cleaner extraction)
Wait For: 2000ms (dynamic content)
Block Ads: ✅ ON (cleaner data)

Output Formats:

Markdown: ✅ Recommended (structured text)
HTML: ✅ Recommended (preserves structure)
Links: Optional (for crawling)
Screenshot: Optional (visual verification)

Performance Settings:

Timeout: 30000ms (30 seconds per page)
Retry Count: 3 (automatic retries)
Concurrent Pages: 5 (parallel processing)

Common Workflows

Workflow 1: Test Single Product

Mode: Single Page
URL: One product page
Preview: Review extraction
Adjust: Fix field mappings
Scale: Switch to Sitemap/Crawl mode

Workflow 2: Bulk Scrape E-commerce

Mode: Sitemap
URL: https://example.com/sitemap.xml
Max Pages: 100
Preview: Test on first page
Run: Process all pages

Workflow 3: Discover Suppliers

Mode: Search
Query: "ceramic tile suppliers Spain"
Max Results: 20
Preview: Review found pages
Run: Scrape all results

Workflow 4: Map Then Scrape

Mode: Map
URL: https://example.com
Get: All URLs
Review: Filter product URLs
Switch: Use Sitemap mode with filtered URLs

Troubleshooting

No Materials Found

✅ Check if page has product data
✅ Verify LLM extraction is enabled
✅ Adjust extraction prompt
✅ Review field mappings

Missing Images

✅ Ensure "Remove Base64 Images" is OFF
✅ Check if images are in HTML
✅ Verify image URLs are valid

Timeout Errors

✅ Increase timeout setting
✅ Reduce concurrent pages
✅ Check website speed

Rate Limit Errors

✅ Reduce concurrent pages
✅ Add delays between requests
✅ Check Firecrawl plan limits

Best Practices

Before Scraping:

✅ Test with single page first
✅ Review website's robots.txt
✅ Set reasonable page limits
✅ Configure field mappings
✅ Preview before full scrape

During Scraping:

✅ Monitor progress
✅ Check for errors
✅ Review extracted data
✅ Adjust if needed

After Scraping:

✅ Verify data quality
✅ Check image URLs
✅ Review embeddings
✅ Test search functionality

Next Steps

After successful scraping:

Materials Created: View in Materials section
Embeddings Generated: Ready for AI search
Chunks Created: Optimized for AI processing
Search Enabled: Find materials semantically

Support

For issues or questions:

Check Firecrawl documentation
Review error messages
Test with single page mode
Adjust configuration settings