Building a Product Index

Build and maintain your own product index using Catalog’s crawl and products endpoints. The typical workflow is: crawl a vendor to discover their product listings, browse the discovered data to curate which products you want, then extract full product data for your selected listings.

This guide covers the core workflow: discovering product listings via POST /v1/crawl, browsing discovered data via GET /v1/vendors, POST /v1/collections, and POST /v1/listings, and extracting products via POST /v1/products.

Step 1: Crawl a Vendor

Start by crawling a vendor website to discover all their collections and product listings. Use POST /v1/crawl with the vendor URL:

{
  "url": "https://www.example.com"
}

The crawl process:

Discovers collections from the vendor (e.g., “New In”, “Sale”, “Shoes”)
Extracts product listings from each collection

Typical flow:

Start a crawl by calling POST /v1/crawl with the vendor URL
Receive an execution_id immediately (format: crawl-{hostname}-{uuid})
Poll GET /v1/crawl/{execution_id} to check progress:
- Status will be "pending", "running", "completed", or "failed"
- When status === "completed", you’ll see total_listings_found indicating how many products were discovered

Billing Requirement: Crawl requests require auto top-up to be enabled in your billing settings. This ensures you have sufficient credits to complete the crawl operation.

Step 2: Browse Discovered Data

After crawling, use the three browse endpoints to explore and curate which products you want to extract. These endpoints return data from vendors you have crawled.

List your vendors

Use GET /v1/vendors to see all vendors you have crawled and their product counts:

Review which vendors you have indexed
Check product_count to see how many listings were discovered
Use latest_product_update_by_catalog to see when data was last refreshed

Browse collections

Use POST /v1/collections to explore a vendor’s collections:

Retrieve collections (e.g., “New In”, “Sale”, “Shoes”)
Decide which collections to include in your index (e.g., only “New Arrivals” or “Top Sellers”)

This helps you build a more structured index (vendor → collection → products).

Curate product listings

Use POST /v1/listings to page through product listings for a vendor or collection. As you browse:

Review the lightweight listing data (title, URL, collection, timestamps)
Curate which products you want to extract full data for
Store the canonical product URLs for listings you want to index

This step lets you select a subset of products rather than extracting everything—useful when you only need certain collections, price ranges, or product types.

Step 3: Extract Full Product Data

Once you have curated your list of product URLs, use POST /v1/products to get full product data with AI enrichment, reviews, and image tags. Typical flow:

Pull a batch of URLs from your curated list (up to 1000 URLs per request)

Call POST /v1/products with your URLs:

{
  "urls": [
    "https://www.example.com/product/1",
    "https://www.example.com/product/2"
  ],
  "enable_enrichment": true,
  "enable_reviews": true,
  "enable_image_tags": true,
  "country_code": "us"
}

Receive an execution_id (format: products-batch-{uuid}) and poll GET /v1/products/{execution_id}
Upsert the extracted products into your index (search engine, DB, vector store, etc.)

Extracting from Any URL Source

The POST /v1/products endpoint accepts product URLs from any source—not just URLs discovered through crawling. This gives you flexibility to build your index from multiple sources: Common use cases:

Affiliate feeds: Extract products from affiliate network URLs
Merchant feeds: Process product URLs from partner data feeds
Internal catalogs: Index products from your own product database
Hand-curated lists: Extract specific products you’ve manually selected
Competitor monitoring: Track products from URLs you’ve collected

Example:

{
  "urls": [
    "https://www.nike.com/t/air-force-1-07-mens-shoes-5QFp5Z/CW2288-111",
    "https://www.adidas.com/us/gazelle-shoes/BB5476.html",
    "https://www.newbalance.com/pd/574-core/ML574EVG.html"
  ],
  "enable_enrichment": true,
  "country_code": "us"
}

This works the same as extracting crawled URLs—you receive an execution_id and poll for results.

Keeping Your Index Fresh

To maintain a high-quality product index: Schedule re-crawling: Periodically re-crawl vendor websites to discover new product listings and collections. Schedule re-extraction: Re-run extraction to capture price, availability, and content changes for existing products. Monitor failures: Use success and outcome fields to detect:

Non-product URLs
Unsupported vendors
Products that have been removed

Prune stale products: Remove (or downgrade) products that consistently fail to process or are no longer available. Use URLs for targeted updates: For specific products that need frequent updates (e.g., featured items), use the Products endpoint to refresh them more frequently than a full re-extraction.

/crawl

Discover collections and product listings

/products

Extract high-quality, real-time product data

/vendors

List your crawled vendors

/collections

List collections for a vendor

/listings

List product listings for a vendor

Get Started

API Reference

Guides

Resources

Step 1: Crawl a Vendor

Step 2: Browse Discovered Data

List your vendors

Browse collections

Curate product listings

Step 3: Extract Full Product Data

Extracting from Any URL Source

Keeping Your Index Fresh

/crawl

/products

/vendors

/collections

/listings

Get Started

API Reference

Guides

Resources

​Step 1: Crawl a Vendor

​Step 2: Browse Discovered Data

​List your vendors

​Browse collections

​Curate product listings

​Step 3: Extract Full Product Data

​Extracting from Any URL Source

​Keeping Your Index Fresh

​Related endpoints

/crawl

/products

/vendors

/collections

/listings

Step 1: Crawl a Vendor

Step 2: Browse Discovered Data

List your vendors

Browse collections

Curate product listings

Step 3: Extract Full Product Data

Extracting from Any URL Source

Keeping Your Index Fresh

Related endpoints