Skip to main content
POST
/
v2
/
extract
cURL
curl -X POST https://api.getcatalog.ai/v2/extract \
  -H "Content-Type: application/json" \
  -H "x-api-key: $CATALOG_API_KEY" \
  -d '{
    "urls": [
      "https://www.nike.com/t/air-force-1-07-mens-shoes-5QFp5Z/CW2288-111",
      "https://www.adidas.com/us/gazelle-shoes/BB5476.html"
    ],
    "enable_enrichment": false,
    "country_code": "us",
    "enable_reviews": false,
    "enable_image_tags": false
  }'
{
  "execution_id": "extract-urls-965b1912-6af0-4ed8-b7e3-184b85e788b7",
  "status": "pending",
  "meta": {
    "credits_used": 10,
    "url_count": 2,
    "urls_to_process": 2
  }
}
You can extract products using one of two input sources:
  • URLs: Provide an array of product URLs to extract (up to 1000 URLs per request)
  • Vendor: Provide a vendor to extract products from previously discovered listings
Async Processing: This endpoint starts processing asynchronously and returns an execution_id immediately. Use GET /v2/extract/{execution_id} to check progress and retrieve results when processing completes.
Managing Executions:
  • View All Executions: You can view all your extraction executions using GET /v2/extract to see all processing jobs, their statuses, and when they were created.
  • Cancel Running Executions: If you need to stop a running execution, use DELETE /v2/extract/{execution_id}.

Request

x-api-key
string
required
Your API key for authentication

Request Body

Choose one of the following input sources:

Input Source: URLs

urls
string[]
Array of product URLs to extract (up to 1000 URLs per request). Use this when you have specific product URLs to extract.Requirements:
  • Must be a non-empty array
  • Each entry must be a non-empty string
  • Maximum 1000 URLs per request
Example:
[
  "https://www.nike.com/t/air-force-1-07-mens-shoes-5QFp5Z/CW2288-111",
  "https://www.adidas.com/us/gazelle-shoes/BB5476.html",
  "https://www.example-store.com/product/123"
]
Note: You can send a single URL in an array: ["https://example.com/product"]URL sanitization: URLs are sanitized and normalized on the backend; client-side sanitization is not required.

Input Source: Vendor

vendor
string
The vendor to extract products from (e.g., “example.com” or “https://example.com”). Use this when you want to extract products from a vendor’s previously discovered listings.Examples:
  • "nike.com"
  • "https://www.adidas.com"
  • "example-store.com"
Note: The domain will be normalized automatically. You can include or omit the protocol and www prefix.Important: This vendor must have been crawled previously using POST /v2/crawl, or you must provide a crawl_id to wait for an active crawl to complete. Extraction requires product listings to have been discovered via crawl.

Vendor-Specific Parameters

These parameters only apply when using the vendor input source:
max_products
number
Maximum number of products to process. If not provided, all available products will be processed.
crawl_id
string
Optional crawl execution ID to wait for before starting extraction. If provided, extraction will be queued and will start automatically when the crawl execution completes.Use Case: Use this when you want to ensure product listings are discovered via crawl before extraction begins. The extraction will remain in pending status with waiting_for in the meta until the crawl completes.Format: crawl-{vendor}-{uuid}Note: If you haven’t started a crawl yet, use POST /v2/crawl first, then use the returned execution_id as the crawl_id parameter here.

Shared Parameters

These parameters apply to both input sources:
country_code
string
default:"us"
ISO 2-letter country code for localization (e.g., “us”, “ca”, “gb”, “de”)Supported Country Codes: nl, ca, si, co, by, ee, no, br, us, de, es, vn, il, th, gb, ph, kg, in, ru, fr, hk, ua, jp, at, mm, my, pl, au, nz, ro, tw, mx, id, dk, ng, ch, hu, sg, sa, ae, cn, ar, cl, it, tr, lv, gh, sk, gr, eg, lu, bg, se, lt, lk, kz, hr, bd, kr, cz, fi, ie, be, pt, za
enable_enrichment
boolean
default:"false"
Whether to enable AI enrichment for products. When enabled, products are enhanced with additional attributes and categorization.
enable_reviews
boolean
default:"false"
Whether to enable product reviews processing
enable_image_tags
boolean
default:"false"
Whether to enable AI-generated image tags and descriptions
enable_brand_pdp
boolean
default:"false"
Whether to enable brand PDP (Product Detail Page) discovery for additional product information
enable_similar_products
boolean
default:"false"
Whether to enable similar products discovery to find related items
enable_reddit_insights
boolean
default:"false"
Whether to enable Reddit insights gathering for product context and discussions

Response

execution_id
string
Unique execution identifier for this extraction job. Use this ID with GET /v2/extract/{execution_id} to check progress and retrieve results.Format:
  • When using urls: extract-urls-{uuid}
  • When using vendor: extract-{vendor}-{uuid} (dots in vendor replaced with dashes)
status
string
Execution status. Will be "pending" in the POST response. When using vendor with crawl_id, status will be "pending" with waiting_for in meta until the crawl completes.
meta
object
Response metadata
HTTP Status Code: This endpoint returns 202 Accepted to indicate the request has been accepted for processing. The response body contains the execution details you need to track the extraction progress.

Response Schema and Enable Flags

The /v2/extract endpoint maintains a consistent response schema regardless of enable_* flag values. All fields are always present in product objects returned in the data array, but will be null when the corresponding flag is false. Field Mappings:
FlagAffected Fields
enable_reviewsreviews
enable_enrichmentattributes, product_type, google_product_category_id, google_product_category_path
enable_image_tagsimages[].attributes
enable_brand_pdpbrand_pdp
enable_similar_productssimilar_products
enable_reddit_insightsreddit_insights
cURL
curl -X POST https://api.getcatalog.ai/v2/extract \
  -H "Content-Type: application/json" \
  -H "x-api-key: $CATALOG_API_KEY" \
  -d '{
    "urls": [
      "https://www.nike.com/t/air-force-1-07-mens-shoes-5QFp5Z/CW2288-111",
      "https://www.adidas.com/us/gazelle-shoes/BB5476.html"
    ],
    "enable_enrichment": false,
    "country_code": "us",
    "enable_reviews": false,
    "enable_image_tags": false
  }'
{
  "execution_id": "extract-urls-965b1912-6af0-4ed8-b7e3-184b85e788b7",
  "status": "pending",
  "meta": {
    "credits_used": 10,
    "url_count": 2,
    "urls_to_process": 2
  }
}

Workflow

Using URLs

  1. Start Extraction: Send a POST request with your urls array
  2. Get Execution ID: Receive an execution_id immediately in the response (HTTP 202)
  3. Check Status: Poll GET /v2/extract/{execution_id} using the execution_id
  4. Retrieve Results: When status is "completed", results are available with pagination support

Using Vendor

  1. Crawl the Domain (if not already done): First, use POST /v2/crawl to discover product listings from the domain. Wait for the crawl to complete or use the crawl_id parameter in step 2.
  2. Start Extraction: Send a POST request with your vendor domain. If the domain hasn’t been crawled yet, provide the crawl_id from step 1 to wait for crawl completion.
  3. Get Execution ID: Receive an execution_id immediately in the response
  4. Check Status: Poll GET /v2/extract/{execution_id}. If waiting for a crawl, the status will show pending with waiting_for in the meta.
  5. Retrieve Results: When status is "completed", results are available with pagination support