SitemapEntity

The SitemapEntity helps you extract all URLs from any website-even if the website doesn’t have a traditional sitemap.xml file. Think of it as a smart crawler that finds every page on a website and organizes them into a clean, easy-to-use list. Whether you need to audit a website’s structure, analyze competitors, or build a content inventory, this entity does the heavy lifting for you. It works fast and reliably, even on websites with thousands of pages.

Definition

Here are all the options you can use when making a sitemap request:

SitemapEntity

interface SitemapEntity {
  sort: "asc" | "desc";
  type: "absolute" | "relative";
  urls: string[];
  pathCase?: "lowercase" | "uppercase" | "capitalize" | "as-is";
  validateUrls?: boolean;
  includeDeadUrls?: boolean;
  include?: MetadataEntity[];
  maxResults?: number | undefined;
  filters?: FilterEntity[];
  responseType?: "json" | "cdn";
}

Property	Type	Default	Description
`sort`	`'asc' \| 'desc'`	`'asc'`	Specifies the sorting order of URLs in the sitemap. Use ‘asc’ for ascending order and ‘desc’ for descending order.
`type`	`'absolute' \| 'relative'`	`'absolute'`	Determines whether URLs are returned as absolute (complete URLs) or relative (domain-relative paths).
`urls`	`string[]`	-	An array of URLs from which to extract sitemap data.
`pathCase`	`'lowercase' \| 'uppercase' \| 'capitalize' \| 'as-is'`	`'as-is'`	Specifies the casing for the sitemap URLs. Options include converting to lowercase, uppercase, capitalizing, or retaining original casing.
`validateUrls`	`boolean`	`false`	If set to `true`, the system will validate each extracted URL to ensure it is reachable, such as detecting broken links or dead ends.
`includeDeadUrls`	`boolean`	`false`	When `validateUrls` is enabled, setting this to `true` will include URLs that are found to be dead or unreachable in the sitemap results.
`include`	`MetadataEntity[]`	-	An optional array specifying additional metadata fields to include for each sitemap URL. See MetadataEntity for available fields.
`maxResults`	`number`	-	Specifies the maximum number of URLs to include in the sitemap results.
`filters`	`FilterEntity[]`	-	An optional array of filter objects to refine the sitemap results based on specific criteria.
`responseType`	`'json' \| 'cdn'`	`'json'`	Specifies the response type. Use ‘json’ for standard JSON response or ‘cdn’ to utilize CDN Workers for efficient handling of large sitemaps.

Filters

Filters help you narrow down the results to only the URLs you care about. For example, you might want only blog posts, product pages, or URLs from a specific year. Here’s how filters work:

interface FilterEntity extends Filter, Operator {}

Filter Fields

Field	Description	Example
`url`	Matches against the complete URL including the domain	`https://example.com/blog/post-1`
`path`	Matches against only the path portion of the URL	`/blog/post-1`

Filter Operators

Operator	Description	Example Use Case
`equals`	Exact match	Find a specific URL
`notEquals`	Excludes exact matches	Exclude a specific page
`contains`	Partial match anywhere in the string	Find all URLs containing “blog”
`notContains`	Excludes partial matches	Exclude URLs containing “admin”
`startsWith`	Matches the beginning of the string	Find URLs starting with “/products/“
`endsWith`	Matches the end of the string	Find URLs ending with “.html”

Combining Filters

You can use multiple filters together:

AND: The URL must match ALL filters (this is the default)
OR: The URL must match ANY of the filters

Dead URLs

Websites change over time-pages get deleted, URLs get restructured, or content moves to new locations. When website owners forget to update their sitemaps, they end up with links pointing to pages that no longer exist (also known as “broken links” or “404 errors”). Our system can find these dead URLs for you when:

The validateUrls property is set to true
OR includeDeadUrls option is enabled
OR include property is used (metadata extraction requires URL validation)

This feature is particularly useful for:

SEO audits: Identify and fix broken links that may harm your search rankings
Website maintenance: Keep your sitemap clean and accurate
Migration projects: Verify that all old URLs redirect properly to new locations

Using the validateUrls option may increase processing time, especially for large sitemaps.

When validateUrls is enabled, you will receive a response like this:

{
  "https://example.com": {
    "urls": ["https://example.com/page1", "https://example.com/page2"],
    "deadUrls": ["https://example.com/old-page"],
    "metrics": {...}
  }
}

Metadata

By default, you only get a list of URLs. But with the include option, you can also get extra information about each page-like its title, description, and images. This turns a simple URL list into a complete content inventory.

Available Metadata Fields

Field	Description
`title`	The page’s HTML title tag
`description`	The meta description of the page
`favicon`	The website’s favicon URL
`url`	The full URL of the page
`sitename`	The name of the website (from Open Graph or meta tags)
`images`	Images found on the page, usually cover images that used in social media platforms
`domain`	The domain name of the URL

Use Cases

This feature is useful for:

Content audits: Quickly review titles and descriptions across your site
SEO analysis: Check for missing or duplicate meta tags
Site inventory: Build a comprehensive overview of your website’s content with all relevant metadata
Competitive analysis: Analyze how competitors structure their page metadata

Enabling metadata increases processing time.

For more details, see MetadataEntity.

Metrics

Every response includes helpful statistics about the extraction. Here’s what each number means:

Metrics

interface Metrics {
  discovered: number;
  live: number;
  dead: number;
  afterFilters?: number;
  afterLimit?: number;
  processingTimeMs: number;
}

Metric	Type	Description
`discovered`	`number`	The total number of URLs discovered during the sitemap extraction.
`live`	`number`	The number of live (reachable) URLs found if URL validation was enabled.
`dead`	`number`	The number of dead (unreachable) URLs found if URL validation was enabled.
`afterFilters`	`number`	The number of URLs remaining after applying any specified filters.
`afterLimit`	`number`	The number of URLs returned after applying the `maxResults` limit.
`processingTimeMs`	`number`	The total time taken to process the sitemap extraction in milliseconds.

The dead count will only be included if validateUrls, include, or includeDeadUrls is set to true in the request; otherwise, it will be 0.

Performance

How long does it take to extract a sitemap? It depends on a few things:

Factors Affecting Performance

Factor	Impact	Recommendation
URL Validation	High	Only enable `validateUrls` when you need to identify dead links
Metadata Extraction	High	Limit the `include` array to only the fields you need
Number of URLs	Fast	Use `maxResults` to limit results if you don’t need the full sitemap
Anti-bot Measures	Variable	Some websites may slow down or block requests; results may be incomplete
Missing Sitemap	Medium-High	Sites without `sitemap.xml` require crawling, which takes longer

Typical Processing Times

Scenario	Expected Time
Basic extraction (URLs only)	A few seconds to 2 minutes
With filters applied	A few seconds to 2 minutes
With metadata extraction	2–10 minutes depending on site size
With URL validation	5–20 minutes depending on site size
Large sites (100k+ URLs) without validation	2–5 minutes

Tips for Faster Results

Use filters: Narrow down results to only the URLs you need
Limit results: Set maxResults if you only need a sample
Avoid validation for large sites: Skip validateUrls for initial exploration
Use CDN for large extractions: Set responseType: "cdn" to avoid download bottlenecks

CDN

For large websites with hundreds of thousands (or even millions) of URLs, we use a CDN (Content Delivery Network) to deliver your results. Instead of trying to send all that data in one big response, we upload the files to fast servers around the world and give you download links.

Why Use CDN?

Imagine trying to download a file with 500,000 URLs all at once-it would take forever, might time out, or could even crash your browser or app. The CDN solves this by:

Splitting data into manageable files: No more crashes or timeouts
Faster downloads: Files are served from servers close to you
Multiple formats: Get your data as TXT or CSV files

To use CDN mode, simply set responseType: "cdn" in your request.

Automatic CDN Routing

Good news: you don’t always need to set responseType: "cdn" manually. If your sitemap has more than 100,000 URLs, we automatically switch to CDN mode. This way, you always get your results without worrying about crashes or timeouts.

Parts

For very large sitemaps, we split the results into multiple parts. Each part contains up to 100,000 URLs.

Why Parts Matter

Imagine a website with 2 million URLs. Loading all of them at once would cause serious problems:

Automation tools crash: Tools like n8n, Make, or Zapier can’t handle files with millions of URLs. They’ll freeze or run out of memory.
Apps struggle too: Even if you’re building your own app, loading 20 million URLs into memory at once is a recipe for crashes.
Downloads fail: Huge files take forever to download and often time out.

By splitting into smaller parts (100,000 URLs each), you can:

Download files one at a time without crashing
Process each part separately
Retry failed downloads easily

When Parts Are Used

Parts are used in two situations:

Large sitemaps: When the website has more than 100,000 URLs, we automatically switch to parts.
You request CDN mode: When you set responseType: "cdn", you’ll get parts even for smaller sitemaps.

How Parts Work

When using parts, the response structure changes slightly:

The urls array will be empty (since URLs are in the part files)
The parts array contains download links to each part file
totalParts tells you how many files to download
partSize is always 100,000 (max URLs per file)

Parts Response Example

{
  "https://example.com": {
    "live": {
      "totalUrls": 250000,
      "totalParts": 3,
      "partSize": 100000,
      "parts": [
        "https://tedi-cdn.evergreens.ai/.../sitemap-live-part-1.txt",
        "https://tedi-cdn.evergreens.ai/.../sitemap-live-part-2.txt",
        "https://tedi-cdn.evergreens.ai/.../sitemap-live-part-3.txt"
      ],
      "urls": []  // Empty when using parts
    },
    "dead": {
      "totalUrls": 0,
      "urls": []
    },
    "metrics": {
      "discovered": 250000,
      "live": 250000,
      "dead": 0,
      "validated": false,
      "processingTimeMs": 45000
    },
    "fileTypes": ["txt", "csv"],
    "expiresIn": "24 hours",
    "expiresInDate": "2025-12-23T05:35:06.226Z"
  }
}

Download each part file separately. You can process them one at a time or combine them together later.

Available Formats

You can download your sitemap data in two formats:

Format	Best For
`TXT`	Simple list with one URL per line-great for scripts and automation
`CSV`	Opens directly in Excel or Google Sheets for easy analysis

URL Categories

When you enable URL validation, your results are organized into categories:

Category	What It Contains
`live`	URLs that work (pages that load successfully)
`dead`	URLs that are broken or return errors

This makes it easy to download only what you need-live URLs for your project, or dead URLs for fixing broken links.

If you only need working URLs, download from the live category. Use the dead category when you want to find and fix broken links.

CDN File Expiration

CDN files are available for 24 hours after creation. After that, they’re automatically deleted. Make sure to download your files before they expire!

Files are deleted after 24 hours. If you need to keep the data longer, download and save the files to your computer or cloud storage.

Benefits of CDN Caching

Once we upload your sitemap to the CDN, you get direct download links that work for 24 hours. Here’s why this is great:

Share with your team: Send the download links to colleagues-no extra API calls needed
Download multiple times: Re-download as many times as you want within 24 hours
Choose your format: Download as TXT for automation or CSV for spreadsheets
Fast downloads: Files are served from servers near you for quick access

Save the download links from your response-you can use them as many times as you want within 24 hours without making new API requests.

Response Structure

The response format depends on what options you use. Below are examples showing different scenarios.

All responses follow the same basic structure: live URLs, dead URLs (if validation is enabled), and metrics with statistics about the extraction.

{
  "https://example.com": {
    "live": {
      "totalUrls": 2,
      "totalParts": 0,
      "partSize": 100000,
      "parts": [],
      "urls": ["https://example.com/page1", "https://example.com/page2"]
    },
    "dead": {
      "totalUrls": 0,
      "urls": []
    },
    "metrics": {
      "discovered": 2,
      "live": 2,
      "dead": 0,
      "validated": false,
      "processingTimeMs": 1200
    },
    "fileTypes": ["txt"],
    "expiresIn": null,
    "expiresInDate": null
  }
}

Examples

{
  // No filters, no validation, no metadata ...
  "urls": ["https://example.com"]
}

Limits

While we can handle very large websites, there are some limits to keep in mind:

Limit	Value	What It Means
Execution Time	`30 minutes`	Each request can run for up to 30 minutes. Very large websites might not complete in time.
Maximum URLs	`Unlimited` (currently)	No limit on how many URLs we can find! But the 30-minute time limit still applies.
Maximum Filters	`6`	You can use up to 6 filters per request. Need more? Download all results and filter them yourself.
Validation Limit	`6,000`	When checking for dead URLs or getting metadata, we can only process the first 6,000 URLs. Use filters to work around this.

Need to validate more than 6,000 URLs? Use filters to split your request into smaller batches (e.g., filter by year or section of the website).

Edge Cases

Some websites won’t give complete results. Here’s why:

Login-protected pages: Sites like Facebook, LinkedIn, and Amazon require you to log in to see certain pages. We can only find publicly visible URLs.
JavaScript-heavy sites: Some modern websites load content dynamically with JavaScript, which can make pages harder to discover.
Aggressive blocking: Some sites actively block automated tools, which may result in incomplete results.

We are continuously improving our methods to handle these scenarios more effectively in future updates.

Anti-Bot

The Tedi Browser Network component automatically handles common anti-bot protections, including rate limiting. When such measures are detected, Tedi Browser attempts to bypass them to extract the requested content. In rare cases where anti-bot defenses are highly advanced, manual intervention may be necessary.

Internal testing shows a success rate of nearly 98% against standard anti-bot protections.

This ensure that any Browser entity, can reliably retrieve data even from websites with anti-bot measures in place.

You do not need to worry about anti-bot protections when using Tedi Browser, as these are handled automatically. This is one of the key reasons we provide this entity through our API.

It may sometimes take slightly longer to extract content from such protected sites, but Tedi Browser will make every effort to get you the data you need. This feature is automatically included in all other browser entities, such as this entity and other browser entities.

Conclusion

The Sitemap API makes it easy to get a complete list of URLs from any website-even sites without a traditional sitemap file. Whether you’re doing SEO audits, competitor research, or building a content inventory, you’ll get fast, reliable results. Key features:

Works on any website, with or without a sitemap.xml file
Handles massive sites with millions of URLs
Finds broken links automatically
Delivers results in easy-to-use formats (TXT and CSV)

Introduction

Account

Chat completions & observations

Tedi Network

Definition

Filters

Filter Fields

Filter Operators

Combining Filters

Dead URLs

Metadata

Available Metadata Fields

Use Cases

Metrics

Performance

Factors Affecting Performance

Typical Processing Times

Tips for Faster Results

CDN

Why Use CDN?

Automatic CDN Routing

Parts

Why Parts Matter

When Parts Are Used

How Parts Work

Parts Response Example

Available Formats

URL Categories

CDN File Expiration

Benefits of CDN Caching

Response Structure

Examples

Limits

Edge Cases

Anti-Bot

Conclusion

Introduction

Account

Chat completions & observations

Tedi Network

​Definition

​Filters

​Filter Fields

​Filter Operators

​Combining Filters

​Dead URLs

​Metadata

​Available Metadata Fields

​Use Cases

​Metrics

​Performance

​Factors Affecting Performance

​Typical Processing Times

​Tips for Faster Results

​CDN

​Why Use CDN?

​Automatic CDN Routing

​Parts

​Why Parts Matter

​When Parts Are Used

​How Parts Work

​Parts Response Example

​Available Formats

​URL Categories

​CDN File Expiration

​Benefits of CDN Caching

​Response Structure

​Examples

​Limits

​Edge Cases

​Anti-Bot

​Conclusion

Definition

Filters

Filter Fields

Filter Operators

Combining Filters

Dead URLs

Metadata

Available Metadata Fields

Use Cases

Metrics

Performance

Factors Affecting Performance

Typical Processing Times

Tips for Faster Results

CDN

Why Use CDN?

Automatic CDN Routing

Parts

Why Parts Matter

When Parts Are Used

How Parts Work

Parts Response Example

Available Formats

URL Categories

CDN File Expiration

Benefits of CDN Caching

Response Structure

Examples

Limits

Edge Cases

Anti-Bot

Conclusion