Skip to main content
The SitemapEntity helps you extract all URLs from any website-even if the website doesn’t have a traditional sitemap.xml file. Think of it as a smart crawler that finds every page on a website and organizes them into a clean, easy-to-use list. Whether you need to audit a website’s structure, analyze competitors, or build a content inventory, this entity does the heavy lifting for you. It works fast and reliably, even on websites with thousands of pages.

Definition

Here are all the options you can use when making a sitemap request:
SitemapEntity
interface SitemapEntity {
  sort: "asc" | "desc";
  type: "absolute" | "relative";
  urls: string[];
  pathCase?: "lowercase" | "uppercase" | "capitalize" | "as-is";
  validateUrls?: boolean;
  includeDeadUrls?: boolean;
  include?: MetadataEntity[];
  maxResults?: number | undefined;
  filters?: FilterEntity[];
  responseType?: "json" | "cdn";
}
PropertyTypeDefaultDescription
sort'asc' | 'desc''asc'Specifies the sorting order of URLs in the sitemap. Use ‘asc’ for ascending order and ‘desc’ for descending order.
type'absolute' | 'relative''absolute'Determines whether URLs are returned as absolute (complete URLs) or relative (domain-relative paths).
urlsstring[]-An array of URLs from which to extract sitemap data.
pathCase'lowercase' | 'uppercase' | 'capitalize' | 'as-is''as-is'Specifies the casing for the sitemap URLs. Options include converting to lowercase, uppercase, capitalizing, or retaining original casing.
validateUrlsbooleanfalseIf set to true, the system will validate each extracted URL to ensure it is reachable, such as detecting broken links or dead ends.
includeDeadUrlsbooleanfalseWhen validateUrls is enabled, setting this to true will include URLs that are found to be dead or unreachable in the sitemap results.
includeMetadataEntity[]-An optional array specifying additional metadata fields to include for each sitemap URL. See MetadataEntity for available fields.
maxResultsnumber-Specifies the maximum number of URLs to include in the sitemap results.
filtersFilterEntity[]-An optional array of filter objects to refine the sitemap results based on specific criteria.
responseType'json' | 'cdn''json'Specifies the response type. Use ‘json’ for standard JSON response or ‘cdn’ to utilize CDN Workers for efficient handling of large sitemaps.

Filters

Filters help you narrow down the results to only the URLs you care about. For example, you might want only blog posts, product pages, or URLs from a specific year. Here’s how filters work:
interface FilterEntity extends Filter, Operator {}

Filter Fields

FieldDescriptionExample
urlMatches against the complete URL including the domainhttps://example.com/blog/post-1
pathMatches against only the path portion of the URL/blog/post-1

Filter Operators

OperatorDescriptionExample Use Case
equalsExact matchFind a specific URL
notEqualsExcludes exact matchesExclude a specific page
containsPartial match anywhere in the stringFind all URLs containing “blog”
notContainsExcludes partial matchesExclude URLs containing “admin”
startsWithMatches the beginning of the stringFind URLs starting with “/products/“
endsWithMatches the end of the stringFind URLs ending with “.html”

Combining Filters

You can use multiple filters together:
  • AND: The URL must match ALL filters (this is the default)
  • OR: The URL must match ANY of the filters

Dead URLs

Websites change over time-pages get deleted, URLs get restructured, or content moves to new locations. When website owners forget to update their sitemaps, they end up with links pointing to pages that no longer exist (also known as “broken links” or “404 errors”). Our system can find these dead URLs for you when:
  • The validateUrls property is set to true
  • OR includeDeadUrls option is enabled
  • OR include property is used (metadata extraction requires URL validation)
This feature is particularly useful for:
  • SEO audits: Identify and fix broken links that may harm your search rankings
  • Website maintenance: Keep your sitemap clean and accurate
  • Migration projects: Verify that all old URLs redirect properly to new locations
Using the validateUrls option may increase processing time, especially for large sitemaps.
When validateUrls is enabled, you will receive a response like this:
{
  "https://example.com": {
    "urls": ["https://example.com/page1", "https://example.com/page2"],
    "deadUrls": ["https://example.com/old-page"],
    "metrics": {...}
  }
}

Metadata

By default, you only get a list of URLs. But with the include option, you can also get extra information about each page-like its title, description, and images. This turns a simple URL list into a complete content inventory.

Available Metadata Fields

FieldDescription
titleThe page’s HTML title tag
descriptionThe meta description of the page
faviconThe website’s favicon URL
urlThe full URL of the page
sitenameThe name of the website (from Open Graph or meta tags)
imagesImages found on the page, usually cover images that used in social media platforms
domainThe domain name of the URL

Use Cases

This feature is useful for:
  • Content audits: Quickly review titles and descriptions across your site
  • SEO analysis: Check for missing or duplicate meta tags
  • Site inventory: Build a comprehensive overview of your website’s content with all relevant metadata
  • Competitive analysis: Analyze how competitors structure their page metadata
Enabling metadata increases processing time.
For more details, see MetadataEntity.

Metrics

Every response includes helpful statistics about the extraction. Here’s what each number means:
Metrics
interface Metrics {
  discovered: number;
  live: number;
  dead: number;
  afterFilters?: number;
  afterLimit?: number;
  processingTimeMs: number;
}
MetricTypeDescription
discoverednumberThe total number of URLs discovered during the sitemap extraction.
livenumberThe number of live (reachable) URLs found if URL validation was enabled.
deadnumberThe number of dead (unreachable) URLs found if URL validation was enabled.
afterFiltersnumberThe number of URLs remaining after applying any specified filters.
afterLimitnumberThe number of URLs returned after applying the maxResults limit.
processingTimeMsnumberThe total time taken to process the sitemap extraction in milliseconds.
The dead count will only be included if validateUrls, include, or includeDeadUrls is set to true in the request; otherwise, it will be 0.

Performance

How long does it take to extract a sitemap? It depends on a few things:

Factors Affecting Performance

FactorImpactRecommendation
URL ValidationHighOnly enable validateUrls when you need to identify dead links
Metadata ExtractionHighLimit the include array to only the fields you need
Number of URLsFastUse maxResults to limit results if you don’t need the full sitemap
Anti-bot MeasuresVariableSome websites may slow down or block requests; results may be incomplete
Missing SitemapMedium-HighSites without sitemap.xml require crawling, which takes longer

Typical Processing Times

ScenarioExpected Time
Basic extraction (URLs only)A few seconds to 2 minutes
With filters appliedA few seconds to 2 minutes
With metadata extraction2–10 minutes depending on site size
With URL validation5–20 minutes depending on site size
Large sites (100k+ URLs) without validation2–5 minutes

Tips for Faster Results

  1. Use filters: Narrow down results to only the URLs you need
  2. Limit results: Set maxResults if you only need a sample
  3. Avoid validation for large sites: Skip validateUrls for initial exploration
  4. Use CDN for large extractions: Set responseType: "cdn" to avoid download bottlenecks

CDN

For large websites with hundreds of thousands (or even millions) of URLs, we use a CDN (Content Delivery Network) to deliver your results. Instead of trying to send all that data in one big response, we upload the files to fast servers around the world and give you download links.

Why Use CDN?

Imagine trying to download a file with 500,000 URLs all at once-it would take forever, might time out, or could even crash your browser or app. The CDN solves this by:
  • Splitting data into manageable files: No more crashes or timeouts
  • Faster downloads: Files are served from servers close to you
  • Multiple formats: Get your data as TXT or CSV files
To use CDN mode, simply set responseType: "cdn" in your request.

Automatic CDN Routing

Good news: you don’t always need to set responseType: "cdn" manually. If your sitemap has more than 100,000 URLs, we automatically switch to CDN mode. This way, you always get your results without worrying about crashes or timeouts.

Parts

For very large sitemaps, we split the results into multiple parts. Each part contains up to 100,000 URLs.

Why Parts Matter

Imagine a website with 2 million URLs. Loading all of them at once would cause serious problems:
  • Automation tools crash: Tools like n8n, Make, or Zapier can’t handle files with millions of URLs. They’ll freeze or run out of memory.
  • Apps struggle too: Even if you’re building your own app, loading 20 million URLs into memory at once is a recipe for crashes.
  • Downloads fail: Huge files take forever to download and often time out.
By splitting into smaller parts (100,000 URLs each), you can:
  • Download files one at a time without crashing
  • Process each part separately
  • Retry failed downloads easily

When Parts Are Used

Parts are used in two situations:
  1. Large sitemaps: When the website has more than 100,000 URLs, we automatically switch to parts.
  2. You request CDN mode: When you set responseType: "cdn", you’ll get parts even for smaller sitemaps.

How Parts Work

When using parts, the response structure changes slightly:
  • The urls array will be empty (since URLs are in the part files)
  • The parts array contains download links to each part file
  • totalParts tells you how many files to download
  • partSize is always 100,000 (max URLs per file)

Parts Response Example

{
  "https://example.com": {
    "live": {
      "totalUrls": 250000,
      "totalParts": 3,
      "partSize": 100000,
      "parts": [
        "https://tedi-cdn.evergreens.ai/.../sitemap-live-part-1.txt",
        "https://tedi-cdn.evergreens.ai/.../sitemap-live-part-2.txt",
        "https://tedi-cdn.evergreens.ai/.../sitemap-live-part-3.txt"
      ],
      "urls": []  // Empty when using parts
    },
    "dead": {
      "totalUrls": 0,
      "urls": []
    },
    "metrics": {
      "discovered": 250000,
      "live": 250000,
      "dead": 0,
      "validated": false,
      "processingTimeMs": 45000
    },
    "fileTypes": ["txt", "csv"],
    "expiresIn": "24 hours",
    "expiresInDate": "2025-12-23T05:35:06.226Z"
  }
}
Download each part file separately. You can process them one at a time or combine them together later.

Available Formats

You can download your sitemap data in two formats:
FormatBest For
TXTSimple list with one URL per line-great for scripts and automation
CSVOpens directly in Excel or Google Sheets for easy analysis

URL Categories

When you enable URL validation, your results are organized into categories:
CategoryWhat It Contains
liveURLs that work (pages that load successfully)
deadURLs that are broken or return errors
This makes it easy to download only what you need-live URLs for your project, or dead URLs for fixing broken links.
If you only need working URLs, download from the live category. Use the dead category when you want to find and fix broken links.

CDN File Expiration

CDN files are available for 24 hours after creation. After that, they’re automatically deleted. Make sure to download your files before they expire!
Files are deleted after 24 hours. If you need to keep the data longer, download and save the files to your computer or cloud storage.

Benefits of CDN Caching

Once we upload your sitemap to the CDN, you get direct download links that work for 24 hours. Here’s why this is great:
  • Share with your team: Send the download links to colleagues-no extra API calls needed
  • Download multiple times: Re-download as many times as you want within 24 hours
  • Choose your format: Download as TXT for automation or CSV for spreadsheets
  • Fast downloads: Files are served from servers near you for quick access
Save the download links from your response-you can use them as many times as you want within 24 hours without making new API requests.

Response Structure

The response format depends on what options you use. Below are examples showing different scenarios.
All responses follow the same basic structure: live URLs, dead URLs (if validation is enabled), and metrics with statistics about the extraction.
{
  "https://example.com": {
    "live": {
      "totalUrls": 2,
      "totalParts": 0,
      "partSize": 100000,
      "parts": [],
      "urls": ["https://example.com/page1", "https://example.com/page2"]
    },
    "dead": {
      "totalUrls": 0,
      "urls": []
    },
    "metrics": {
      "discovered": 2,
      "live": 2,
      "dead": 0,
      "validated": false,
      "processingTimeMs": 1200
    },
    "fileTypes": ["txt"],
    "expiresIn": null,
    "expiresInDate": null
  }
}

Examples

{
  // No filters, no validation, no metadata ...
  "urls": ["https://example.com"]
}

Limits

While we can handle very large websites, there are some limits to keep in mind:
LimitValueWhat It Means
Execution Time30 minutesEach request can run for up to 30 minutes. Very large websites might not complete in time.
Maximum URLsUnlimited (currently)No limit on how many URLs we can find! But the 30-minute time limit still applies.
Maximum Filters6You can use up to 6 filters per request. Need more? Download all results and filter them yourself.
Validation Limit6,000When checking for dead URLs or getting metadata, we can only process the first 6,000 URLs. Use filters to work around this.
Need to validate more than 6,000 URLs? Use filters to split your request into smaller batches (e.g., filter by year or section of the website).

Edge Cases

Some websites won’t give complete results. Here’s why:
  • Login-protected pages: Sites like Facebook, LinkedIn, and Amazon require you to log in to see certain pages. We can only find publicly visible URLs.
  • JavaScript-heavy sites: Some modern websites load content dynamically with JavaScript, which can make pages harder to discover.
  • Aggressive blocking: Some sites actively block automated tools, which may result in incomplete results.
We are continuously improving our methods to handle these scenarios more effectively in future updates.

Anti-Bot

The Tedi Browser Network component automatically handles common anti-bot protections, including rate limiting. When such measures are detected, Tedi Browser attempts to bypass them to extract the requested content. In rare cases where anti-bot defenses are highly advanced, manual intervention may be necessary.
Internal testing shows a success rate of nearly 98% against standard anti-bot protections.
This ensure that any Browser entity, can reliably retrieve data even from websites with anti-bot measures in place.
You do not need to worry about anti-bot protections when using Tedi Browser, as these are handled automatically. This is one of the key reasons we provide this entity through our API.
It may sometimes take slightly longer to extract content from such protected sites, but Tedi Browser will make every effort to get you the data you need. This feature is automatically included in all other browser entities, such as this entity and other browser entities.

Conclusion

The Sitemap API makes it easy to get a complete list of URLs from any website-even sites without a traditional sitemap file. Whether you’re doing SEO audits, competitor research, or building a content inventory, you’ll get fast, reliable results. Key features:
  • Works on any website, with or without a sitemap.xml file
  • Handles massive sites with millions of URLs
  • Finds broken links automatically
  • Delivers results in easy-to-use formats (TXT and CSV)