Skip to main content
The ContentEntity defines the structure for specifying how to extract complete content from web pages using the Tedi Browser network component.

Definition

Below is the definition of the ContentEntity interface:
ContentEntity
interface ContentEntity {
  urls: string[];
  format?: ('html' | 'markdown')[];
  excludeHeader?: boolean;
  excludeFooter?: boolean;
}
PropertyTypeDescription
urlsstring[]An array of URLs from which to extract content.
format('html' | 'markdown')[]An optional array specifying the desired content formats to be returned for each URL.
excludeHeaderbooleanAn optional flag to exclude headers from the extracted content. Defaults to false.
excludeFooterbooleanAn optional flag to exclude footers from the extracted content. Defaults to false.
The default format is markdown if not specified.

Examples

Extract content in html format:
{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["html"]
}
Extract content in markdown format:
{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["markdown"]
}
Extract content in both html and markdown formats:
{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["html", "markdown"]
}
Content extraction with headers and footers excluded:
{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["markdown"],
  "excludeHeader": true,
  "excludeFooter": true
}

Anti-Bot

The Tedi Browser Network component automatically handles common anti-bot protections, including rate limiting. When such measures are detected, Tedi Browser attempts to bypass them to extract the requested content. In rare cases where anti-bot defenses are highly advanced, manual intervention may be necessary.
Internal testing shows a success rate of nearly 98% against standard anti-bot protections.
This ensure that any Browser entity, can reliably retrieve data even from websites with anti-bot measures in place.
You do not need to worry about anti-bot protections when using Tedi Browser, as these are handled automatically. This is one of the key reasons we provide this entity through our API.
It may sometimes take slightly longer to extract content from such protected sites, but Tedi Browser will make every effort to get you the data you need. This feature is automatically included in all other browser entities, such as this entity and other browser entities.

Conclusion

The ContentEntity provides a flexible way to specify which web pages to extract content from and in what formats, making it easy to integrate web content extraction into your applications using the Tedi Browser network component.