ContentEntity

The ContentEntity defines the structure for specifying how to extract complete content from web pages using the Tedi Browser network component.

Definition

Below is the definition of the ContentEntity interface:

ContentEntity

interface ContentEntity {
  urls: string[];
  format?: ('html' | 'markdown')[];
  excludeHeader?: boolean;
  excludeFooter?: boolean;
  excludeSelectors?: string[];
}

Property	Type	Description
`urls`	`string[]`	An array of URLs from which to extract content.
`format`	`('html' \| 'markdown')[]`	An optional array specifying the desired content formats to be returned for each URL.
`excludeHeader`	`boolean`	An optional flag to exclude headers from the extracted content. Defaults to `false`.
`excludeFooter`	`boolean`	An optional flag to exclude footers from the extracted content. Defaults to `false`.
`excludeSelectors`	`string[]`	An optional array of CSS selectors specifying elements to exclude from the extracted content.

The default format is markdown if not specified.

Examples

Extract content in html format:

{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["html"]
}

Extract content in markdown format:

{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["markdown"]
}

Extract content in both html and markdown formats:

{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["html", "markdown"]
}

Content extraction with headers and footers excluded:

{
  "urls": [
    "https://www.evergreen.media",
    "https://www.tirol.gv.at"
  ],
  "format": ["markdown"],
  "excludeHeader": true,
  "excludeFooter": true
}

Content extraction with custom CSS selectors excluded:

{
  "urls": ["https://example.com"],
  "format": ["markdown"],
  "excludeSelectors": [
    ".header",
    "footer",
    "#someid",
    ".someclass",
    "[data-test=\"value\"]",
    "nav.main-navigation",
    ".sidebar, .ads"
  ]
}

Anti-Bot

The Tedi Browser Network component automatically handles common anti-bot protections, including rate limiting. When such measures are detected, Tedi Browser attempts to bypass them to extract the requested content. In rare cases where anti-bot defenses are highly advanced, manual intervention may be necessary.

Internal testing shows a success rate of nearly 98% against standard anti-bot protections.

This ensure that any Browser entity, can reliably retrieve data even from websites with anti-bot measures in place.

You do not need to worry about anti-bot protections when using Tedi Browser, as these are handled automatically. This is one of the key reasons we provide this entity through our API.

It may sometimes take slightly longer to extract content from such protected sites, but Tedi Browser will make every effort to get you the data you need. This feature is automatically included in all other browser entities, such as this entity and other browser entities.

Conclusion

The ContentEntity provides a flexible way to specify which web pages to extract content from and in what formats, making it easy to integrate web content extraction into your applications using the Tedi Browser network component.

Introduction

Account

Chat completions & observations

Tedi Network

Definition

Examples

Anti-Bot

Conclusion

Introduction

Account

Chat completions & observations

Tedi Network

​Definition

​Examples

​Anti-Bot

​Conclusion

Definition

Examples

Anti-Bot

Conclusion