ContentEntity defines the structure for specifying how to extract complete content from web pages using the Tedi Browser network component.
Definition
Below is the definition of theContentEntity interface:
ContentEntity
| Property | Type | Description |
|---|---|---|
urls | string[] | An array of URLs from which to extract content. |
format | ('html' | 'markdown')[] | An optional array specifying the desired content formats to be returned for each URL. |
excludeHeader | boolean | An optional flag to exclude headers from the extracted content. Defaults to false. |
excludeFooter | boolean | An optional flag to exclude footers from the extracted content. Defaults to false. |
markdown if not specified.
Examples
Extract content inhtml format:
markdown format:
html and markdown formats:
Anti-Bot
The Tedi Browser Network component automatically handles common anti-bot protections, including rate limiting. When such measures are detected, Tedi Browser attempts to bypass them to extract the requested content. In rare cases where anti-bot defenses are highly advanced, manual intervention may be necessary. This ensure that any Browser entity, can reliably retrieve data even from websites with anti-bot measures in place.You do not need to worry about anti-bot protections when using Tedi Browser, as these are handled automatically. This is one of the key reasons we provide this entity through our API.It may sometimes take slightly longer to extract content from such protected sites, but Tedi Browser will make every effort to get you the data you need. This feature is automatically included in all other browser entities, such as this entity and other browser entities.
Conclusion
TheContentEntity provides a flexible way to specify which web pages to extract content from and in what formats, making it easy to integrate web content extraction into your applications using the Tedi Browser network component.
