Skip to main content

Config Files

For each DocSearch request we receive, we create a custom JSON configuration file that defines how the crawler should behave. You can find all the configs in this repository.

A DocSearch looks like this:

{
"index_name": "example",
"start_urls": ["https://www.example.com/docs"],
"selectors": {
"lvl0": "#content header h1",
"lvl1": "#content article h1",
"lvl2": "#content section h3",
"lvl3": "#content section h4",
"lvl4": "#content section h5",
"lvl5": "#content section h6",
"text": "#content header p,#content section p,#content section ol"
}
}

index_name#

This is the name of the Algolia index where your records will be pushed. The apiKey we will share with you is restricted to work with this index and is a search-only key.

When using the free DocSearch crawler, the indexName will always be the name of the configuration file. If you're running DocSearch yourself, you can use any name you'd like.

{
"index_name": "example"
}

When the DocSearch scraper runs, it builds a temporary index. Once scraping is complete, it moves that index to the name specified by index_name (replacing the existing index).

By default, the name of the temporary index is the value of index_name + _tmp.

To use a different name, set the INDEX_NAME_TMP environment variable to a different value. This variable can be set in the .env file alongside APPLICATION_ID and API_KEY.

start_urls#

This array contains the list of URLs that will be used to start crawling your website. The crawler will recursively follow any links (<a> tags) from those pages. It will not follow links that are on another domain and never follow links matching stop_urls.

{
"start_urls": ["https://www.example.com/docs"]
}

selectors_key, tailor your selectors#

You can define finer sets of selectors depending on the URL. You need to use the parameter selectors_key from your start_urls.

{
"start_urls": [
{
"url": "http://www.example.com/docs/faq/",
"selectors_key": "faq"
},
{
"url": "http://www.example.com/docs/"
}
],
[],
"selectors": {
"default": {
"lvl0": ".docs h1",
"lvl1": ".docs h2",
"lvl2": ".docs h3",
"lvl3": ".docs h4",
"lvl4": ".docs h5",
"text": ".docs p, .docs li"
},
"faq": {
"lvl0": ".faq h1",
"lvl1": ".faq h2",
"lvl2": ".faq h3",
"lvl3": ".faq h4",
"lvl4": ".faq h5",
"text": ".faq p, .faq li"
}
}
}

To find the right subset to use based on the URL, the scraper iterates over these start_urls items. Only the first one to match is applied.

Considering the URL http://www.example.com/en/api/ with the configuration:

{
"start_urls": [
{
"url": "http://www.example.com/doc/",
"selectors_key": "doc"
},
{
"url": "http://www.example.com/doc/faq/",
"selectors_key": "faq"
},
[],
]
}

Only the set of selectors related to doc will be applied to the URL. The correct configuration should be built the other way around (as primarily described).

If one start_urls item has no selectors_key defined, the default set will be used. Do not forget to set this fallback set of selectors.

Using regular expressions#

The start_urls and stop_urls options also enable you to use regular expressions to express more complex patterns. This object must at least contain a url key targeting a reachable page.

You can also define a variables key that will be injected into your specific URL pattern. The following example makes this variable feature clearer:

{
"start_urls": [
{
"url": "http://www.example.com/docs/(?P<lang>.*?)/(?P<version>.*?)/",
"variables": {
"lang": ["en", "fr"],
"version": ["latest", "3.3", "3.2"]
}
}
]
}

The beneficial side effect of using this syntax is that every record extracted from pages matching http://www.example.com/docs/en/latest will have attributes lang: en and version: latest. It enables you to filter on these facetFilters.

The following example shows how the UI filters results matching a specific language and version.

docsearch({
[],
algoliaOptions: {
'facetFilters': ["lang:en", "version:latest"]
},
[],
});

Using custom tags#

You can also apply custom tags to some pages without the need to use regular expressions. In that case, add the list of tags to the tags key. Note that those tags will be automatically added as facets in Algolia, allowing you to filter based on their values as well.

{
"start_urls": [
{
"url": "http://www.example.com/docs/concepts/",
"tags": ["concepts", "terminology"]
}
]
}

From the JS snippet:

docsearch({
[],
algoliaOptions: {
'facetFilters': ["tags:concepts"]
},
});

Using Page Rank#

To give more weight to some pages. This parameter helps to boost records built from the page. Pages with highest page_rank will be returned before pages with a lower page_rank. Note that you can pass any numeric value, including negative values.

{
"start_urls": [
{
"url": "http://www.example.com/docs/concepts/",
"page_rank": 5
},
{
"url": "http://www.example.com/docs/contributors/",
"page_rank": 1
}
]
}

In this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page.

Using custom selectors per page#

If the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages.

{
"start_urls": [
"http://www.example.com/docs/",
{
"url": "http://www.example.com/docs/concepts/",
"selectors_key": "concepts"
},
{
"url": "http://www.example.com/docs/contributors/",
"selectors_key": "contributors"
}
],
"selectors": {
"default": {
"lvl0": ".main h1",
"lvl1": ".main h2",
"lvl2": ".main h3",
"lvl3": ".main h4",
"lvl4": ".main h5",
"text": ".main p"
},
"concepts": {
"lvl0": ".header h2",
"lvl1": ".main h1.title",
"lvl2": ".main h2.title",
"lvl3": ".main h3.title",
"lvl4": ".main h5.title",
"text": ".main p"
},
"contributors": {
"lvl0": ".main h1",
"lvl1": ".contributors .name",
"lvl2": ".contributors .title",
"text": ".contributors .description"
}
}
}

Here, all documentation pages will use the selectors defined in selectors.default while the page under ./concepts will use selectors.concepts and those under ./contributors will use selectors.contributors.

selectors#

This object contains all the CSS selectors that will be used to create the record hierarchy. It can contain up to 6 levels (lvl0, lvl1, lvl2, lvl3, lvl4, lvl5) and text.

A default config would be to target the page title or h1 as lvl0, the h2 as lvl1, h3 as lvl2, and p as text, but this is highly dependent on the markup.

The text key is mandatory, but we highly recommend setting also lvl0, lvl1 and lvl2 to have a decent depth of relevance.

{
"selectors": {
"lvl0": "#content header h1",
"lvl1": "#content article h1",
"lvl2": "#content section h3",
"lvl3": "#content section h4",
"lvl4": "#content section h5",
"lvl5": "#content section h6",
"text": "#content header p,#content section p,#content section ol"
}
}

Selectors can be passed as strings, or as objects containing a selector key. Other special keys can be set, as documented below.

{
"selectors": {
"lvl0": {
"selector": "#content header h1"
}
}
}

Using global selectors#

The default way of extracting content through selectors is to read the HTML markup from top to bottom. This works well with semi-structured content, like a hierarchy of headers. This breaks when the relevant information is not part of the same flow. For example when the title is not part of a header or sidebar.

For that reason, you can set a selector as global, meaning that it will match on the whole page and will be the same for all records extracted from this page.

{
"selectors": {
"lvl0": {
"selector": "#content header h1",
"global": true
}
}
}

We do not recommend text selectors to be global.

Setting a default value#

If a selector doesn't match a valid element on the page, you can define a default_value as a fallback.

{
"selectors": {
"lvl0": {
"selector": "#content header h1",
"default_value": "Documentation"
}
}
}

Removing unnecessary characters#

Some documentation adds special characters to headings, like # or . Those characters have a stylistic value but no meaning and shouldn't be indexed in the search results.

You can define a list of characters you want to exclude from the final indexed value by setting the strip_chars key.

{
"selectors": {
"lvl0": {
"selector": "#content header h1",
"strip_chars": "#›"
}
}
}

Note that you can also define strip_chars directly at the root of the configuration and it will be applied to all selectors.

{
"strip_chars": "#›"
}

Targeting elements using XPath instead of CSS#

CSS selectors are a clear and concise way to target elements of a page, but they have a limitations. For example, you cannot go up the cascade with CSS.

If you need a more powerful selector mechanism, you can write your selectors using XPath by setting type: xpath.

The following example will look for a li.chapter.active.done and then go up two levels in the DOM until it finds a a. The content of this a will then be used as the value of the lvl0 selector.

{
"selectors": {
"lvl0": {
"selector": "//li[@class=\"chapter active done\"]/../../a",
"type": "xpath",
"global": true
}
}
}

XPath selector can be hard to read. We highly encourage you to test them in your browser first, making sure they match what you're expecting.

custom_settings Optional#

This key can be used to overwrite your Algolia index settings. We don't recommend changing it as the default settings are meant to work for all websites.

custom_settings.separatorsToIndexOptional#

One use case would be to configure the separatorsToIndex setting. By default Algolia will consider all special characters as a word separator. In some contexts, like for method names, you might want _, / or # to keep their meaning.

{
"custom_settings": {
"separatorsToIndex": "_/"
}
}

Check the Algolia documentation for more information about the Algolia settings.

custom_settings.synonyms Optional#

custom_settings can include a synonyms key that is an array of synonyms. Each element is an array of one-word synonym. These words are interchangeable.

For example:

"custom_settings": {
"synonyms": [
[
"js",
"javascript"
],
[
"es6",
"ECMAScript6",
"ECMAScript2015"
]
]
},

Note that you can use advanced synonym with Algolia. Our scraper only supports regular one-word synonyms.

scrape_start_urls Optional#

By default, the crawler will extract content from the pages defined in start_urls. If you do not have any valuable content on your starts_urls or if it's a duplicate of another page, you should set this to false.

{
"scrape_start_urls": false
}

selectors_exclude Optional#

This expects an array of CSS selectors. Any element matching one of those selectors will be removed from the page before any data is extracted from it.

This can be used to remove a table of content, a sidebar, or a footer, to make other selectors easier to write.

{
"selectors_exclude": [".footer", "ul.deprecated"]
}

stop_urls Optional#

This is an array of strings or regular expressions. Whenever the crawler is about to visit a link, it will first check if the link matches something in the array. If it does, it will not follow the link. This should be used to restrict pages the crawler should visit.

Note that this is often used to avoid duplicate content, by adding http://www.example.com/docs/index.html if you already have http://www.example.com/docs/ as a start_urls.

{
"stop_urls": ["https://www.example.com/docs/index.html", "license.html"]
}

min_indexed_level Optional#

The default value is 0. By increasing it, you can choose not to index some records if they don't have enough lvlX matching. For example, with a min_indexed_level: 2, the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set. You can find out more details about this strategy in this section.

This is useful when your documentation has pages that share the same lvl0 and lvl1 for example. In that case, you don't want to index all the shared records, but want to keep the content different across pages.

{
"min_indexed_level": 2
}

only_content_level Optional#

When only_content_level is set to true, then the crawler won't create records for the lvlX selectors.

If used, min_indexed_level is ignored.

{
"only_content_level": true
}

nb_hits Special#

The number of records that were extracted and indexed by DocSearch. We check this key internally to keep track of any unintended spike or drop that could reveal a misconfiguration.

nb_hits is updated automatically each time you run DocSearch on your config. If the term is a tty, DocSearch will prompt you before updating the field. To avoid being prompted, set the UPDATE_NB_HITS environment variable to true (to enable) or false (to disable). This variable can be set in the .env file alongside APPLICATION_ID and API_KEY.

You don't have to edit this field. We're documenting it here in case you were wondering what it's all about.

Sitemaps#

If your website has a sitemap.xml file, you can let DocSearch know and it will use it to define which pages to crawl.

sitemap_urls Optional#

You can pass an array of URLs pointing to your sitemap(s) files. If this value is set, DocSearch will try to read URLs from your sitemap(s) instead of following every link of your starts_urls.

{
"sitemap_urls": ["http://www.example.com/docs/sitemap.xml"]
}

You must explicitly defined this parameter, our scraper doesn't follow robots.txt

sitemap_alternate_links Optional#

Sitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default DocSearch will ignore those URLs.

Set this to true if you want those other versions to be crawled as well.

{
"sitemap_urls": ["http://www.example.com/docs/sitemap.xml"],
"sitemap_alternate_links": true
}

With the above configuration and the sitemap.xml below, both http://www.example.com/docs/ and http://www.example.com/docs/de/ will be crawled.

<url>
<loc>http://www.example.com/docs/</loc>
<xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/de/"/>
</url>

JavaScript rendering#

By default DocSearch expects websites to have server-side rendering, meaning that HTML source is returned directly by the server. If your content is generated by the front-end, you have to tell DocSearch to emulate a browser through Selenium.

As client-side crawl is way slower than server-side crawl, we highly encourage you to update your website to enable server-side rendering.

js_render Optional#

Set this value to true if your website requires client-side rendering. This will make DocSearch spawn a Selenium proxy to fetch all your web pages.

{
"js_render": true
}

js_wait Optional#

If your website is slow to load, you can use js_wait to tell DocSearch to wait a specific amount of time (in seconds) for the page to load before extracting its content.

Note that this option might have a large impact on the time required to crawl your website and we would encourage you to enable server-side rendering on your website instead.

This option has no impact if js_render is set to false.

{
"js_render": true,
"js_wait": 2
}

use_anchors Optional#

Websites using client-side rendering often don't use full URLs, but instead take advantage of the URL hash (the part after the #).

If your website is using such URLs, you should set use_anchors to true for DocSearch to index all your content.

{
"js_render": true,
"use_anchors": true
}

user_agent Optional#

You can override the user agent used to crawl your website. By default, this value is:

Algolia DocSearch Crawler

However, if the crawl of your website requires a browser emulation (i.e. js_render=true), our user_agent is:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/71.0.3578.98 Safari/537.36

To override it, from the configuration:

{
"user_agent": "Custom Bot"
}