Config Files
For each DocSearch request we receive, we create a custom JSON configuration file that defines how the crawler should behave. You can find all the configs in this repository.
A DocSearch looks like this:
index_name
#
This is the name of the Algolia index where your records will be pushed. The
apiKey
we will share with you is restricted to work with this index and is a
search-only key.
When using the free DocSearch crawler, the indexName
will always be the name
of the configuration file. If you're running DocSearch yourself, you can use any
name you'd like.
When the DocSearch scraper runs, it builds a temporary index. Once scraping is
complete, it moves that index to the name specified by index_name
(replacing
the existing index).
By default, the name of the temporary index is the value of index_name
+
_tmp.
To use a different name, set the INDEX_NAME_TMP
environment variable to a
different value. This variable can be set in the .env file alongside
APPLICATION_ID
and API_KEY
.
start_urls
#
This array contains the list of URLs that will be used to start crawling your
website. The crawler will recursively follow any links (<a>
tags) from those
pages. It will not follow links that are on another domain and never follow
links matching stop_urls
.
selectors_key
, tailor your selectors#
You can define finer sets of selectors depending on the URL. You need to use the
parameter selectors_key
from your start_urls
.
To find the right subset to use based on the URL, the scraper iterates over
these start_urls
items. Only the first one to match is applied.
Considering the URL http://www.example.com/en/api/
with the configuration:
Only the set of selectors related to doc
will be applied to the URL. The
correct configuration should be built the other way around (as primarily
described).
If one start_urls
item has no selectors_key
defined, the default
set will
be used. Do not forget to set this fallback set of selectors.
#
Using regular expressionsThe start_urls
and stop_urls
options also enable you to use regular
expressions to express more complex patterns. This object must at least contain
a url
key targeting a reachable page.
You can also define a variables
key that will be injected into your specific
URL pattern. The following example makes this variable feature clearer:
The beneficial side effect of using this syntax is that every record extracted
from pages matching http://www.example.com/docs/en/latest
will have attributes
lang: en
and version: latest
. It enables you to filter on these
facetFilters
.
The following example shows how the UI filters results matching a specific language and version.
#
Using custom tagsYou can also apply custom tags to some pages without the need to use regular
expressions. In that case, add the list of tags to the tags
key. Note that
those tags will be automatically added as facets in Algolia, allowing you to
filter based on their values as well.
From the JS snippet:
#
Using Page RankTo give more weight to some pages. This parameter helps to boost records built
from the page. Pages with highest page_rank
will be returned before pages with
a lower page_rank
. Note that you can pass any numeric value, including
negative values.
In this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page.
#
Using custom selectors per pageIf the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages.
Here, all documentation pages will use the selectors defined in
selectors.default
while the page under ./concepts
will use
selectors.concepts
and those under ./contributors
will use
selectors.contributors
.
selectors
#
This object contains all the CSS selectors that will be used to create the
record hierarchy. It can contain up to 6 levels (lvl0
, lvl1
, lvl2
, lvl3
,
lvl4
, lvl5
) and text
.
A default config would be to target the page title
or h1
as lvl0
, the h2
as lvl1
, h3
as lvl2
, and p
as text
, but this is highly dependent on
the markup.
The text
key is mandatory, but we highly recommend setting also lvl0
, lvl1
and lvl2
to have a decent depth of relevance.
Selectors can be passed as strings, or as objects containing a selector
key.
Other special keys can be set, as documented below.
#
Using global selectorsThe default way of extracting content through selectors is to read the HTML markup from top to bottom. This works well with semi-structured content, like a hierarchy of headers. This breaks when the relevant information is not part of the same flow. For example when the title is not part of a header or sidebar.
For that reason, you can set a selector as global, meaning that it will match on the whole page and will be the same for all records extracted from this page.
We do not recommend text
selectors to be global.
#
Setting a default valueIf a selector doesn't match a valid element on the page, you can define a
default_value
as a fallback.
#
Removing unnecessary charactersSome documentation adds special characters to headings, like #
or ›
. Those
characters have a stylistic value but no meaning and shouldn't be indexed in the
search results.
You can define a list of characters you want to exclude from the final indexed
value by setting the strip_chars
key.
Note that you can also define strip_chars
directly at the root of the
configuration and it will be applied to all selectors.
#
Targeting elements using XPath instead of CSSCSS selectors are a clear and concise way to target elements of a page, but they have a limitations. For example, you cannot go up the cascade with CSS.
If you need a more powerful selector mechanism, you can write your selectors
using XPath by setting type: xpath
.
The following example will look for a li.chapter.active.done
and then go up
two levels in the DOM until it finds a a
. The content of this a
will then be
used as the value of the lvl0
selector.
XPath selector can be hard to read. We highly encourage you to test them in your browser first, making sure they match what you're expecting.
custom_settings
Optional#
This key can be used to overwrite your Algolia index settings. We don't recommend changing it as the default settings are meant to work for all websites.
custom_settings.separatorsToIndex
Optional#
One use case would be to configure the separatorsToIndex
setting. By default
Algolia will consider all special characters as a word separator. In some
contexts, like for method names, you might want _
, /
or #
to keep their
meaning.
Check the Algolia documentation for more information about the Algolia settings.
custom_settings.synonyms
Optional#
custom_settings
can include a synonyms
key that is an array of synonyms.
Each element is an array of one-word synonym. These words are interchangeable.
For example:
Note that you can use advanced synonym with Algolia. Our scraper only supports regular one-word synonyms.
scrape_start_urls
Optional#
By default, the crawler will extract content from the pages defined in
start_urls
. If you do not have any valuable content on your starts_urls
or
if it's a duplicate of another page, you should set this to false
.
selectors_exclude
Optional#
This expects an array of CSS selectors. Any element matching one of those selectors will be removed from the page before any data is extracted from it.
This can be used to remove a table of content, a sidebar, or a footer, to make other selectors easier to write.
stop_urls
Optional#
This is an array of strings or regular expressions. Whenever the crawler is about to visit a link, it will first check if the link matches something in the array. If it does, it will not follow the link. This should be used to restrict pages the crawler should visit.
Note that this is often used to avoid duplicate content, by adding
http://www.example.com/docs/index.html
if you already have
http://www.example.com/docs/
as a start_urls
.
min_indexed_level
Optional#
The default value is 0
. By increasing it, you can choose not to index some
records if they don't have enough lvlX
matching. For example, with a
min_indexed_level: 2
, the scraper indexes temporary records having at least
lvl0
, lvl1
and lvl2
set. You can find out more details about this
strategy in this section.
This is useful when your documentation has pages that share the same lvl0
and
lvl1
for example. In that case, you don't want to index all the shared
records, but want to keep the content different across pages.
only_content_level
Optional#
When only_content_level
is set to true
, then the crawler won't create
records for the lvlX
selectors.
If used, min_indexed_level
is ignored.
nb_hits
Special#
The number of records that were extracted and indexed by DocSearch. We check this key internally to keep track of any unintended spike or drop that could reveal a misconfiguration.
nb_hits
is updated automatically each time you run DocSearch on your config.
If the term is a tty, DocSearch will prompt you before updating the field. To
avoid being prompted, set the UPDATE_NB_HITS
environment variable to true
(to enable) or false
(to disable). This variable can be set in the .env file
alongside APPLICATION_ID
and API_KEY
.
You don't have to edit this field. We're documenting it here in case you were wondering what it's all about.
#
SitemapsIf your website has a sitemap.xml
file, you can let DocSearch know and it will
use it to define which pages to crawl.
sitemap_urls
Optional#
You can pass an array of URLs pointing to your sitemap(s) files. If this value
is set, DocSearch will try to read URLs from your sitemap(s) instead of
following every link of your starts_urls
.
You must explicitly defined this parameter, our scraper doesn't follow
robots.txt
sitemap_alternate_links
Optional#
Sitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default DocSearch will ignore those URLs.
Set this to true
if you want those other versions to be crawled as well.
With the above configuration and the sitemap.xml
below, both
http://www.example.com/docs/
and http://www.example.com/docs/de/
will be
crawled.
#
JavaScript renderingBy default DocSearch expects websites to have server-side rendering, meaning that HTML source is returned directly by the server. If your content is generated by the front-end, you have to tell DocSearch to emulate a browser through Selenium.
As client-side crawl is way slower than server-side crawl, we highly encourage you to update your website to enable server-side rendering.
js_render
Optional#
Set this value to true if your website requires client-side rendering. This will make DocSearch spawn a Selenium proxy to fetch all your web pages.
js_wait
Optional#
If your website is slow to load, you can use js_wait
to tell DocSearch to wait
a specific amount of time (in seconds) for the page to load before extracting
its content.
Note that this option might have a large impact on the time required to crawl your website and we would encourage you to enable server-side rendering on your website instead.
This option has no impact if js_render
is set to false
.
use_anchors
Optional#
Websites using client-side rendering often don't use full URLs, but instead take
advantage of the URL hash (the part after the #
).
If your website is using such URLs, you should set use_anchors
to true
for
DocSearch to index all your content.
user_agent
Optional#
You can override the user agent used to crawl your website. By default, this value is:
However, if the crawl of your website requires a browser emulation (i.e.
js_render=true
), our user_agent
is:
To override it, from the configuration: