This content is archived. See latest version here

Configuring Find connectors

This page describes configuration options for search connectors in EPiServer Find. A search connector lets website visitors search indexed content on other websites directly from the search interface, thereby accessing content that is related to but not stored in your website. EPiServer Find has two predefined search connector types: Crawler and RSS/Atom. See also: External content.

Fine-tuning crawling and indexing

You can fine-tune indexing by excluding internet media types, and excluding or including parts of a website to be crawled and indexed. You do so from the EPiServer Find administrative interface. Refer to the EPiServer User Guide for more information.

Excluding media types

When excluding media types, follow the standard method of classifying internet file types. See also: Media Types. The following media types are excluded by default when indexing:

  • text/css
  • text/javascript
  • text/ecmascript
  • application/x-pointplus
  • application/x-javascript
  • application/javascript
  • application/ecmascript

Excluding query strings

You can exclude any query string. As a use case, exclude known tracking URL parameters. For example, in the URL, you can exclude utm_source to prevent the unintentional incrementing of a campaign counter.

Common exclusions of this type:

  • sid
  • zenid

Note that all strings are case sensitive. Include no wildcards nor whitespaces.

Patterns and globbing

Globbing lets you expand a non-specific file name containing a wildcard character into a set of specific file names for storage on a computer, server, or network. All excluded fields support Glob patterns. The crawler connector uses patterns similar to those in robots.txt.

Pattern Example Corresponding regex
'*' */abc/
'?' */???/ .*/.../.*
'{', '}', ',' {abc,def} .*(abc|def).*
'[', ']', '!', ',' [0-9,xyz][!abc] .*[0-9xyz][^abc].*
',' abc,def .*abc,def.*
'\' \*\?\,\{\}\[\]\\ .*\*\?\,\{\}\[\]\\.*
'.', '(', ')', '+', '|', '^', '$', '@', '%'  .()+|^$@% .*\.\+\|\^\$\@\%.*

Include patterns

Parameter name 'included_crawl_patterns'. Can be a single globbing pattern as string or an array of globbing patterns.
Default: Seed base URLs.

Exclude patterns

Parameter name 'excluded_crawl_patterns'. Can be a single globbing pattern as a string or an array of globbing patterns. Overrides include patterns.
Default: '.{avi,bmp,css,gif,gz,ico,jpeg,jpg,js,m4v,mid,mov,mp2,mp3,mp4,mpeg,png,ram,rar,rm,smil,swf,tif,tiff,wav,wma,wmv,zip}'

No index patterns

Parameter name 'excluded_index_patterns'. Can be a single globbing pattern as a string or an array of globbing patterns.

Last updated: Sep 21, 2015