Configuring link validation
In Episerver CMS, you can track broken links for a website using the link validator scheduled job. The Link Validator scheduled job checks the links in tblContentSoftLink, performs a head request against each one, and save the links status back to tblContentSoftLink. The result of the validation job is available as a report called Link Status report in Report Center.
The scheduled job first gets a batch of up-to-1000 links from tblContentSoftLink. The job returns only unchecked links or links that were checked earlier than the time when the job started. The job uses the date the link was last checked and the re-check interval to determine if the link should be checked again.
Each of the links in the batch are checked using a head request, if the servers' robots.txt allows for this. No host is checked more than once every five seconds. If a link exists on a host that was checked in the last five seconds, the job waits five seconds and then checks the link.
The job saves the status of the link and the date the link was checked, and includes the HTTP status code if possible, to tblContentSoftLink. The job saves information about when a link was first found broken. After the first batch of links is checked, a new batch is fetched from the database.
The job continues until it cannot get any more unchecked links form the database, or the job's runtime has exceeded the value set in maximumRunTime. The job stops if a large number of consecutive errors are found on external links, in case there is a general network problem with the server running the site.
Configuring the Link Validator
None of the settings are required but you can use them to customize the behavior of the link validation job. Add the <linkValidator> node as a child to the <episerver> node of the web.config file. Example:
<linkValidator externalLinkErrorThreshold="10" maximumRunTime="4:00:00" recheckInterval="30.00:00:00" userAgent="EPiServer LinkValidator" proxyAddress="http://myproxy.mysite.com" proxyUser="myUserName" proxyPassword="secretPassword" proxyDomain=".mysite.com" internalLinkValidation="Api"> <excludePatterns> <add regex=".*doc"/> <add regex=".*pdf"/> </excludePatterns> </linkValidator>
To configure the behavior of the link validation job, you have the following options:
- externalLinkErrorThreshold. If there are more than the configured value of consecutive errors on external links, the job aborts.
- maximumRunTime. The maximum time the scheduled job executes.
- recheckInterval. A link that was validated as working is not rechecked until the configured time span has elapsed.
- userAgent. The user agent string to use when validating a link.
- proxyAddress. Web proxy address for the link checker to use when validating links.
- proxyUser. Web proxy user for authenticating proxy connection.
- proxyPassword. Web proxy password to authenticate the proxy connection.
- proxyDomain. Web proxy domain to authenticate the proxy connection.
- internalLinkValidation. How the link validator handles internal links. Possible values:
- Off. Internal links are ignored.
- Api. The internal API are used to validate that the referenced page exists. [default]
- Request. Internal links are the same way as external, using a head request.
- excludePatterns. A list of patterns for links that the link validation job skips. Use the regex attribute to identify what links to skip.
The link validator does not handle private resources with the exception of pages. This includes documents and images stored on a local file system which does not allow anonymous access. If you use forms authentication, these links are not validated and do not appear in the link report. If you use basic or Windows authentication, links to these resources result in 401 (access denied) in the link report. This may be the case for an intranet site with Windows authentication and anonymous access disabled.
- Configuration describes syntax used in the description of the configuration elements.