You can create a set of rules for each session to follow. These rules can allow or disallow actions on URL resources, based on criteria you specify. For example, you might want to prevent certain files from being downloaded.
Rules override the Crawling Scope setting, so you can actually explicitly allow URLs not in the scope to be accessed. Rules are evaluated in the order they're arranged, so actions further down take precedence. You can re-order rules by dragging them around.
You can test the rules before you start the session to make sure they work like you intend them to, by using the Test button.
There are three types of actions that you can choose from. Each of these has an Always and a Never entry. Always denotes that a rule allows the action and performs it when the conditions are met. Never is the opposite; it prevents the action from being taken when the appropriate conditions are met.
The access action allows or disallows any handling of the resource in question. This is the most common action type. It's a simple way to avoid certain files from being accessed. All URLs within the crawling scope default to being allowed access.
The follow action denotes whether a resource is searched for links. This applies to certain file types such as HTML, CSS and RSS feeds, wherein external references can be found. All compatible files have this action turned on by default.
This action controls whether the file will be saved to disk. If not, the file can still be searched for links (see Follow). This is allowed by default for all accessed files.
The most common use for rules are to disallow access to certain unwanted files which do not need to be downloaded. It's also common to explicitly allow some URLs in domains outside the crawled one - remember that rules can override the crawling scope.
Each rule has a set of conditions that decide when the action should be performed. You can choose whether all conditions must be met ("all"), or if at least one is enough ("any"). Conditions compare a property of each resource against a value you specify. You can also specify how the comparisons are to be performed.
There are eight different properties. The first six take their information from parts of the web adress:
Property | Description | Example |
---|---|---|
URL | The entire web address of the resource. | http://www.example.com/directory/page.html?query |
File name | The last path component of the URL. | page.html |
Path | The entire path part of the URL. | /directory/page.html |
Query string | The part following the ? in the URL, or empty if there's no query. | query |
File extension | The part after the last dot of the file name. | html |
Host name | The domain name (or IP address). | www.example.com |
The last two properties are special in that their information is taken from the HTTP response header. This means that the result of these can not be determined until the resource header has been fetched.
Property | Description | Example |
---|---|---|
Content type | The MIME content type in the response header. | text/html |
Response code | The HTTP response code | 200 |
SiteCrawler has support for powerful regular expressions in rules, so that you can match properties against patterns. Use the Matches pattern and Does not match pattern comparison types to accomplish this. SiteCrawler uses ICU regular expressions. You can learn more about the syntax here.
The property must end with a 3-digit number: .*[0-9]{3}$
The property must be at least 10 characters long: ^.{10,}$
The property must either contain foo or begin with bar: (.*foo.*)|(^bar.*)