URL filters allow you to easily control Project downloads by
setting which pictures/pages should be loaded and which should be
skipped.
URL Filters are divided into four parts:
路
|
Page URL Include Filters - determine which HTML pages should be
accessed and analyse to follow the links. |
路
|
Page URL Exclude Filters - determine which HTML pages should be
skipped. |
路
|
Picture URL Include Filters - determin which pictures should be
downloaded. |
路
|
Picture URL Exclude Filters - determin which pictures should be
skipped. |
You may enter several keywords into each of these filter lists,
using a semicolon (;) to separate keywords.
You can use Perl like Regular Expression as keyword, A regular
expression is a string of characters which tells PicaLoader which
URL (or URLs) you are looking for. The following explains the
format of regular expressions in detail. If you are familiar with
Perl, you already know the syntax.
1.Simple
Regular Expressions:In its simplest form, a regular
expression is just a word or phrase to search for. For example,
beatles
would match any URL with the string "beatles" in it, or which
mentioned the word "beatles" in the URL line.Thus, URLs like
"xxx.beatles.xxx", "xxx.music.xxx/beatles.htm" or
"xxx.anmimal.xxx/beatleswild.htm" would all be matched.
2.Metacharacters:Some
characters have a special meaning to the filter. These characters
are called metacharacters. Although they may seem confusing at
first, they add a great deal of flexibility and convenience to the
filter.
The period (.) is a
commonly used metacharacter. It matches exactly one character,
regardless of what the character is. For example, the regular
expression:
pic.01
will match "pic001" and "pic101"... Note that the period matches
exactly one character-- it will not match a string of characters,
nor will it match the null string. Thus, "picture01" and "pic01"
will not be matched by the above regular expression.
But what if you wanted to match for a URL containing a period?
For example,
pic001.jpg
This would indeed match "pic001.jpg", but it would also match
"pic001ajpg", "pic0011jpg"... In short, any string of the form
"pic001xjpg", where x is any character, would be matched by the
regular expression above.
To get around this, we introduce a second metacharacter, the
backslash (\). The backslash can be used to indicate that the
character immediately to its right is to be taken literally. Thus,
to match for the string "pic001.jpg", we would use:
pic001\.jpg
This is called "quoting". We would say that the period
in the regular expression above has been quoted. In general,
whenever the backslash is placed before a metacharacter, the
searcher treats the metacharacter literally rather than invoking
its special meaning.
The question mark (?):
indicates that the character immediately preceding it either zero
times or one time. Thus
pic0?1
will match "pic1" and "pic01".
The star (*): indicates
that the character immediately to its left may be repeated any
number of times, including zero. Thus
pic0*1
will match "pic1", "pic01", "pic001", "pic0001", and any string
that starts with an "pic", is followed by a sequence of
"0"'s, and ends with a "1".
The plus (+): indicates
that the character immediately preceding it may be repeated one or
more times. It is just like the star metacharacter, except it
doesn't match the null string. Thus
pic0+1
would not match "pic1", but it would match "pic01", "pic001",
"pic0001" and so on.
Metacharacters may be combined. A common combination includes
the period and star metacharacters, with the star immediately
following the period. This is used to match an arbitrary string of
any length, including the null string. For example:
pic.*1
would match "pic1", "pic01" and even "picture_001" Any string
that starts with "pic", is followed by an arbitrary string, and
ends with "1" will be matched. Note that the null string will be
matched by the period-star pair; thus, "pic1" would be matche by
the above expression.
3.Earlier it
was mentioned that the backslash can turn ordinary characters into
metacharacters, as well as the other way around.
The digit metacharacter:
which is invoked by following a backslash with a lower-case "d",
like this: "\d". The "d" must be lower case. The digit
metacharacter matches exactly one digit; that is, exactly one
occurence of "0", "1", "2", "3", "4", "5", "6", "7", "8" or "9".
For example, the regular expression:
pic\d\.jpg
would match "pic0.jpg", "pic1.jpg" and so forth. Similarly,
pic\d\d\.jpg
would match "pic00.jpg", "pic01.jpg" ~ "pic99.jpg".
We could combine the digit metacharacter with other
metacharacters; for instance,
pic\d+\.jpg
matches any string starting with "pic", followed by a string of
numbers, followed by a ".jpg". (Note that the plus is used, and
thus "pic.jpg" is not matched.)
The non-digit
metacharacter: which uses the uppercase "D". The non-digit
metacharacter looks like "\D" and matches any character except a
digit. Thus,
pic\D\.jpg
would match "pica.jpg", "picZ.jpg" or "pic+.jpg", but would not
match "pic1.jpg", "pic5.jpg" or "pic9.jpg". Similarly,
\D+
Matches any non-null string which contains no numeric
characters.
The word metacharacter:
which matches exactly one letter, one number, or the underscore
character (_). It is written as "\w". It's opposite, "\W", matches
any one character except a letter, a number or the underscore.
Thus,
a\wz
would match "abz", "aTz", "a5z", "a_z", or any three-character
string starting with "a", ending with "z", and whose second
character was either a letter (upper- or lower-case), a number, or
the underscore. Similarly,
a\Wz
would not match "abz", "aTz", "a5z", or "a_z". It would match
"a%z", "a{z", "a?z" or any three-character string starting with "a"
and ending with "z" and whose second character was not a letter,
number, or underscore. (This means the second character must either
be a symbol or a whitespace character.)
The braces
metacharacter: This metacharacter follows a normal character
and contains two number separated by a comma (,) and surrounded by
braces ({}). It is like the star metacharacter, except the length
of the string it matches must be within the minimum and maximum
length specified by the two numbers in braces. Thus,
pic0{3,5}\.jpg
will match "pic000.jpg" and "pic00000.jpg". No other string is
matched. Likewise,
pic.{3,5}\.jpg
will match "pic000.jpg", "pic99999.jpg" or "picabc.jpg", but not
"pic00.jpg", since "00" is only two characters long.
The alternative
metacharacter: is represented by a vertical bar (|). It
indicates an either/or behavior by separating two or more possible
choices. For example:
beatles|u2
will match any subject containing the strings "beatles" or "u2"
or both.
The bracket
metacharacter: matches one occurence of any character inside
the brackets ([]). For example,
pic_[abf]\.jpg
will match "pic_a.jpg", "pic_b.jpg" and "pic_f.jpg", but not
"pic_0.jpg", "pic_c.jpg" or "pic_e.jpg". Similarly,
Ranges of characters can be used by using the dash (-) within
the brackets. For example,
pic[a-d]\.jpg
will match "pica.jpg", "picb.jpg", "picc.jpg" or "picd.jpg", and
nothing else. Likewise,
wallpaper[3-5]\d\.jpg
will match "wallpaper30.jpg" ~ "wallpaper59.jpg".
If you wish to include a dash within brackets as one of the
characters to match, instead of to denote a range, put the dash
immediately before the right bracket. Thus:
a[1234-]z
and
a[1-4-]z
both do the same thing. They both match "a1z", "a2z", "a3z",
"a4z" or "a-z", and nothing else.
The bracket metacharacter can also be inverted by placing a
caret (^) immediately after the left bracket. Thus,
wallpaper[^02468]\.jpg
matches any ten-character string starting with "wallpaper" and
ending with anything except an even number. Inversion and ranges
can be combined, so that
\W[^f-h]ood\W
matches any four letter wording ending in "ood" except for
"food", "good" or "hood". (Thus "mood" and "wood" would both be
matched.)
Note that within brackets, ordinary quoting rules do not apply
and other metacharacters are not available. The only characters
that can be quoted in brackets are "[", "]", and "\". Thus,
[\[\\\]]abc
matches any four letter string ending with "abc" and starting
with "[", "]", or "\".
4.The table
below lists some of the more useful special (meta)
characters.
Reg-expr
|
Description
|
.
|
Matches any character (except newline)
|
x?
|
Matches 0 or 1 x's, where x is any regular expression
|
x*
|
Matches 0 or more x's
|
x+
|
Matches 1 or more x's
|
foo|bar
|
Matches one of foo or bar
|
[xyz]
|
Matches any character in the set xyz, specify ranges with a
-
|
[^xyz]
|
Matches any single character not in the set xyz
|
\w
|
Matches an alpha-numeric character, i.e., [a-zA-Z0-9_]
|
(x)
|
Brackets a regular expression
|
\metachar
|
Matches the metacharacter (takes away its special meaning)
|
5.The search is
case insensitive; thus
picture
and
Picture
and
PICTURE
all search for the same set of strings. Each will match
"picture", "PICTURE", "Picture", "PicTure" and so forth. Thus you
need not worry about capitalization. (Note, however, that
metacharacter must still have the proper case. This is especially
important for metacharacters whose case determines whether their
meaning is reversed or not.)
|