%# use Data::Dumper; %# $Data::Dumper::Indent=1;
<% $MESSAGE %>% } elsif(defined $CGI->param('editfiltering')) { % } elsif(defined $CGI->param('editsites')) { % my $site = $CGI->param('editsites'); % my $escsite = $site; % $escsite =~ s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg; % my %siteconfig;
Back to FilterProxy main configuration.
% foreach my $sitere (keys %{$FilterProxy::CONFIG->{'filters'}}) {
% if($site =~ $sitere && defined $FilterProxy::CONFIG->{'filters'}->{$sitere}->{'Rewrite'}) {
% $siteconfig{$sitere} = $FilterProxy::CONFIG->{'filters'}->{$sitere}->{'Rewrite'};
% }
% }
% foreach my $sitere (keys %siteconfig) {
% my $escsitere = $sitere;
% $escsitere =~ s/([^a-zA-Z0-9_.-])/uc sprintf("%%%02x",ord($1))/eg;
% $SITECONFIG = $siteconfig{$sitere};
Name | Action | Finder Operation | Submit |
---|
Back to FilterProxy main configuration. % if(defined $SITECONFIG) {
Name | Action | Finder Operation | Submit |
---|
A "Matcher" (currently either tag
, attrib
, or
regex
) is applied to the file to find the content desired. So a
Matcher like
tag <img src>
regex /blue chickens/
tag </(a|img)/ /(src|href)/> add tagblock <script>
add
will then expand the match to include a
<script> block that follows the initial <a href> or <img
src>. You can use the encloser
matcher instead of
tag
to cause it to grow to a <script> block that
encloses the initial <a href> or <img src>.
add alternate
adds "alternate content". In other words, if
you match a <script> block, it's alternate content is a
<noscript> block. This is usually used to show banner ads to browsers
which don't support javascript, or don't have it turned on. Often it's easy
to match the ad inside <noscript> but almost impossible to match a
javascript ad. Since these are often right next to each other in a page,
alternate
will consider them one block. alternate also knows
about <layer>, <ilayer> and it's alternate content
<nolayer>, and <frame> and it's alternate content
<noframe>.
add balanced
adds "balanced enclosers". In other words, if
you match <img src=...> and it has a <center> preceeding it and a
</center> trailing it, balanced
will consider the center
tags part of the match. It continues adding in balanced enclosers until it
reaches a leading tag that does not have a corresponding closing tag trailing
the match. balanced
ignores whitespace, comments, and a few
other tags like <br> and <p>.
Clever combinations of add, balanced, alternate, and encloser can make most pages look like it never had an ad.
Once your match is found it is either stripped or rewritten. Strip should
be obvious (removes the match from the page). Rewrite requires the matcher
to be followed by as [block]
. The match will be replaced by the
text following the as
keyword. No interpretation is done of the
as part, it is simply replaced verbatim.
Each rule can be named, so that if a rule BADRULE destroys the layout of
one page, you can create a site regex for it (on the FilterProxy main page)
which will contain the rule ignore BADRULE
. This will cause
BADRULE to not be applied to sites matching that site regex. You don't have
to name your rules if you don't want to. You can even name
ignore
rules, so that you can ignore your ignore rules. But
that is probably a little silly. Rules and ignores are processed in
alphabetical order, so if you want one rule or ignore to be processed before
another, you can preceed the name with a number (i.e. 1_MYRULE), or just
name it something that comes before it alphabetically.
add
to expand the match to include
any javascript, tables, or forms. Do not, for instance, write a rule
like:strip tag <img width=468 height=60>since many sites have 468x60 images that are not ads.
strip regex /<!-- +Begin Ad +-->/ add regex /<!-- +End Ad +-->/
perl FilterProxy/Rewrite.pm <html file> '<rule>'More than one rule may be specified. You must have saved the html file to a file in your directory (and specify it as the second argument above). <rule> must be in quotes so that it gets treated as a single argument by the shell. If the ' character appears in the rule, escape it with a backslash. Some default rules are defined when doing this (look at the end of Rewrite.pm). This also dumps a bunch of timing information, and the HTML that is actually stripped/rewritten.
strip tag <script src=/(?x) # This is a comment (blah.com|blahblah.com) # match some hostnames # ... etc />Note that comments are only allowed inside the regex this way. Consult 'man perlre' for more information.
[25163 Mon Mar 12 14:48:27 2001] Rewrite: ADS took 0.48920 seconds, 381 failed, 2 successfulFilterProxy should be roughly O(n) in the number of "failed" matches listed. FilterProxy is also roughly O(n) in the number of "successful" matches listed (but we don't care how long that takes, right?) A failed match is considered the number of times the tag name matched, but the attributes did not, when using the
tag
matcher. (now you
see why the ADS rule is so slow...because it tries to find ads by looking
at the <a> tag)
strip tag <a> containing attrib <a href=~/doubleclick.net/>would be very slow (for most documents that contain lots of <a> tags), but:
strip tag <a href=~/doubleclick.net/>would be much faster (by a factor of 3, by my tests). But an even faster way is:
strip tag <img src=~/doubleclick.net/> add tagblock <a>Since most documents contain many <a> tags, both of the first two examples will be pretty slow. Assuming the document contains more <a> tags than <img> tags, the last example will be fastest.
regex
finder instead of the tag
finder, the string matched by
the regex
matcher must be unique in the document, and an
equivalent tag
matcher would have very many false matches.
(where the tag name matches, but the attributes of the tag do not)
The basic syntax is: [NAME:] command matcher [[qualifying predicate] [expanding predicate] ...] Note: [] means "optional", {} means "mandatory", ... means "more than one" <> are literal, and must be included as part of the rule. Commands: strip remove from file rewrite {matcher} as {html} change matched text to something else ignore {NAME} [...] ignore a named rule (can specify more than one) Expanding Predicates: (modifiers that expand an existing match, but will not cause the match to fail if not found) add {matcher} grow match to include text matched by [matcher] (use a matcher below) (can apply more than once, order matters) if [matcher] is one of (tag, tagblock, regex, attrib), the match will grow forward from the previous point until it finds [matcher]. Qualifying Predicates: (modifiers that also must match in order to consider it a match) inside {matcher} like encloser, except that the match that preceeds it must be *inside* the match that follows it. This does not change the original match (use add encloser instead if you want to strip the thing it's inside). Note that currently, the only matcher that will succeed is 'encloser', since none of the other matchers can search backwards. containing {matcher} the matched block must contain {matcher} preceeding {matcher} the matched block must come before {matcher}. {matcher} is not considered a part of the match. Matchers: Each matcher "finds" a block of text that gets passed to the predicates that follow it. tag [options] <{tagname} [attrib[=value]] [...]> Will grab a single tag (without corresponding closing tag). Any of tagname, attrib, or value can be a regular expression by enclosing them in one of these regex delimiters: [/#%&!,=:]. An empty regex (//) is valid (and will match everything). If a value is not enclosed by regex delimiters, it will match all valid means of specifying that value, regardless of the quote characters surrounding the value in the rule, or the matched HTML. (i.e. "tag <img width=1>" will match <img width="1">, <img width='1'> and <img width=1>) If more attribs are present in the HTML than are specified in the rule, it will still match. 'tag', 'tagblock', 'attrib', and 'encloser' all use this method of specifying the tag to be matched. tagblock <{tagname} [attrib[=value]] [...]> Matches the block starting at the specified tag, and ending at the corresponding closing tag. (like old 'tag -tagblock') (see 'tag') attrib <{tagname} attrib[=value] [attrib[=value]] [...]> Will grab the attribute specified. Note that you can specify more than one attribute, and the *first* one is the one that will be stripped/rewritten, but the tagname must match and other attribs are required to be present. (see 'tag') regex /regex/ Match any (perl) regex. Regex must be delimited by one of: [/#%&!,=:]. Note that this does matching (m//), not s///, tr/// or y///. (yet) Expanding Matchers: These matchers must be given as an argument to 'add'. encloser <{tagname} [attrib[=value]] [...]> Like tagblock, except that the block must enclose the previous match. (only makes sense as argument to 'add', and should really be named "enclosing tag block" but that's too long) (see 'tag') leader <{tagname} [attrib[=value]] [...]> Like tag, except that it searches backwards from the current match. (only makes sense as argument to 'add') Note: There is no 'trailer' or 'follower' matcher. Use 'add tag ...' to search forward in a similar manner. balanced Grow match to include "balanced" tags that have the tag preceeding the match, and the corresponding closing tag trailing the match (with nothing in between). Only makes sense as argument to 'add'. alternate Grow the match to include "alternate content". i.e. script/noscript, frame/noframe, layer/nolayer etc. Only makes sense as argument to 'add'. "alternate content" may preceed or trail the original match. whitespace Grow the match to include whitespace (' ', '\n', '\t') as well as whitespace-like tags such as <p>, <br>, <hr>, and entities like . Note that only tags that preceede the match will be added for speed. (searching backwards is hard) In all cases more than one attrib can be specified. You may chain as many matchers and predicates as you like, but if it starts to get too long it will probably be ambiguous and not do what you might expect. (I need a BNF form grammar for this syntax...see this discussion on perlmonks.org.)% } else { # if(defined $SITECONFIG)
Rewrite has no global configuration. Please select a site from the main FilterProxy Config page. % }