Quick Start
Introduction
Installation
Configuration
Running News Clipper
How It Works
The News Clipper Tag Language
<!-- newsclipper <input name=date> -->
News Clipper is a Perl program that allows people to integrate dynamic information their web page. This information might be something simple, like the date, or complex, like a set of links to recent Usenet postings. News Clipper allows the user to specify, using an HTML-like syntax, the source of data, how that data should be filtered, and how that data should be output.
By separating acquisition of data, filtering of data, and output of data, web designers are given more freedom to control the presentation of data. For example, you can specify that all headlines from Yahoo Tech News that have to do with Microsoft, Linux, or Y2K should be printed in three column , with the word Linux highlighted. Here's how the HTML might look:
<!--newsclipper <input name=yahootopstories source=tech> <filter name=grep words="microsoft,linux,y2k"> <filter name=map filter=highlight words="microsoft,linux,y2k"> <output name=array numcols=3> -->
Originally News Clipper was designed for a single user (me), but some effort has been spent to make it more generally useful. DOS/Windows installation is supported, as is system installation as Perl modules, and global HTML and image caches. Now timezones are supported for people whose time zone doesn't correspond to the server's.
News Clipper is a Perl program. If you are on a Unix-derived operating system, you should have it installed already. If you are on a DOS or Windows system, you are likely to have more difficulties that average users. If you are a Windows user and have never heard of Perl, you might be in for even more difficulty. (i.e. Find someone who can help you.)
Instructions:
Read the README for more detailed instructions. Also check the FAQ if you run into problems.
For a complete description of all configuration options, run "perldoc NewsClipper.pl". Here are a few notes regarding the various configuration parameters. More description can be found in the NewsClipper.cfg file itself.
When News Clipper is run with the -c switch, the specified file is used as a configuration file. Otherwise, News Clipper looks for a configuration file in ~user/.NewsClipper, then SYSCONFIGDIR, which is set in NewsClipper.pl during installation time.
On Windows systems, the TZ environment variable is set in the configuration file during installation.
For single user installations, the News Clipper modules will not be in the standard Perl locations. In this case, modulepath in the configuration file is set to point to them. (This means you don't have to change your PERL5LIB environment variable or run perl with the -I flag.)
Timeouts are used to prevent News Clipper from running too long, and to prevent unresponsive remote servers from slowing things down. Set sockettimeout to the maximum amount of time that you want News Clipper to wait for a response from a server. Set scripttimeout to the maximum time that you want News Clipper to run. (Note that scripttimeout should be about equal to sockettimeout times the number of News Clipper tags in your input files.)
News Clipper can handle multiple input and output files. Be sure that the number of input files equals the number of output files.
News Clipper caches remote web pages internally. This means that an ISP with 100 users using the "lycosweather" handler won't hit the Lycos server 100 times. Also, authors of handlers specify the times that data is updated on remote servers, and News Clipper will only fetch data if it has been updated since the last time it was fetched. This is useful for things like comics, which only update once a day.
The "cacheimages" handler caches remote images locally. When given a bit of HTML with <img src="URLx">, it caches the image pointed to by URLx, and substitutes a local URLy in the place of URLx. cacheimages also deletes old images from the cache after a specified time. These options can be given default values in the configuration file, which lets system maintainers provide a global image cache for all users.
News Clipper also allows the user to specify the location where handlers should be stored. System maintainers can point this value to a globally accessible directory. Otherwise, it defaults to ~user/.NewsClipper/NewsClipper/Handler, where ~user is the user's home directory.
There are different ways of using the script:
If called as a CGI program, the output file is echoed to standard
output with the text "Content-type: text/html" preceding it. This allows
it to be called dynamically over the net via cgiwrap. For example:
<a
href="http://www.host.com/cgi-bin/cgiwrap?user=you&script=NewsClipper.pl">.
The first time you run the script each day, it may a half-minute or so to collect the information (depending on network load and amount of data to aquire). But after that, the script is very fast because is will only pull data from the net if it needs to.
NewsClipper.pl processes command line options, the configuration file, and input and output files. Each input file is parsed, and when a comment of the form <!-- newsclipper...--> is found, the comment is parsed for commands to be executed.
If there is only one command to be executed (an input command), News Clipper determines the default filter and output handlers from the input handler. The resulting (expanded) command list will be composed of an input command, zero or more filter commands, and an output command.
During input commands, the cache is checked to see if fresh data still exists. If not, the data is grabbed from the net, stored in the cache, and then used by the handler.
Each command is executed, and the results are fed into the next command. If anything goes wrong, News Clipper inserts a comment in the output file describing the problem.
If, at any time, a handler can not be found, News Clipper prompts the user to download it. The -n flag can be used to tell News Clipper to check for new versions of handlers, and the -a flag can be used to automatically download them.
With the release of News Clipper 7.0, users have much more flexibility when it comes to choosing how data should be displayed on their web pages. This is achieved by separating data acquisition, modification, and output into distinct steps.
A newsclipper tag is composed of three types of commands: <input name=...>, <filter name=...>, and <output name=...>;. The first part of the command tells News Clipper how to execute the command. The name attribute tells News Clipper which handler to use for the command. Additional attributes can also be specified for the command, and are passed on to the handler. Each handler has a set of default filter and output handler commands, so if you only specify the input command, the defaults are used.
First off, terminology: a string is a sequence of characters, possibly containing newlines. Strings can be HTML or regular text, and it doesn't matter to News Clipper. An array is an ordered list of items. The items can be anything, even another array. A hash is an unordered list with named entries. For example, you might have 3 strings, each corresponding to the "author", "URL", and "description". The names in a hash are called the keys.
One important thing to note is the type of data that is input and output from each command. For example, if you use an input command that generates a list of items, and you then try to filter this list with a filter that expects a single string of data, an error will occur. The input and output types are documented in the comments of the handler.pm file located in your handlers directory, and also at the handler webpage.
There are over 100 handlers that can be used in input commands. Some handlers also perform filtering and output commands if the data that they generate is very specific to the handler. The majority of handlers, however, generate strings, lists, and hashes that can be manipulated using generic filters and output using generic output handlers.
Below is an example tag:
<!-- newsclipper <input name=slashdot type=articles> <filter name=slashdot type=LinksAndText> <filter name=limit number=4> <filter name=map filter=limit number=200 chars> <output name=array numcols=2 prefix="<p>-->" suffix="</p>"> -->
This tag specifies nearly everything, including values that already have defaults. The first command results in an array of hashes containing information about the current Slashdot articles. The next command is a filter, which uses one of the filters in the slashdot handler. The slashdot filter returns an array of strings, which is then sent to the generic "limit" filter to reduce the number of strings to four.
At this point, we have an array of four (or less) strings containing Slashdot links and text. The next command is a "map" filter, which applies another filter to the contents of a data structure. In this case, the map filter is applying the limit filter to the text in each item of our array. ("number=200 chars" tells the limit filter that we want to limit the number of characters, not the number of lines, which is the default behavior for strings.)
The final step is to print the array of shortened strings, so we send the data to the "array" handler, and tell it to print in two columns using our own special bullets and spacing.
The output might look something like this:
->Is Red
Hat the Next Microsoft? ->Mozill
a M3 Release Available Now |
->Wired
on Kipling ->CeBIT
Tidbits |
If all of this seems too complicated, you can just settle for the default filters and output of the handlers. In the case of Slashdot, you would do this:
<!-- newsclipper <input name=slashdot> -->
And the default output would look like this:
Each of these filters comes pre-installed with News Clipper. They will not be located in your .NewsClipper directory, but in the same location as the other News Clipper modules. (This location depends on your system configuration, and whether or not you did a site-wide installation.)
<filter name=grep words=X invert>
grep is
named after the Unix command for finding lines in a file that contain a
pattern. It takes a string, array, or hash, and returns the data that
contain one of a set of words. The "invert" attribute can be used to
return the data that does *not* contain the keyword. (Note that in the
case of the hash, it isn't the keys, but the values that are searched.)
<filter name=selectkeys keys=X invert>
Takes
and returns a smaller hash with the given keys. "invert" returns the hash
that does not contain the keys.
<filter name=highlight style=X words=Y>
Highlight surrounds the specified words with a HTML tags. The
style is "strong" by default.
<filter name=limit number=X chars>
Accepts a
string, array, or hash, and returns the same. This filter trims the number
of characters, lines, items, or keys to the number specified. "chars" must
be specified if you want to treat strings as sequences of characters
instead of lines.
<filter name=hash2array order=X>
hash2array
takes a hash and a given key ordering, and returns an array whose items
are the hash values in the specified order.
<filter name=map depth=X filter=Y [...]>
Suppose you have an array of strings, and want to apply the
highlight filter to the strings. Unfortunately, highlight doesn't take
arrays of strings. That's what this filter is for. "depth" tells map how
many levels into your data structure to go before applying the filter
given by "filter". Any additional arguments are passed on to the filter.
<filter name=cacheimages maxage=X dir=Y url=Z>
Suppose you have an array of HTML image links, and you want
to cache them locally, and translate the links to point to the local
images. Give cacheimages the "dir" to store the images in, the "url" that
corresponds to that dir on the web, and it will download the images and
store them for you. "maxage" tells the filter that it can delete images
older than a certain number of seconds.
<output name=string>
Prints a string.
<output name=table header=X border=Y>
Takes
a two-dimensional array and outputs a table having a border size as given
by "border". "header" allows you to specify whether the top and/or left
sides of the table should be headers.
<output name=array numcols=W prefix=X suffix=Y
separator=Z>
Output an array of strings. "numcols" is the
number of columns. "prefix", "separator" and "suffix" are strings to print
before, between, and after each item. If prefix is "ul" or "ol", a
bulletted or numbered list is created.
<output name=thread style=X>
Takes a
"thread" data type, like you would see in discussion lists. Outputs using
numbered or unnumbered lists, depending on whether the style is "ol" or
"ul". See the handler's comments for a description of the thread data
type.