Command Line Syntax
The generic syntax is:
webbot [ options ] [ URI [ keywords ] ]
The order of the options is not important and options can in fact be specified
on either side of any URI. Currently available
options are:-
Getting Help
-
-v [ a | b | c | g | p | s | t | u ]
-
Verbose mode: Gives a running commentary on the program's attempts to read
data in various ways. As the amount of verbose output is substantial, the
-v option can now be followed by zero, one or more of the following
flags (without space) in order to differentiate the verbose output generated:
-
a: Anchor relevant information
-
b: Bindings to local file system
-
c: Cache trace
-
g: SGML trace
-
p: Protocol module information
-
s: SGML/HTML relevant information
-
t: Thread trace
-
u: URI relevant information
The -v option without any appended options shows all trace messages.
An example is "-vpt" showing thread and protocol
trace messages
-
-version
-
Prints out the version number of the robot and the version number of libwww
and exits.
-
-depth [ n ]
-
Limit jumps to n hops from the start page. The n-1 link is checked
using a HEAD request. The default value is 0 which means that only
the start page is searched. A value of 1 will cause the start page and
all pages directly linked from that start page to be checked.
-
-prefix [ URI ]
-
Define a URI prefix for all URIs - if they do not match the prefix then they
are not checked. The rejected URIs can be logged to a
separate file.
The following options are only available if you have
linked against a regex library
handling regular
expressions.
-
-exclude [ regex ]
-
Allows you to define a
regular expression
of which URIs should be excluded from the traversal. The rejected URIs can
be logged to a separate file. This can be used to
exclude specific parts of the URI space, for example all URIs containing
"/old/":
-exclude "/old/"
-
-check [ regex ]
-
Check all URIs that match this regular expression with a HEAD method
instead of a GET method. This can be used to verify links but avoiding
downloading large distribution files like this:
-check
"\.gz$|\.Z$|\.zip$|
, for example.
-
-include [ regex ]
-
Allows you to define a
regular expression
of which URIs should be included in the traversal
-
-img
-
Test include inlined images using a HEAD request
-
-saveimg
-
Saving the inlined images on local disk or pump them to
a black hole. This is primarily to emulate a GUI client's behavior using
the robot
-
-alt [ file ]
-
Specifies a
Referer Log
Format style log file of all inlined images without or with an
empty an ALT tag.
-
-prefix [ URI ]
-
Define a URI prefix for all inlined image URIs - if they do not match the
prefix then they are not checked. The rejected URIs can be
logged to a separate file.
-
-404 [ file ]
-
Specifies a
Referer Log
Format style log file of all links resulting in a 404 (Not Found) status
code
-
-l [ file ]
-
Specifies a
Common
Log File Format style log file with a list of visited documents and the
result codes obtained.
-
-negotiated [ file ]
-
Specifies a log file of all URIs that where subject to content negotiation.
-
-referer [ file ]
-
Specifies a
Referer Log
Format style log file of which documents points to which documents
-
-reject [ file ]
-
Specifies a log file of all the URIs encountered that didn't fulfill the
constraints for traversal.
Distribution and Statistics Features
-
-format [ file ]
-
Specifies a log file of which media types (content types)
were encountered in the run and their distribution
-
-charset [ file ]
-
Specifies a log file of which charsets (content type parameter)
were encountered in the run and their distribution
-
-hitcount [ file]
-
Specifies a log file of URIs sorted after how many times
they were referenced in the run
-
-lm [ file]
-
Specifies a log file of URIs sorted after last modified date.
This gives a good overview of the dynamics of the web site that you are checking.
-
-rellog [ file ]
-
Specifies a log file of any link relationship found in the HTML LINK
tag (either the REL of the REV
attribute) that has the relation specified in the -relation parameter
(all relations are modelled by libwww as "forward"). For example "-rellog
stylesheets-logfile.txt -relation stylesheet" will produce a log file
of all link relationships of type "stylesheet". The format of the log file
is
"<relationship> <media type> <from-URI> -->
<to-URI>"
meaning that the from-URI has the forward relationship
with to-URI.
-
-title [ file ]
-
Specifies a log file of URIs sorted after any title found
either as an HTTP header or in the HTML.
Persistent Cache
-
-cache
-
Enable the libwww persistent
cache
-
-cacheroot [ dir ]
-
Where should the cache be located? The default is /tmp/w3c-cache
-
-validate
-
Force validation using either the etag or the last-modified
date provided by the server
-
-endvalidate
-
Force end-to-end validation by adding a max-age=0 cache control
directive
Other Options
-
-delay [ n]
-
Specify the write delay in milliseconds for how long we can wait until we
flush the output buffer when using pipelining. The default value is 50 ms.
The longer delay, the bigger TCP packets but also longer response time.
-
-n
-
Non-interactive mode.
-
-nopipe
-
Do not use HTTP/1.1 pipelining. The default for this option can be
set using the configure script under
installation.
-
-o [ file ]
-
Redirects output to specified file. This mode forced non-interactive mode.
-
-q
-
Somewhat quit mode.
-
-Q
-
Really quit mode.
-
-r <file>
-
Rule file, a.k.a. configuration
file. If this is specified, a rule file may be used to map URLs, and
to set up other aspects of the behavior of the browser. Many rule files may
be given with successive -r options, and a default rule file name may be
given using the WWW_CONFIG environment variable.
-
-single
-
Single threaded mode. If this flag is set then the browser uses blocking,
non interruptible I/O in interactive mode. Non-interactive mode always uses
blocking I/O.
-
-ss
-
Print out date and time for start and stop for the job.
-
-timeout <n>
-
Timeout in seconds on sockets
The URI is the hypertext
address of the document at which you want to start the robot.
Any further command line arguments are taken as keywords. The first argument
must refer to an index in this case. The index is searched for entries matching
the keywords, and a list of matching entries is displayed.
Henrik Frystyk Nielsen
@(#) $Id: CommandLine.html,v 1.20 1998/02/07 23:59:36 frystyk Exp $