pavuk - HTTP, HTTP over SSL, FTP, FTP over SSL and Gopher recursive document retrieval program
pavuk [-X] [-x] [-with_gui] [-runX] [-[no]bg] [-[no]prefs] [-h] [-help] [-v] [-version]
pavuk [-mode {normal | resumeregets | singlepage | singlereget | sync | dontstore | ftpdir | mirror}] [-X] [-x] [-with_gui] [-runX] [-[no]bg] [-[no]prefs] [-[no]progress] [-[no]stime] [-xmaxlog $nr ] [-logfile $file ] [-slogfile $file ] [-auth_file $file ] [-msgcat $dir ] [-language $str ] [-gui_font $font ] [-quiet/-verbose] [-[no]read_css] [-[no]read_msie_cc] [-[no]read_cdata] [-[no]read_comments] [-cdir $dir ] [-scndir $dir ] [-scenario $str ] [-dumpscn $filename ] [-dumpdir $dir ] [-dumpcmd $filename ] [-l $nr ] [-lmax $nr ] [-dmax $nr ] [-leave_level $nr ] [-maxsize $nr ] [-minsize $nr ] [-asite $list ] [-dsite $list ] [-adomain $list ] [-ddomain $list ] [-asfx $list ] [-dsfx $list ] [-aprefix $list ] [-dprefix $list ] [-amimet $list ] [-dmimet $list ] [-pattern $pattern ] [-url_pattern $pattern ] [-rpattern $regexp ] [-url_rpattern $regexp ] [-skip_pattern $pattern ] [-skip_url_pattern $pattern ] [-skip_rpattern $regexp ] [-skip_url_rpattern $regexp ] [-newer_than $time ] [-older_than $time ] [-schedule $time ] [-reschedule $nr ] [-[dont_]leave_site] [-[dont_]leave_dir] [-http_proxy $site[:$port] ] [-ftp_proxy $site[:$port] ] [-ssl_proxy $site[:$port] ] [-gopher_proxy $site[:$port] ] [-[no]ftp_httpgw] [-[no]ftp_dirtyproxy] [-[no]gopher_httpgw] [-[no]FTP] [-[no]HTTP] [-[no]SSL] [-[no]Gopher] [-[no]FTPdir] [-[no]CGI] [-[no]FTPlist] [-[no]FTPhtml] [-[no]Relocate] [-[no]force_reget] [-[no]cache] [-[no]check_size] [-[no]Robots] [-[no]Enc] [-auth_name $user ] [-auth_passwd $pass ] [-auth_scheme {1/2/3/4/user/Basic/Digest/NTLM} ] [-[no_]auth_reuse_nonce] [-http_proxy_user $user ] [-http_proxy_pass $pass ] [-http_proxy_auth {1/2/3/4/user/Basic/Digest/NTLM} ] [-[no_]auth_reuse_proxy_nonce] [-ssl_key_file $file ] [-ssl_cert_file $file ] [-ssl_cert_passwd $pass ] [-from $email ] [-[no]send_from] [-identity $str ] [-[no]auto_referer] [-[no]referer] [-[no]persistent] [-alang $list ] [-acharset $list ] [-retry $nr ] [-nregets $nr ] [-nredirs $nr ] [-rollback $nr ] [-sleep $nr ] [-[no]rsleep ] [-timeout $nr ] [-rtimeout $nr ] [-wtimeout $nr ] [-[no]preserve_time] [-[no]preserve_perm] [-[no]preserve_slinks] [-bufsize $nr ] [-maxrate $nr ] [-minrate $nr ] [-user_condition $str ] [-cookie_file $file ] [-[no]cookie_send] [-[no]cookie_recv] [-[no]cookie_update] [-cookies_max $nr ] [-disabled_cookie_domains $list ] [-disable_html_tag $TAG[,$ATTRIB[,$ATTRIB...]][;$TAG[,$ATTRIB[,$ATTRIB...]][;...]] ] [-enable_html_tag $TAG[,$ATTRIB[,$ATTRIB...]][;$TAG[,$ATTRIB[,$ATTRIB...]][;...]] ] [-tr_del_chr $str ] [-tr_str_str $str1 $str2 ] [-tr_chr_chr $chrset1 $chrset2 ] [-index_name $str ] [-[no]store_index] [-store_name $str ] [-[no]debug] [-debug_level $level ] [-browser $str ] [-urls_file $file ] [-file_quota $nr ] [-trans_quota $nr ] [-fs_quota $nr ] [-enable_js/-disable_js] [-fnrules $t $m $r ] [-mime_type_file $file ] [-[no]store_info] [-[no]all_to_local] [-[no]sel_to_local] [-[no]all_to_remote] [-url_strategy $strategy ] [-[no]remove_adv] [-adv_re $RE ] [-[no]check_bg] [-[no]send_if_range] [-sched_cmd $str ] [-[no]unique_log] [-post_cmd $str ] [-ssl_version $v ] [-[no]unique_sslid] [-aip_pattern $re ] [-dip_pattern $re ] [-[no]use_http11] [-local_ip $addr ] [-request $req ] [-formdata $req ] [-httpad $str ] [-nthreads $nr ] [-[no]immesg] [-dumpfd {$nr | @[@]$filepath }] [-dump_urlfd {$nr | @[@]$filepath }] [-[no]unique_name] [-[dont_]leave_site_enter_dir] [-max_time $nr ] [-[no]del_after] [-[no]singlepage] [-[no]dump_after] [-[no]dump_response] [-[no]dump_request] [-auth_ntlm_domain $str ] [-auth_proxy_ntlm_domain $str ] [-js_pattern $re ] [-follow_cmd $str ] [-[no]retrieve_symlink] [-js_transform $p $t $h $a ] [-js_transform2 $p $t $h $a ] [-ftp_proxy_user $str ] [-ftp_proxy_pass $str ] [-[dont_]limit_inlines] [-ftp_list_options $str ] [-[no]fix_wuftpd_list] [-[no]post_update] [-info_dir $dir ] [-mozcache_dir $dir ] [-aport $list ] [-dport $list ] [-[no]hack_add_index] [-default_prefix $str ] [-ftp_login_handshake $host $handshake ] [-js_script_file $file ] [-dont_touch_url_pattern $pat ] [-dont_touch_url_rpattern $pat ] [-dont_touch_tag_rpattern $pat ] [-tag_pattern $tag $attrib $url ] [-tag_rpattern $tag $attrib $url ] [-nss_cert_dir $dir ] [-[no]nss_accept_unknown_cert] [-nss_domestic_policy/-nss_export_policy] [-[no]verify] [-tlogfile $file ] [-trelative {object | program} ] [-tp FQDN[:port] ] [-transparent_proxy FQDN[:port] ] [-tsp FQDN[:port] ] [-transparent_ssl_proxy FQDN[:port] ] [-[not]sdemo] [-noencode] [-[no]ignore_chunk_bug] [-hammer_mode $nr ] [-hammer_threads $nr ] [-hammer_flags $nr ] [-hammer_ease $nr ] [-hammer_rtimeout $nr ] [-hammer_repeat $nr ] [-[no]log_hammering] [-hammer_recdump {$nr | @[@]$filepath }] [URLs ]
pavuk [-mode {normal | singlepage | singlereget}] [-base_level $nr ]
pavuk [-mode sync] [-ddays $nr ] [-subdir $dir ] [-[no]remove_old]
pavuk [-mode resumeregets] [-subdir $dir ]
pavuk [-mode linkupdate] [-cdir $dir ] [-subdir $dir ] [-scndir $dir ] [-scenario $str ]
pavuk [-mode reminder] [-remind_cmd $str ]
pavuk [-mode mirror] [-subdir $dir ] [-[no]remove_old] [-[no]remove_before_store] [-[no]always_mdtm]
This manual page describes how to use pavuk.
Pavuk can be used to mirror contents of Internet/intranet servers and to maintain copies in a local tree of documents. Pavuk stores retrieved documents in locally mapped disk space. The structure of the local tree is the same as the one on the remote server. Each supported service (protocol) has its own sub-directory in the local tree. Each referenced server has its own sub-directory in these protocols sub-directories; followed by the port number on which the service resides, delimited by character can be be changed. With the option -fnrules you can change the default layout of the local document tree, without losing link consistency.
With pavuk it is possible to have up-to-date copies of remote documents in the local disk space.
As of version 0.3pl2, pavuk can automatically restart broken connections, and reget partial content from an FTP server (which must support the REST command), from a properly configured HTTP/1.1 server, or from a HTTP/1.0 server which supports Ranges.
As of version 0.6 it is possible to handle configurations via so called scenarios. The best way to create such a configuration file is to use the X Window interface and simply save the created configuration. The other way is to use the -dumpscn switch.
As of version 0.7pl1 it is possible to store authentication information into an authinfo file, which pavuk can then parse and use.
As of version 0.8pl4 pavuk can fetch documents for use in a local proxy/cache server without storing them to local documents tree.
As of version 0.9pl4 pavuk supports SOCKS (4/5) proxies if you have the required libraries.
As of version 0.9pl12 pavuk can preserve permissions of remote files and symbolic links, so it can be used for powerful FTP mirroring.
The pavuk releases starting at 0.9.36 support dumping commands to a specific file (see the -dumpdir and -dumpcmd arguments).
Pavuk supports SSL connections to FTP servers, if you specify ftps:// URL instead of ftp://.
Pavuk can automatically handle file names with unsafe characters for file-system. This is only implemented yet for Win32 platform and it is hard coded.
Pavuk can now use HTTP/1.1 protocol for communication with HTTP servers. It can use persistent connections, so one TCP connection should be used to transfer several documents without closing it. This feature saves network bandwidth and also speeds up network communication.
Pavuk can do configurable POST requests to HTTP servers and support also file uploading via HTTP POST request.
Pavuk can automatically fill found HTML forms, if user will supply data for its fields before with option -formdata .
Pavuk can run configurable number of concurrently running downloading threads when compiled with multithreading support.
Pavuk 0.9pl128 introduced the use of JavaScript bindings for doing some complicated tasks (e.g. decision making, filename transformation) which need some more computing complexity than may be achieved with a regular, non-scriptable program.
pavuk 0.9.36 introduced the optional multiplier suffixes K, M or G for numeric parameter values of command line options. These multipliers represent the ISO multipliers Kilo(1000), Mega(1000000) and Giga(1.0E9), unless otherwise specified (some command line options relate to memory or disc sizes in either bytes of kBytes, where these multipliers will then be processed as the nearest power-of-2: K(1024), M(1048567) or G(1073741824).
http://[[user][:password]@]host[:port][/document]
[[user][:password]@]host[:port][/document]
https://[[user][:password]@]host[:port][/document]
ssl[.domain][:port][/document]
ftp://[[user][:password]@]host[:port][/relative_path][;type=x]
ftp://[[user][:password]@]host[:port][//absolute_path][;type=x]
ftp[.domain][:port][/document][;type=x]
ftps://[[user][:password]@]host[:port][/relative_path][;type=x]
ftps://[[user][:password]@]host[:port][//absolute_path][;type=x]
ftps[.domain][:port][/document][;type=x]
gopher://host[:port][/type[document]]
gopher[.domain][:port][/type[document]]
http://[[user][:password]@]host[:port][/document][?query]
to
http/host_port/[document][?query]
https://[[user][:password]@]host[:port][/document][?query]
to
https/host_port/[document][?query]
ftp://[[user][:password]@]host[:port][/path]
to
ftp/host_port/[path]
ftps://[[user][:password]@]host[:port][/path]
to
ftps/host_port/[path]
gopher://host[:port][/type[document]]
to
gopher/host_port/[type[document]]
Note
Pavuk will use the string with which it queries the target server as the name of the results file. This file name may, in some cases, contain punctuations such as $,?,=,& etc. Such punctuation can cause problems when you are trying to browse downloaded files with your browser or you are trying to process downloaded files with shell scripts or view files with file management utilities which reference the name of the results file. If you believe that this may be causing problems for you, then you can remove any punctuation or other undesirable characters from the result file name with the option: -tr_del_chr [:punct:] or with other options for adjusting file names (-tr_str_str and -tr_chr_chr ).
The order in which these URL to file name conversions are applied is as follows:
each of the -fnrules specified on the pavuk commandline or in the GUI, in the order in which they were listed.
-tr_str_str is applied after that.
-tr_del_chr follows.
-tr_chr_chr is the last transformation, before ...
... the cache directory prefix (as specified through -tr_chr_chr is prepended to the generated file path. In this last step, any lacking ’/’ directory separator is included as glue between the two parts: cache prefix and generated local file path, which may or may not contain subdirectories of is own.
All options are case insensitive.
Mode
Help
Indicate/Logging/Interface options
Netli options
Special start
Scenario/Task options
Directory options
Preserve options
Proxy options
Proxy authentication
Protocol/Download Option
Authentication
Site/Domain/Port Limitation Options
Limitation Document properties
Limitation Document name
Limitation Protocol Option
Other Limitation Options
JavaScript support
Cookie
HTML rewriting engine tuning options
File name / URL Conversion Option
Hammer mode options: load testing web sites
Other Options
Set operation mode.
retrieves recursive documents
update remote URLs in local HTML documents to local URLs if these URLs exist in the local tree
synchronize remote documents with local tree (if a local copy of a document is older than remote, the document is retrieved again, otherwise nothing happens)
URL is retrieved as one page with all inline objects (picture, sound ...) this mode is now obsoleted by -singlepage option.
pavuk scans the local tree for files that were not retrieved fully and retrieves them again (uses partial get if possible)
get URL until it is retrieved in full
transfer page from server, but don’t store it to the local tree. This mode is suitable for fetching pages that are held in a local proxy/cache server.
used to inform the user about changed documents
similar to the ’sync’ mode, but will automatically remove local documents which do not exist anymore on the remote site. This mode will make an exact copy of the remote site, including keeping the file names intact as much as possible.
Cstate=1 Cstate=2 used to list of contents of FTP directories
default operation mode is normal mode.
Print long verbose help message
Show version information and feature set configuration at compilation time.
Feature : Developer Debug Build
Short description : Identifies this pavuk binary as compiled with debug
features enabled (-DDEBUG), such as extra run-time checks.
Affects : all
Feature : Debug features
Short description : This pavuk binary can show very detailed debug /
diagnostic information about the grabbing process, including message dumps,
etc.
Affects : -debug/-nodebug , -debug_level $level
Feature : GNU gettext internationalization of messages
Short description : Important messages can be shown in the local
language.
Affects : -language , -msgcat
Feature : flock() / fcntl() document locking
Short description : When you do not have this built in, you should refrain
from running multiple pavuk binaries and/or multithreaded sessions. Depending on
the built-in locking type (’flock()’, ’Win32 flock()’ or
’fcntl()’) you can or should not use network shared storage to store
the results of your session: fcntl() locking is assumed to be capable of locking
files on NFS shares, while flock() very probably won’t be able to do
that.
Affects : file I/O
Feature : Gtk GUI interface
Short description : You can use the built-in GUI.
Affects : -X , -with_gui , -runX , -prefs ,
-noprefs , -xmaxlog , -gui_font
Feature : GUI with URL tree preview
Short description : You can use the built-in GUI URL tree views.
Affects : -browser
Feature : HTTP and FTP over SSL; SSL layer implemented with OpenSSL /
SSLeay / NSS library
Short description : You can access SSL secured URLs / sites and proxies.
pavuk may have been built with either OpenSSL, SSLeay or Netscape SSL support. Some
features are only available with the one, some only with another
implementation.
Affects : -noSSL , -SSL , -verify , -noverify ,
-noFTPS , -FTPS , -ssl_cert_passwd , -ssl_cert_file ,
-ssl_key_file , -ssl_cipher_list , -ssl_proxy ,
-ssl_version , -unique_sslid , -nounique_sslid ,
-nss_cert_dir , -nss_accept_unknown_cert ,
-nonss_accept_unknown_cert , -nss_domestic_policy ,
-nss_export_policy
Feature : Socks proxy support
Short description : You can SOCKS4 and/or SOCKS5 proxies.
Affects :
Feature : file-system free space checking
Short description : You can use quotas to prevent your local storage from
filling up / overflowing.
Affects : -file_quota
Feature : optional regex patterns in -fnrules and -*rpattern options
Short description : You can use regular expressions to help pavuk select and
filter content. pavuk also mentions which regex engine has been built in: POSIX,
Bell V8, BSD, GNU, PCRE or TRE
Affects : -rpattern , -skip_rpattern , -url_rpattern ,
-skip_url_rpattern , -remove_adv , -noremove_adv ,
-adv_re , -aip_pattern , -dip_pattern , -js_pattern ,
-js_transform , -js_transform2 , -dont_touch_url_rpattern ,
-dont_touch_tag_rpattern , -tag_rpattern
Feature : support for loading files from Netscape browser cache
Short description : You can access the private browser cache of Netscape
browsers.
Affects : -nscache_dir
Feature : support for loading files from Microsoft Internet Explorer
browser cache
Short description : You can access the private browser cache of Microsoft
Internet Explorer browsers.
Affects : -ie_cache
Feature : support for detecting whether pavuk is running as background
job
Short description : Progress reports, etc. will be disabled when pavuk is
running as a background task
Affects : -check_bg , -nocheck_bg , -progress_mode ,
-verbose , -noverbose , -noquiet , -debug_level ,
-nodebug , -debug , ...
Feature : multithreading support
Short description : Allows pavuk to perform multiple tasks
simultaneously.
Affects : -hammer_threads , -nthreads , -immesg ,
-noimmesg
Feature : NTLM authorization support
Short description : You can access web servers which use NTLM-base access
security.
Affects : -auth_ntlm_domain , -auth_proxy_ntlm_domain
Feature : JavaScript bindings
Short description : You can use JavaScript-based filters and patterns.
Affects : -js_script_file
Feature : IPv6 support
Short description : Pavuk incorporates basic IPv6 support.
Affects :
Feature : HTTP compressed data transfer (gzip/compress/deflate
Content-Encoding)
Short description : pavuk supports compressed transmission formats (HTTP
Accept-Encoding) to reduce network traffic load.
Affects : -noEnc , -Enc
Feature : DoS support (a.k.a. ’chunky’ a.k.a. ’hammer
modes’)
Short description : this pavuk binary can be used to test
(’hammer’) your sites
Affects : -hammer_recdump , -log_hammering ,
-nolog_hammering , -hammer_threads , -hammer_mode ,
-hammer_flags , -hammer_ease , -hammer_rtimeout ,
-hammer_repeat
Don’t show any messages on the screen.
Force to show output messages on the screen (default)
Show retrieving progress while running in the terminal (default is progress off). When turned on, progress will be shown in the format specified by the -progress_mode setting.
Note
This option only has effect when pavuk is run in a console window.
Specify how progress (see -progress will be shown to the user. Several modes $nr are supported:
Report every run (-hammer_mode ) and URL fetched on a separate line. Also show the download progress (bytes and percentage downloaded) while fetching a document from the remote site. This is the most verbose progress display. (default)
Example output:
URL[ 1]: 35(0) of 56 http://hobbelt.com/CAT-tuts/panther-l2-50pct.jpg S: 10138 / 10138 B [100.0%] [R: 187.8 kB/s] [ET: 0:00:00] [RT: 0:00:00] URL[ 1]: 38(0) of 56 http://hobbelt.com/CAT-tuts/get-started-cat-50pct.jpg S: 5868 / 5868 B [100.0%] [R: 114.8 kB/s] [ET: 0:00:00] [RT: 0:00:00] URL[ 2]: 34(0) of 56 http://hobbelt.com/CAT-tuts/CAT_Panther_CM2.avi S: 8311 / 8311 kB [100.0%] [R: 4.7 MB/s] [ET: 0:00:01] [RT: 0:00:00] URL[ 2]: 40(0) of 56 http://hobbelt.com/icons/knowspam-teeny-9.gif S: 817 / 817 B [100.0%] [R: 20.3 kB/s] [ET: 0:00:00] [RT: 0:00:00]
Report every run (-hammer_mode ) in a concise format (’=RUN=’) and display each URL fetched as a separate dot ’.’.
Example output:
............................................[URL] download: ERROR: HTTP document not found
These are identical to mode 1 , except in hammer mode while hammering a site. Increase the number to see less progress info during a hammer operation.
Show start and end time of transfer. (default isn’t this information shown)
Maximum number of log lines in the Log widget. 0 means unlimited. This option is available only when compiled with the GTK+ GUI. (default value is 0)
$nr specifies the size in bytes, unless postfixed with one of the characters K, M or G, which imply the multipliers K(1024), M(1048567) or G(1073741824).
File where all produced messages are stored.
When logfile as specified with the option -logfile is already used by another process, try to generate new unique name for the log file. (default is this option turned off)
File to store short logs in. This file contains one line of information per processed document. This is meant to be used in connection with any sort of script to produce some statistics, for validating links on your website, or for generating simple site maps. Multiple pavuk processes can use this file concurrently, without overwriting each others entries. Record structure:
process id of pavuk process
current time
in the format current/total number of URLs
contains the type of the error: FATAL, ERR, WARN or OK
is the number code of the error (see errcode.h in pavuk sources)
of the document
first parent document of this URL (when it doesn’t have parent - [none])
is the name of the local file the document is saved under
size of requested document if known
time which takes downloading of this document in format seconds.milli_seconds
Cstate=1 Cstate=2 contains the first line of the HTTP server response
Native language that pavuk should use for communication with its user (works only when there is a message catalog for that language) GNU gettext support (for message internationalization) must also be compiled in. Default language is taken from your NLS environment variables.
Font used in the GUI interface. To list available X fonts use the xlsfonts command. This option is available only when compiled with GTK+ GUI support.
Enable or disable fetching objects mentioned in inline and external CSS style sheets.
Enable or disable fetching objects mentioned in Microsoft Internet Explorer Conditional Comments (a.k.a. MSIE CC’s).
Enable or disable fetching objects mentioned in <![CDATA[...]]> sections.
Enable or disable fetching objects mentioned in HTML <!-- ... --> Comment sections.
Enable or disable verifying server CERTS in SSL mode.
Turn on Netli logging with output to specified file.
Make Netli timings relative to the start of the first object or the program.
When processing URL, send the original, but send it to the IP address at
FQDN
When processing HTTPS URL, send the original, but send it to the IP address at FQDN
Output in sdemo compatible format. This is only used by sdemo . (For now it simply means output ’-1’ rather than ’*’ when measurements are invalid.)
Do / do not escape characters that are "unsafe" in URLs. Default behavior is to escape unsafe characters.
Start program with X Window interface (if compiled with support for GTK+). By default pavuk starts without GUI and behaves like a regular command-line tool.
When used together with the -X option, pavuk starts processing of URLs immediately after the GUI window is launched. Without the -X given, this option doesn’t have any effect. Only available when compiled with GTK+ support.
This option allows pavuk to detach from its terminal and run in background mode. Pavuk will not output any messages to the terminal than. If you want to see messages, you have to use the -log_file option to specify a file where messages will be written. Default pavuk executes at foreground.
Normally, programs sent into the background after being run in foreground continue to output messages to the terminal. If this option is activated, pavuk checks if it is running as background job and will not write any messages to the terminal in this case. After it becomes a foreground job again, it will start writing messages to terminal in the normal way. This option is available only when your system supports retrieving of terminal info via tc*() functions.
When you turn this option on, pavuk will preserve all settings when exiting, and when you run pavuk with GUI interface again, all settings will be restored. The settings will be stored in the ~./pavuk_prefs file. Default pavuk want restore its option when started. This option is available only when compiled with GTK+.
Execute pavuk at the time specified as parameter. The Format of the $time parameter is YYYY.MM.DD.hh.mm . You need a properly configured scheduling with the at command on your system for using this option. If default configuration (at -f %f %t %d.%m.%Y ) of scheduling command won’t work on your system, try to adjust it with -sched_cmd option.
$time must be specified as local (a.k.a. ’wall clock’) time.
Execute pavuk periodically with $nr hours period. You need properly configured scheduling with the at command on your system for using this option.
Command to use for scheduling. Pavuk explicitly supports scheduling with at $str should contain regular characters and macros, escaped by % character. Supported macros are:
for script filename
for time (in format HH:MM)
all macros as supported Bstate=2 by the strftime (3) Sstate=1 Sstate=1 Sstate=1 function Sstate=1
If you use this option, pavuk will read URLs from $file before it starts processing. In this file, each URL needs to be on a separate line. After the last URL, a single dot . followed by a LF (line-feed) character denotes the end. Pavuk will start processing right after all URLs have been read. If $file is given as the - character, standard input will be read.
This option causes pavuk to store information about each document into a separate file in the .pavuk_info directory. This file is used to store the original URL from which the document was downloaded. For files that are downloaded via HTTP or HTTPS protocols, the whole HTTP response header is stored there. I recommend to use this option when you are using options that change the default layout of the local document tree, because this info file helps pavuk to map the local filename to the URL. This option is also very useful when different URLs have the same filename in the local tree. When this occurs, pavuk detects this using info files, and it will prefix the local name with numbers. At default is disabled storing of this extra information.
You can set with this option location of separate directory for storing info files created when -store_info option is used. This is useful when you don’t want to mix in destination directory the info files with regular document files. The structure of the info files is preserved, just are stored in different directory.
With this option you can specify extended information for starting URLs. With this option you can specify query data for POST or GET . Current syntax of this option is:
URL:["]$url["] [METHOD:["]{GET|POST}["]] [ENCODING:["]{u|m}["]] [FIELD:["]variable=value["]] [COOKIE:["][variable=value;[...]]variable=value[;]["]] [FILE:["]variable=filename["]] [LNAME:["]local_filename["]]
specifies request URL
specifies request method for URL and is one of GET or POST.
specifies encoding for request body data.
is for multipart/form-data encoding
Cstate=1 Cstate=2 is for application/x-www-form-urlencoded encoding
specifies field of request data in format variable=value. For encoding of special characters in variable and value you can use same encoding as is used in application/x-www-form-urlencoded encoding.
specifies one or more cookies that are related to the specified URL. These cookies will be used/transmitted by pavuk when this URL is accessed, thus enabling pavuk to access URLs which require the use of specific cookies for a proper response.
Note
The settings of command-line option -disabled_cookie_domains does apply.
See the Cookie chapter for more info.
specifies special field of query, which is used to specify file for POST based file upload.
Cstate=1 Cstate=2 specifies localname for this request
When you need to use inside the FIELD: and FILE: fields of request specification special characters, you should use the application/x-www-form-urlencoded encoding of characters. It means all nonASCII characters, quote character ("), space character ( ), ampersand character (&), percent character (%) and equal character (=) should be encoded in form %xx where xx is hexadecimal representation of ASCII value of character. So for example % character should be encoded like %25 .
This option gives you chance to specify contents for HTML forms found during traversing document tree. Current syntax of this option is same as for -request option, but ENCODING: and METHOD: are meaningless in this option semantics. In URL: you have to specify HTML form action URL, which will be matched against action URLs found in processed HTML documents. If pavuk finds action URL which matches that supplied in -formdata option, pavuk will construct GET or POST request from data supplied in this option and from default form field values supplied in HTML document. Values supplied on command-line have precedence before that supplied in HTML file.
By means of this option you can specify how many concurrent threads will download documents. Default pavuk executes 3 concurrent downloading threads.
This option is available only when pavuk is compiled to support multithreading.
Default pavuks behavior when running multiple downloading threads is to buffer all output messages in memory buffer and flush that buffered data just when thread finishes processing of one document. With this option you can change this behavior to see the messages immediately when it is produced. It is only usable when you want to debug some specials in multithreading environment.
This option is available only when pavuk is compiled to support multithreading.
For scripting is sometimes usable to be able to download document directly to pipe or variable instead of storing it to regular file. In such case you can use this option to dump data for example to stdout ( $nr = 1 ).
Note
pavuk 0.9.36 and later releases also support the @$file argument, where you can specify a file to dump the data to. The file path must be prefixed by an ’@’ character. If you prefix the file path with a second ’@’, pavuk will assume you wish to append to an already existing file. Otherwise the file will be created/erased when pavuk starts.
While using -dumpfd option in multithreaded pavuk, it is required to dump document in one moment because documents downloaded in multiple threads can overlap. This option is also useful when you want to dump document after pavuk adjusts links inside HTML documents.
This option has effect only when used with the -dumpfd option. It is used to dump HTTP requests.
This option has effect only when used with the -dumpfd option. It is used to dump HTTP response headers.
When you will use this option, pavuk will output all URLs found in HTML documents to file descriptor $nr . You can use this option to extract and convert all URLs to absolute URLs and write those to stdout, for example.
Note
pavuk 0.9.36 and later releases also support the @$file argument, where you can specify a file to dump the data to. The file path must be prefixed by an ’@’ character. If you prefix the file path with a second ’@’, pavuk will assume you wish to append to an already existing file. Otherwise the file will be created/erased when pavuk starts.
Name of scenario to load and/or run. Scenarios are files with a structure similar to the .pavukrc file. Scenarios contain saved configurations. You can use it for periodical mirroring. Parameters from scenarios specified at the command line can be overwritten by command line parameters. To be able to use this option, you need to specify scenario base directory with option -scndir .
Store actual configuration into scenario file with name $filename . This is useful to quickly create pre-configured scenarios for manual editing.
File name where the command will be ’dumped’. To be able to use this option, you need to specify the dump base directory with option -dumpdir .
Directory which contains the message catalog for pavuk. If you do not have permission to store a pavuk message catalog in the system directory, you should simply create similar structure of directories in your home directory as it is on your system.
For example:
Your native language is German, and your home directory is /home/jano .
You should at first create the directory /home/jano/locales/de/LC_MESSAGES/ , then put the German pavuk.mo there and set -msgcat to /home/jano/locales/ . If you have properly set locale environment values, you will see pavuk speaking German. This option is available only when you compiled in support for GNU gettext messages internationalization.
Directory where are all retrieved documents are stored. If not specified, the current directory is used. If the specified directory doesn’t exist, it will be created.
Directory in which your scenarios are stored. You must use this option when you are loading or storing scenario files.
Directory in which your command dumps are stored. You must use this option when you are storing command dump files using the -dumpcmd command.
Store downloaded document with same modification time as on the remote site. Modification time will be set only when such information is available (some FTP servers do not support the MDTM command, and some documents on HTTP servers are created online so pavuk can’t retrieve the modification time of this document). At default modification time of documents isn’t preserved.
Store downloaded document with the same permissions as on the remote site. This option has effect only when downloading a file through FTP protocol and assumes that the -ftplist option is used. At default permissions are not preserved.
Set symbolic links to point exactly to same location as on the remote server; don’t do any relocations. This option has effect only when downloading file through FTP protocol and assumes that the -ftplist option is used. Default symbolic links are not preserved, and are retrieved as regular documents with full contents of linked file.
For example, assume that on the FTP server ftp.xx.org there is a symbolic link /pub/pavuk/pavuk-current.tgz , which points to /tmp/pub/pavuk-0.9pl11.tgz . Pavuk will create symbolic link ftp/ftp.xx.org_21/pub/pavuk/pavuk-current.tgz
if option -preserve_slinks will be used this symbolic link will point to /tmp/pub/pavuk-0.9pl11.tgz
if option -nopreserve_slinks will be used, this symbolic link will point to ../../tmp/pub/pavuk-0.9pl11.tgz
Retrieve files behind symbolic links instead of replicating symlinks in local tree.
If this parameter is used, then all HTTP requests are going through this proxy server. This is useful if your site resides behind a firewall, or if you want to use a HTTP proxy cache server. The default port number is 8080. Pavuk allows you to specify multiple HTTP proxies (using multiple -http_proxy options) and it will rotate proxies with round robin priority disabling proxies with errors.
Use this option whenever you want to get the document directly from the site and not from your HTTP proxy cache server. Default pavuk allows transfer of document copies from cache.
If this parameter is used, then all FTP requests are going through this proxy server. This is useful when your site resides behind a firewall, or if you want to use FTP proxy cache server. The default port number is 22. Pavuk supports three different types of proxies for FTP, see the options -ftp_httpgw and -ftp_dirtyproxy . If none of the mentioned options is used, then pavuk assumes a regular FTP proxy with USER user@host connecting to remote FTP server.
The specified FTP proxy is a HTTP gateway for the FTP protocol. Default FTP proxy is regular FTP proxy.
The specified FTP proxy is a HTTP proxy which supports a CONNECT request (pavuk should use full FTP protocol, except of active data connections). Default FTP proxy is regular FTP proxy. If both -ftp_dirtyproxy and -ftp_httpgw are specified, -ftp_dirtyproxy is preferred.
Gopher gateway or proxy/cache server.
The specified Gopher proxy server is a HTTP gateway for Gopher protocol. When -gopher_proxy is set and this -gopher_httpgw option isn’t used, pavuk is using proxy as HTTP tunnel with CONNECT request to open connections to Gopher servers.
SSL proxy (tunneling) server [as that in CERNhttpd + patch or in Squid] with enabled CONNECT request (at least on port 443). This option is available only when compiled with SSL support (you need the SSleay or OpenSSL libraries with development headers)
User name for HTTP proxy authentication.
Password for HTTP proxy authentication.
Authentication scheme for proxy access. Similar meaning as the -auth_scheme option (see help for this option for more details). Default is 2 (Basic scheme).
NT or LM domain used for authorization again HTTP proxy server when NTLM authentication scheme is required. This option is available only when compiled with OpenSSL or libdes libraries.
When using HTTP Proxy Digest access authentication scheme use first received nonce value in multiple following requests.
User name for FTP proxy authentication.
Password for FTP proxy authentication.
Uses passive ftp when downloading via ftp.
Uses active ftp when downloading via ftp.
This option permits to specify the ports used for active ftp. This permits easier firewall configuration since the range of ports can be restricted.
Pavuk will randomly choose a number from within the specified range until an open port is found. Should no open ports be found within the given range, pavuk will default to a normal kernel-assigned port, and a message (debug level net ) is output.
The port range selected must be in the non-privileged range (e.g. greater than or equal to 1024); it is STRONGLY RECOMMENDED that the chosen range be large enough to handle many simultaneous active connections (for example, 49152-65534, the IANA-registered ephemeral port range).
Force pavuk to always use "MDTM" to determine the file modification time and never uses cached times determined when listing the remote files.
Force unlink’ing of files before new content is stored to a file. This is helpful if the local files are hardlinked to some other directory and after mirroring the hardlinks are checked. All "broken" hardlinks indicate a file update.
Set the number of attempts to transfer processed document. Default set to 1, this mean pavuk will retry once to get documents which failed on first attempt.
Set the number of allowed regets on a single document, after a broken transfer. Default value for this option is 2.
This option is discarded when running pavuk in singlereget mode as pavuk will then keep on trying to reget the URL until successful or a fatal error occurs. If the server is found to not support reget’ing content and -force_reget has not been specified, this will be regarded as a fatal error.
Set number of allowed HTTP redirects. (use this for prevention of loops) Default value for this option is 5, and conform to HTTP specification.
Set the number of bytes to discard from the already locally available content (counted from the end of the file) if regetting. Default value for this option is 0.
Force reget’ing of the whole document after a broken transfer when the server doesn’t support retrieving of partial content. Pavuk default behavior is to stop getting documents which don’t allow restarting of transfer from specified position.
When forced reget’ing is turned on, pavuk will still start fetching each URL by requesting a partial content download when (part of) the URL content is already available locally. However, when such an attempt fails, pavuk will discard the notion of requesting a partial content download (i.e. HTTP Range specification) entirely for this URL only and attempt to download the content as a whole instead.
Hence, in order for ’-force_reget’ to work as expected, you should realize each URL should be at least spidered twice, i.e. the -nregets command-line option should have a value of 1 at least (2 by default if this option is not specified explicitly).
Timeout for stalled connection attempts in milliseconds. Default timeout is 0, and that means timeout checking is disabled.
$nr specifies the timeout in millieseconds, unless postfixed with one of the characters S, M, H or D (either in upper or lower case), which imply the alternative time units S = seconds, M = minutes, H = hours or D = days.
Timeout for data read operations in milliseconds: the connection is closed with an error when no further data is received within this time limit. Default timeout is 0, an that means timeout checking is disabled.
$nr specifies the timeout in millieseconds, unless postfixed with one of the characters S, M, H or D (either in upper or lower case), which imply the alternative time units S = seconds, M = minutes, H = hours or D = days.
Timeout for data write operations in milliseconds: the connection is closed with an error when no further data could be transmitted within this time limit. Default timeout is 0, an that means timeout checking is disabled.
$nr specifies the timeout in millieseconds, unless postfixed with one of the characters S, M, H or D (either in upper or lower case), which imply the alternative time units S = seconds, M = minutes, H = hours or D = days.
This switch suppresses / enables the use of the robots.txt standard, which is used to restrict access of Web robots to some locations on the web server. (The default setting is to enable its use on all HTTP/HTTPS servers.)
Enable this option always when you are downloading huge sets of pages with unpredictable / yet unknown layout. This prevents you from upsetting server administrators :-).
Note
pavuk only applies the instructions in robots.txt to collected URLs. Manual URLs, on the other hand, are not subject to any robots.txt limitations.
Collected URLs, i.e. all URLs which were collected from spidered entities.
Manual URLs, i.e. the ones you pass on to pavuk through its command line or GUI interfaces, e.g. when using command line options like -scenario $str , -urls_file $file , -request $req or pavuk URLs .
This switch suppresses / enables using the gzip , compress or deflate encoding in transfer.
Some servers are broken as they are reporting files with the MIME type application/gzip or application/compress as gzip or compress encoded, when it should have been reported as ’untouched’, which is defined by the keyword ’identity’ according to the HTTP standards. See for example HTTP/1.1 standard RFC2616, section 14.3, Accept-Encoding and the counterpart: section 14.11, Content-Encoding.
Turn this option off (-noEnc) when you don’t want to allow the server to compress content for transmission: in that case, the server will transmit all content as is, which, in the case of faulty servers mentioned above, means you will receive the compressed file types exactly as they are stored on the server and no undesirable decompression attempts will be made by pavuk.
By default, the option ’-Enc’ is enabled, as this allows for often significant data transfer savings, resulting in reduced transmission costs and consequently faster web responses.
Note
Note: when you have a pavuk binary without libz support compiled in, pavuk will never request content compression, as it won’t be able to decompress those results. In that case, ’-Enc’ is identical to ’-noEnc’.
For improved functionality, make sure your pavuk binary comes with libz support. Check your pavuk --version output for a mention of this feature (’Content-Encoding’).
The option -nocheck_size should be used if you are trying to download pages from a HTTP server which sends a wrong Content-Length: field in the MIME header of response. Default pavuk behavior is to check this field and complain when something is wrong.
If you don’t want to give all your transfer bandwidth to pavuk, use this option to set pavuk’s maximum transfer rate. This option accepts a floating point number to specify the transfer rate in kB/s. If you want get optimal settings, you also have to play with the size of the read buffer (option -bufsize ) because pavuk is doing flow control only at application level. At default pavuk uses full bandwidth.
If you hate slow transfer rates, this option allows you to break transfers with slow speed. You can set the minimum transfer rate, and if the connection gets slower than the given rate, the transfer will be stopped. The minimum transfer rate is given in kB/s. At default pavuk doesn’t check this limit.
This option is used to specify the size of the read buffer (default size: 32kB). If you have a very fast connection, you may increase the size of the buffer to get a better read performance. If you need to decrease the transfer rate, you may need to decrease the size of the buffer and set the maximum transfer rate with the -maxrate option. This option accepts the size of the buffer in kB.
$nr specifies the size in kiloBytes, unless postfixed with one of the characters K or M, which imply the corresponding (power-of-2) multipliers. That means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is a whopping 1 GigaByte.
If you are running pavuk on a multiuser system, you may need to avoid filling up your file system. This option lets you specify how many space must remain free. If pavuk detects an underrun of the free space, it will stop downloading files. Specify this quota in kB. Default value is 0, and that mean no checking of this quota.
$nr specifies the size in kiloBytes, unless postfixed with one of the characters K or M, which imply the corresponding (power-of-2) multipliers. That means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is a whopping 1 GigaByte.
This option is useful when you want to limit downloading of big files, but want to download at least $nr kilobytes from big files. A big file will be transferred, and when it reaches the specified size, transfer will break. Such document will be processed as properly downloaded, so be careful when using this option. At default pavuk is transferring full size of documents.
$nr specifies the size in kiloBytes, unless postfixed with one of the characters K or M, which imply the corresponding (power-of-2) multipliers. That means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is a whopping 1 GigaByte.
If you are aware that your selection should address a big amount of data, you can use this option to limit the amount of transferred data. Default is by size unlimited transfer.
$nr specifies the size in kiloBytes, unless postfixed with one of the characters K or M, which imply the corresponding (power-of-2) multipliers. That means that the $nr value ’1K’ is 1 MegaByte, ’1M’ is a whopping 1 GigaByte.
Set maximum amount of time for program run. After time is exceeded, pavuk will stop downloading. Time is specified in minutes. Default value is 0, and it means downloading time is not limited.
This option allows you to specify a downloading order for URLs in document tree. This option accepts the following strings as parameters:
will order URLs as it loads it from HTML files (default)
as previous, but inline objects URLs come first
will insert URLs from actual HTML document at start, before other
as previous, Estate=1 Estate=2 but inline objects URLs come first
Send If-Range: header in HTTP request. I found out that some HTTP servers (greetings, MS :-)) are sending different ETag: fields in different responses for the same, unchanged document. This causes problems when pavuk attempts to reget a document from such a server: pavuk will remember the old ETag value and uses it in following requests for this document. If the server checks it with the new ETag value and it differs, it will refuse to send only part of the document, and start the download from scratch.
Set required SSL protocol version for SSL communication. $v is one of:
ssl2
ssl23
ssl3
Dstate=1 Dstate=2 tls1
This option is available only when compiled with SSL support. Default is ssl23.
This option can be used if you want to use a unique SSL ID for all SSL sessions. Default pavuk behavior is to negotiate each time new session ID for each connection. This option is available only when compiled with SSL support.
This option is used to switch between HTTP/1.0 and HTTP/1.1 protocol used with HTTP servers. Using of HTTP/1.1 is recommended, because it is faster than HTTP/1.0 and uses less network bandwidth for initiating connections. pavuk uses HTTP/1.1 by default.
You can use this option when you want to use specified network interface for communication with other hosts. This option is suitable for multihomed hosts with several network interfaces. Address should be entered as regular IP address or as host name.
This option allows you to specify content of User-Agent: field of HTTP request. This is usable, when scripts on remote server returns different document on same URL for different browsers, or if some HTTP server refuse to serve document for Web robots like pavuk. Default pavuk sends in User-Agent: field pavuk/$VERSION string.
This option forces pavuk to send HTTP Referer: header field with starting URLs. Content of this field will be self URL. Using this option is required, when remote server checks the Referer: field. At default pavuk wont send Referer: field with starting URLs.
This option allows to enable and disable the transmission of HTTP Referer: header field. At default pavuk sends Referer: field.
This option allows to enable and disable the use of persistent HTTP connections. The default is to use persistent HTTP connections. Some servers have problems with that type of connection and this options allows to get data from these type of servers also.
In some cases you may want to add user defined fields to HTTP/HTTPS requests. This option is exactly for this purpose. In $str you can directly specify content of additional header. If you specify only raw header, it will be used only for starting requests. When you want to use this header with each request while crawling, prefix the header with + character.
To add multiple additional headers, you can repeatedly specify this command-line option, once for each additional header.
Specify a collection of filename / web page extensions which are to be treated as HTML pages, which is useful when scanning / hammering web sites which present unusual mime types with their pages (see also: -hammer_mode ). $list must contain a comma separated list of web page endings. The default set is .html, .htm, .asp, .aspx, .php, .php3, .php4, .pl, .shtml
Note
When pavuk includes the chunky/hammer feature (see -hammer_mode ), any web page which matches the endings specified in $list will be registered in the hammering recording buffer and marked as a page starter (’[STARTER]’): hammer time measurements are collected and reported on a ’total page’ base (see -tlogfile ). This means that pavuk assumes a user or web browser, which loads a page, will also load any style sheets, scripts and images to properly display that page. All those items are part of a ’total page’, but each page has only a single ’starting point’: the page itself.
To approximate ’total page’ timings instead of ’per item’ timings, pavuk will mark the URLs which act as web page ’starting points’ as [STARTER]. Here pavuk assumes that each web page is simple (i.e. does not use iframes, etc.), hence it is assumed that recognizing the web page URL ending is sufficient.
Please note also that the ’endings’ in $list do not have to be ’filename extensions’ per se: the ’endings’ are simply matched against the URL (with any ’?xxx=yyy’ query elements removed) using a simple, case-insensitive comparison. Hence you may also specify:
-page_sfx "index.html,index.htm"
when you only want any URLs which end with ’index.html’ or ’index.htm’ to be treated as ’page starters’ for timing purposes.
This option allows you to delete FILES from REMOTE server, when download is properly finished. At default is this option off.
When the option -FTPlist will be used, pavuk will retrieve content of FTP directories with the FTP command LIST instead of NLST . So the same listing will be retrieved as with the "ls -l " UNIX command.
This option is required if you need to preserve permissions of remote files or you need to preserve symbolic links. Pavuk supports wide listing on FTP servers with regular BSD or SYSV style "ls -l" directory listing, on FTP servers with EPFL listing format, VMS style listing, DOS/Windows style listing and Novell listing format. Default pavuk behavior is to use NLST for FTP directory listings.
Some FTP servers require to supply extra options to LIST or NLST FTP commands to show all files and directories properly. But be sure not to use any extra options which can reformat output of the listing. Useful is especially -a option which force FTP server to show also dot files and directories and with broken WuFTP servers it also helps to produce full directory listings not just files.
This option is result of several attempts to to get working properly the -remove_old option with WuFTPd server when -ftplist option is used. The problem is that FTP command LIST on WuFTPd don’t mind when trying to list non-existing directory, and indicates success in FTP response code. When you activate this option, pavuk uses extra FTP command ( STAT -d dir ) to check whether the directory really exists. Don’t use this option until you are sure that you really need it!
Ignore IIS 5/6 RFC2616 chunked transfer mode server bug, which would otherwise have pavuk fail and report downloads as ’possibly truncated’. When this is reported by pavuk you should specify this option and retry the operation.
File where you have stored authentication information for access to some service. For file structure see below in FILES section.
If you are using this parameter, pavuk will transmit your authentication details with each HTTP access for grabbing a document. For security reasons, use this option only if you know that only one HTTP server could be accessed or use the -asite option to specify the sites for which you want to use authentication. Otherwise your auth parameters will be sent to each accessed HTTP server.
Value of this parameter is used as password for authentication
This parameter specifies used authentication scheme.
means user authentication scheme is used as defined in HTTP/1.0 or HTTP/1.1. Password and user name are sent in plaintext format (unencrypted).
means Basic authentication scheme is used as defined in HTTP/1.0. Password and user name are sent BASE64 encoded.
This is the default setting.
means Digest access authentication scheme based on MD5 checksums as defined in RFC2069.
means NTLM proprietary access authentication scheme used by Microsoft IIS or Proxy servers. When you use this scheme, you must also specify NT or LM domain with option -auth_ntlm_domain .
This scheme is supported only when compiled with OpenSSL or libdes libraries.
NT or LM domain used for authorization again HTTP server when NTLM authentication scheme is required.
This option is available only when compiled with OpenSSL or libdes libraries.
While using HTTP Digest access authentication scheme use first received nonce value in more following requests. Default pavuk negotiates nonce for each request.
File with public key for SSL certificate (learn more from SSLeay or OpenSSL documentation).
This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).
Certificate file in PEM format (learn more from SSLeay or OpenSSL documentation).
This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).
Password used to generate certificate (learn more from SSLeay or OpenSSL documentation).
This option is available only when compiled with SSL support (you need SSLeay or OpenSSL libraries and development headers).
Config directory for NSS (Netscape SSL implementation) certificates. Usually ~/.netscape (created by Netscape communicator/navigator) or profile directory below ~/.mozilla (created by Mozilla browser). The directory should contain cert7.db and key3.db files.
If you don’t use Mozilla nor Netscape, you must create this files by utilities distributed with NSS libraries. Pavuk opens certificate database only read-only.
This option is available only when pavuk is compiled with SSL support provided by Netscape NSS SSL implementation.
By default will pavuk reject connection to SSL server which certificate is not stored in local certificate database (set by -nss_cert_dir option). You must explicitly force pavuk to allow connection to servers with unknown certificates.
This option is available only when pavuk is compiled with SSL support provided by Netscape NSS SSL implementation.
Selects sets of ciphers allowed/disabled by USA export rules.
This option is available only when pavuk is compiled with SSL support provided by Netscape NSS SSL implementation.
This parameter is used when accessing anonymous FTP server as password or is optionally inserted into From field in HTTP request. If not specified pavuk discovers this from USER environment variable and from site hostname.
This option is used for enabling or disabling sending of user identification, entered in -from option, as FTP anonymous user password and From: field of HTTP request. By default is this option off.
When you need to use nonstandard login procedure for some of FTP servers, you can use this option to change default pavuk login procedure. To allow more flexibility, you can assign the login procedure to some server or to all. When $host is specified as empty string ("" ), than attached login procedure is assigned to all FTP servers besides those having assigned own login procedures. In the $handshake parameter you can specify exact login procedure specified by FTP commands followed by expected FTP response codes delimited with backslash (\ ) characters.
For example this is default login procedure when logging in regular ftp server without going through proxy server:
USER %u\331\PASS %p\230
There are two commands followed by two response codes. After USER command pavuk expects FTP response code 331 and after PASS command pavuk expects from server FTP response code 230. In ftp commands you can use following macros which will be replaced by respective values:
user name used to access FTP server
password used to access FTP server
user name used to access FTP proxy server
password used to access FTP proxy server
hostname of FTP server
Cstate=1 Cstate=2 port number on which FTP server listens
Specify comma separated list of allowed sites on which referenced documents are stored. When this option is specified, pavuk will only follow links which point to servers in this list.
The -dsite parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
Specify a comma separated list of disallowed sites.
The -asite parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
Specify a comma separated list of allowed domains on which referenced documents are stored. When this option is specified, pavuk will only follow links which point to domains in this list.
The -ddomain parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
Specify a comma separated list of disallowed domains.
The -adomain parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
In $list , you can write comma separated list of ports from which you allow to download documents.
The -dport parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
This option is used to specify denied ports. When this option is specified, pavuk will only follow links which point to servers in this list.
The -aport parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
List of comma separated allowed MIME types. You can also use wildcard patterns with this option.
The -dmimet parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
List of comma separated disallowed MIME types. You can also use wildcard patterns with this option.
The -amimet parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
Maximum allowed size of document. This option is applied only when pavuk is able to detect the document before starting the transfer. Default value is 0, and it means this limit isn’t applied.
$nr specifies the size in bytes, unless postfixed with one of the characters K, M or G, which imply the multipliers K(1024), M(1048567) or G(1073741824).
minimal allowed size of document. This option is applied only when pavuk is able to detect the document before starting the transfer. Default value is 0, and it means this limit isn’t applied.
$nr specifies the size in bytes, unless postfixed with one of the characters K, M or G, which imply the multipliers K(1024), M(1048567) or G(1073741824).
Allow only transfer of documents with modification time newer than specified in parameter $time . Format of $time is: YYYY.MM.DD.hh:mm . To apply this option pavuk must be able to detect modification time of document.
$time must be specified as local (a.k.a. ’wall clock’) time.
Allow only transfer of documents with modification time older than specified in parameter $time. Format of $time is: YYYY.MM.DD.hh:mm . To apply this option pavuk must be able to detect modification time of document.
$time must be specified as local (a.k.a. ’wall clock’) time.
this switch prevents to transfer dynamically generated parametric documents through CGI interface. This is detected with occurrence of ? character inside URL. Default pavuk behavior is to allow transfer of URLs with query strings.
this allows you to specify an ordered comma separated list of preferred natural languages. This option works only with HTTP and HTTPS protocol using Accept-Language: MIME field.
This options allows you to specify a comma separated list of preferred encoding standards for transferred documents. This works only with HTTP and HTTPS urls and only if such document encodings are available on the destination server.
An example:
-acharset iso-8859-2,windows-1250,utf8
This parameter allows you to specify a set of comma separated suffixes used to restrict the selection of documents which will be processed.
The -dsfx parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
A set of comma separated suffixes that are used to specify which documents will not be processed.
The -asfx parameter is the opposite of this one. If both are used the last occurrence of them is used and all previous occurrences are discarded.
These two options allow you to specify set of allowed or disallowed prefixes of documents. They are mutually exclusive: when these options occur multiple times in your configuration file and/or command line, the last occurrence will be used and all previous ones discarded.
This option allows you to specify wildcard pattern for documents. All documents are tested if they match this pattern.
This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.
This option allows you to specify wildcard pattern for documents that should be skipped. All documents are tested if they match this pattern.
This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.
This option allows you to specify wildcard pattern for URLs. All URLs are tested if they match this pattern.
Example:
-url_pattern http://\*.idata.sk:\*/~ondrej/\*
this option enables all HTTP URLs from domain .idata.sk on all ports which are located under /~ondrej/ .
This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.
This option allows you to specify wildcard pattern for URLs that should be skipped. All URLs are tested if they match this pattern.
Example:
-skip_url_pattern ’*home*’
this option will force pavuk to skip all HTTP URLs which have ’home’ anywhere in their URL. This of course includes the query string part of the URL,
hence
-skip_url_pattern ’*&action=edit*’ will direct pavuk to skip any HTTP URLs which have a URL query section which has ’action=edit ’ as any but the first query element (as it would then match ’*?action=edit* ’ instead).
This is equal option as previous, but this uses regular expressions. Available only on platforms which have any supported RE implementation.
This option allows you to limit set of transferred documents by server IP address. IP address can be specified as regular expressions, so it is possible to specify set of IP addresses by one expression. Available only on platforms which have any supported RE implementation.
This option similar to previous option, but is used to specify set of disallowed IP addresses. Available only on platforms which have any supported RE implementation.
More powerful version of -url_pattern option for more precise matching of allowed URLs based on HTML tag name pattern, HTML tag attribute name pattern and on URL pattern. You can use in all three parameters of this option wildcard patterns, thus something like -tag_pattern ’*’ ’*’ url_pattern is equal to -url_pattern url_pattern . The $tag and $attrib parameters are always matched against uppercase strings. For example if you want pavuk to follow only regular links ignoring any style sheets, images, etc., use option -tag_pattern A HREF ’*’ .
This is variation on the -tag_pattern . It uses regular expression patterns in parameters instead of wildcard patterns used in the -tag_pattern option.
This switch suppresses all transfers through HTTP protocol. Default is transfer trough HTTP enabled.
This switch suppresses all transfers through HTTPS protocol (HTTP protocol over SSL) . Default is transfer trough HTTPS enabled.
This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).
Suppress all transfers through Gopher Internet protocol. Default is transfer trough Gopher enabled.
This switch prevents processing documents allocated on all FTP servers. Default is transfer trough FTP enabled.
This switch prevents processing documents allocated on all FTP servers accessed through SSL. Default is transfer trough FTPS enabled.
This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).
By using of option -FTPhtml you can force pavuk to process HTML files downloaded with FTP protocol. At default pavuk won’t parse HTML files from FTP servers.
Force recursive processing of FTP directories too. The default setting is to deny recursive downloading from FTP servers, i.e. FTP directory trees will not be traversed.
Enable or disable processing of particular HTML tags or attributes. At default all supported HTML tags are enabled.
For example if you don’t want to process all images you should use option -disable_html_tag ’IMG,SRC;INPUT,SRC;BODY,BACKGROUND’ en note that tags and attributes are case insensitive.
Tags ($TAG ) do not have to include an attribute ($ATTRIB ), in which case all attributes are assumed applicable. An attribute may come without a tag, in which case all tags are assumed applicable.
When you want to disable all tags (and their attributes), you can specify the shorthand ’all’, i.e. -disable_html_tag ’all’ . Both tags and attributes may contain wildcards (’*’, ’?’, etc.), similar to filenames, which will result in all matching tags or attributes being treated.
An (artificial) example may clarify the above: -disable_html_tag ’,on*;img;style,;s*,link,src;a,href’ disables all attributes starting with ’on’ (onload, onclick, ...) for all tags as that tag before the ’,’ is empty, while also disabling the the ’img’ tag itself and all attributes for ’img’ as well (as no attributes were specified for this tag). The same is done to ’style’ (tag and attribs disabled), it’s just that this one comes with a (superfluous) ’,’ comma. Next, all tags which start with ’s’ (style, ...) each get both their ’link’ and ’src’ attributes disabled. Last, the ’href’ attribute of the ’a’ tag will be disabled as well (which in the real world would prevent pavuk from detecting and following any web page hyperlinks).
Sub-directory of local tree directory, to limit some of the modes {sync, resumeregets, linkupdate} in its tree scan.
(Don’t) leave starting site. At default pavuk can span host when recursing through WWW tree.
(Don’t) leave starting directory. If -dont_leave_dir option is used pavuk will stay only in starting directory (including its own sub-directories). At default pavuk can leave starting directories.
If you are downloading WWW tree which spans multiple hosts with huge trees, you may want to allow downloading of document which are in directory hierarchy below directory which we visited as first on each site. To obtain this, use option -dont_leave_site_enter_dir . By default pavuk will go also to higher directory levels on that site.
Set maximum allowed level of tree traverse. Default is set to 0, what means that pavuk can traverse at infinitum. As of version 0.8pl1 inline objects of HTML pages are placed at same level as parent HTML page.
Maximum level of documents outside from site of starting URL. Default is set to 0, and 0 means that checking is not applied.
Note that pavuk will grab ’inline objects’ (such as images) from one level further away, i.e. pavuk will still grab inline objects residing at level $nr but not from $nr+1 , while regular pages will only be grabbed up to and including level $nr-1 .
Maximum level of sites outside from site of starting URL. Default is set to 0, and 0 means that checking is not applied.
Set maximum allowed number of documents that are processed. Default value is 0. That means no restrictions are used in number of processed documents.
Using option -singlepage allows you to transfer just HTML pages with all its inlined objects (pictures, sounds, frame documents, ...). By default single page transfer is disabled.
Note
This option renders the -mode singlepage option obsolete.
With this option you can control whether limiting options apply also to inline objects (pictures, sounds, ...). This is useful when you want to download specified set of HTML pages with all inline options without any restrictions.
Script or program name for users own conditions. You can write any script which should with exit value decide if download URL or not. Script gets from pavuk any number of options, with this meaning :
processed URL
any number of parent URLs
level of this URL from starting URL
size of requested URL
Cstate=1 Cstate=2 modification time of requested URL in format YYYYMMDDhhmmss
The exit status 0 of script or program means that current URL should be rejected and nonzero exit status means that the URL should be accepted.
Warning
Use user conditions only when absolutely necessary because forking scripts for each checked URL will result in a significant slowdown.
This option allows you to specify script or program which can by its exit status decide whether to follow URLs from current HTML document. This script will be called after download of each HTML document. The script will get following options as it’s parameters:
URL of current HTML document
Cstate=1 Cstate=2 local file where is stored HTML document
The exit status 0 of script or program means that URLs from current document will be disallowed, other exit status means, that pavuk can follow links from current HTML document.
Warning
Use -follow_cmd only when absolutely necessary because forking scripts for each checked URL will result in a significant slowdown.
Support for scripting languages like JavaScript or VBScript in pavuk is done bit hacky way. There is no interpreter for these languages, so not all things will work. Whole support which pavuk have for these scripting languages is based on regular expression patterns specified by user. Pavuk searches for these patterns in DOM event attributes of HTML tags, in javascript:... URLs, in inline scripts in HTML documents enclosed between <script></script> tags and in separate javascript files. Support for scripting languages is only available when pavuk is compiled with proper regular expression library (POSIX/GNU/PCRE/TRE).
This options are used to enable or disable processing of JavaScript parts of HTML documents. You must enable this option to be able to use processing of javascript patterns.
With this option you are specifying what patterns match interesting parts of JavaScript for extracting URLs. The parameter must be RE pattern with exactly one subpattern which matches the URL part precisely. For example to match the URL in the following type of javascript expressions:
document.b1.src=’pics/button1_pre.jpg’
you can use this pattern
^document.[a-zA-Z0-9_]*.src[ \t]*=[ \t]*’(.*)’$
This option is similar to the previous one, but you can use custom transform rules for the URL parts of patterns and also specify the exact HTML tag and attribute where to look for this pattern. The $p is the pattern to match the relevant part of script. The $t is a transform rule for the URL. In this parameter the $x parts will be replaced by x -th subpattern of the $p pattern. The $h parameter is either the exact HTML tag or "*" when this applies to javascript body of HTML document or separate JavaScript file: URLs or DOM event attribs or "" (empty string) when this apply to javascript body of HTML document or separate JavaScript file. The $a parameter is either the exact HTML attrib of tag or "" (empty string) when this rule applies to the javascript body.
This option is very similar to previous. The meaning of all parameters is same, just the pattern $p can have only one substring which will be used in the transform rule $t . This is required to allow rewriting of URL parts of the tags and scripts. This option can also be used to force pavuk to recognize HTML arg/attribute pairs which pavuk does not support.
Use this option instead of -js_transform when you want to make sure pavuk ’rewrites’ the transformed URL in the content grabbed from a site and stored on your local disc.
In other words: -js_transform is good enough when you only want to direct pavuk to grab a specific URL which is not literally available in the content already downloaded, while -js_transform2 does just that little bit more: it also makes sure this newly created URL ends up in the content saved to disc, by replacing the text matched by the first sub-expression.
Note
Make sure that the first sub-expression always matches some content, because otherwise pavuk will display a warning and not rewrite the content, as it could not detect where you wanted the replacement URL to go.
Note
Additional caveat: when your pavuk binary was built using a RE library which does not support sub-expressions, pavuk will report an error and abort when any of the -js_pattern , -js_transform or -js_transform2 command-line options were specified.
File where are stored cookie info. This file must be in Netscape cookie file format (generated with Netscape Navigator or Communicator ...).
Use collected cookies in HTTP/HTTPS requests. Pavuk will not send at default cookies.
Store received cookies from HTTP/HTTPS responses into memory cookie cache. At default pavuk will not remember received cookies.
Update cookie file on disk and synchronize it with changes made by any concurrent processes. At default pavuk will not update cookie file on disk.
Maximum number of cookies in memory cookie cache. Default value is 0, and that means no restrictions for cookies number.
Comma-separated list of cookie domains which are permitted to send cookies stored into cookie cache
Check when receiving cookie, if cookie domain is equal to domain of server which sends this cookie. At default pavuk check is server is setting cookies for its domain, and if it tries to set cookie for foreign domain pavuk will complain about that and will reject such cookie.
This switch prevents the program from rewriting relative URLs to absolute URLs after the HTML document has been transferred. Default pavuk behavior is to maintain link consistency of HTML documents. So always when a HTML document is downloaded pavuk will rewrite all URLs to point to the local document if it is available and if it is not available it will point the link to the remote document instead. After the document has been properly downloaded pavuk will update all the links in any HTML documents which point at this one.
This option forces pavuk to change all URLs inside HTML document to local URLs immediately after download of document. Default is this option disabled.
This option forces pavuk to change all URLs, which accomplish conditions for download, to local inside HTML document immediately after download of document. I recommend to use this option, when you are sure, that transfer will be without any problems. This option can save a lot of processor time. Default is this option disabled.
This option forces pavuk to change all URLs inside HTML document to remote URLs immediately after download of document. Default is this option disabled.
This option is especially designed to allow in -fnrules option doing rules based on MIME type of document. This option forces pavuk to generate local names for documents just after pavuk knows what is the MIME type of document. This have big impact on the rewriting engine of links inside HTML documents. This option causes dysfunction of other options for controlling the link rewriting engine. Use this option only when you know what you are doing :-)
Note
Note: since release 0.9.36 this option is no longer mandatory to permit the -fnrules instructions %M / %B / %A / %E / %X / %Y to work correctly with MIME types. See also the -fnrules documentation.
This options serves to deny rewriting and processing of particular URLs in HTML documents by pavuk HTML rewriting engine. This option accepts wildcard patterns to specify such URLs. Matching is done against untouched URLs so when he URL is relative, you must use pattern which matches the relative URL, when it is absolute, you must use absolute URL.
This option is variation on previous option. This one uses regular patterns for matching of URLs instead of wildcard patterns used by -dont_touch_url_pattern option. This option is available only when pavuk is compiled with support for regular expression patterns.
This option is variation on previous option, just matching is made on full HTML tag with included <>. This option accepts regular expression patterns. It is available only when pavuk is compiled with support for regular expression patterns.
All characters found in $str will be deleted from local name of document. $str should contain escape sequences similar like in the UNIX tr (1) command:
newline (ASCII LF: 10(dec))
carriage return (ASCII CR: 13(dec))
horizontal tab space (ASCII TAB: 9(dec))
hexadecimal ASCII value (1-byte range, but you can never specify ASCII NUL (0(dec)), i.e. XX can be in the range ’01’ to ’FF’)
all uppercase letters (ASCII ’A’..’Z’)
all lowercase letters (ASCII ’a’..’z’)
all letters (ASCII ’A’..’Z’ + ’a’..’z’)
all letters and digits (ASCII ’A’..’Z’ + ’a’..’z’ + ’0’..’9’)
all digits (ASCII ’0’..’9’)
all hexadecimal digits (ASCII ’0’..’9’ + ’A’..’F’ + ’a’..’f’)
all horizontal and vertical white-space (ASCII SPACE(’ ’, 32(dec)), TAB(9(dec)), CR(10(dec)), LF(13(dec)), FF(12(dec)), VT(11(dec)))
all horizontal white-space (ASCII SPACE(’ ’, 32(dec)), TAB(9(dec)))
all control characters (ASCII 1(dec)..31(dec) + 127(dec))
all printable characters including space (ASCII 32(dec)..126(dec))
all non printable characters (ASCII 1(dec)..31(dec) + 127(dec)..255(dec))
all punctuation characters (ASCII 33(dec)..47(dec) + 58(dec)..64(dec) + 91(dec)..96(dec) + 123(dec)..126(dec)), in other words these characters:
! (Exclamation mark),
" (Quotation mark; " in HTML),
# (Cross hatch a.k.a. number sign),
$ (Dollar sign),
% (Percent sign),
& (Ampersand),
’ (Closing single quote a.k.a. apostrophe),
( (Opening parentheses),
) (Closing parentheses),
* (Asterisk a.k.a. star, multiply),
+ (Plus),
, (Comma),
- (Hyphen, dash, minus),
. (Period),
/ (Slant a.k.a. forward slash, divide),
: (Colon),
; (Semicolon),
< (Less than sign; < in HTML),
= (Equals sign),
> (Greater than sign; > in HTML),
? (Question mark),
@ (At-sign),
[ (Opening square bracket),
\ (Reverse slant a.k.a. Backslash),
] (Closing square bracket),
^ (Caret a.k.a. Circumflex),
_ (Underscore),
‘ (Opening single quote),
{ (Opening curly brace),
| (Vertical line),
} (Cloing curly brace),
~ (Tilde a.k.a. approximate))
all printable characters excluding space (ASCII 33(dec)..126(dec))
a range: expands to a character series starting with the last expanded character (or ASCII(1(dec)) when the ’-’ minus character is positioned at the start of this string/specification) and ending with the character specified by X , where X may also be a ’\’-escaped character, e.g. ’\n’ or ’\x7E’. Hence you can specify ranges like ’\x20-\x39’ and get what you’d expect.
String $str1 from local name of document will be replaced with $str2 .
Characters from $chrset1 from local name of document will be replaced with corresponding character from $chrset2 . $charset1 and $charset2 should have same syntax as $str in -tr_del_chr option: both $charset1 and $charset2 will be expanded to a character set using the rules described above. The characters in the expanded sets $charset1 and $charset2 have a 1:1 relationship, e.g. the second character in $charset1 will be replaced by the second character in $charset2 .
Caution
If the set $charset2 is smaller than the set $charset1 , any characters in the set $charset1 at positions at or beyond the size of the set $charset2 will be replaced by the last character in the set $cha2rset2 . For example, tr_chr_chr(’abcd’, ’AB’, ’abcde’) will produce
the result
’ABBBe’as ’c’ and ’d’ in $charset1 are beyond the range of $charset2 , hence these are replaced by the last character in $charset2 : ’B’. With the above example this may seem rather obvious, but be reminded that elements like ’[:punct:]’ are deterministic (as they do not depend on your ’locale’, but they can still be hard to use as you must determine which and how many characters they will produce upon expansion. See the description for -tr_del_chr above for additional info to help you with this.
Define the local filename to use for the very first file downloaded. This option is most useful when running pavuk in ’singlepage’ mode, but it works for any mode.
With this option you can change directory index name. By default the filename _._.html is used, which is assumed to be a filename not usually occuring on web/ftp/... sites.
With option -nostore_index you deny storing of directory indexes into HTML files (which are named according to the -index_name settings). The default is to store all directory URLs as HTML index files (i.e. -store_index ).
This is a very powerful option! This option is used to flexibly change the layout of the local document tree. It accepts three parameters.
The first parameter $t is used to say what type the following pattern is:
is used for a wildcard pattern (uses fnmatch (3) ), while
is used for a regular expression pattern (using any supported RE implementation).
The second parameter is the matching pattern used to select URLs for this rule. If a URL matches this pattern, then the local name for this URL is computed using the rule specified in the third parameter.
And the third parameter is the local name building rule. Pavuk now supports two kinds of local name building rules. One is based only on simple rule macros and the other is a more complicated, extended rule rule, which also enables you to perform several functions in a LISP-like micro language.
Pavuk differentiates between these two kinds of rules by looking at the first character of the rule. When the first character is a ’(’ open bracket character, the rule is assumed to be of the extended sort, while in all other cases it is assumed to be a simple rule.
A Simple rule should contain a mix of literals and escaped macros. Macros are escaped by the % character or the $ character.
Note
if you want to place a literal % or $ character in the generated string, you can escape that character with a \ backslash character prefix, so pavuk will not recognize it as a macro escape character here.
Note
-fnrules always performs additional cleanup for file paths produced by both matching simple and extended rules: multiple consecutive occurrences of / slashes in the path are replaced by a single / slash, while any directory and/or file names which end with a . dot have that dot removed.
Note
-fnrules are processed in the order they occurred on the command line. If a rule matches the current URL, this rule will be applied. Any subsequent rules will be skipped. This allows you to specify multiple -fnrules on the command line. By ordering them from specific to generic, you can apply different rules to subsets of the URL collection (e.g. you’re putting the -fnrules F ’*’ ’%some%macros%’ statement last).
Note
When an -fnrules statement matches the current URL, any specified -base_level path processing will not be applied to the -fnrules generated path.
Here is a list of recognized macros:
where x is any positive number. This macro is replaced with x -th substring matched by the RE pattern, which was specified in the second -fnrules argument $m . (If you use this you need to understand RE sub-matches!)
is replaced with protocol id string:
(http,https,ftp,ftps,file,gopher)
is replaced with password. (use this only where applicable)
is replaced with user name. (use this only where applicable)
is replaced with the fully qualified host name.
is replaced with the fully qualified domain name.
is replaced with port number.
is replaced with default absolute local path to document.
is replaced with path to document.
is replaced with document name (including the extension).
is replaced with base name of document (without the extension).
is replaced with the URL filename extension.
is replaced with the URL searchstring.
is replaced with the full MIME type of document as transmitted in the MIME header. For example:
text/html; charset=utf-8
As of v0.9.36, you do not need to specify the -post_update option to make this option work.
is replaced with basic MIME type of the document, i.e. the MIME type without any attributes. For example:
text/html
As of v0.9.36, you do not need to specify the -post_update option to make this option work.
is replaced with MIME type attributes of the document, i.e. all the stuff following the initial ’;’ semicolon as specified in the MIME type header which was sent to us by the server. For example:
charset=utf-8
As of v0.9.36, you do not need to specify the -post_update option to make this option work.
is replaced with default extension assigned to the MIME type of the document.
As of v0.9.36, you do not need to specify the -post_update option to make this option work.
You may want to specify the additional command line option -mime_type_file $file to override the rather limited set of built-in MIME types and default file extensions.
is replaced with the default extension assigned to the MIME type of the document, if one exists. Otherwise, the existing file extension is used instead.
You may want to specify the additional command line option -mime_type_file $file to override the rather limited set of built-in MIME types and default file extensions.
is replaced with file extension if one is available. Otherwise, the default extension assigned to the MIME type of the document is used instead.
You may want to specify the additional command line option -mime_type_file $file to override the rather limited set of built-in MIME types and default file extensions.
where x is a positive decimal number. This macro is replaced with the x -th directory from the path of the document, starting with 1 for the initial sub-directory.
where x is a positive decimal number. This macro is replaced with the x -th directory from the path of the document, counting down from end. The value 1 indicates the last sub-directory in the path.
Cstate=1 Cstate=2 default local filename for URL
Here is an example. If you want to place the document into a single directory, one for each extension, you should use the following -fnrules option:
-fnrules F ’*’ ’/%e/%n’
Extended rules always begin with a ’(’ character. These rules use a syntax much alike the LISP syntax.
Here are the basic rules for writing extended rules:
the complete rule statement must return the local filename as a string return value
each function/operation is enclosed inside round braces ()
the first token right after the opening brace is the function name/operator Bstate=3
each function has a nonzero fixed number of parameters
each function returns a numeric or string value
function parameters are separated by one or more space characters
any parameter of a function should be a string, number, macro or another function
a literal string parameter must always be quoted using " double quotes. When you need to include a " double quote as part of the literal string itself, escape it by prefixing it with a \ backslash character.
a literal numeric parameter can be presented in any encoding supported by the strtol (3) function (octal, decimal, hexadecimal, ...)
there is no implicit conversion from number to string
each macro is prefixed by % character and is one character long
each macro is replaced by its string representation from current URL
function parameters are typed strictly
Dstate=1 Bstate=4 top level function must return string value
Extended rules supports the full set of % escaped macros supported by simple rules, plus one additional macro:
Cstate=1 Cstate=2 URL string
Here is a description of all supported functions/operators: Ostate=1 Ostate=2
concatenate two string parameters
accepts two string parameters
Dstate=1 Hstate=1 Dstate=2 returns string value
substring from string
accepts three parameters.
first is Hstate=1 string from which we want to cut a sub-part
second is number which represents starting position in string
third is number which represents ending position in string
Dstate=1 Dstate=2 returns string value
compute modulo hash value from string Hstate=1 with specified base
accepts two parameters
first is string for which we are computing the hash value
second is numeric value for base of modulo hash
Dstate=1 Dstate=2 returns numeric value
compute MD5 checksum for string
accepts Hstate=1 one string value
Dstate=1 Dstate=2 returns string which represents MD5 checksum
convert Hstate=1 all characters inside string to lower case
accepts one string value
Dstate=1 Dstate=2 returns string value
convert all characters inside string to upper case
accepts Hstate=1 one string value
Dstate=1 Dstate=2 returns string value
encode unsafe characters in string Hstate=1 with same encoding which is used for encoding unsafe characters inside URL ( %xx ). By default all non-ASCII values are encoded when this function is used.
accepts two string values
first is string which we want to encode
second is string which contains unsafe characters
Dstate=1 Dstate=2 return string value
decode any URL entities in the string and replace those by the actual Hstate=1 characters.
accepts one string value
Dstate=1 Dstate=2 return string value
delete unwanted Hstate=1 characters from string (has similar functionality as -tr_del_chr option)
accepts two string values
first is string from which we want delete
second is string which contains characters we want to delete.
Dstate=1 Dstate=2 returns string value
replace character with other character in string (has similar functionality Hstate=1 as -tr_chr_chr option)
accepts three string values
first is string inside which we want to replace characters
second is set of characters which we want to replace
third is set of characters with which we want to replace those with
Dstate=1 Dstate=2 returns string value
replace some string inside string with Hstate=1 any other string (has similar functionality as -tr_str_str option)
accepts three string values
first is the string which we want to process
second is the from string
third is to string
Dstate=1 Dstate=2 returns string value
Find Hstate=1 series of identical characters and replace them with a single item instead.
Accepts two string values.
First is the string which we want to process.
Second is the set of characters which should be processed.
Dstate=1 Dstate=2 Returns the processed string value
Expand the given character set. Accepts most of the entities Hstate=1 like you’d expect within a regex ’[...]’ set definition, with the notable exception of the ’^’ set-inversion character at the start of the set. (You can create the same result by wrapping this command inside an inv instruction.) Here’s the list of supported features for the input set description string:
the predefined sets [:upper:], [:lower:], [:alpha:], [:alnum:], [:digit:], [:xdigit:], [:space:], [:blank:], [:cntrl:], [:print:], [:nprint:], [:punct:], [:graph:] (where [:nprint:] is the exact oposite of [:print:] )
’-’ representing a range from previous starting character to subsequent end character (where ’-’ at the start of the set definition implies a start at ASCII character 0x01 (which is not the number ’1’) and a ’-’ at the end of the set implies and end of range character 0xFF (which is, admitedly, beyond the ASCII range, but is the highest possible value representable in a single 8-bit byte)
’\’-escaped characters ’\n’, ’\r’, ’\t’ representing ASCII LF, CR and TAB respective (while all other ’escaped’ characters are copied as-is, though without the ’\’ escape character)
’\0xnn’ or ’\0Xnn’, where ’n’ is any hexadecimal digit, can be used to represent any character value (except NUL of course, as that character serves as a string sentinel throughout the system)
literal characters, which are copid verbatim.
Accepts one string: the string describing the set using the features listed above, e.g. [:alnum:].,~()_\0x27/
Dstate=1 Dstate=2 returns the expanded set as a string value
Inverts the given character set. Hstate=1 ’Inverting’ means this instruction will produce a set of characters, consisting of all those which do not occur in the given input set.
Accepts one string value.
The input set may contain duplicate entries, e.g. ’abb’ is as valid as set as is ’ba’ or ’ab’.
The input set is assumed to have been expanded already. You may wish to use the ecs instruction to do so before.
Dstate=1 Dstate=2 Returns the complementing set as a string value
Converts the input string to a ’path safe’ process, Hstate=1 producing a string which may represent a ’safe’ file path on both UNIX and Windows systems. This means ’unsafe’ characters, such as ’:’ and others are deleted or replaced. To help create Windows-safe file paths on all systems, the processing includes the removal/replacement of the leading ’.’ dot in files (which is genrally used on UNIX to create ’hidden’ files).
Removal or replacement of such ’unsafe’ characters depends on the second argument, which is the replacement string. When empty, the ’unsafe’ characters will be removed (i.e. replaced by this empty string), and when the replacement string is not empty, each occurrence of an ’unsafe’ character will be replaced by this string.
The full set of characters considered ’unsafe’ is: ’\:*?"<>|&^%’ (excluding the surrounding quotes).
Accepts two string values.
First is the string which we want to process.
Second is the replacement string which will be applied to each occurrence of an ’unsafe’ character in the first argument string.
Returns the (now ’safe’)
calculate initial Hstate=1 length of string which contains only specified set of characters. (has same functionality as strspn (3) libc function)
accepts two string values
first is input string
second is set of acceptable characters
Dstate=1 Dstate=2 returns numeric value
calculate initial length of string which doesn’t contain specified Hstate=1 set of characters. (has same functionality as strcspn (3) libc function)
accepts two string values
first is input string
second is set of unacceptable characters
Dstate=1 Dstate=2 returns numeric value
calculate length of string
accepts Hstate=1 one string value
Dstate=1 Dstate=2 returns numeric value
convert number to string by Hstate=1 format
accepts two parameters
first parameter is format string same as for printf (3) function
second is number which we want to convert
Dstate=1 Dstate=2 returns string value
convert string to number by radix
accepts two parameters Hstate=1
first parameter is string which we want to convert using the strtol (3) function
second is radix number to use for conversion; specify radix ’0’ zero if the strtol (3) function should auto-discover the radix used
Dstate=1 Dstate=2 returns numeric value
return position of last occurrence of specified character Hstate=1 inside string
accepts two string parameters
first string which we are searching in
Second string contains character for which we are looking (only the first character of the string is used).
Returns a numeric value (0 if the character could not be found).
add two numeric values
accepts two numeric Hstate=1 values
Dstate=1 Dstate=2 returns numeric value
subtract two numeric values
accepts two Hstate=1 numeric values
Dstate=1 Dstate=2 returns numeric value
calculate modulo remainder
accepts Hstate=1 two numeric values
returns numeric value;
Gstate=2
multiply two numeric values
accepts two numeric values
Dstate=1 Dstate=2 returns numeric value
divide two numeric values
accepts two numeric values
returns numeric Gstate=1 Hstate=1 value; Gstate=2 Gstate=3 0 if division by zero
remove parameter from query string Hstate=1
accepts two strings
first parameter is the string which we are adjusting
second parameter is the name of parameter which should be removed
Dstate=1 Dstate=2 returns adjusted string
get query string parameter value
accepts two strings Hstate=1
first parameter is query string from which to get the parameter value (usually %s )
second string is name of parameter for which we want to get the value
Dstate=1 Dstate=2 returns value of the parameter or empty string when the parameter doesn’t exists
logical decision
accepts three parameters
first is numeric Hstate=1 and when its value is nonzero, the result of this decision is the result of the second parameter, otherwise it is the result of the third parameter
second parameter is string (returned when condition is nonzero/true)
third parameter is string (returned when condition is zero/false)
Dstate=1 Dstate=2 returns string result of decision
logical not
accepts one numeric parameter
Dstate=1 Hstate=1 Dstate=2 returns negation of parameter
logical and
accept two numeric parameters
Dstate=1 Hstate=1 Dstate=2 returns logical and of parameters
logical or
accept two numeric parameters
Dstate=1 Hstate=1 Dstate=2 returns logical or of parameters
get file extension
accept one sting (filename Hstate=1 or path)
Dstate=1 Dstate=2 return string containing extension of parameter
compare two Hstate=1 strings
accepts two strings for comparison
returns
if different
Cstate=1 Cstate=2 if equal
compare a wildcard pattern and a string Hstate=1 (has the same functionality as the fnmatch (3) libc function)
accepts two strings for comparison
first string is a wildcard pattern
second string is the data which should match the pattern
returns
if different
Cstate=1 Cstate=2 if equal
return URL sub-part from the matching -fnrules Hstate=1 ’R’ regex
accepts one number, which references the corresponding sub-expression in the -fnrules ’R’ regex
returns the URL substring which matched the specified sub-expression
This function is available only when pavuk is compiled with regex support, including sub-expressions (POSIX/PCRE/TRE/...).
Return Hstate=1 the result of the given ’simple’ expression, as applied to the current URL.
Accepts one string, which is parsed as a ’simple’ -fnrules expression. As such, expr can be used to mix the ease of ’simple expressions’ with the enahanced cababilities available with our LISP-like complex expression constructs, giving you the best of both worlds, all at the same time.
For example, you could create a local path, based on the hash, followed by the base name, search string and extension from the URL like this:
-fnrules F ’*’ ’(sc (nc "%02d/" (hsh %h 100)) (sexpr "%b%s.%Y"))’
Returns the string which is produced by the evaluated ’simple expression’.
Execute JavaScript function
Accepts one string parameter which holds Hstate=1 name of JavaScript function specified in script loaded with -js_script_file option.
Returns string value equal to return value. of JavaScript function. See the -js_script_file command line option for further details.
This function is available only when pavuk is compiled with support for JavaScript bindings.
For example, if you are mirroring a very large number of Internet sites into the same local directory, too much entries in one directory will cause performance problems. You may use for example hsh or md5 functions to generate one additional level of hash directories based on hostname with one of the following options:
-fnrules F ’*’ ’(sc (nc "%02d/" (hsh
%h 100)) %o)’
-fnrules F ’*’ ’(sc (ss (md5 %h) 0 2)
%o)’ Kstate=1
Number of directory levels to omit in local tree.
For example when downloading URL ftp://ftp.idata.sk/pub/unix/www/pavuk-0.7pl1.tgz you enter at command line -base_level 4 in local tree will be created www/pavuk-0.7pl1.tgz not ftp/ftp.idata.sk_21/pub/unix/www/pavuk-0.7pl1.tgz as normally.
Default prefix of mirrored directory. This option is used only when you are trying to synchronize content of remote directory which was downloaded using -base_level option. Also you must use directory based synchronization method, not URL based synchronization method. This is especially useful, when used in conjunction with -remove_old option.
This option is used for turn on/off of removing HTML tags which contains advertisement banners. The banners are not removed from HTML file, but are commented out. Such URLs also will not be downloaded. This option have effect only when used with option -adv_re . Default is turned off. This option is available only when your system have support for one of the supported regular expressions implementations.
This option is used to specify regular expressions for matching URLs of advertisement banners. For example:
-adv_re http://ad.doubleclick.net/.*
is used to match all files from server ad.doubleclick.net. This option is available only when your system has any supported regular expressions implementation.
Pavuk by default always attempts to assign a unique local filename to each unique URL. If this behavior is not wanted, you can use option -nounique_name to disable this.
define the hammer mode:
0 = old fashioned: keep on running until all URLs have been accessed -hammer_repeat times.
1 = record activity on first run; burst transmit recorded activity -hammer_repeat times. This is an extremely fast mode suitable for loadtesting medium and large servers (assuming you are running pavuk on similar hardware).
define the number of threads to use for the replay hammer attack (hammer mode 1)
define hammer mode flags: see the man page for more info
delay for network communications (msec). 0 == no delay, default = 0.
$nr specifies the timeout in millieseconds, unless postfixed with one of the characters S, M, H or D (either in upper or lower case), which imply the alternative time units S = seconds, M = minutes, H = hours or D = days.
timeout for network communications (msec). 0 == no timeout, default = 0.
$nr specifies the timeout in millieseconds, unless postfixed with one of the characters S, M, H or D (either in upper or lower case), which imply the alternative time units S = seconds, M = minutes, H = hours or D = days.
number of times the requests should be executed again (load test by hammering the same stuff over and over).
log all activity during a ’hammer’ run.
Note
Note: only applies to hammer_modes >= 1, as hammer_mode == 0 is simply a re-execution of all the requests, using the regular code and processing by pavuk and as such the regular pavuk logging is used for that mode.
number of filedescriptor where to output recorded activity.
Note
pavuk 0.9.36 and later releases also support the @$file argument, where you can specify a file to dump the data to. The file path must be prefixed by an ’@’ character. If you prefix the file path with a second ’@’, pavuk will assume you wish to append to an already existing file. Otherwise the file will be created/erased when pavuk starts.
This option allows you to specify number of seconds during that the program will be suspended between two transfers. Useful to deny server overload. Default value for this option is 0.
When this option is active, pavuk randomizes the the sleep time between transfers in interval between zero and value specified with -sleep option. Default is this option inactive.
If document has a modification time later then $nr days before today, then in sync mode pavuk attempts to retrieve a newer copy of the document from the remote server. Default value is 0.
Remove improper documents (those which don’t exist on remote site). This option have effect only when used in directory based sync mode. When used with URL based sync mode, pavuk will not remove any old files which were excluded from document tree and are not referenced in any HTML document. You must also use option -subdir , to let pavuk find files which belongs to current mirror. By default pavuk won’t remove any old files.
is used to set your browser command (in URL tree dialog you can use right click to raise menu, from which you can start browser on actually selected URL). This option is available only when compiled with GTK GUI and with support for URL tree preview.
turns on displaying of debug messages. This option is available only when compiled with -DDEBUG, i.e. when having executed ./configure --enable-debug to set up the pavuk source code. If the -debug option is used, pavuk will output verbose information about documents, whole protocol level information, file locking information and much more (the amount and types of information depends on the -debug_level command-line arguments). This option is used as a trigger to enable output of debug messages selected by the -debug_level option. Default is debug mode turned off. To check if your pavuk binary supports -debug , you can run pavuk with the -version option.
Set level of required debug information. $level can be numeric value which represent binary mask for requested debug levels, or comma separated list of supported debug level indentifiers.
The debug level identifiers (as listed below) can be prefixed with a ! exclamation mark to turn them off . For example, this $level specification:
all,!html,!limits
will turn ’all’ debug levels ON, except ’html’ and ’limits’.
Currently pavuk supports following debug level identifiers:
request all currently supported debug levels
for watching the pavuk I/O buffering layer at work - this layer is positioned on top of all file I/O and network traffic for improved performance.
for monitoring HTTP ’cookies’ processing.
for additional ’developer’ debug info. This generally produces more debug info across the board.
for watching events while running in -hammer_mode >= 1 replay mode
for HTML parser debugging
for monitoring HTML web form processing, such as recognizing and (automatically) filling in web form fields.
to see server side protocol messages
to see client side protocol messages
to see some special procedure calls
for debugging of documents locking
for debugging some low level network stuff
for miscellaneous unsorted debug messages
for verbose user level messages
locking of resources in multithreading environment
launching/waking/sleeping/stopping of threads in multithreaded environment
for DEBUGGING of POST requests
for debugging limiting options, you will see the reason why particular URLs are rejected by pavuk and which option caused this.
for debugging -fnrules and JavaScript-based filters.
to enable verbose reporting about SSL related things.
to enable verbose reporting of development related things.
for debugging the -js_pattern , -js_transform and -js_transform2 filter processing.
This option has effect only when running pavuk in reminder mode. To command specified with this option pavuk sends result of running reminder mode. There are listed URLs which are changed and URLs which have any errors. Default remind command is "mailx user@server -s \"pavuk reminder result\"" .
Path to Netscape browser cache directory. If you specify this path, pavuk attempts to find out if you have URL in this cache. If URL is there it will be fetched else pavuk will download it from net. The cache directory index file must be named index.db and must be located in the cache directory. To support this feature, pavuk have to be linked with BerkeleyDB 1.8x .
Path to Mozilla browser cache directory. Same functionality as with previous option, just for different browser with different cache formats. Pavuk supports both formats of Mozilla browser disk cache (old for versions <0.9 and new used in 0.9=<). The old format cache directory must contain cache directory index database with name cache.db . Then new format cache directory must contain map file _CACHE_MAP_ , and three block files _CACHE_001_ , _CACHE_002_ , _CACHE_003_ . To support old Mozilla cache format, pavuk have to be linked with BerkeleyDB 1.8x. New Mozilla cache format doesn’t require any external library.
Post-processing command, which will be executed after successful download of document. This command may somehow handle with document. During time of running this command, pavuk leaves actual document locked, so there isn’t chance that some other pavuk process will modify document. This post-processing command will get three additional parameters from pavuk.
local name of document
1 -- if document is HTML document,
0 -- Hstate=1 Rstate=1 Rstate=2 Rstate=3 if not Rstate=4
Cstate=1 Cstate=2 original URL of this document
This is bit hacky option. It forces pavuk to add to URL queue also directory indexes of all queued documents. This allow pavuk to download more documents from site, than it is able achieve in normal traversing of HTML documents. Bit dirty but useful in some cases.
Pavuk have optionally built-in JavaScript interpreter to allow high level customization of some internal procedures. Currently you are allowed to customize with your own JavaScript functions two things. You can use it to set precise limiting options, or you can write own functions which can be used inside rules of -fnrules option. With this option you can load JavaScript script with functions into pavuks internal JavaScript interpreter. This option is available only when you have compiled pavuk with support for JavaScript bindings.
Specify an alternative MIME type and file extensions definition file $file to override the rather limited set of built-in MIME types and default file extensions. The file must be of a UNIX mime.types(5) compatible format.
If you do not specify this command line option, these MIME types and extensions are known to pavuk by default:
MIME types and default file extensions
MIME type | Default File Extension |
|
|
text/html* | html |
text/js | js |
text/plain | txt |
image/jpeg | jpg |
image/pjpeg | jpg |
image/gif | gif |
image/png | png |
image/tiff | tiff |
application/pdf | |
application/msword | doc |
application/postscript | ps |
application/rtf | rtf |
application/wordperfect5.1 | wps |
application/zip | zip |
video/mpeg | mpg |
Note that the source distribution of pavuk already includes a full fledged mime.types file for your convenience. You may point -mime_type_file at this file to make pavuk aware of (almost) all MIME types available out there!
You may want to use the JavaScript bindings built into pavuk for performing tasks which need some more complexity than can achieved with a regular, non-scriptable program.
You can load one JavaScript file into pavuk using command line option -js_script_file . Currently there are in pavuk two exits where user can insert own JavaScript functions.
One is inside routine which is doing decision whether particular URL should be downloaded or not. If you want insert own JavaScript decision function you must name it pavuk_url_cond_check . The prototype of this function looks following:
function pavuk_url_cond_check (url, level) { ... }
where the function return value is used by pavuk. Any return value which evaluates to a boolean ’false’ or integer ’0’ (zero) will be considered a ’NO’ answer, i.e. skip the given URL. Any other boolean or integer return value constitutes a ’YES’ answer. (Note that return values are cast to an integer value before evaluation.)
is an integer number and indicates from which of five different places in pavuk code is currently pavuk_url_cond_check function called:
condition checking is called from HTML parsing routine. At this point you can use all conditions besides -dmax , -newer_than , -older_than , -max_size , -min_size , -amimet , -dmimet and -user_condition when calling the pavuk url.check_cond(name, ....) URL class method from this JavaScript function script code. Calling url.check_cond(name, ....) with any of the conditions listed above will be processed as a no-op, i.e. it will return the boolean value ’TRUE’.
condition checking is called from routine which is performing queueing of URLs into download queue. These URLs have been collected from another HTML page before. At this point you can only use the conditions -dmax and -user_condition .
condition checking is called when URL is taken from download queue and will be transferred after this check will be successful. At this point you can use same set of conditions like in level == 0 except -tag_pattern and -tag_rpattern . Meanwhile you can use the condition -dmax here.
condition checking is called after pavuk sent request for download and detected document size, modification time and mime type. In this level you can only use the conditions -newer_than , -older_than , -max_size , -min_size , -amimet , -dmimet and -user_condition . As with the other levels, using any other conditions is identical to a no-op check.
is object instance of PavukUrl class. It contains all information about particular URL and is wrapper for parsed URLs defined in pavuk like structure of url type.
It have following attributes:
read-write attributes
(int32, defined always) holds bitfields with different info (look Mstate=1 in url.h to see more)
read-only attributes defined always
one of "http" "https" "ftp" "ftps" "file" "gopher" "unknown" means kind of URL
level in document tree at which this URL lies
number of parent documents which reference this URL
Cstate=1 Cstate=2 full URL string
read-only attributes defined when protocol == "http" or "https"
host name or IP address
port number
HTTP document
query string when available (the part of URL after ?)
anchor name when available (the part of URL after #)
user name for authorization when available
Cstate=1 Cstate=2 password for authorization when available
read-only attributes defined when protocol == "ftp" or "ftps"
host name or IP address
port number
user name for authorization when available
password for authorization when available
path to file or directory
anchor name when available (the part of URL after #)
Cstate=1 Cstate=2 flag whether this URL points to directory
read-only attributes defined when protocol == "file"
path to file or directory
query string when available (the part of URL after ?)
anchor name when available (the part of URL after #)
read-only attributes defined when protocol == "gopher"
host name or IP address
port number
Cstate=1 Cstate=2 selector string
read-only attributes defined when protocol is unidentified
Cstate=1 Cstate=2 full URL string
read-only attributes available when performing checking of conditions
equivalent to level parameter of pavuk_url_cond_check function
MIME type of this URL (defined when available)
size of document (defined when available)
modification time of document (defined when available)
number of document in download queue (defined when available)
full content of parent document of current URL (defined when level == 0)
offset of current HTML tag in parent document of URL (defined when level == 0)
get URL to which was this URL moved (define when available)
full HTML tag (including the <> delimiter characters) from which is taken current URL (defined when level == 0)
name of HTML tag from which is current URL taken (defined when level == 0)
name of HTML tag attribute from which is current URL taken (defined when level == 0)
And following methods:
get URL of n-th parent document
check condition which option name is "name". when you will not provide additional parameters pavuk will use parameters from command-line or scenario file for condition checking. Else it will use listed parameters.
The following condition names are recognized (note that the use of other names is considered an error here):
-noFTP
-noHTTP
-noSSL
-noGopher
-noFTPS
-noCGI
-lmax
-asite
-dsite
-adomain
-ddomain
-aprefix
-dprefix
-asfx
-dsfx
-dont_leave_site
-dont_leave_dir
-site_level
-leave_level
-dont_leave_site_enter_dir
-aport
-dport
-aip_pattern
-dip_pattern
-pattern
-rpattern
-skip_pattern
-skip_rpattern
-url_pattern
-url_rpattern
-skip_url_pattern
-skip_url_rpattern
-tag_pattern
-tag_rpattern
-dmax
-user_condition
-max_size
-min_size
-amimet
-dmimet
-newer_than
Dstate=1 Dstate=2 -older_than
Next to that, pavuk also offers a global print(...) function which will print each of the parameters passed to it, separating them by a single space. The text is terminated by a newline. Note that each of the print(...) parameters is cast to a string before being printed.
Here is some example like pavuk_url_cond_check function can look:
function pavuk_url_cond_check (url, level) { if (level == 0) { if (url.level > 3 && url.check_cond("-asite", "www.host.com")) return false; if (url.check_cond("-url_rpattern" , "http://www.idata.sk/~ondrej/" , "http://www.idata.sk/~robo/") && url.check_cond("-dsfx", ".jar", ".tgz", ".png")) return false; } if (level == 2) { par = url.get_parent(); if (par && par.get_moved()) return false; } return true; }
This example is rather useless, but shows you how to use this feature.
Second possible use of JavaScript with pavuk is in -fnrules option for generating local names. In this case it is done by special function of extended -fnrules option syntax called "jsf " which has one parameter: the name of javascript function which will be called. The function must return a string and its prototype is something like the following:
function some_jsf_func(fnrule) { ... }
The -fnrule parameter is an object instance of PavukFnrules class.
it have three read-only attributes:
url - which is of PavukUrl type described above
pattern - which is the -fnrules provided pattern string
pattern_type - which is the -fnrules provided pattern type ID (an integer number): when called by a -fnrules ... ’F’ option, pattern_type == 2, when called by a -fnrules ... ’R’ (regex) option, pattern_type == 1, otherwise pattern_type == 0 (unknown).
and also has two methods
get_macro( (macro ) ) - it returns value of the ’%’ macros used in -fnrules option, where the (string type) (macro ) argument may be any of ’%i’, ’%p’, ’%u’, ’%h’, ’%m’, ’%r’, ’%d’, ’%n’, ’%b’, ’%e’, ’%s’, ’%q’, ’%U’, ’%o’, ’%M’, ’%B’, ’%A’, ’%E’, ’%Y’ or ’%X’. Any other (macro ) argument value will not be processed and is passed as is, i.e. will be returned by get_macro( (macro ) ) untouched.
get_sub( (nr ) ) - which returns the substring of ’urlstr’ as matched by the Regex sub-expression ’nr’ when the -fnrules R statement was matched.
You can do something like:
-fnrules F "*" ’(jsf "some_fnrules_func")’
As of version 0.9pl29 pavuk have changed indication of status by exit codes. In earlier versions exit status 0 was for no error and nonzero exit status was something like count of failed documents. In all version after 0.0pl29 there are defined following exit codes:
no error, everything is OK
error in configuration of pavuk options or error in config files
some error occurred while downloading documents
a signal was caught while downloading documents; transfer was aborted
an internal check failed while downloading documents; transfer was aborted
variable is used to construct email address from user and hostname
used to set internationalized environment
with this variable you can specify alternative location for your .pavukrc configuration file.
is used for scheduling.
is used to decode gzip or compress encoded documents. Note that since pavuk release 0.9.36 gunzip is only used when pavuk has been built without built-in zlib support. You can check if your pavuk binary comes with built-in zlib support by running pavuk -v which should report ’gzip/compress/deflate Content-Encoding’ as one of the optional features available.
If you find any, please let me know.
---
---
These files are used as default configuration files. You may specify there some constant values like your proxy server or your preferred WWW browser. Configuration options reflect command line options. Not all parameters are suitable for use in default configuration file. You should select only some of them, which you really need.
File ~/.pavuk_prefs is special file which contains automatically stored configuration. This file is used only when running GUI interface of pavuk and option -prefs is active.
File $file should contain as many authentication records as you need. Records are separated by any number of empty lines. Parameter name is case insensitive.
Structure of record:
Field : Proto: <proto ID>
Description : identification of protocol
(ftp/http/https/..)
Reqd: : required
Field : Host: <host:[port]>
Description : host name
Reqd: : required
Field : User: <user>
Description : name of user
Reqd: : optional
Field : Pass: <password>
Description : password for user
Reqd: : optional
Field : Base: <path>
Description : base prefix of document path
Reqd: : optional
Field : Realm: <name>
Description : realm for HTTP authorization
Reqd: : optional
Field : NTLMDomain: <domain>
Description : NTLM domain for NTLM authorization
Reqd: : optional
Field : Type: <type>
Description : HTTP authentication scheme. Accepted
values: {1/2/3/4/user/Basic/Digest/NTLM} Similar meaning
as the -auth_scheme option (see help for this option for
more details). Default is 2 (Basic scheme).
Reqd: : optional
See pavuk_authinfo.sample file for an example.
this is file where are stored information about configurable menu option shortcuts. This is available only when compiled with GTK+ 1.2 and higher.
this file contains information about URLs for running in reminder mode. Structure of this file is very easy. Each line contains information about one URL. First entry in line is last known modification time of URL (stored in time_t format - number of seconds since 1.1.1970 GMT), and second entry is the URL itself.
First (if present) parsed file is /usr/local/etc/pavukrc then ~/.pavukrc (if present), then ~/.pavuk_prefs (if present). Last the command line is parsed.
The precedence of configuration settings is as follows (ordered from highest to lowest precedence):
Entered in user interface
Entered in command line
~/.pavuk_prefs
~/.pavukrc
/usr/local/etc/pavukrc
Here is table of config file - command line options pairs:
Config file options vs. command line option equivalents
Config file option | command line option |
|
|
ActiveFTPData: | -ftp_active / -ftp_passive |
ActiveFTPPortRange: | -active_ftp_port_range |
AddHTTPHeader: | -httpad |
AdvBannerRE: | -adv_re |
AllLinksToLocal: | -all_to_local / -noall_to_local |
AllLinksToRemote: | -all_to_remote / -noall_to_remote |
AllowCGI: | -CGI / -noCGI |
AllowedDomains: | -adomain |
AllowedIPAdrressPattern: | -aip_pattern |
AllowedMIMETypes: | -amimet |
AllowedPorts: | -aport |
AllowedPrefixes: | -aprefix |
AllowedSites: | -asite |
AllowedSuffixes: | -asfx |
AllowFTP: | -FTP / -noFTP |
AllowFTPRecursion: | -FTPdir |
AllowFTPS: | -FTPS / -noFTPS |
AllowGopher: | -Gopher / -noGopher |
AllowGZEncoding: | -Enc / -noEnc |
AllowHTTP: | -HTTP / -noHTTP |
AllowRelocation: | -Relocate / -noRelocate |
AllowSSL: | -SSL / -noSSL |
AlwaysMDTM: | -always_mdtm / -noalways_mdtm |
AuthFile: | -auth_file |
AuthReuseDigestNonce: | -auth_reuse_nonce |
AuthReuseProxyDigestNonce: | -auth_reuse_proxy_nonce |
AutoReferer: | -auto_referer / -noauto_referer |
BaseLevel: | -base_level |
BgMode: | -bg / -nobg |
Browser: | -browser |
CheckIfRunnigAtBackground: | -check_bg / -nocheck_bg |
CheckSize: | -check_size / -nocheck_size |
CommTimeout: | -timeout |
CookieCheckDomain: | -cookie_check / -nocookie_check |
CookieFile: | -cookie_file |
CookieRecv: | -cookie_recv / -nocookie_recv |
CookieSend: | -cookie_send / -nocookie_send |
CookiesMax: | -cookies_max |
CookieUpdate: | -cookie_update / -nocookie_update |
Debug: | -debug / -nodebug |
DebugLevel: | -debug_level |
DefaultMode: | -mode |
DeleteAfterTransfer: | -del_after / -nodel_after |
DisabledCookieDomains: | -disabled_cookie_domains |
DisableHTMLTag: | -disable_html_tag |
DisallowedDomains: | -ddomain |
DisallowedIPAdrressPattern: | -dip_pattern |
DisallowedMIMETypes: | -dmimet |
DisallowedPorts: | -dport |
DisallowedPrefixes: | -dprefix |
DisallowedSites: | -dsite |
DisallowedSuffixes: | -dsfx |
DocExpiration: | -ddays |
DontLeaveDir: | -leave_dir / -dont_leave_dir |
DontLeaveSite: | -leave_site / -dont_leave_site |
DontTouchTagREPattern: | -dont_touch_tag_rpattern |
DontTouchUrlPattern: | -dont_touch_url_pattern |
DontTouchUrlREPattern: | -dont_touch_url_rpattern |
DumpFD: | -dumpfd |
DumpUrlFD: | -dump_urlfd |
EmailAddress: | -from |
EnableHTMLTag: | -enable_html_tag |
EnableJS: | -enable_js / -disable_js |
FileSizeQuota: | -file_quota |
FixWuFTPDBrokenLISTcmd: | -fix_wuftpd_list / -nofix_wuftpd_list |
FnameRules: | -fnrules |
FollowCommand: | -follow_cmd |
ForceReget: | -force_reget |
FSQuota: | -fs_quota |
FTPDirtyProxy: | -ftp_dirtyproxy |
FTPhtml: | -FTPhtml / -noFTPhtml |
FTPListCMD: | -FTPlist / -noFTPlist |
FTPListOptions: | -ftp_list_options |
FtpLoginHandshake: | -ftp_login_handshake |
FTPProxy: | -ftp_proxy |
FTPProxyPassword: | -ftp_proxy_pass |
FTPProxyUser: | -ftp_proxy_user |
FTPViaHTTPProxy: | -ftp_httpgw |
GopherProxy: | -gopher_proxy |
GopherViaHTTPProxy: | -gopher_httpgw |
GUIFont: | -gui_font |
HackAddIndex: | -hack_add_index / -nohack_add_index |
HammerEaseOffDelay: | -hammer_ease |
HammerFlags: | -hammer_flags |
HammerMode: | -hammer_mode |
HammerReadTimeout: | -hammer_rtimeout |
HammerRecorderDumpFD: | -hammer_recdump |
HammerRepeatCount: | -hammer_repeat |
HammerThreadCount: | -hammer_threads |
HashSize: | -hash_size |
HTMLFormData: | -formdata |
HTMLTagPattern: | -tag_pattern |
HTMLTagREPattern: | -tag_rpattern |
HTTPAuthorizationName: | -auth_name |
HTTPAuthorizationPassword: | -auth_passwd |
HTTPAuthorizationScheme: | -auth_scheme |
HTTPProxy: | -http_proxy |
HTTPProxyAuth: | -http_proxy_auth |
HTTPProxyPass: | -http_proxy_pass |
HTTPProxyUser: | -http_proxy_user |
Identity: | -identity |
IgnoreChunkServerBug | -ignore_chunk_bug / -noignore_chunk_bug |
ImmediateMessages: | -immesg / -noimmsg |
IndexName: | -index_name |
JavaScriptFile: | -js_script_file |
JavascriptPattern: | -js_pattern |
JSTransform2: | -js_transform2 |
JSTransform: | -js_transform |
Language: | -language |
LeaveLevel: | -leave_level |
LeaveSiteEnterDirectory: | -leave_site_enter_dir / -dont_leave_site_enter_dir |
LimitInlineObjects: | -limit_inlines / -dont_limit_inlines |
LocalIP: | -local_ip |
LogFile: | -logfile |
LogHammerAction: | -log_hammering / -nolog_hammering |
MatchPattern: | -pattern |
MaxDocs: | -dmax |
MaxLevel: | -lmax / -l |
MaxRate: | -maxrate |
MaxRedirections: | -nredirs |
MaxRegets: | -nregets |
MaxRetry: | -retry |
MaxRunTime: | -max_time |
MaxSize: | -maxsize |
MinRate: | -minrate |
MinSize: | -minsize |
MozillaCacheDir: | -mozcache_dir |
NetscapeCacheDir: | -nscache_dir |
NewerThan: | -newer_than |
NLSMessageCatalogDir: | -msgcat |
NSSAcceptUnknownCert: | -nss_accept_unknown_cert / -nonss_accept_unknown_cert |
NSSCertDir: | -nss_cert_dir |
NSSDomesticPolicy: | -nss_domestic_policy / -nss_export_policy |
NTLMAuthorizationDomain: | -auth_ntlm_domain |
NTLMProxyAuthorizationDomain: | -auth_proxy_ntlm_domain |
NumberOfThreads: | -nthreads |
OlderThan: | -older_than |
PageSuffixes: | -page_sfx |
PostCommand: | -post_cmd |
PostUpdate: | -post_update / -nopost_update |
PreferredCharset: | -acharset |
PreferredLanguages: | -alang |
PreserveAbsoluteSymlinks: | -preserve_slinks / -nopreserve_slinks |
PreservePermisions: | -preserve_perm / -nopreserve_perm |
PreserveTime: | -preserve_time / -nopreserve_time |
Quiet: | -quiet / -verbose |
RandomizeSleepPeriod: | -rsleep / -norsleep |
ReadBufferSize: | -bufsize |
ReadCSS: | -read_css / -noread_css |
ReadHtmlComment: | -noread_comments / -read_comments |
Read_MSIE_ConditionalComments: | -noread_msie_cc / -read_msie_cc |
Read_XML_CDATA_Content: | -noread_cdata / -read_cdata |
RegetRollbackAmount: | -rollback |
REMatchPattern: | -rpattern |
ReminderCMD: | -remind_cmd |
RemoveAdvertisement: | -remove_adv / -noremove_adv |
RemoveBeforeStore: | -remove_before_store / -noremove_before_store |
RemoveOldDocuments: | -remove_old |
RequestInfo: | -request |
Reschedule: | -reschedule |
RetrieveSymlinks: | -retrieve_symlink / -noretrieve_symlink |
RunX: | -runX |
ScenarioDir: | -scndir |
SchedulingCommand: | -sched_cmd |
SelectedLinksToLocal: | -sel_to_local / -nosel_to_local |
SendFromHeader: | -send_from / -nosend_from |
SendIfRange: | -send_if_range / -nosend_if_range |
SeparateInfoDir: | -info_dir |
ShowDownloadTime: | -stime |
ShowProgress: | -progress |
SinglePage: | -singlepage / -nosinglepage |
SiteLevel: | -site_level |
SkipMatchPattern: | -skip_pattern |
SkipREMatchPattern: | -skip_rpattern |
SkipURLMatchPattern: | -skip_url_pattern |
SkipURLREMatchPattern: | -skip_url_rpattern |
SleepBetween: | -sleep |
SLogFile: | -slogfile |
SSLCertFile: | -ssl_cert_file |
SSLCertPassword: | -ssl_cert_passwd |
SSLKeyFile: | -ssl_key_file |
SSLProxy: | -ssl_proxy |
SSLVersion: | -ssl_version |
StatisticsFile: | -statfile |
StoreDirIndexFile: | -store_index / -nostore_index |
StoreDocInfoFiles: | -store_info / -nostore_info |
StoreName: | -store_name |
TransferQuota: | -trans_quota |
TrChrToChr: | -tr_chr_chr |
TrDeleteChar: | -tr_del_chr |
TrStrToStr: | -tr_str_str |
UniqueDocName: | -unique_name / -nounique_name |
UniqueLogName: | -unique_log / -nounique_log |
UniqueSSLID: | -unique_sslid / -nounique_sslid |
URLMatchPattern: | -url_pattern |
URLREMatchPattern: | -url_rpattern |
UrlSchedulingStrategy: | -url_strategy |
URLsFile: | -urls_file |
UseCache: | -cache / -nocache |
UseHTTP11: | -use_http11 |
UsePreferences: | -prefs / -noprefs |
UserCondition: | -user_condition |
UseRobots: | -Robots / -noRobots |
Verify CERT: | -verify / -noverify |
WaitOnExit: | -ewait |
WorkingDir: | -cdir |
WorkingSubDir: | -subdir |
XMaxLogSize: | -xmaxlog |
URL: | one URL (more lines with URL: ... means more URLs) |
Some config file entries are not available as command-line options:
Extra config file options for the GTK GUI
Config file option | Description |
|
|
BtnConfigureIcon: | accepts a path argument |
BtnConfigureIcon_s: | accepts a path argument |
BtnLimitsIcon: | accepts a path argument |
BtnLimitsIcon_s: | accepts a path argument |
BtnGoBgIcon: | accepts a path argument |
BtnGoBgIcon_s: | accepts a path argument |
BtnRestartIcon: | accepts a path argument |
BtnRestartIcon_s: | accepts a path argument |
BtnContinueIcon: | accepts a path argument |
BtnContinueIcon_s: | accepts a path argument |
BtnStopIcon: | accepts a path argument |
BtnStopIcon_s: | accepts a path argument |
BtnBreakIcon: | accepts a path argument |
BtnBreakIcon_s: | accepts a path argument |
BtnExitIcon: | accepts a path argument |
BtnExitIcon_s: | accepts a path argument |
BtnMinimizeIcon: | accepts a path argument |
BtnMaximizeIcon: | accepts a path argument |
A line which begins with ’#’ means comment.
TrStrToStr: and TrChrToChr: must contain two quoted strings. All parameter names are case insensitive. If here is missing any option, try to look inside config.c source file.
See pavukrc.sample file for example.
The most simple incantation:
pavuk http://<my_host>/doc/
Mirroring a site to a specific local directory tree, rejecting big files (> 16MB), plus lots of extra options for, among others: active FTP sessions, passive FTP (for when you’re behind a firewall), etc. As such, this is a rather mix & mash example:
pavuk -mode mirror -nobg -store_info -info_dir /mirror/info -nthreads 1 -cdir /mirror/incoming -subdir /mirror/incoming -preserve_time -nopreserve_perm -nopreserve_slinks -noretrieve_symlink -force_reget -noRobots -trans_quota 16384 -maxsize 16777216 -max_time 28 -nodel_after -remove_before_store -ftpdir -ftplist -ftp_list_options -a -dont_leave_site -dont_leave_dir -all_to_local -remove_old -nostore_index -active_ftp_port_range 57344:65535 -always_mdtm -ftp_passive -base_level 2 http://<my_host>/doc/
Note
This is a writeup for a bit of extra pavuk documentation. Comments are welcomed; I hope this is useful for those who are looking for some prime examples of pavuk use (intermediate complexity).
Author: Ger Hobbelt
< ger@hobbelt.com >
Anyone who doesn’t find
’pavuk http://www.da-url-to-spider.com/’
suits their need entirely.
Anyone who feels an itch coming up when their current spider software croaks again, merely because you were only interested in spidering a part of the pages.
This example text assumes you’ve had your first few trial runs using pavuk already. We take off at the point where you knew you should really read the manual but didn’t dare do so. Yet. ... Or you did and got that look upon your face, where your relatives start to laugh and your kids yell: “Mom! Dad is doing that look again!”
We’re going to cover a hardcase example of use for any spider: a Mediawiki-driven documentation website.
The goal: Get some easily readable pages in your local (off-line) storage.
I wished to have the documentation for a tool, which I have purchased before, available off net, since I’m not always connected when I’m somewhere where I find time to work with that particular tool. And the company that sells the product doesn’t include a paper manual.
Their documentation is stored in a Mediawiki web site, i.e. a website driven by the same software which was written for the well known Wikipedia.
There are several issues with such sites, at least from a ’off net copy’ and ’spider’ perspective:
The web pages don’t come with proper file extensions, e.g. ’.HTML’. Sometimes even no filename extensions at all, such as is the case with Mediawiki sites. For a web site, this is not an issue, as the web server and your browser will work as a perfect tandem as long as the server sends along the correct MIME type with that content, and Mediawiki does a splendid job there.
As each page has quite a few links to:
edit sections
view this page’s history / revisions
etc.etc.
your spider will really love to dig in and go there.
Unfortunately this is the Road To Hell (tm) as:
any site of sufficient age, i.e. a large enough number of edits to its pages, will have your spider go... and go... and go... and then some more.
To put it mildly, you may not be particularly interested in those historic edits / revisions / etc. -- I know I wasn’t, I just wanted to have the latest documentation along when I open up my laptop next where there’d be no Net. And I didn’t like my disc flooded with - to me - garbage.
If you are really lucky with these highly dynamic sites, they’ll provide reporting and other facilities on a day to day basis: when the spider hits those calendars and the site is set up to, for example, show the state of the union, pardon, website for any given day back till the dawn of civilization, you’re in for a real treat as the spider will request those dynamic pages for every day in that lovely calendar.
ETA on this process? Somewhere around this saturday next year. If you’re lucky and your IP doesn’t get banned before that day for abuse.
So the key to this type of spider activity is to be able to restrict the spider to the ’main pages’, i.e. that part of the content you are interested in.
Which leaves only one ’minor’ issue: local files don’t come with a ’MIME type’, so you’re in a real need for some fitting filename extensions to help your HTML browser/viewer decide how to show that particular bit of content. After all, both a .HTML and a .JPG file are just a bunch of bytes, but, heck, does a JPG look wicked when you try to view it as if it were a HTML page. And vice versa.
pavuk is perfectly able to help you out with this challenge as it comes with quite a few features to selectively grab and discard pages during the spider process.
And it has something extra, which is not to be sneezed at when you are trying to convert dynamically generated content into some sort of static HTML pages for off net use: FILENAME REWRITING. This allows you to tell pavuk how you like those pages to be filed exactly and under what filenames and, very important to have your web browser cooperate when you feed it these pages from your local disc, the appropriate filename extensions.
Let’s have a look at the pavuk commandline which does all of that - and then some:
Note
(this is pavuk tests/ example script no. 2a by the way) The pavuk commandline has been broken across multiple lines to improve it’s readability.
We are going to grab the documentation for a 3D animation plugin called CAT, available at http://cat.wiki.avid.com/
Special notes for this spider run:
We are also interested in the ’RecentChanges’ report/overview, as I edit my local copy of this documentation and like to know which pages have changed since the last time I visited the site.
Remove the single spaces before each of those ’&’ in those URLs if you want the real URL; these were inserted only for simplification this document’s formatting.
For the same reason, remove the single spaces following each ’,’ comma in several of the commandline option arguments down there.
#! /bin/sh # single thread ’mirror’ mode web grab. # accept these Special:xxxxxx pages !only! : # # http://cat.wiki.avid.com/index.php/Special:Lonelypages # http://cat.wiki.avid.com/index.php/Special:Unusedimages # http://cat.wiki.avid.com/index.php/Special:Allpages # http://cat.wiki.avid.com/index.php?title=Special:Recentchanges &hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500 # # # -fnrules: # # -fnrules F "*/index.php/*:*" "%h:%r/%d/%n%s.%X" ## -- convert ’index.php/Images:xyz.png’ HTML pages to ## ’index.php/Images:xyz.png.html’ # -fnrules F "*/index.php[/?]*" "%h:%r/%d/%b%s.%X" ## -- convert ’index.php/YadaYada’ HTML pages to ## ’index.php/YadaYada.html’ # -fnrules F "*" "%h:%r/%d/%b%s.%Y" ## -- keep extensions on any other URL: favorite.ico, ## style.css, et al # ../src/pavuk -verbose -dumpdir pavuk_data/ -noRobots -cdir pavuk_cache/ -cookie_send -cookie_recv -cookie_check -cookie_update -cookie_file pavuk_data/chunky-cookies3.txt -read_css -auto_referer -enable_js -info_dir pavuk_info/ -mode mirror -index_name chunky-index.html -request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges& hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500 METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Allpages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/ METHOD:GET’ -scndir pavuk_scenarios/ -dumpscn TestScenario.txt -nthreads 1 -progress_mode 6 -referer -nodump_after -rtimeout 10s -wtimeout 10s -timeout 60s -dumpcmd test_cmd_dumped.txt -debug -debug_level ’all, !locks, !mtlock, !cookie, !trace, !dev, !net, !html, !htmlform, !procs, !mtthr, !user, !limits, !hammer, !protos, !protoc, !protod, !bufio, !rules, !js’ -store_info -report_url_on_err -tlogfile pavuk_log_timing.txt -dump_urlfd @pavuk_urlfd_dump.txt -dumpfd @pavuk_fd_dump.txt -dump_request -dump_response -logfile pavuk_log_all.txt -slogfile pavuk_log_short.txt -test_id T002 -adomain cat.wiki.avid.com -use_http11 -skip_url_pattern ’*oldid=*, *action=edit*, *action=history*, *diff=*, *limit=*, *[/=]User:*, *[/=]User_talk:*, *[^p]/Special:*, *=Special:[^R]*, *.php/Special:[^LUA][^onl][^nul]*, *MediaWiki:*, *Search:*, *Help:*’ -tr_str_str ’Image:’ ’’ -tr_chr_chr ’:\\!&=?’ ’_’ -mime_types_file ../../../mime.types -fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’ -fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’ -fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
Whew, that’s some commandline you’ve got there! Well, I always start out with the same set of options, which are not really relevant here (we’re not all that concerned with tracking cookies on this one, for one), but it has grown into a habit which is hard to get rid of.
A bit of a toned down version looks like this:
Note
removed are:
logging features (the -dump_whathaveyou commandline options / -store_info/-[ts]logfile)
cookie tracking and handling options
storage directory configuration (-dumpdir/-cdir/-info_dir/-scndir)
multithreading configuration (-nthreads)
verbosity and progress info aids (-verbose/-progress_mode/-report_url_on_err)
diagnostics features: there a whole slew of flags there that are really helpful when you are setting up this sort of thing first time: without those it can be really hard to find the proper incantations for some of the remaining options (-debug/-debug_level)
miscellaneous for administrative purposes (-test_ID)
leaving us:
../src/pavuk -noRobots -read_css -auto_referer -enable_js -mode mirror -index_name chunky-index.html -request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges& hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500 METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Allpages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/ METHOD:GET’ -referer -adomain cat.wiki.avid.com -use_http11 -skip_url_pattern ’*oldid=*, *action=edit*, *action=history*, *diff=*, *limit=*, *[/=]User:*, *[/=]User_talk:*, *[^p]/Special:*, *=Special:[^R]*, *.php/Special:[^LUA][^onl][^nul]*, *MediaWiki:*, *Search:*, *Help:*’ -tr_str_str ’Image:’ ’’ -tr_chr_chr ’:\\!&=?’ ’_’ -mime_types_file ../../../mime.types -fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’ -fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’ -fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
which tells pavuk to:
skip the ’robots.txt’, if available from this web site (-noRobots)
load and interpret any CSS files, i.e. see if there are additional URLs available in there (-read_css)
play nice with the web server and tell the box which path it is traveling, just like a regular web browser would do when a human would click on the links shown on screen (-auto_referer/-referer)
look at any JavaScript code for extra URLs (-enable_js). Yes, we’re that desperate for URLs to spider. Well, this option is in my ’standard set’ to use with pavuk, and if it (he? she?) doesn’t find any, it doesn’t hurt to have it here with us anyway.
operate in ’mirror’ mode. Pavuk has several modes of operation available for you, but I find I use ’mirror’ most, probably because I’ve become really used to it. In a moment of weakness, I might concede that it’s more probable that I have found that often almost any problem can be turned into a nail if you find yourself holding a large and powerful hammer. And the ’mirror’ mode might just be my hammer there.
directory index content should be stored in the ’chunky_index.html’ file for each such directory. Simply put: the content sent by the server when we request URLs that end with a ’/’. This is not the whole truth, but it’ll do for now.
spider starting at several URLs (-request ...). Now this is interesting, in that, at least theoretically, I could have done with specifying a single start URL there:
-request ’URL:http://cat.wiki.avid.com/ METHOD:GET’
as the other URLs shown up there can be reached from that page.
In practice though, I often find it a better approach to specify each of the major sections of a site which you want to be sure your pavuk run needs to cover. Besides, practice shows that some of those extra URLs can only be reached by spidering and interpreting otherwise uninteresting revision/edit Mediapage system pages. And since we’re doing our darnedest best to make sure pavuk does NOT grab nor process any of _those_ pages, we’ll miss a few bits, e.g. these ones:
-request ’URL:http://cat.wiki.avid.com/index.php/Special:Lonelypages METHOD:GET’ -request ’URL:http://cat.wiki.avid.com/index.php/Special:Unusedimages METHOD:GET’
will be completely missed if I hadn’t specified them explicitly here, while keeping all the restrictions (-skip_url_pattern et al) as strict and restrictive as they are now.
restrict any spidering to the specified domain and any of its subdomains (-adomain). In this particular case, there’s only one domain to spider, but you can spider several locations in a single run, by specifying multiple ’acceptable domains’ using -adomain.
to use the HTTP 1.1 protocol when talking to the web server. This is another one of those ’standard options’ which I tend to copy&paste in every pavuk command set. This one comes in handy when your web site is hosted on a ’virtual host’, i.e. when several domains share the same server and IP number (such as is the case with my own web sites, such as ’www.hebbut.net’ and ’www.hobbelt.com’. Though this option’s use dates back to older pavuk releases I still tend to include it, despite the fact that the latest pavuk versions default to using HTTP 1.1 instead of the older HTTP 1.0.
And now some of the real meat of this animal:
-skip_url_pattern comes with a huge set of comma-separated wildcard expressions. When part of a URL matches any one of these expressions, pavuk will ignore that URL and hence skip grabbing that particular page.
is kind of trivial: if we somehow end up attempting to spider a ’historic’ (older) copy of a given web page, we are NOT interested. This forces pavuk to skip any older versions of any Mediawiki pages.
is another trivial one: we are not going to log in and edit the page as we are interested only in grabbing the current content. No editing pages with web forms for us then.
is a variant to the ’oldid’ expression above with the same intent. Note that all this is - of course - web site and Mediawiki specific, so web sites serviced by different brands of CMS/Wiki software, require their own set of skip patterns.
Nevertheless, the set above should work out nicely for most if not all Mediawiki sites.
Also note that the complete URL is matched against these patterns, i.e. including the ’?xxx&xxx&xxx’ URL query part of that URL. (Bookmarks, encoded as a dash-delimited last part of a client side URL like this: ’...#jump_here’, are NOT included in the match. The server should never get to see those anyway, as dash bookmarks are a pure client side thing.)
we don’t want to know what the changes to page X are compared to, say, the previous version of said page.
there are several report/system pages in any Mediawiki site, where lists of items are split into chunks to reduce page size and user strain. This is quick & dirty way to get rid of any of those.
And then there are the pages we do like to see (UnusedImages + LonelyImages), but are not interested in seeing till the end of the list if it is that large for the site.
two more which are irrelevant from our perspective: we’re going offline with this material, so there’s no way to discuss matters with the editors.
this one rejects any ’Special:’ pages at first glance, but is a little more wicked than that, as we do want those ’LonelyPages’, ’UnusedImages’ and ’AllPages’, thank you very much. See this pattern is augmented to any ’Special:’ pages which are not located in a (virtual) directory ending with a ’p’. Due to the way the Mediawiki software operates and presents to web pages, this basically means, this pattern will ONLY select any ’Special:’ pages which are not directly following the ’index.php’ processing page, which, in Mediawiki’s case, presents itself as if it is a directory, such as in this URL:
http://cat.wiki.avid.com/index.php/Special:Lonelypages
Unfortunately, the above pattern is not restrictive enough, as we’ll now be treated to a whole slew of main page ’Specials:’s. And that wasn’t what we wanted, did we?
Additional patterns to the rescue!
Remember that we are only in interested in three of them:
’LonelyPages’,
’UnusedImages’ and
’AllPages’
So the next pattern:
’*=Special:[^R]*’
may seem kind of weird right now. Let’s file that one for a later, and first have look at the next one after that:
’*.php/Special:[^LUA][^onl][^nul]*’: Now this baby looks just like the supplement we were looking for: Skip any ’Special:’s, which do not start their name with on of the characters ’L’, ’U’ or ’A’. Compare that to the three ’Special:’s we actually _do_ want to download listed above and the method should quickly become apparent: the second letter is declared ba-a-a-a-a-d and evil when it’s not one of these: ’o’, ’n’ or ’l’, and just to top it off, the third letter in the name is checked too: not ’n’, ’u’ or ’l’ and the page at hand is _out_.
So this should do it regarding those ’Special:’s, right?
Not Entirely, No. Because there’s still that fourth one we’d love to see:
-request ’URL:http://cat.wiki.avid.com/index.php? title=Special:Recentchanges& hideminor=0 &hideliu=0 &hidebots=0 &hidepatrolled=0 &limit=500 &days=30 &limit=500 METHOD:GET’
which has a bit of a different form around the ’Special’ text:
index.php?title=Special:Recentchanges
Note the ’=’ in there. So that ’s why we had that other pattern we had filed for later discussion:
’*=Special:[^R]*’
i.e. discard any page containing the string ’=Special:’. Which is not immediately followed by the character ’R’ of ’RecentChanges’.
So far, so good.
Mediawiki comes with another heap of system pages, which are categorically rejected using this set of three patterns:
’*MediaWiki:*, *Search:*, *Help:*’
NOW we’re done. At least as far as filtering/restricting the spider is concerned.
Note
A last note before we continue on with the next section: note that each of the ’-skip_url_pattern’ patterns are handled as if they were UNIX filesystem/shell wildcards: MSDOS/Windows people will recognize ’?’ (any single character) and ’*’ (zero or more characters), but UNIX wildcard patterns also accept ’sets’, such as ’[a-z]’ (any one of the letters of our alphabet, but only the lowercase ones) or ’[^0-9]’ (any one character, but it may NOT be digit!). pavuk calls these ’fnmatch()’ patterns and if you google the Net, you’ll be sure to find some very thorough descriptions of those. They live next to the ’regex’ (a.k.a. ’regular expressions’) which are commonly used in Perl and other languages.
pavuk - of course - comes with those too: if you like to use regexes, you should specify your restrictive patterns using the ’-skip_url_rpattern’ commandline option instead. Note that subtle extra ’r’ in the commandline option there.
Still, if you grab a Mediawiki site’s content just like that, you’ll end up with a horrible mess of files with all sorts of funny characters in their filenames.
This might not be too bothersome on a UNIX box (apart from the glaring difficulty to properly view each filetype as the filename extensions are the browser/viewer’s only help as soon as these files end up on your local storage), but I wished to view the downloaded content on a laptop with Windows XP installed.
So there’s a bit more work to do here: knead the filenames into a form that palatable to both me and my Windows web page viewing tools.
This is where some of the serious power of pavuk shows. It might not be the simplest tool around, but if you were looking for that Turbo Piledriver to devastate those 9 inch nail-shaped challenges, here you are.
We’ll start off easy: Images.
They should at least have decent filenames and more importantly: suitable filename extensions.
So we add these commandline options as filename ’transformation’ instructions:
will simply discard any ’Image:’ string in the URL while converting said URL to a matching filename.
Windows does NOT like ’:’ colons (and a few other characters), so we’ll have those replaced by a ’programmers’ space’, a.k.a. the ’_’ underscore.
This ’-tr_chr_chr’ will convert those long URLs which include ’?xxx&yyy&etc’ URL query sections into something without any of those darned characters: ’:’, ’\’ (note the UNIX shell escape there, hence ’\\’), ’!’, ’&’, ’=’ and ’?’.
Of course, if you find other characters in your grabbed URLs offend you, you can add them to this list.
Then we’re on to the last and most interesting part of the filename transformation act. But for that, we’ll need to help pavuk convert those MIME types to filename extensions.
That we do by providing a nicely formatted mime.types(3) (see online UNIX man pages for a format description) page:
-mime_types_file ../../../mime.types
Of course, I manipulated this file a bit so pavuk would choose ’.html’ over ’.htm’, etc. as several MIME types come with a set of possible filename extensions: MIME types and filename extensions come from quite disparate worlds and are not 1-on-1 exchangeable. But we try.
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’
Will take any URL which contains the string ’/index.php/’ and comes with a ’:’ a little further down the road, and convert it to a filename using
the
’%h:%r/%d/%n%s.%X’
template.
The
’F’
tells pavuk what follows will be a ’fnmatch()’ type pattern: like the ’-skip_url_pattern’ patterns above, these are very similar to UNIX filesystem wildcards. If you wish to use real perl(5) -alike regexes instead, you should specify ’R’ here instead.
The template ’%h:%r/%d/%n%s.%X’ instructs pavuk to construct the filename for the given URL like this:
’%h’ is replaced with the fully qualified host name, i.e. ’cat.wiki.avid.com’.
’%r’ is replaced with the port number, i.e. ’80’ for your average vanilla web site/server.
’%d’ is replaced with the path to the document.
’%n’ is replaced with the document name (including the extension).
’%s’ is replaced with the URL searchstring, i.e. the ’...?xxx&yyy&whatever’ section of the URL.
and ’la piece de resistance’:
’%X’ is replaced with the default extension assigned to the MIME type of the document, if one exists. Otherwise, the existing file extension is used instead.
Note
And the manual also says this: “You may want to specify the additional command line option ’-mime_type_file’ to override the rather limited set of built-in MIME types and default file extensions.” Good! We did that already!
But what is that about “Otherwise, the existing file extension is used instead”? Well, if the webserver somehow feeds you a MIME type with document X and your list/file does not show a filename extension for said MIME type, pavuk will try to deduce a filename extension from the URL itself. Basically this comes comes down to pavuk looking for the bit of the non-query part of the URL following the last ’.’ dot pavuk can find in there. In our case, that would imply the extension would end up to be ’.php’ if we aren’t careful, so it is imperative to have your ’-mime_type_file’ mime.types file properly filled with all the filename extensions for each of the MIME types you are to encounter on the website under scrutiny.
Since you’ve come this far, you might like to know that a large part of the pavuk manual has been devoted to the ’-fnrules’ option alone. And let me tell you: these ’-fnrules’ shown here barely scratch the surface of the capabilities of the ’-fnrules’ commandline option: we did not use any of the ’Extended Functions’ in the transformation templates here...
As we have covered the first ’-fnrules’ of the set shown in the example:
-fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’ -fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’ -fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
you may wonder what the others are for and about.
The second one
-fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
makes immediate sense as it is the equivalent of the first, but now only for those URLs which have a ’/’ slash or a ’?’ question mark following the string ’/index.php’ immediately.
But wait! Wouldn’t its transform template execute on the URLs as the first ’-fnrules’ statement. In other words: what’s the use of the first ’-fnrules’ if we have the second one too?
Well, there’s a little detail you need to know regarding ’fnrules’: every URL only gets to use ONE. What is saying is that once a URL matches one of the ’fnrules’, that template will be applied and no further ’-fnrules’ processing will be applied to that URL. This gives us the option to process several URLs in different ways, though we must take care about the order in which we specify these ’-fnrules’: starting from strictest matching pattern to most generic matching pattern. That is why the ’-fnrule’ with matching pattern ’*’ (= simply anything will do) comes last.
The second ’-fnrules’ line has only a few changes to its template, compared to the first:
(1st) -fnrules F ’*/index.php/*:*’ ’%h:%r/%d/%n%s.%X’ (2nd) -fnrules F ’*/index.php[/?]*’ ’%h:%r/%d/%b%s.%X’
where %b is replaced with the basename of the document (without the extension) so that the URL query section of the URLs matching the 2nd ’-fnrules’ will be discarded for the filename, while the 1st ’-fnrules’ will include that (-tr_chr_chr/-tr_str_str transformed) part instead.
The third ’-fnrules’ option:
-fnrules F ’*’ ’%h:%r/%d/%b%s.%Y’
is also interesting, because its template includes ’%Y’ instead of ’%X’, where the manual tells us this about ’%Y’: “ %Y is replaced with the file extension if one is available. Otherwise, the default extension assigned to the MIME type of the document is used instead.” Which means ’%Y’ is the opposite of ’%X’ in term of precedence of using the ’default/URL-derived filename extension’ and the MIME type derived filename extension: ’%X’ will have a MIME type related filename extension ’win’ over the extension ripped from the URL string, while ’%Y’ will act just the other way around. Hence, ’%Y’ will only use the MIME type filename extension if there’s no ’.’ dot in the filename section of the URL:
site.com/index.php
would keep it’s ’.php’, while
site.com/dir-with-extension.ext/no-extension-here
would cause pavuk to look up the related MIME type filename extension instead (notice that the filename section of the URL does not come with a ’.’ dot!).
... you’ve travelled far, but now we have covered all the commandline options which were relevant to the case at hand: spider a Mediawiki-based website for off-line perusal.
Along the way, you’ve had a whiff of the power of pavuk, while I hope you’ve found several bits that may be handy in your own usage of pavuk. I suggest you check out the other sections of the manual, forgive it its few grammitical errors as it was originally written by a non-native speaker, and enjoy pavuk for its worth: a darn powerful web spider and test machine. (Yes, I have used it to perform performance and coverage analysis on web sites with this tool. Check out the gaitlin gun of web access: the -hammer mode. But that’s a whole different story.)
I did intentionally not cover the very important diagnostics commandline options in this example, as that would have stretched your endurance as a reader beyond the limit. Perusing the ’-debug / -debug_level’ log output is subject matter to fill a book. Maybe another time.
Take care and enjoy the darndest best web spider out there. And it’s Open Source, so do as I did: grab the source if the tool doesn’t completely fit your needs already, and improve it yet further!
Look into ChangeLog file for more information about new features in particular versions of pavuk.
Main development Ondrejicka Stefan
Look into CREDITS file of sources for additional information.
pavuk is available from http://pavuk.sourceforge.net/