hglogstat 1.13 ---------------------------------------------------------------------------- Name hglogstat - create statistics about WWW-access to Hyper-G servers Synopsis hglogstat parameters For a short description of the parameters try hglogstat -h. For more details see below. ---------------------------------------------------------------------------- Description Based on the logfiles produced by the WWW-gateway, hglogstat produces access-statistics by collecting various information (see Modes). Depending on the users' selection the tool may put the results into an HTML-document (which may immediately be inserted into a defined Hyper-G collection, optionally along with some graphic representation), or the script writes detailed information about requested objects, searches and failed searches to a file, or, as a third possibility, the tool may produce overall statistics, based on information gathered during either of the previous modes. (The first two actions may be combined in a single run, the third one needs an extra run.) Modes hglogstat may execute in three different modes, the first two of which can be combined into a single run. * Produce Exhaustive Statistics In this mode, information of different categories in collected from the logfiles and presented as an HTML document. If the user defines a destination collection for the document (parameter -pname), the document is immediately inserted there; otherwise, the HTML text goes to standard-output, and no graphics will be produced at all. Categories: o General Information lists the total number of sessions, user requests, robot requests, redirected requests, failed requests, successful search requests, failed search requests, requesting hosts, bytes transferred etc. o Details about Requested Objects include how many distinct objects have been requested how many times within the relevant period of time, and which objects have been requested but could not be delivered, for what ever reason. (The reasons are listed in an extra domain.) o Details about Searches list the objects that have successfully been searched for, and also those for with a searches did not yield any result. o Details about Actions show the most frequent successful and failed actions. o Details about User Access shows the most frequent referring pages, entry points, user agents and domain names of the requesting hosts. o Time Information finally shows requests and sessions per hour and day. Since this information does not say too much when presented in a tabular form, it is also presented in a graphic form, accessible via hyperlinks. (For the creation of these graphics, Gnuplot is applied) The parameter -top specifies how many of these items shall be listed in the report; the default value is 20. There is also a possibility to make sure certain items do not appear in the report. This is achieved by adding them to a list in the rc-file (for more details, see below.) * Save Detailed Information In this mode, all requested objects, searches and failed searches (all along with the number of occurrence) are written to a file. From there, this information may further be processed by other tools. The script hgcollstat, for example, uses these files to produce statistics about single collections instead of the whole server. * Produce Overall Statistics When hglogstat executes in the first or second mode, it outputs the number of sessions per day (for each day within the analyzed period) to the file sessions.log, the number of requests per day to the file requests.log. (These files are located in the current directory.) From this information, mode three generates daily and monthly summaries, covering the whole period that has been examined by hglogstat so far, and produces an HTML document, which will be inserted into the given collection. ---------------------------------------------------------------------------- Parameters Most parameters may be abbreviated by the first four characters. Exceptions are -lastse[ven] and -lastmo[nth]. (<...> defines the type of data required) -html Mode 1. The script will produce detailed statistics and output an HTML; the user may choose to have this document inserted into a specified collection. -details Mode 2. Output all requested objects, searches and failed searches in plain ASCII (for further use). -overall Mode 3. Produce overall statistics using results from previous runs. -dir Defines the directory the logfiles are stored in. Only the logfiles in this directory will be examined (e.g. ~hgsystem/logs). -file Gives the name of the current logfile. The default name is wwwlog, so this parameter may be omitted. Old logfiles are supposed to be consist of the given filename followed by a timestamp (e.g. wwwlog.30703723). Optionally, these files may also be gzipped; in this case, the tool temporarily expands them (using gzip -c). So, giving 'wwwlog' as filename actually means all files matching wwwlog[.timestamp[.gz]]. -hghost Name of Hyper-G host ... -pname ... and name of collection to put the HTML document into. If an HTML output is desired but this parameter is omitted, the HTML text goes to the standard output. -imgcoll Name of collection to put images into (by default, equals the collection defined by -pname). -hname Hostname that shall appear in the summary's title. This option may be used in Mode 1, when an alias name shall be used instead of the host's domain name within the report. When the script is executed in Mode 2 only, there is no need to define -hghost, -pname and -imgcoll, since an ASCII-file is the only thing that will be output. So -hname may be used to still give the host a name. -from First day to analyze. Should be in the form yy/mm/dd. -to Last day to analyze. By default, yesterday's date is assumed. Format as above. -lastseven Analyzes the last seven days (may be used instead of -from and -to). -lastmonth Analyzes the last month (may be used instead of -from and -to). -top Specifies the top n items to be listed (20 by default). -regex Tells the script to treat the entries in th rc-file as regular expressions (in perl-fashion); without this option the entries are supposed to be object titles -cmd Take the parameters defined in the given file. These parameters can still be overridden by those given in the commandline. -test output current settings. -v Verbose mode. ---------------------------------------------------------------------------- The rc-file It has been mentioned above that unwanted items may be excluded from the summaries by describing them in an rc-file. By default, the script looks for hglogstat.rc in the current directory, but an alternative filename may be defined by the -rc parameter. The list of items in this file may be divided into several categories, each headed by a line identifying the type of objects to follow. So far, requested objects and entry pages may be skipped, the corresponding heading lines are _SKIP_OBJECTS_, _SKIP_ENTRIES_ and _SKIP_HOSTS_. Lines starting with # are considered to be comments. The objects may be described in one of two ways: * Simply list their titles Although this may be a bit bothersome, this method has the advantage of speed - the lists are kept in structures that can be searched very fast. Example: # unwanted objects _SKIP_OBJECTS_ coll_open.gif coll_clos.gif # unwanted entry pages _SKIP_ENTRIES_ / identify.gif text.gif info.gif * Describe the objects with perl's regular expressions While whole classes of objects may be described by just a few expressions, performance is cut down, since these expressions have to evaluated for each object discovered in the logfile. An example: # unwanted objects _SKIP_OBJECTS_ \.gif$ statistics The first expression describes all objects ending in ".gif", the second one describes all object containing "statistics" at any position. (For perl hackers: the above descriptions are all combined to a single expression by joining them with "|" and putting the result between the matching operators. For the above example this would yield /\.gif$|statistics/) It shall be emphasized, however, that the items that appear in the rc-file are excluded from the top-n lists only; they still count as requested objects or entry pages! ---------------------------------------------------------------------------- What is necessary to run hglogstat? hglogstat is a perl script and takes advantage of the features new in perl 5. So, the first prerequisite is perl 5 to be installed on your system. The graphics are produced by Gnuplot, which is called by the script. So, this has to be installed, too. Since Gnuplot does not produce gif outputs (at least my version 3.5 (pre 3.6) does not), ppmtogif is called to do the translation. So, this, too, should be installed. Finally, insertion of the HTML document into the database is done by hginstext. If you have this installed on your system, too, nothing can keep you from working with hglogstat. ---------------------------------------------------------------------------- History Changes since hglogstat 1.12 * additional parameter to specify an rc-file * -cmd for commandfile * -test to show settings Changes since hglogstat 1.11 * If no parent collection is given, HTML output goes to stdout. * top-n objects: hyperlink is inserted if GOid is available. * Graphics with lines instead of impulses. * Timeinfo headers have been changed. * Corrected regex for entry pages. * Skipping entrypages and referers for uniteresting objects. * Failed actions are sorted, too. ---------------------------------------------------------------------------- Known Bugs Of course, there are some minor bugs, but none of them is really serious. * POST Requests These requests sometimes are the start of a new session, sometimes they are not. In the logfile, however, they are simply declared as POST Requests. As a consequence, the exact number of sessions cannot be figured out, the result slightly diverges from the result obtained by analyzing the dbserver's logfiles. In numbers, the deviation within a month is a few hundred, which is less than 0.5% and usually may be neglected. To eliminate this bug, the logfile's format must be changed, which it will anyway soon. * Gnuplot It is a great graphics tool, but sometimes it behaves a bit strange. I place two plots on one screen, and although they start at the same x-position, the second plot is moved one unit to the right - but only on some architectures. There is a simple remedy to this - forcing a plot at (0,0) which is invisible - but this produces faulty behaviour on other architectures. Till now, I have not found an elegant solution. ---------------------------------------------------------------------------- Author Alfons Schmid (aschmid@iicm.edu) - September 25, 1996