=========== | VBSTATS | =========== Experimental Windows Web Server Statistics Reporter =================================================== By: Bob Denny July 15, 1994 I hereby place these programs into the public domain without restriction and WITHOUT ANY WARRANTY OF ANY KIND. All I ask is that if you distribute copies of the kit, PLEASE mark any changes CLEARLY, and distribute it as a complete kit. *** PRESERVE DIRECTORIES WHEN DECOMPRESSING! *** ============= INTRODUCTION: ============= This kit contains Visual Basic files for three programs that together provide the ability to analyze Web server logfiles (in Common Log Format ONLY), and produce HTML pages (and a usage graph) describing the server usage. The basic idea came from the "wusage" program developed by Tom Boutell, however this program works entirely differently. If you aren't familiar with VB and SQL, or don't want to be, stop here. At this stage, the package is for dedicated tinkerers only. Also, if you don't have VB Professional, stop here. The package requires the Professional Edition, as it uses a relational database and SQL to do the work, and requires you to edit the source and build the executables. No VB components are included with this kit. On the good side, the programs are really quite small, hopefully well documented, and hopefully easy to follow. The database is in Access/VB-native format, so you don't need SQL Server or anything like that! The kit is designed to work automatically with Windows httpd, it defaults to checking to see if the server is running and if so, sending the "cycle logfiles" signal to the server, then sucking the cycled out log data into the database. Command line options permit the kit to be used offline to analyze and report on data from other kinds of servers (NCSA, CERN, etc.). There are three programs: * logtodb - Captures common log format files and loads the database (creating it if needed) * restrict - A GUI "object restriction" editor * reporter - The program that generates the HTML and graph "Restriction" is the process of hiding objects (users, URL targets, and sites) from consideration in statistics summaries. Objects can be hidden from top-10 lists (but still count in the totals) or completely hidden (not even count in the totals). ================ GETTING STARTED: ================ BEFORE DOING ANYTHING ELSE, read this document clear through and poke around in the code in each of the programs. Then make any changes in lib/config.bas (common to all three programs' MAK files). You will have to at least change the server name and the server admin's name. Now make all three executables in their subdirectories, and move them to the parent (whtpstat), which we'll call the "work area". Next get a copy of a Common Log Format Web server log, preferably one that has lots of data extending over several weeks, and put it into the work area. Assuming it is called access.log, run the following: logtodb \access.log \access.mdb This will create the database and fill in the data from the server log. The command argumetns override the automatic behavior used for routine operation. TIP: You'll need either Access 1.1 (not 2.0!) or the "Visual Data" sample that comes with VB Professional. If you don't have Access 1.1, use the modified modules I supplied in the ./visdata subdirectory to replace the originals that came with Visual Data, then build the VISDATA executable. Use it to examine the tables and QueryDefs in the database after it is built by the logtodb program. The modifications permit you to see the QueryDef SQL by simply clicking on the QueryDef name in the Tables/Queries list. One more thing: If you are using the Windows httpd server, add a directory ./htdocs/svrstats, the default location into which the reporter puts the HTML and graph files. Once you have a loaded database, you're ready to play around. ============= THE DATABASE: ============= The database is the heart of this package. It contains both tables and QueryDefs, the latter are stored SQL used to produce the summary report data, and perform maintenance tasks. Here's a brief description of them. Table Description ------------- --------------------------------------------------- Accesses Each access to the server, from a line in the log AuthUsers Each unique user that authenticated via the server Objects Each unique URL target that was accessed PastTotals Weekly totals generated by previous report runs RestrictPats Contains patterns used for visiblity restriction RestrictKey (not used, see below) Sites Each unique client hostname or IP address TableKey (not used, see below) The RestrictKey table was supposed to contain a textual description of the restriction codes (more on this later). The TableKey table was supposed to contain a textual description of each table, listed by "table code" used to identify tables during restriction processing. (Baffled? Don't worry... this is fun!) There is always a User with ID=1, called , which links with unauthenticated accesses. Most of the SQL queries are in the database as QueryDefs. Many take parameters ("parameter queries"), most of which are the start and end date/times for the query. Doing this allows changes to be made to the database structure (and the query) without changing the code in the applications. Cool, eh? Well, there are a few SQL statements in the code anyway, so it ain't pure. Maybe I'll move them into querydefs some day. Here are brief descriptions of the QueryDefs: QueryDef Resulting dynaset (D) or action (A) ------------- ------------------------------------------- AuthUsersByAccessCount (D) Total accesses for each user AuthUsersByByteCount (D) Total byte count for each user MarkNewAuthUsers (A) Init restrict code for newly added users MarkNewObjects (A) Init restrict code for newly added targets MarkNewSites (A) Init restrict code for newly added sites ObjectsByAccessCount (Q) Total accesses for each URL target PatListSelector (Q) Restrict pattern lists for each table PurgeAccesses (A) Remove logged accesses for a date range RemoveAllRestrictions (A) Unrestrict all users, targets and sites RestrictAuthUsers (A) Apply restriction patterns to users RestrictNewAuthUsers (A) Apply restrictions to newly added users RestrictNewObjects (A) Apply restrictions to newly added targets RestrictNewSites (A) Apply restrictions to newly added sites RestrictObjects (A) Apply restriction patterns to targets RestrictSites (A) Apply restriction patterns to sites SitesByAccessCount (Q) Total accesses for each site hostname/IP SitesByByteCount (Q) Total byte count for each site hostname/IP TotalAccessesByMethod (Q) Total accesses for each method (GET, etc.) TotalAccesses (Q) Grand total of accesses TotalByteCount (Q) Grand total of byte count TotalIndexQueries (Q) Grand total ISINDEX requests Some QueryDefs may be obsolete. They will be removed later. Here is the ObjectsByAccessCount SQL: PARAMETERS pStart DateTime, pEnd DateTime; SELECT DISTINCTROW Count(*) AS Accesses, Objects.ObjectName AS Object FROM ((Accesses INNER JOIN Objects ON Accesses.ObjectID = Objects.ObjectID) INNER JOIN AuthUsers ON Accesses.AuthUserID = AuthUsers.AuthUserID) INNER JOIN Sites ON Accesses.SiteID = Sites.SiteID WHERE ((Accesses.ObjectID = [Objects].[ObjectID]) AND (Accesses.Time >= [pStart] And Accesses.Time < [pEnd]) AND (Objects.Restrict = 0) AND (Sites.Restrict < 2) AND (AuthUsers.Restrict < 2)) GROUP BY Objects.ObjectName ORDER BY Count(*) DESC; The "ID"s are counter-generated links between the tables. The complexity arises out of the restriction capability. First note the PARAMETERS are start and end time. The code sets these to control the time interval over which the totals are accumulated. The SELECT part returns the access total for each "object" (URL target), GROUPed BY the target name and ORDERed BY the access count, DESCending. So the dynaset is a list of access counts for each URL target, sorted by access count, in descending order. The first 10 rows of this are the "Top-10" accessed URLs and the number of accesses for each. The first INNER JOIN links to the Objects table so we can get the URL target name instead of the internal ID which is in the Accesses table. It also lets us get the restriction code for the object. The second and third INNER JOINs are used to access the AuthUser and Site tables in order to get the restriction codes for the site and the user involved in the access records that make up the total for each of the objects. Here's the trick on restrictions. If the object's restriction code is 0, it always counts. If it is 1, the object is hidden from top-10 lists. If it is 2, the object is ignored altogether. The implications of this policy are subtle. For example, given an access to /foo/bar.html by user farkle from site dorfle.com, this access will be counted in the totals only if all three objects (URL target, user, and site) have restriction codes of less than 2. If either user farkle or site dorfle.com have restriction codes of 2, the access will be ignored. In addition, if the target /foo/bar.html has a restriction code of greater than 0, it will not be counted in the top-10 URL LIST. However, that access will be counted in the top-10 SITE LIST and top-10 USER LIST unless either farkle or dorfle.com are restricted greater than 0. Look at the WHERE clause (after the linking phrase): First, the date range is applied. Next the top-10 restriction is applied to the primary object (URL target), then the total-hiding restriction is applied to the secondary objects (user and site). The Restriction List Editor handles the application and removal of restrictions so that the user needn't be concerned with all of this gory detail. Thank God. Meanwhile, you will get probably confused by all of this in the beginning. Note that ALL accesses are kept in the database. Restrictions may be changed at any time. Subsequent reports will reflect the new restrictions. Consider the alternative where access records are filtered at the time they are brought into the database.. What if you change your mind later??? Maybe you can see why SQL is so cool. Consider doing this in C or perl! I'm out of breath, so let's press on to the description of the 3 programs: ======== LOGTODB: ======== This program handles the importing of Web server Common Log Format files into the Access database. It will create the empty database if it doean't exist. If nothing is put on its command line, it expects to be run on the same machine as an NCSA Windows httpd web server, with the server running. It first signals the server to cycle its logs using the SIGHTTPD.DLL supplied with the server kit. Then it waits for the cycled-out log to appear in the server's logs directory as ACCESS.001. Then it opens the text logfile and imports the data into the database ACCESS.MDB, which it will create if needed in the server's logs directory. Logtodb may be run at any time, as often as desired. It will simply rotate the data out of the server's textual log into the database, and then delete the cycled-out text log. It is designed to be run from a scheduler so it needs no human interaction. If logtodb encounters a corrupted log record, the record is skipped. An icon is displayed while the program runs, indicating the number of records processed, in multiples of 10. No effort is made to detect duplicate access records, however unique sites (hostname or IP address), URL targets and authenticated users are kept in separate tables, and the access records link to these other tables via an internally generated 32-bit ID. You can use this program outside of the Windows httpd environment by adding the pathname of the server log and the pathname of the database (in that order) to the logtodb command line. That's what you did in the "getting started" section. The BUILD_DB module was generated from a working copy of the database that I originally created with Access 1.1. I used a program called DB2BAS, which I hacked to generate BASIC that creates the QueryDefs as well as the tables and fields. The hacked DB2BAS did much of the job, but I still had to edit the result by hand. Any changes you make to the database will have to be reflected in BUILD_DB by hand. Sorry. The hacked DB2BAS is a mess. I'm not sure it really saved me all that much time. ========= RESTRICT: ========= This is a GUI program that is used to maintain the restriction specifications for the database. There is online help available. The match patterns used in the restrictions are those of the Access/VB SQL language, and are documented in the help. A small help button next to the pattern entry field pops up a pattern reference window. The patterns are kept in the RestrictPatterns table in the database. Once you have made the changes to restrictions, you must choose Apply Restrictions in the Database menu. This removes old restrictions and uses the patterns to set the restriction codes appropriate to the changed patterns and/or restriction levels. You can also remove restrictions completely by choosing Remove Restrictions in the database menu. Normally, restrict expects to find the database in the Windows httpd server's logs directory. You can override this by putting the pathname of the database on the commmand line. ========= REPORTER: ========= This program generates the statistics HTML pages, and a Windows BMP graph of total accesses by week. You'll have to create a directory httpd\htdocs\svrstats or edit CONFIG.BAS to change that, so the program knows where to put its output. The program may be run at any time. It will produce reports for all complete weeks from the first day for which information appears in the database through the just-completed week. Nothing occurring after last Sunday is reported because the week isn't over yet. In order to prevent duplication of effort, the reporter records summary infrmation for each report it generates in the PastTotals table. The record number is used as the "tag" on the HTML filenames. Next time the reporter runs, it looks at PastTotals and skips data for which reports have already been generated. NOTE: If you want to regenerate reports (perhaps you changed the restrictions and want past reports re-run), delete all of the PastTotals records and re-run the reporter. All past reports will be regenerated. Note that the report numbers will not start back at 1 unless you compress the database after deleting the PastTotals data. The reporting process is straightforward. It uses the QueryDefs to produce the statistics information, then takes that and formats it into HTML reports, one per week. At the end, it produces an "index" page. Finally, the reporter uses the data in the PastTotals table to produce a column chart showing the total accesses by week, using the VB Graph control. The image is automatically written to disk in Windows DIB format (".BMP" file). The program may be run from a scheduler, as it requires no human interaction. As it runs, an icon displays the week-starting date of the report being produced. At present, I haven't found a program that will convert BMP to GIF without human interaction, so this part of the process must be done manually. I use WinGif. It's only once a week. With nothing on the commmand line, reporter expects the database to be in the Windows httpd server's logs directory, and will put its HTML pages and graph into htdocs\svrstats. You'll have to create the latter directory. ===================== IDEAS FOR THE FUTURE: ===================== "You have the sources, DO IT!" * Add a program to resolve IP addresses into hostnames, including PTR-spoof detection, on the sites listed in the database. Then disable DNS reverse lookups in the Web server to improve its response time. * Add a GUI admin program so you don't have to edit CONFIG.BAS. * Add some more graphs. It's easy! Thumbnails, other statistics... If you want a challenge, write a querydef that will produce a list of total accesses per day for the last 30 days and graph that. You could run this one every day.