Text only version (see inttxt.sit for MacBinary Word with figures) Hypertext and Text Indexing Engines Overview By Josh Rabinowitz joshr@kronos.arc.nasa.gov Friday, March 11, 1994 Table Of Contents: Introduction 1 Tools 1 Applications 1 Glossary 4 Bibliography 7 .c.Introduction;: This document presents an overview of some of the publicly available hypertext strategies. It contains a glossary of terms commonly used in networking and information retrieval, protocols, common indexing strategies, tools and programs for the reader familiar with hypertext, but perhaps not with specific hypertext implementations. .c..c.Applications;: World Wide Web (see also glossary entry) The WWW is made up of a series of servers and clients. There are several examples of each. Some common servers are the CERN server, which is the basic www daemon program, and the NCSA server, which is written in C for unix and is in the public domain. The most prolific www clients are the three platform versions of Mosaic. There also exist several tools for writing HTML files and converting other file types to HTML files. WWW Generic Browser Architecture Control Flow. Figure 1. WWW Generic Browser Architecture Control Flow. Adapted from "www control diagram" at info.cern.ch. Modules to the left of the grey line are platform dependent; modules to the right of the grey line are platform independent. WAIS 1 WAIS is a database system that exploits two recently popularised computer science concepts: the client-server model, and full-text databases. It gives the ability for users to search existing databases of articles, books, references, abstracts and specialist information (such as genome databases, usenet group archives, ftp-site listings, etc), and for people with information to publish it at little expense and effort over the Internet. The fundamental concepts in WAIS are the database, the document, the source and the hit. A source is a short text file that describes how a client can access a database that is provided by a server. It typically lists the database name, the machine the server program is running on, a brief description of the database, the name of the maintainer of the database, and the cost (if any). A document is the basic unit - when you perform a search and look at results, you will be looking at documents. Databases hold lots of documents, and the server will search all the documents in the database. When the server finishes the search, it sends the client a list of hits - the names of documents that looked like they matched what you were searching for. A hit is one document name. Indexing Methods: Essence Essence exploits file semantics to index both textual and binary files. By exploiting semantics, Essence extracts keywords that summarize a file, and generates a compact yet representative index. Essence understands nested file structures (such as uuencoded, compressed, tar'' files), and recursively unravels such files to generate summaries for them. Essence generates indexes that are ten times smaller than WAIS indexes, but retain the fine-grained information access that WAIS's full-text indexes provide. Furthermore, Essence generates WAIS-compatible indexes allowing WAIS users to make use of Essence's indexing capabilities. This is one of the ways that the Networked Resource Discovery Project at the University of Colorado has extended the conceptual paradigm of the type of information that WAIS handles. If you would like to learn more about Essence, you can obtain the source to the Essence prototype and a paper which appears in the 1993 Winter USENIX Technical Conference, San Diego, CA, January 1993, pp. 361- 374. Both the paper and the prototype are available via anonymous ftp from ftp.cs.colorado.edu in /pub/cs/distribs/essence. Or search for the keyword 'Essence' using a WAIS server to find all of the files on ftp.cs.colorado.edu that are related to Essence; you will find the files for both the paper and the prototype. Essence exports its indexes through WAIS's search and retrieval interface, allowing users to use tools such as waissearch and the X Windows-based graphical user interface xwais. In order to generate WAIS-compatible indexes, Essence uses WAIS's indexing software to index the Essence summary files. This mechanism generates full-text WAIS indexes from the Essence summary files. The WAIS indexing mechanism has been modified by some to understand the format of the Essence summary files, so that it generates meaningful WAIS headlines. These headlines provide users with a short description of a single file, usually a filename. In Essence, headlines represent the filename, its file type, and the file's core filename. To support additional file types, WAIS must be recompiled with new procedures that understand these file types. With Essence, one need only write a new summarizer, add its name to a configuration file, and add new heuristics for identifying the file type; no recompilation is necessary. In this sense, Essence modularizes the typed-file indexing extensions that WAIS can use, because it removes the keyword extraction process from WAIS and places it instead in Essence. Essence is better suited to incorporating new file types, and can be quickly adapted to become a comprehensive indexing system. glimpse Glimpse is a system written by Udi Manber at the University of Arizona. Standing for "GLobal IMPlicit SEarch", glimpse is a system that uses a very small index, typically 2-4% the size of the original file. Whereas most indexing systems keep track of the offset of each occurance of a word in a document, glimpse uses indexes which simply tell which fraction of a document (in 256ths) a word appears in. To find the exact location of the word in that section, and how many occurances there are in that section, the file must be parsed individually (by Glimpse or another program). Glimpse attempts to make an intelligent compromise to the speed/size tradeoff in indexing. lq-text Lq-Text uses the tested, tried and true "inverse index" method, in which the offset and filename of every occurance of every indexed word is noted in the index. Not very glamorous, but fast for retrieval, (although fairly slow for indexing). donna harman's (similar to lq-text) To be written waisindex Serial Indexer Overview2 The serial indexer is a simple inverted file system that is not very different from existing IR systems. DATABASE FILES The serial indexer parses files and creates an inverted file index made up of 7 files. For a database named "index" the files would be: * index.inv -- the "postings file" that is a term followed by a list ofentries each of which describe where that word occurs in the original files. A posting is a weight, docnumber (see the doc file), and characterposition. This file is indexed with the dictionary (dct) file. The terms are in alphabetical order. * index.doc -- this is a linear list of document-entries one for each document. A document can be a complete file or a piece of a file (such as mail files that are the concatentation of many messages, each message is a document). The information kept in each entry is: filenameid: position into the filename file (fn) of the filename for this document. headlineid: position in the headline file. startcharacter: position in the file where this document starts endcharacter: end position. 0 if complete file. documentlength: in characters numberoflines date: timet * index.fn -- list of the filenames in the database with the write-date of the file and the type of the file. Type is a string. Indexed by the position in the file, so this file can not be edited after the index is built. * index.hl -- list of the headlines. Indexed by position. * index.dct -- dictionary file which is a 2 level b-tree. The first block is pointers to the every 1000th entry in the rest of the dictionary file. Each entry is a fixed-length record of the word with the position into the rest of the file. The rest of the file are blocks just like the first block, but each entry is the word plus the position of it in the inverted file (inv). The whole dictionary is in alphabetical order. * index.src -- source description that is used to access the database. It is also returned as a response to the "help" query. This file is not overwritten once it is created. Therefore database maintainers should edit this file to add a good desciption of what that database contains. * index.status -- only the ram based seeker uses this to describe itself and get parameters from the user. .c.Glossary; Daemon A program that is not invoked explicitly, but lies dormant waiting for some condition(s) to occur. Daemons may perform various management tasks such as building indexes, overviews, and back-links. Under unix, "daemon" is used for " server". DTD "Document Type Definition." Used by markup languages (See "SGML") to map types of paragraphs to the style they should appear in. Defines how browsing apps should display marked up parts of a document. EPS "Encapsulated PostScript." A subset of the PostScript language, this is used as the native file format for illustration programs such as FreeHand and Illustrator. Stores all information device independently. See "PostScript." Gopher A program designed to allow easy access to FTP sites. Supports index search and document retrieval. Allows the user to easily broswe ftp files and their summaries, and insulates users from text ftp commands. Mosaic includes Gopher functionality. The Gopher distributed information system uses a lightweight protocol very simliar to HTTP. Therefore, it is now included in every WWW client, so that the Gopher world can be browsed as part of the web. Gopher menus are easily mapped onto hypertext links. It may be that future versions of the Gopher and HTTP protocols will converge. HTML "HyperText Markup Language," based on SGML. HTML usually refers to both the document type and the markup language for representing instances of that document type. All HTML documents share the same SGML declaration and prologue, thus HTML is sometimes reffered to as "trimmed SGML". An HTML file is a specific type of URL - namely a file written in the HTML formatting language used by WWW. Every HTML file begins with a Title, which hopefully gives useful information about its contents. HTTP "HyperText Transfer Protocol." Used by mosaic, it is also WWW's search and retrieval protocol. It operates in a fast, stateless way needed for HyperText jumps. Designed to operate in a client-server mode, the client submits document requests in the form of a line of ASCII characters. This request consists of the word "GET", a space, and an abbreviation of the document address. This response is a message in HTML. HyTime A standardized hyperdocument structuring language for representing hypertext linking, time scheduling, and synchronization. MarkupLanguage Generic name for a language which "marks up" text to give information about formatting, style, etc., which may not appear in the content of the data. A method of adding information to the text indicating the logical components of a document, or instructions for layout of the text on the page. PDF "Portable Document Format." A format designed by Adobe for display of printable data created on any platform or software. Very similar to PostScript, but much simpler; does not allow programming constructs such as loops and variables as PostScript does. PostScript A format designed ny Adobe for printing of information. PostScript is really a langauge which allows branching, loops, variables, etc. Protocol An agreed upon method of sending information between systems. Any two systems that understand a given protocol can communicate what the protocol allows regardless of actual implementation of data on each system. RTF "Rich Text Format." A document format designed by MicroSoft to interchange documnets between applications. More like PostScript than HTML. Source Description Structure Used by WAIS clients to contact a database on a server. Includes the file version, name, the sites ip name and number, and other information. Similar to but more extensive than Mosiac's URL's. SGML "Standard Generalized Markup Language." ISO standardised derivative of an earlier IBM "GML". A high level language which makes text independent of any word processor. Used currently to markup ascii files with semantic data about which part of a document is what, i.e. Header, Title, Body text, etc. "Marked up" paragraphs are styled to meaningful fonts and sizes by the client (see DTD). SGML alone says nothing about the representation of the document on paper or a screen. TCP/IP "Transmission Control Protocol/Internet Protocol." A set of protocols developed to allow cooperating computers to share resources across a network. Any real application will use several of these protocols. Information is transfered in "packets" which are sent through the network individually. TCP is responsible for breaking up the message into packets, reassembling them at the other end, resending anything that gets lost, and putting things back in the right order. IP is responsible for routing individual packets. URI URL "Universal Record Locator." Brief description of a file that contains the filename, the internet location address, and the file type. Used extensively in Mosaic. (WAIS uses a more extensive format, see "Source Description Structure") A URL object is any resource accessible on the WWW - such as an image file, gopher object or news archive. A URL is the WWW name of a URL object - for example: "http://www.cs.colorado.edu/home/test/smallecot.gif". Usually we try to never deal directly with URL's. URN The Uniform Resource Name. The premise must be accepted that a data item should be assigned a unique permanent identifier that is completely disassociated from its URL, as well as a description bound to that identifier (the URC). Why is this so? Consider a document located on a host somewhere on the network. This document is made accessible through a variety of information distribution mechanisms and its availability is widely known on the network. At some point, the document is moved to another location. When this occurs, the URL for this document becomes invalid. If this URL has been widely propagated throughout the network, particularly if it has been committed to print, the reaction from users can become quite unpleasant, and at the very least quite confusing. We can solve this problem with a Uniform Resource Name, or URN. The URN is simply a serial number-like identifier uniquely associated with a publisher identifier. The serial number has no predefined format - it is simply a string of characters, possibly without meaning, that is unique within the domain of the publisher. Once a URN is obtained, it must be resolved into a URL by means of a URL/URN Resolution Service. The resulting URL(s) may then be used to retrieve the data object for local use. By binding a URN to any data object, its location on the network and preferred access mechanism can be established with a simple query to a central resource locator server. Given this information, properly constructed software can immediately retrieve the data object from its current storage location using any suitable data retrieval mechanism. WAIS "Wide Area Information Service". An architecture for ditributed information retrieval systems. Developed by Thinking Machines as a protocol and set of clients and servers that allow searching and viewing of documents at diverse sites. Allows sophisticated english queries to be asked and refined by site and similar documents. (Such as "find me all poems about roses at anonymous ftp stes") Using an extension of the Z39.509 protocol, WAIS itself is abstractly a protocol which operates with the document being the smallest addressable object. WWW "World Wide Web," originated at CERN. Refers to a loosely connected network of servers that make their information available. WAIS and Mosaic can both act as WWW clients. WAIS uses indexes which usually cover a certain domain of information. Since WWW indexes are documents like any other, indexes can point to other indexes. Each index has a set of keywords associated with it which define it's context. WWW users can interrogate Wais indexes and and gopher servers. Between May '91 and May '93, the load on CERN's WWW server doubled every four months or less. X.500 An international Email standard. Also a directory service protocol which defines an abstract name space which is hierarchical, allowing objects such as organizations, people, and documents to be arranged in a tree. Z39.50 "Information Retrieval Service Definition and Protocol Specificaton fot Library Applications", as defined by the Nationbl Information Standards Organization. Protocol used by WAIS for information search and retreival between servers and clients. It is a evolving protocol, with small but significant differences between Z39.50-88 and Z39.50-92 .c.Bibliography;: Various HTML documents as accessed through Mac Mosaic, 12/93 and 1/94 Adobe Systems Incorporated, Portable Document Format Reference Manual; Addison Wesley, California 1993 Author Unknown, WAIS for Macintosh: A Macintosh User Interface for Wide Area Information Servers, User Guide for Release 1.1, February 1993 Nathan Torkington. how.to.swais, Victoria University of Wellington, September 1993 Jim Fullton, Network Information Dissemination Standards and Z39.50, http://cnidr.org/cnidrpapers/info.html 1taken from how.to.swais, included in the WAIS source code distribution, available at ftp.think.com 2Lifted straight from freeWAIS distribution, /freeWais/ir/DESIGN, by Brewster Kahle