Looking for Information.... Search and Ye Shall Find.....Maybe! Copyright 1995 by Peter Neuendorffer What it means to search for information on computers and the Internet can be illustrated by using real-life comparisons. Much of this search article is about common sense that is rooted in experience and has been incorporated into computer programming theory and practice. A central problem of the information explosion, new or old, is how to find specific information. I have a consultant friend who had to uninstall a program that he couldn't find because the user had put every single file into a single large directory on the hard disk. Liken this to having a big warehouse with no shelves. As files proliferate on computers, file names and extensions don't give a clue as to what’s inside an individual file. So what that Windows95 has come up with the long file name? Just another elusive title and not much else! There are a number of important reference points for a search and for search tools which include knowing what you are looking for, which ballpark it probably is in, and how much time and work space you have. How much stuff you will have to search - the size of the search space -, and what you know about how it is organized are all relevant. That's a lot to keep track of and to understand for most people. The noted information scientist Norbert Wiener is quoted as asking "Am I walking to lunch or coming from lunch? I don't know!" Not only did he not know what he was looking for, he probably didn’t even know where he was! Finding one's sense of direction when undertaking a search can be crucial. The other day I mislaid my house keys. I knew what I was looking for. Surely they were in one of the correct obvious places. I ransacked the apartment, to no avail. This called for logic. Could I reconstruct exactly what I did when I last came in? It came to me that as I was unloading my bundle of bathroom sundries the telephone rang. Sure enough, my keys were behind the toilet paper in the broom closet. A likely place. I had the What, and the time and space, but not the Where, nor had I accounted for human error. Indexing of information is common from your local phone tickler to the largest mainframes. One technique, hashing, involves storing information not randomly one item after another, but in mathema-tical order that makes retrieval faster. Searching keywords on the Internet can be a pain, because you don't have the What. If you do, you don't have both the time and space, as a field of 200000 entries pops up for the word computer. How do you find what you’re looking for if the search system doesn't understand what you want? Even in these crude searches, Internet searches are indexed. The search engine knows something about the data it is searching and how it is organized - in addition to just how to search it. A simple tool in computer science for searching lists is taken from how we search for a name in a paper phone book, the Binary Search. The Binary Search must have an alphabetized list to succeed so it knows that the list is in alphabetical order. It uses this logic: When we look for Jones in the phone book, we unconsciously turn to the middle -actually to the left of middle- of the book. If we come up with Friendly, we unconsciously turn halfway further in the book. And back and forth by large chunks till we get the Jones page. This search is a lot faster than starting at the first name on page one and then proceeding. We are cutting the book in powers of 2. This algorithm has an O of less than one. Many problems in computer science are non-complete since it would take longer to solve them than there is time in the universe. For example, because it is so hard to find very large prime numbers, one company was able to patent two of them. Having devised a search method, they own the rights to the numbers. Finding out the exact location of Earth relative to Mars in 1000 years is the three body problem that is deemed not solubable unless we wait 1000 years. Fortunately, we are not usually looking for such weighty information, but rather, something more like our Aunt Martha's phone number in our personal information manager. With such a small field to search on our computer, it is fast. We know the where, the what, have the time and work space, and the software knows a lot about how the PIM warehouse is organized. However, when we get out into the bigger world, it is not simply who owns the information but who is able to find it that is the important key. Just talking to the IRS or Social Security can be a trial as you wait endlessly through a hold pattern with recorded messages like Please do not hang up, or your call will be further delayed. or Don't give up, we'll be with you momentarily. Maybe! If we attempt to index large amounts of information, that's OK, but we will have to be prepared to update the index constantly. When I worked for a department store, we were constantly counting inventory as shipments came and went and merchandise was sold. The stock numbers were set up differently for every type of item and vendor. Certainly the POS system is a vast improvement, indexing the lists. If only discount coupons didn't bog down the supermarket checkout line, violating all of my time, space, and know what you are looking for rules. People question the computer's accuracy , discounting the answer, and this is another story. That can't be right, check again. or I hate computers, they're always wrong. If people get their information and then misread it or don't use it or reject it, what is the use of it all? The famous case in point in science is the cold fusion Stanford experiment, which astounded scientists because it would mean the ancient Alchemists were right about turning lead to gold. Unfortunately, the certainty of their results were deemed to be in the noise zone, or not much better than fiction. I have a friend who works as an order picker in a warehouse. He punches stock changes into a computer when he physically moves items around. But what if the store’s information gets out of kilter with reality? Sales go up, inventory goes down some on paper, but actually there is much less in the warehouse than anyone realizes. Although the information was wrong, it was assumed to be right. The cozy computer system then was consistent but only with itself, not with the real stockroom! it was supposed to reflect That essential match was going to pot. Somewhat like President Hoover's famous remark shortly before the Depression "Prosperity is just around the corner." According to Newsweek magazine, a state-of-the-art automated ware-house for running shoes ground to a halt. The workers found that they could not move anything into or out of the facility, even though conveyer belts kept spewing out merchandise for non-existent orders. We live in the real world, not inside a computer. Information not matching reality is garbage, at best a theory or modelling. That method of organization was described in Lewis Carroll's Alice in Wonderland. At the tea party, the Mad Hatter ordered every-one to move down the table when the dishes were dirty. Garbage in, garbage out. But still, if you are not playing with a full deck, you can pose a search question which is perfectly reasonable and still get garbage for an answer. Its like dealing with one of those Bostonians who are noted for firmly giving patently wrong street directions to passers-by who have lost their way. What if the search system doesn't have the foggiest idea of where the item that you’re looking for is located in your search space ? It knows nothing about it's warehouse except that it contains files which are in text format. You are looking for the word computer on your hard drive. It has an equal probability of being in the first line of the first file, or the last line of the last file. You could index the drive, but that takes a lot of time, and some space. Without an index to use, you start out with a zero probability of finding a match, but, as you move ahead, you are more sure of finding your answer. By the end, you have a 100% probability if it’s there at all. If you skip to the middle to start, this is just like rearranging your warehouse and will still take the same time-- actually longer-since you have to rearrange the warehouse. The fruitless search takes the longest. If you asked a sales person to look for an item in the color and size you want, he may come back 15 minutes later to report that "We don't have it." It takes so long precisely because they do not have it. It's either there or it isn't but he had to search the entire stockroom to find the answer. One assumes that most stockrooms are well-organized and is surprised when this is not the case. Let's get back to searching your PC for text and presume you do not have an index. You are starting from scratch. Your work space and time are constant. You only have till three o'clock, and the hard drive only has so much free disk and memory space to work with. You can narrow down the size of the search space by looking only for DOC files and ignoring categories of files like programs. Or perhaps start the search in a subdirectory that is likely to have your information, like \DOC. The file names do tell you something about how the data is organized, but they are like labels on packages with the contents imprecise. Somewhat informative, but not detailed as that would take up much more space. Unfortunately, searching for a lower case a is different than searching for a upper case A because words like alice and ALICE are stored differently on the computer. Another way you can narrow down the search is picking what you are looking for carefully. Computer is not going to be very descriptive on a computer. In the middle of a lake, looking for a computer would be very helpful because it is a rarer item then on land. A rarer item on the computer might be EISA motherboard. We can also search for more than one thing or field at once, using Boolean Logic. Boolean logic involves using AND OR NOT like arithmetic. A Boolean expression is either true or false. In other words, we search for ALICE AND COMPUTER. Both words must be found near each other in our files to evaluate as true. If Search_it (ALICE and COMPUTER) then "we have a match". It turns out that you can string together combinations of the operators AND OR NOT. Of course on the computer, you have to have software that does this. In real life you use these conditions all the time without realizing it. "If it's lunch time and I'm hungry then I think I'll eat." One of the fundamental aspects of computers is being able to perform different actions based on the result of a condition. IF BEFORE LUNCH THAN EAT BREAKFAST ELSE EAT LUNCH. Or, IF THE_COMPANY_SHOWED_A_PROFIT than PAY_STOCKHOLDERS Else FILE_BANKRUPTCY. Each of the actions above could be the name for a procedure, module, or program that does all the necessary processing. A Boolean condition could be ALICE and (COMPUTER or GROCERY). This match would be a mention of the word ALICE and also one of the others, either COMPUTER or GROCERY. Structured Query Language searches utilize this type of searching; they also extract records that have common fields. But remember in our search, we know very little about how our information warehouse is organized. It turns out, luckily, that Boolean conditions can be chopped in two, and each half treated separately. This is like cutting the cards, and then cutting them again, and can produce the same kind of speed increase as the phone book example above. If we are searching for two things at once, say ALICE and COMPUTER, as soon as we know that Alice isn't there, we don't have to check for Computer. In searching, time and space are at a premium. You give up one for the other and must compromise. The ready availability to a great variety of knowledge bases opens up information if one has the ability to use search tools and learn to use those tools efficiently and with minimum cost. Although it is currently gauche to say, not all of us will live forever. It would be nice to know that we will find what we’re looking for before the end! Peter Neundorffer is a regular WindoWatch contributor. He is the creator of Alice and a DOS and Windows programmer. Peter has very recently released a text search program for Windows he calls Bool Text Searcher which can be retrieved as ABOOL11.ZIP" ww