Path: senator-bedfellow.mit.edu!dreaderd!not-for-mail Message-ID: Supersedes: Expires: 22 Mar 2000 10:45:17 GMT References: X-Last-Updated: 1999/12/07 Organization: none From: david@cn.net.au (David Novak) Newsgroups: alt.internet.research,sci.research,alt.answers,sci.answers,news.answers Subject: Information Research FAQ v.4.0 (Part 9/9) Followup-To: poster Approved: news-answers-request@MIT.EDU Summary: Information Research FAQ: Resources, Tools & Training Originator: faqserv@penguin-lust.MIT.EDU Date: 07 Feb 2000 10:52:17 GMT Lines: 1204 NNTP-Posting-Host: penguin-lust.mit.edu X-Trace: dreaderd 949920737 22501 18.181.0.29 Xref: senator-bedfellow.mit.edu sci.research:19867 alt.answers:47100 sci.answers:11173 news.answers:176879 Archive-name: internet/info-research-faq/part9 Posting-Frequency: monthly Last-modified: Dec 05 1999 URL: http://cn.net.au Copyright: (c) 1999 David Novak Maintainer: David Novak Information Research FAQ (Part 9/9) This part of the FAQ highlights other aspects of information research. See also the full faq (http://cn.net.au/faq.txt) and the spire project (http://cn.net.au). Please note the disclaimer statement on Part 1 of this FAQ. Contents ----- Part 9 ----- 32. Internet Information Theory 32.1 three definitions of the Internet 32.2 information, transaction, entertainment 32.3 information formats 32.4 information preparation 32.5 publishing motivation 32.6 promoting information 32.7 information clumps 32.8 bringing this together 33. More on the Commercial Information Sphere 33.1 structure of the database industry 33.2 advanced search technologies 33.3 the art of searching 34. More on the Information Service Industry 34.1 judging information value 35. Emerging Trends in the information sphere 35.1 developing an informative internet? 36. Question and Answer Section 36.1 How do I find information on the Internet? 37. Acknowledgements ___________________________________________________ 32. Internet Information Theory Lets agree the Internet is great fun to surf, but less valuable when you have a specific question in mind. To improve our search skills, we begin by understanding how information is arranged on the Internet. Contrary to myth, information is not disorganized but rather organized very carefully along clear patterns. Many patterns are specific to the information format (text document, webpage, email message, printed article). Further patterns match the way we become aware of information, or are specific to the information systems (mailing list, faq, peer-reviewed journal). Your understanding of the strengths and weaknesses of each pattern, each format, each system, guides your search for information. We shall start by shattering the Internet, and commenting on the many pieces. __ 32.1 three definitions of the Internet Let us be careful when we use the word 'Internet'. 1_ The Internet is a physical network; more than a million computers continuously exchanging information. The Internet allows us to transfer information around the world. 2_ The Internet is a landscape of information available on almost every topic imaginable. This information appears almost chaotically distributed to the world, but holds clear patterns. For instance, linking information together are various structures like government web links, search engines and FAQ documents. 3_ The Internet is a community of 100+ million individuals. These are real people who chose to interact, discuss and share information online. What we learn here is not so important as the technique - break the large seemingly chaotic system into smaller pieces: pieces that hopefully make more sense. Eventually, when we've made sense of the little bits, perhaps we can comment astutely on the big-picture. In this example, let me just draw your attention to the way most of our research effort focuses on the second definition: a landscape of information. Much of the best information originates in the third definition: the Internet is a community. Sometimes it is far more effective to ask real people than search the information cyberspace. Let us now illuminate more important facets of the Internet. __ 32.2 information, transaction, entertainment There is a triad of functions to all online activity: Function - Activity - Unit ---------------------------------------- Information - Research - The Fact or Conclusion Exchange - Business - The Transaction Entertainment - Play - The Experience Each Internet function grows at a different rate and moves in a different direction. The development of forums is firmly in the smallest segment dealing with information. This segment is quite poorly organized and confusing. The entertainment function in contrast is well financed and graphically innovative with clear, profitable opportunities. Much of the web is prepared with Exchange or Entertainment in mind. "Brochureware" (purely promotional webpages) is rarely required for research, but is critical to securing a transaction. Entertainment related, or just entertaining, websites abound. Let us recognize just how few webpages are information & research related. My own experience suggests we are just beginning to see the movements towards profiting from providing information. Direct sales of information is still chaotic and unrewarding. __ 32.3 information formats The way information is packaged has a great bearing on the content, quality and use of the information. This theme is evident throughout the work of the spire project, and is particularly applicable to Internet information. Webpages, text files, software, email and database entries each have particular qualities. Each shapes, constrains and restricts the informative content. These particular qualities apply irrespective of the information involved. Books are dense, factual, a little old. Articles are short, sharp, more recent. News is puff, introductory, immediate. Each way the information is packaged, each format, presents the information to set standards. Information formats on the Internet are the same. Webpages are graphical, technical to produce, and not easily updated. FAQs are easier to maintain, text only, and attract more peer review. Mailing lists are simpler still, text, short, immediate, very peer-reviewed, characterized by discussion and resource discovery. Newsgroups are characterized by extremely low costs, vulnerable to trashing, poorly managed. Email is simple use, one-to-one discussion. Lets look at books more closely. Books are created by authors who have something to write. Books are printed and marketed by Publishers to the bookstores that then provide it to the readers. Each facet of this process defines the resource. Books have quality, editorial vetting but minimal peer-review, marketable value and a potentially lengthy preparation time. When it comes to research, why look for a book when investigating digital money? Books would just have the wrong qualities - would present the information poorly. We need a more current format (digital money is a fast moving topic), and a more peer-reviewed format (books have editorial vetting, but not intrinsic peer-review). Why not search for a mailing list, an FAQ, or an association website. These formats have qualities more appropriate to our question. __ 32.4 information preparation Information flows also impress patterns on Internet information. Most information is transplanted to the web - first created elsewhere. The source of information imparts as much pattern as the eventual format the information takes. Information may appear as a webpage, and conform to our expectations for all webpages, but the information may have been prepared from the discussion on a mailing list - and thus enjoy a more topical, specific, timely and peer-reviewed quality. Lets look at FAQs. The best resource in the world on copyright law is the musings of a group of copyright lawyers who form the copyright mailing list. The copyright FAQ supported by this group is a logical document summarizing much of the discussion of this mailing list. FAQs are vetted by the news.answers team, then automatically mirrored around the world. From its origins in the mailing list, the FAQ is a peer-reviewed document, often full of links to further resources, topical, knowledgeable and factual. As an FAQ, the document is not immediate, graphical or financially rewarding (some FAQs stagnate). Only some Internet information is created within the Internet environment. The concept of 'brochureware' describes the common traits to promotional webpages directly prepared from paper promotional brochures. One of the more exciting trends is the movement of information from the dusty shelves of government offices and association libraries to their more accessible websites. The quality of information retained in your average government agency, from quality research reports, to detailed studies, to current industry monitoring is very high. These qualities are then brought over to the web format. Such web-documents tend to be isolated (not linked to other related resources) and perhaps a little behind the time line, but of a generally high quality. An exciting holistic view of the Internet information landscape is based on these descriptions. Imagine, for a moment, information flowing through a collection of systems. At certain points, information groups together, and generates new, perhaps higher quality information, which then flows in a different system, a different direction, to different people. The flow of information from one person to another, from one format to another, imprints qualities to the information along the way. Each organization, or subsequent re-organization, imparts specific styles and conventions and quality to the result. __ 32.5 publishing motivation Let us proceed to a third set of patterns. Information appears on the Internet for one very specific reason. Someone Publishes (DUH). The motivation behind publishing colours the information. Patterns we will use to better search for answers on the web. Ask yourself who is publishing, and why. One of the biggest publishing segment a year ago were individuals publishing documents derived from their personal expertise. A typical document would be one with minimal peer review, a list of aging links to further resources, simple graphics, variable to short length, prone to bias, but moderately reliable because the publisher knows their topic well. These pages are often located on web pages with private sub-directories (usually starting /~name/). Commercial sites publish mainly for the promotional value. Their secondary purpose is to provide sales information to prospective clients. Rarely do commercial sites go beyond this. Commercial webpages often reside on their own domain name, as a .com, or in sub-directories - without the tilde symbol. Commercial sites also tend to age badly. They are very noticeable from their front page. Government agencies are emerging as valued publishers. Slowly their dormant information becomes available through this new medium. Currently almost all government documents on the Internet also appear in print, meaning they are factual, exhaustively reviewed, tend to be a little old (but age well), and come from highly paid knowledgeable people who believe it is their duty to inform others. Such documents are lengthy and appear on .gov domains. These patterns are simple to see. Grant-funded projects create brilliant research resources and hold much promise in pushing the limits of this technology. I am eager to see the results of the US Patents project, and appreciate the value of having Supreme Court rulings on the Internet. Often such projects are short on money but deeply focused on content. Most projects reside on educational servers and are widely discussed within knowledgeable groups. Associations, publish association-kind-of-things. Most are initially just like the commercial webpages, but with time become much more factual and research-worthy. Most associations are dedicated to developing awareness of their chosen topic, albeit coloured by their chosen bias. Few associations are significant publishers yet, but this segment will begin to liberate dormant information within associations. Let's summarize. The key is to always watch who is the publisher. We can assume a great deal, quickly. We are unlikely to find the latest changes to patent law from government or commercial publishers. Such organizations are simply not motivated to present such information. __ 32.6 promoting information Publishing is one achievement, but you and I will never read any information until we learn it exists. This simple fact creates even more patterns to Internet information. Knowledge of information moves through set routes on its way from writer to reader. Promotion is not simple. It is a process that takes time, effort and perhaps money. Information without serious promotion tends not to be promoted far from the source. Another way to phrase this; you must search close to the source to find poorly promoted information. A search engine indexes pages relatively indiscriminately. This also means a site of quality is not likely to reach your attention. The odds are not good, and from a promotion point of view, search engines generate minimal traffic to your webpage. Search engines drop you rather randomly into a website. It is often necessary to move up a directory to understand the purpose and motivation of a site you find interesting. Information published through advertising tends to have a financial payoff for the promoter. This kind of information tends to be promotional information. Brochureware. The alternatives are to promote a webpage or website through one of the referral tools. Each such tool accepts links on some criterion. Each tool you use to locate information also selects particular types of information for your attention. If you arrive at a document by recommendation through a mailing list, the document is likely to be recent, on-topic, and specific to the purpose of the mailing list. Alternatively, (for poor mailing lists) it will be wildly off topic and trash. You are unlikely to see referrals to old documents or documents of historical importance. These are the qualities most acceptable to the mailing list environment. Directory trees, FAQs, guidebooks and related promotion tools all work as historically important documents. In the past, such resources list, describe and alert people to relevant information for the field. Slowly, over time, this function becomes acknowledged, reinforced and promoted. Time is the essence of this fame. Webpages or websites found through historically important documents, by their nature, tend to be long lasting websites with lasting importance in the field. Such documents point to other similar documents or websites that have achieved a long-lasting importance. You are unlikely to find specific documents, but rather sites that focus or bring together information. In short, there is little motivation to link to specific webpages, when a link to important websites is considered just as good. Similar generations can be made of each type of promotional tool, and become important in rapidly seeking our information which matches our intention, as well as summarizing the likely motivation - and bias - of webpages we are interested in. __ 32.7 information clumps Information Clumps. Information is created, nurtured, develops, gets transplanted, gets arranged and then becomes visible through a process which brings similar information together. As we have discussed, there are factors deeply affecting all information on the Internet. Motivation, Preparation, Format and Promotion defines the quality and content of any given item of information. With so many influences, we should not be surprised to learn information naturally groups together. In reality, there is nothing natural involved - it is a social phenomenon reinforced each time you and I visit or read one resource but not another. History can explain some aspects of Internet development. As a small collection of sites become dominant in particular fields, by collecting and delivering better content to more people, new sites find it progressively more difficult to capture attention. This dynamic works for websites reaching out for visitors, and discussion groups reaching out for subscribers. In each case, seniority counts. Seniority counts in several ways too. Promotion is directly related to quality, interest, traffic and time. The longer a site is active, the better the footpath develops, the more people visit. Secondly, quality content is directly related to access to quality content, peer review, and time/money. Important existing sites gain in every way. This results in a grand system where the first-in, best-dressed, can capture the high ground and secure a grand lead in awareness and footpath over competitors who follow. Yahoo is a prime example of a directory tree, not even the best in most areas, which has achieved unparalleled traffic & awareness. This competition is equally evident where no money is involved. Perhaps your association wishes to create a new referral website, or an open mailing list, or an informative guide. All sound concepts, effective projects. However, if older, established resources exist, the work will be long and arduous. Despite the marketing message, the Internet is not a world where the best information floats to the top. The Internet will not let you to reach millions. You must compete for the attention, participation, devotion and assistance in a manner very similar to building a business. In concrete terms, information clumps on the Internet. The best resource could appear on any Internet system (webpages, email mailing lists, ftp-archives, faqs, online databases, newsgroups...) but we can be fairly certain the best information will congregate in just one or two. Consider our article "Searching the Web" (http://cn.net.au/webpage.htm). We progressively search different web tools, looking for the most worthy. Searching the Internet is the same. You must touch each system to see which system is dominant, where the information is congregating for your topic. __ 32.8 bringing this together In summary, we have broken down and discussed various qualities of published information and promoted information. We have made sweeping generalizations and educated guesses about information on the Internet. Now what? When a painter begins to paint, they have already visualized some of the image. They already have a concept of the finished result. Internet research is no different. We start by building a vision of the information we seek. Who would publish it? Where would I find it? What is its motivation? How would we find it? We now have a practical vision. The address is the key. The url for any item of information gives us a surprising amount of information - particularly now we are making generalizations about information patterns. We can guess if information resides on a personal webpage, a funded university project, or a commercial project. The information resides on a .gov website? - the quality is likely to be higher and conform to our expectations of government resources. We use this new-found experience in three ways. First, we restrict our searches to the most likely sources. Second, we quickly jump through lists of resources (such as those generated by search engines) to the sources that match our expectations. Third, your understanding of the relative qualities of information guides your judgement of information value. Internet newcomers often expect to have instant access to the latest information at the touch of the button in beautiful colour and peer reviewed quality prose. Who is publishing this? Where is this information coming from? Who would help us find this? Such a vision is fantasy. If we were instead to look for an association website, dedicated to a certain type of research, or an informed newsgroup, maintained by people passionate about sharing this technology, then we have made four steps forward. We are clear about where to look for the answers we seek, and we will know quickly if the answers are online. ___________________________________________________ 33. More on the Commercial Information Sphere __ 33.1 structure of the database industry The commercial information sphere existed in the 1970's and earlier. It is far more developed, far better organized, far better funded, almost always far more valuable and expensive than every other research resource. For the most part, commercial information is arranged reasonably uniformly in large databases of full-text or bibliographic information. Some databases are small, single source documents, while others are vast unfocused collections of, for example, all the news from the last 15 years. Most directories and journals can be made into a database, but single-source databases do not enjoy much financial success. The market is too limited and the cost of promotion too high (except in a local market with newspapers). To overcome this difficulty, single sources are grouped together into larger collections of databases on a particular topic. These large database groups have become primary tools in commercial research. Developing these databases requires considerable expertise and expense. Sometimes data requires abstracting, interpreting, and as with some Lexis-Nexis and WestLaw databases, even expert legal interpretation. Sometimes firms develop a portfolio of databases. Sometimes firms build just one. The marketing and consumer billing of such databases is then provided by a relatively small collection of large database retailers. A list can be found in our "Commercial Databases" article. As an indication of the size of this market, Knight-Ridder sold Dialog & Datastar for a figure approaching half a billion dollars. This industry consisting of a wide collection of players, each improving and developing the information from individual periodicals, journals, news items... All very confusing for the end user. This is elegantly illustrated by the database descriptions for Lexis-Nexis databases (their preferred term is libraries). See http://www.lexis-nexis.com/lncc/sources/ as an example of specific databases. In particular, see their library on patents. Many single-sources appear in different commercial databases. Further, different databases sometimes include different information from the same single-source. One database may include just abstracts, another may include fulltext, chemical indexing and more. As a result, most researchers are unfamiliar with what exactly is being searched. This state of affairs is not unproductive. Searching a 'Database about Patents', is uncomplicated. You receive information on Patents. It is simple, informative and incomplete. Of course, researchers are busy people. Time is critical. Results matter. This system also gives rise to great customer loyalty to database retailers. Comparative information is dropped in favour of simplicity. (There is too much complexity for researchers anyway.) Unfortunately, I am hard pressed to compare prices let alone describe the differences between information products. Prices actually model many a developed industry, remarkably similar to the telephone or banking industry. As one friend commented, "bullshit baffles the brains". The prices are complex on purpose. It becomes very unrewarding to compare prices, and any conclusions are only valid in specific circumstances - and will not hold in others. This trend, familiar to us as a multitude of banking changes and telephone pricing schedules, reinforces our need to stop price hunting and trust our favoured information retailers. This is not to say we should not try to compare prices - but for the most part, you will find comparing prices a most unrewarding experience. It really requires you to search and retrieve the same information on different systems - and this does not even begin to touch different databases, or database groupings, or variables that change over time like download speeds. Optimistically, there are actually very few important databases in each field. It may be simple to browse each of the databases in your field and compare directly. You may never need to know more than a few databases intimately. Realistically, you will yearn for a simpler solution. The commercial information industry has distributed information this way for several decades. It is both sophisticated and quite difficult. You will need to become experienced with inverted indexes, search techniques (Boolean, truncation, proximity, field limits ...) and properly phrasing the question in a way that will be answered by a database search. I have always found the value of a database search directly proportional to the length of the search query. If you are incompletely skilled at database research, you will take longer, pay more and locate far more information (or unwisely discard more) than desired. This is very different from searching Altavista and Webcrawler. Doing your own research offers an opportunity to more closely influence the research process. Sometimes only you understand the topic and sometimes you can more quickly discard unimportant details. Certainly it is becoming simpler to undertake some work yourself. Many of the commercial databases are also available in a CD format. Substantial subscription costs limit their availability to large research institutions and libraries, but exceptions exist. I believe world books in print costs AU$5000+. Provided you can find casual access, it will cost you far less. Keep an eye on the age, though. Sometimes (and only sometimes) online information is more recent. The decision between undertaking research on your own or seeking external help is really a decision based on your research expertise, your budget, your access to information, your time, and the importance of finding all the information available. It also depends on your access to some decent research assistance. I will soon be able to help with this. What I do know is a newcomer to the commercial information sphere will seriously underestimate the difficulty involved in searching, and underestimate both the cost of research and the cost of research assistance. Keep in mind this same system serves the needs of large commercial conglomerates, professional legal research, and well financed government studies. The commercial information sphere contains far more valuable information than you need. Often the Internet is just an interesting sneeze in comparison. # Article: The State of Databases Today:2000 by Martha E Williams, tracts the development of this industry with survey results. Found in the forward of the Gale Directory of Databases. __ 33.2 advanced search technologies Searching is both science and art. The science is a range of improvements to the blunt system of simply asking for a word. The good news is an experienced searcher can accomplish wonders - collecting articles of 70%+ interest regularly on expensive database. The bad news is most of the best of search technology is not implemented on all the databases you will search and only occasionally on databases free on the Internet. The art is a kind of magic, of choosing just the right words at the right times, and in phrasing your request for information in a way that tightly describes your interest without removing information that should interest you. The art of searching relies heavily on an understanding of what is possible within a given system. Much of this, you guessed it, involves creative visualizing. (See section 3.1) Current search technology allows us several ways to refine our search: Straight Word Searches: All search situations allow you to ask for the presence of words in a block of text. If you ask for the right words, they you will quickly locate the information you desire. For best results, you obviously search the desired text several times with different terms, and you consider the possibility of different spellings for the same words. I use this frequently to locate information in web pages, in large documents like online directories or the archives of past discussion on forums. Text Fragments: The simplest refinement to straight searching involves searching for parts of a word - if you are interested in surfing, search for surf better yet, search for " surf" with the space in front of the word. Truncation: Some search engines don't allow searches for text fragments, and you must explain your intention by adding a truncation mark (usually * or ?) to the ends of words. For most professional researchable alga? will include both algae and algal. I was once badly lost because of the spelling difference between aging and ageing. There are a number of improvements on this concept to. Sometimes there are special symbols for a non-space character car?a, sometimes there is automatic awareness of multiple spellings (colour & color). Sometimes there is even automatic awareness of synonyms. Often you are initially unaware important information is indexed under slightly different spelling, so truncation is strongly suggested for most searching. Thesaurus: An improvement on truncation is the opportunity to look directly at a list of words, either keywords, or descriptors. This allows you to see the range of spellings before you search. This is also ideal for searches of company names or proper places so you can select only the words you are interested in. In a simple way, some library catalogues present subject searches in this way: a list of subject categories arranged alphabetically. Boolean operators: Changing tack, searching for multiple words calls for "and, or, not" concepts. I want this word and that word, but not another word. It is simple enough. Many of the search engines allow for this with the -sign, and commercial databases often add brackets. Use of the not symbol is frowned upon in textbooks (too easy to dismiss information you are interested in it is said), but the 'and & or' is absolutely necessary for complex questions like I want [(spaghetti or noodle) and pasta] or (Italian and cuisine). With most internet search engines, but not all commercial searches, you will find 'and' is assumed. Proximity operators: The next dramatic improvement fixes the position of words relative to one another. In this category we have adjacent (often written as adj, next, or "inserted in quotes"), near (by how many words), or in the same sentence. Often it is wise to stretch the distance a little (within two), but where available, proximity is best way to remove the dross without affecting the value of information. "Patent near Research" is much more precise than "Patent and Research". Fields: By separating information into different fields, we can selectively search different portions of the information. I want the title to show the words "Patent" and the abstract to include the words "Patent Research". Field searching is a common way to refine a search, but be aware searching titles is very likely to remove some desired information, where as searching descriptors and not abstracts may dramatically improve the content. Date Field: Are you really interested in information more than 15 years old? Library catalogues frequently have many aging books, and date limiting is very wise. Further Enhancements: There are some special techniques available on a few systems that bear discussing. Sorting allows you to shape the presentation of the information. When applied to financial information, this is particularly valuable. Alerts allow you to automatically repeat a previous search and have the information sent to you. Multiple database searching allows you to search a collection of databases concurrently. Ranking positions certain information at the top and is valuable when your search is not time or price limited. __ 33.3 the art of searching The artistic side to this deals with two fields. Firstly, the selection of accurate words is not automated. The searcher needs to approach the information beast fully recognizing he or she is likely to get either tons of information... or far to little. When to expand, when to get more in-depth and how to handle fields which you may be poorly experienced in are talents. The search technology itself is simple. The trouble lies in retrieving from databases with far too much information for simple word selection. It also flares when you are dealing with databases charging up from $2 a minute and an additional cost per item retrieved. You decide very quickly to get good at searching once you receive a bill for $200 of irrelevant information. The simplest solution to this difficulty is to practice. You will find all Research Libraries provide access to slightly older articles through CD-rom databases. Search these to hone your skills. I saw a small book on search techniques from an early course in my state library - but it is very basic. Most librarians build experience in using search systems either internally, or through a series of courses given by travelling database officers like the periodic training by Dialog-Insearch. These are expensive, but include some free time searching the expensive databases (no, they don't let you take information back with you). Now, there must be something else I can share with you on this topic. First, learn something about how the databases are built in the first place. It helps if you know what an inverted text database looks like. Second, something personal about technique... I always find the uglier the search query, the better the result. Honestly. A search combining numerous elements improves your chances of getting it right. Third, I always try to change my search techniques to match the medium. I am likely to be more careful of broad searches of expensive database, where as free databases often lead me to gather 50 articles, then weeding them out by hand. (most CD-roms allow you to select only the ones you want). Always bring a 3.5'' floppy with you when visiting a library on the of-chance you want to download and look at results another time. Fourth, I almost always find the initial challenge is in locating those specific terms that appear in 80% of the documents that interest you. When searching the Internet for information about government use of the web, the specific terms required were government and publishing (not even government publish was close) All other search terms gave far to much garbage. Yes, of course, being an expert in a particular field is an edge in already knowing these special terms. There are two escape hatches here. If you can find one or two articles that interest you, often you can browse these articles for those special words. Sometimes even, the descriptors of an interesting article will give you a specific subject heading. I've heard this technique called the "Pearl Development Technique" but I just think of it as a good idea. The second escape hatch is the use of free databases to prepare you for going online. If you have ready access to a CD-rom database, search this first - get the right search words on the free databases, then go online. Oh, of course, there is also the issue of just asking someone involved for the proper words. I like to ask my clients if they know what words are likely to be used. It's not a mark of an amateur to be asked, by the way. A couple of side issues 1) Keep an eye on the type of document you are searching. If you want full text - don't go looking in bibliography databases. More to the point, don't start word searching databases with really big files without using the proximity indicators and descriptive fields. I hated paying for that 20-page document which included all the words I was interested in - but on different pages. 2) Also, keep an eye on the quality of the documents you are retrieving. I know a search of newspapers sounds impressive, but they are rarely capable of explaining anything in depth and are notorious at being advertorials. I try to keep newsprint for locating experts - not for information. I have also been trapped by obscure magazines with appealing articles, only to learn the magazine is one of a large number of very basic business mags which likes to use fillers, or just doesn't like to pay for good journalism. A single article of 5 pages from Scientific American blows 20 small fillers out of the water. In fact the length of an article is a hint of depth. Oh, if you are looking for some really good books on this issue, try the manuals Dialog sends you to start, look for text databases in you library, then proceed to one of the search books recommended at the end of our 'research as a discipline' article. ___________________________________________________ 34. More on the Information Service Industry Private Detectives, Professional Database Researchers, Library Researchers, Legal Researchers, Commercial Database Producers, Commercial Database Retailers, Magazines, News Organizations, Libraries, this is a big industry. Information Research is just a process linking together people seeking information with people who provide it. __ 34.1 judging information value Information has value. It also has other qualities that will assist you to judge information you may consider buying. Accuracy: the factual nature of the information presented. If the statistics purport to show a particular trend - how large is the margin of error? How large is the sample size? How likely are there to have been factual errors in their development? The measurement of statistical error is now a refined science in some fields. A statistical result can be inaccurate when the sample size is too small, if the margin of error is too large, the sample collection procedure incorrect, or a number of other situations. Reliability: the support for trusting the solutions, both from additional resources and from being able to duplicate the conclusions. This includes the reputation of the researchers. No matter how inaccurate and biased you may believe certain facts to be, successful independent support of a suggested fact does improve its value. Bias: conscious or subconscious influences that affect information. Bias can occur in collection, preparation and presentation of information. Most information you find will be tainted. Secondary information is deeply affected. Statistics are not necessarily less biased. We counter bias in several ways. Firstly, we try to be aware of bias. Where is bias likely? Which direction would the bias affect the information? Secondly, we try to collect information with different bias. This is why research based solely on government research, no matter how accurate and reliable, is less valuable. Often information from different countries can counter bias. Thirdly, we need to accept bias is likely to exist. This is why primary sources are often more valuable than secondary sources. This is why tertiary sources, like experts, can rarely stand alone. Age: The date information was created or compiled will feature prominently in the value of information. Dates given sometimes mean the date information was created, or the date information was compiled. How old is a book compiled in 1995, which took the author 10 years to finish? I find statistics often forecast information, prominently displaying recent compilation dates but still use old census data or the like to draw their conclusions. Information on the Internet typically has no date, and can be severely challenged because of this. Purpose: purpose merits further discussion. When you are uncertain about potential bias, you can look for reasons to distrust the information instead. Suspicion is not equivalent to bias, but it can be thought provoking. Privately, I have heard repeated rumours important national statistics have been fudged in different countries. A government research report investigating the price of books in Australia would have a political purpose, a purpose that provides the climate for some potentially significant bias. A tell-all book by industry experts often includes a tremendous quality of insider experience difficult to find elsewhere. While there may be a purpose of self-aggrandizement, the purpose is less a climate for significant bias. Medical research has perhaps the greatest climate for significant bias, and this suggests the greatest standard of proof and external, reliable support. Accuracy, reliability, bias, age and purpose are very important in research. This is what leads us to an appraisal of value. For years, the tobacco industry funded 'independent' research finding smoking minimally harmful to health. It is now likely there may have been errors brought on by accuracy, and bias. Certainly, purpose was in doubt. As new studies show smoking is harmful, we can also say the original research lacked reliability. In some topics, like the Internet, research is perpetually suspect because it also ages so quickly. I have seen further discussions that add 'Coverage' and 'Authority' to this checklist. Both have bearing on the value of the information contained. By coverage, we mean how much detail is invested in covering a specific topic. Sparse or shallow coverage is closely tied to missing critical aspects of information. News stories frequently have limited coverage. Once you are acclimatized to these elements, you begin to see potential for error in a whole range of information. Real-estate association figures, expert opinions, Toothpaste advertisements and National GDP figures all occasionally display some degree of warping and manipulation, clouding the truth. The solution is awareness, comparison and careful analysis. As a personal aside, this is part of the reason for my personal dislike for market research: it is often taken far more seriously than warranted and mean far less than suggested. ___________________________________________________ 35. Emerging Trends in the information sphere For the past few years, individual database owners/maintainers have been flirting with the idea of making paid access available through the Internet, rather than the existing system of allowing database retailing firms to promote and market their databases. I have heard rumours most database producers earn up to 30% of retail price when delivered through database retailing firms. The Internet is not a commercially viable alternative...yet, but some have emerged with alternative funding despite this (Library of Congress, ERIC, see section 13). Others are creeping in around the edges by offering subscribers access at a much reduced flat annual fee (Computer Select at one time). I expect to see much more of this once a meaningful way to charge by the page emerges. Digital money holds the key but despite the hype, practical use appears to be a medium to long-term reality. A second trend is Internet publishing itself. Gradually, the information is getting easier to locate (don't laugh please - its undignified). We are also getting better at using the Internet as a tool to disseminate information. We have the very visible, if perhaps short-lived, search engines, but also other efforts like archives of FAQs, archives of guidebooks, applying the dewey decimal system to the Internet, specialist directories, subject guides, specialist search engines. This will be a lively field for several years to come. As it gets easier to locate the good information, perhaps the lines between commercial quality and Internet quality will begin to merge in places. The third trend is the very promising prospect of paying for information by the page through the Internet - viewing the results in a web page immediately. There are some technical hurdles yet, but certain elements are already appearing in ventures like DialogWeb. This step may prove profitable for ATM vendors and owners of Internet cafes, pubs and kiosks. It will also herald a dramatic drop in the cost of information. __ 35.1 developing an informative internet? Several serious glitches have delayed the further improvement of the internet as an effective information resource. Oh, sure it is the world's largest library and thousands of new webpages are published every hour. But this trite statement disguises how slow the informative value of the internet is developing. The internet holds so very much promise - far more than popularly acknowledged. The marketing mantras endlessly hound us on the internet transforming the marketplace, but few of us grasp this technology promises to rewrite the rules of community, government and the exchange of intellectually valuable information too. This article shall illustrate how the internet, as an effective information resource, is not yet delivering the information pertaining to community, government and the exchange of intellectually valuable (improved) information. We are only proceeding quickly with market information and computer-related information. We should have achieved more by now. Organization Let start by looking at information itself. Information passes from producer, to organizer, to consumer. It travels many paths in this journey. Superficially, we can observe internet communication travels via email, newsgroups, and webpages (and others). Let's call these tools. Looking deeper, we observe information emerges from just a few generalized groups: knowledgeable individuals, informed government employees, grant funded educational projects, commercial organizations and a few others. Each group produces a particular type of information, distributes (publishes & promotes) in particular channels, and hopes to pay for (or justify) their effort in a particular way. Efficient internet research is infused with an understanding of who publishes, where and why. Before information reaches the consumer, it passes through a vetting which organizes and filters both the quality and the presentation style of the information. Let us call these systems. The FAQ is a pivotal piece of a system that may start with a post to a mailing list or newsgroup, involves the vetting of the faq maintainer, then proceeds to an faq archive then to the end consumer. The webpage is published by someone who has justified their time and expense, is indexed by a search engine or definitive-topic-website or webring or what have you, and then is found and read by the end consumer. The internet has many such systems. Each system again defines many of the traits of the resulting information. Faqs are semi-authoritative, collaborative pieces, often dense and factual. Private mailing lists are sometimes more informative, discussive, as well as serving as a notice board. Newsgroups involve far less natural vetting and quality control, but excel in distributing popular volume resources like graphics. Search engines don't vett, but can be searched. Each system reinforces the uniqueness it brings to the whole internet. When I blindly declare Information Clumps in the section 32.7 of this faq, I am really describing a trend whereby certain information accumulates in a particular location, others out of self-interest add to the pile, and further information reinforces both the logic and uniqueness of that pile of information. It is just a short jump from this to understanding how faq archives grow but maintain a good quality, how the grand internet search engines began to lose value about 15 months ago, and how ftp archives still exist for many computer topics. The internal logic to the organization of information is based on simple principles. It defines the environment within which we strive to improve the internet as an effective information resource. Further Reading: Searching the Web: Strategy (http://cn.net.au/webpage.htm#5) Publishing As mentioned, thinking about who is publishing assists research. Applying this to where information is emerging - and we learn much of the best information is not reaching the internet. Certainly, the commercially generated information is not reaching the internet (covered below). The large research studies paid for by public funds and slowly aging on the shelves of government and non-government organizations are also not coming online. (Even offering to publish such documents freely does not appreciably affect this trend as the restrictions are not financial, but mindset.) Instead, government, institutional and commercial organizations primarily publish brochure-ware - as befitting the presentation of market information. Similar logic applies to much of the best information in the world. We should recognize few of the more valuable documents emerge online. Further Reading: Socially Responsible Publishing on the Internet ('97) (http://cn.net.au/cn/past/docs/publish.html) A census of Regionally Important Documents on the Web ('96) (http://cn.net.au/cn/past/docs/webscan4.html) Discussion The internet is so exciting to me, precisely because of a promise of a real community rebirth arising from this technology. For the first time in history, we should be able to discuss in an informed manner any number of issues from crime to taxation. Tied into this are issues of government transparency, international assistance, anti-corporate market reform, and community involvement. Unfortunately, my experience with mailing lists and more recently with a newsgroup confirm the difficulties in developing discussion. Discussion groups function as noticeboard, but the difficulty in developing participation, and in moderation, are just a little too cumbersome to be successful. For many discussion groups, the chaff overwhelms the wheat, and the information content is far from considerable. The financial rewards are also minimal for establishing and maintaining discussion groups. Dramatic improvement to the informative value of the internet is unlikely to emerge from discussion groups. Further Reading: How to build a discussion on the Internet (http://cn.net.au/cn/past/docs/forums.html) Rewards We have alluded to the importance of editorial and organization on the internet. There are several severe limitations to this - first and foremost the difficulty in gathering financial rewards for meaningful work improving and organizing information. I am being circumspect here. There is money available - but not where it is needed. The most important resource in professional research are the contents of the commercial information sphere. This sphere existed decades before the internet, is far better funded, and is far larger. To compare commercial and internet information is almost heresy. The bridge between these two, internet and commercial, is emerging slowly. Digital money should grease the exchange of information by dropping the cost of exchange considerably. Today, credit cards provide this service. This works, at times, but digital money would allow for small amounts of money to change hands. This appears to be a critical threshold for bringing much of the commercial information onto the net. About 5 years ago I was introduced to the Thesius Model - an economic model to pay the intellectual investment in publishing and organizing interactive multimedia. Years ago there was Xanadu. While I have serious reservations about both, they both illustrate the intellectual foundations for effective use of a tool for exchanging small amounts of money. It opens the doors to direct delivery of copyright work - which in turn opens an effective economic model for publishing improved information on the internet. Without digital money, proprietary information can only be exchanged digitally by gift (that is free - the initial driving force of the internet information sphere, or by credit-card purchase of access to passwords to external networks - the current method of accessing database retailers. This has the unfortunate effect of limiting the interest both of internet users in the commercial information sphere and the commercial information retailers in the internet. Oh, there is movement in both directions, but not at the scale experienced in other ways. Further Reading: The UWA Theseus Project (http://www.arts.uwa.edu.au/TheseusWWW/) The Xanadu project (http://www.xanadu.com or concise summary - http://www.sfc.keio.ac.jp/~ted/XU/XuPageKeio.html) Understanding Finding information on the internet is a skill. Finding information on the commercial information sphere is also a skill. There is a great degree of overlap. The awareness of the general public in how to effectively search the internet is very limited. This can be seen both through the abundance of simple web search engines - a research tool well past its prime, and the competition I am experiencing in developing the spire project. Let's take as an example the search engine. Most searches end in 1000's of results here are the first 10. Do you really think the first 10 or 20 or 100 sites listed are particularly better than the next? No - you have a random selection of resources. A selection generated by computer based on the most simple of criterion. To further denigrate the popular search engine, we should also mention how some search engines sell placement in search results, and how recent surveys suggest search engines record only 20% of the web. And yet, the search engine is the much vaulted entryway to the world of information!?! Clearly this is not an avenue that will develop further. We need to look to other resources like topic-specific search engines, meta-sites and discussion groups. Research My own speciality is in assisting electronic research. I publish the spire project, the information research faq, shareware, mirror sites, and further resources which assist people with information research. Much of this work involves building bridges between commercial information resources and the internet. Oh, information retailers are very familiar with selling to an internet audience - but few of the structures are in place to achieve any meaningful promotion. I also find considerable confusion about what resources already exist on the internet which are useful in serious research. Most of us are aware the Library of Congress and the British Library catalogues are online - but few are aware AGIP, MOCAT and the UK Stationery Office catalogue are also online. Multiplication of Information One complication of poor information organization is an inflation of information overlapping nuggets. Information on the internet is so difficult to locate we have almost a continual need for more publishing. Information must exist in numerous locations to reach an intended audience. Promotion of the simplest nature - recognition for the best for a given topic - becomes exceedingly difficult. Only when 20 sites publish or report a given fact does it become accessible. Curiously, this is the state of affairs in the wider community. Promotion is an expensive speciality. Numerous copies, distributors and references are required to generate any kind of significant awareness. Why should the internet be different? Actually, why should the internet be the same. This model certainly breaks down in when a definitive resource is found and recognized (and funded). Websites like the US Census Bureau have no need for alternative descriptions or presentations. Such sites are the exception. Consider a search for the best resources for patent research, We are greeted with 699 websites (Altavista search for "patent research" Aug 20 '99) which does not include my patent research faq (certainly one of the top 10). There is no technical or theoretical need for such confusion. Justification It is relatively difficult to earn money from publishing improved information, or organizating information already on the internet. Given the intense interest in this technology, a collection of models have emerged. A brief tour of these models will highlight the financial limitations to improving the internet as an informative resource. Working for fame (but not payment) This model works well in open source software programming, and some of this ethic certainly extends to publishing information. Simple altruism/complete lack of justification School students and internet novices in particular may not need to justify anything. Unfortunately, such work is usually neither consistent nor persistent. Commercial promotion Promotional funds can be used to publish information. Most promotion is short-sighted, limited to presenting market information (like product information), but in time government and associations will fund publishing in-house information for purely promotional reasons. Invested commercial businesses There are certain commercial opportunities to earn money through banner advertising and sponsorship. Direct payment for improved information (perhaps with digital money), direct payment to authors (Theseus model, royalty systems), and direct state sponsorship need not be necessary to fundamentally improve the internet as an information resource. This is not the premise of this article. Academic peer-reviewed journals do not pay for articles. Commercial periodicals are supported by advertising, and the token subscription costs usually just cover distribution costs. Fame motivates many efforts, not just online, and we do not feel the need to habitually justify everything we do. In no small way, as more people become adept at publishing quickly, important information will move on the net faster. A similar story can be created for organizing the internet. However, all these economic models will not improve the informative value of the internet like direct payment. Most current limitations have economic roots. Conclusion We know something of how the information was published, and how there is a serious lack of important documents coming on to the internet. We have described how information is organized on the internet - how limited editorial vetting and organization have given rise to a range of traits which give rise to the need for research skills. Financial rewards and financial tools are not yet going to solve this difficulty, as we have described. We can only hope for a gradual growing out of our current difficulties. This leaves us with a simple analysis of the limited public awareness of how to research the internet. I assume it goes without saying how a greater understanding of the various systems of the internet will assist you in judging the worth, likely source and likely venues of the information you seek. The same is true in the larger world... database, book or article? Each have different traits and qualities, reinforced over time. Let's consider instead the development of this website. Would the development of a site focused on information research work better than the well established alternatives? On the plus side, there are certain qualities of all internet communication that should assist such a shift. Internet communication is inexpensive, relatively rapid, and increasingly accessible. On the negative side, the internet is badly vetted, potentially very time consuming, and up against very well entrenched systems that have been running for either decades or millenniums (considering databases or books). Elements like a promised but functionally absent digital money, the lack of a meaningful way to recoup the costs of vetting online information, only make matters worse. Sacred to the early users of the internet was an attitude of give and take over the internet. This is the most distressing trend I have seen, with the gradual submergence of the focus on sharing information freely towards a more capitalist instinct of effort and returns. And this has hit the revolution at its weakest point. Further Reading: Theory and Past Projects of Community Networking, (http://cn.net.au/cn/past/) ___________________________________________________ 36. Question and Answer Section __ 36.1 How do I find information on the Internet? A search for information on the Internet is not essentially different from the standard information search process. You still need to start by outlining carefully just what you are hoping to locate. You also need to be aware of the peculiarities of the Internet as a researchable resource (or rather a collection of resources). If you expect instant delivery of exactly what you require, free, then you need a reality check (and I am sure you will get one real soon). Sadly, the printed media tends to overlook this. As with all resources, the more familiar you are with a given resource, the more efficiently you will work. Get to know the Internet for a time first. Understand how it works. Then re-adjust your expectations and file it as just another collection of resources, perhaps preferable in certain circumstances. A more complete answer to this question starts with a great deal of reading and is a primary purpose of the spire project. ___________________________________________________ 37) Acknowledgements I would like to thank my wife Fiona, whom I love and cherish dearly. The spire project is the culmination of several years bridging information research and internet development. The information research industry is on the verge of a radical transformation set to add meaning to the oft-used saying "Information Revolution". The development of the Internet is currently delayed by many factors, but to grow further, we need to radically improve the middle ground of content-rich resource-linked webpages. I feel this is the most beautiful form information can take in this emerging information landscape. It is also a most effortful area to work in. The spire project is the most advanced information guide today. Thanks to the many readers who assist in building and refining this information. Your help is appreciated. ___________________________________________________ Copyright (c) 1998 by David Novak, all rights reserved. This FAQ may be posted to any USENET newsgroup, on-line service, website, or BBS as long as it is posted unaltered in its entirety including this copyright statement. This FAQ may not be included in commercial collections or compilations without express permission from the author. Please post permission requests to david@cn.net.au ----------------------------------- David Novak - david@cn.net.au .