Remote Search Services
Volume Number: 15 (1999)
Issue Number: 12
Column Tag: Web Site Design
Remote Search Services for Web Sites
by Avi Rappoport
Adding search to your site - even if you don't own the server
What is Remote Site Searching?
When you want to add search to your site, you may be have some technical difficulties. Perhaps your site is hosted on a large server somewhere, or you have an uncooperative web administrator, or the challenges of adding a CGI are too daunting. Never fear! You can outsource your search to a remote site search service and let someone else worry about the gory details.
The indexer and search engine run on the remote server: they will use a web indexing robot, or spider, to follow links on your site and read the pages, then store every word in the index file on that server. When it comes time to search, the form on your local Web page send a message to the remote search engine. Although it's going through the Web, process doesn't change - it just has to move a little farther. The remote search engine takes the search terms, matches the words in the index, sorts them according to relevance, and creates an HTML page with the results. When a searcher clicks on the result link, they will see the page from your site, just as though the search came from there. It's easy and painless for practically everyone.
This review covers the range of remote search services, their features and their drawbacks. It will teach you to prepare your site, try indexing it, test the search, customize the results, keep the search up to date, and choose the right program for your long-term needs.
What you Get With Remote Search Services
- No need for server access: Even if your site is hosted and you have FTP access only, you can run a search engine.
- No need to learn CGIs or server systems: You never need to install any software, worry about version compatibility, or learn about permissions and paths (or paying someone else to do so).
- Easy administration: The remote search service will provide a set of Web pages for administration, rather than making you learn about command lines or config files.
- No load on your server: Search engines require significant resources, such as CPU time during researching and retrieval, as well as disk space. Outsourcing to a remote server moves the load away from you. In addition, these servers are usually in data centers with excellent connectivity and 24/7 administration.
- Minimal initial investment: Instead of paying for a search engine up front, you can pay a small monthly fee. Some services are free, showing advertising with the search results.
- Easy to switch: If you aren't happy with your search service, it's easy to switch to another.
- Advertising or continuing costs: You must pay every month or allow your searchers to see other people's advertising
- Less control over the indexing: If your data changes frequently (hourly or daily), most of these services will not index that often.
- Dependent on outside service: If the service's search engine gets busy, it may delay responses for your site, and there's not much you can do.
- Less capacity: The remote search services have a page limit, usually somewhere between 200 and 5000 pages. While many can go higher than that, they can't handle hundreds of thousands of pages.
- Fewer special features: Each search engine has its own special features, but you have more choices if you plan to run your own engine. For example, indexing password-protected areas, or word processing file formats, adding a thesaurus or a spellchecker, etc.
- Intranet privacy: Intranets (internal networks using standard software) want to keep control of all their data, rather than allowing access external systems.
- Multi-site indexing: Most remote services allow you to index just the sites you control. With a local search engine, you can index other sites and create a public search portal.
Remote Site Search Services Covered
The following services are covered in this review, and also have pages and examples on this site.
- free for 500 pages, fewer than 5,000 searches per month, (no ads, just a logo)
- paid version: 250 pages & 2.5K searches @ $75 per year; 500 pages & 5K searches/month @ $150 per year; 1,000 pages & 10K searches/month @ $300 per year; 2,500 pages & 25 K searches/month @ $600 per year; 5,000 pages & 50K searches/month @ $1,200 per year
- free (with advertising) can handle up to 32MG of HTML (flexible), will "sample" sites if they get large.
intraSearch (WhatUSeek) <http://www.whatUseek.com/intraSearch/>
- free (with advertising) to at least 10,000 pages
MondoSearch (remote version) <http://www.mondosearch.com/>
- paid version only: 1 - 1,000 pages: $144; to 5,000 pages: $585; to 10,000 pages: $990; above: contact email@example.com
- local server version also available
- free (with advertising), to 5,000 pages
- paid version: $6.99 per month (12 month commitment); $9.99 per month (3 month commitment)
- free (with advertising) to 5,000 pages
- free (with advertising), for up to 5,000 pages, 30,000 searches per month
- paid version: up to 1,000 pages: $300 per year; up to 5,000 pages: $600 per year (limit of 30,000 searches per month); for more pages, contact company
- free (with advertising), to 10,000 + pages
Webinator (remote version) <http://www.thunderstone.com/texis/indexsite>
- free (with Thunderstone logo), to 5,000 pages
- local server version also available, can do thousands and millions of pages
Checking Links and Pages
Before you install any search engine with a indexing spider, you must make sure it can find the pages on your site. The good news is that cleaning up your links will improve your accessibility to the large public search engines (such as AltaVista, Google, HotBot and Infoseek), and make it easier for you to run an automated site mapper.
Robot Spider Compatibility
The indexing spiders follow links from a starting page, so use a home page if you have good text links, or a site map page.
Whole sites: Robots.txt
The first thing is to check the "robots.txt" file. This is a standard file for web servers that sits at the root of your site, and excludes robots that are not welcome on the site, or in certain specific directories (though this is voluntary). If you run your own server, you control this file: otherwise your host server administrator controls it.
You want to make sure that this file exists, and that it allows at least your indexing spider to access your directories. You may need to negotiate with your web hosting provider on this point, as this file must be stored in the root folder of the web host.
For more information on this topic, see Search Indexing Robots and Robots.txt: <http://www.searchtools.com/info/robots/robots-txt.html> and the WebMasters Guide to the Robots Exclusion Protocol at < http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html>
Individual Pages: META ROBOTS tag
The other way that page designers can control robots and spiders is by using the META ROBOTS tags. These are particularly useful if you have a hosted site and don't want to bother your server administrator.
For example, if you have a directory listing or site map page, you can tell the spiders to follow the links but not index the text on the page by placing the following information into the HTML header: <meta name="robots" content="noindex,follow">. If you have pages with useful data but inappropriate links, such as a web calendar page with duplicate links to other calendar pages, use <meta name="robots" content="index,nofollow">.
For more information, see Search Indexing Robots and the Robots Meta Tags <http://www.searchtools.com/info/robots/robots-meta.html> and the Webmaster guide above.
Good Links and Bad Links
Indexing spiders tend to be pretty dumb. They know about the simple HREF links, but just get lost on anything more complex. Spiders and robots may not follow links in:
- image maps (especially server-side image maps)
- redirect and META Refresh tags
- DHTML layers
- ActiveX controls
- Java pages and site maps
- Flash or Shockwave (unless you use the AfterShock options to generate HTML text and links!)
Check Your Links
Don't rely on your content-management system to check local links: it knows too much about the structure of your site and the special formats you use!
To make sure all your local links work, run a link-checking robot such as Big Brother for Mac & Unix <http://pauillac.inria.fr/%7Efpottier/bb.html.en>, or use a service such as NetMechanic <http://www.netmechanic.com/>. If these services can follow the links, there's a good chance that your search indexing robot can do the same.
Solution: Supplement Complex Links
If you find you have problems, there are two ways around bad links: both require work, but they will make the indexing spiders happy.
- Alternate Navigation: add alternate links in <NOSCRIPT> and <NOFRAMES> tags, lists of the links from image maps, simple alternate pages for DHTML and Java pages, etc. This should work for all kinds of robots and spiders.
- Site Page Listing make a page or sitemap with links to every page on your site. This is hard to maintain and synchronize with your other changes. You can't use a site mapper application that uses a link-following robot, because it will have the same problems that the search engine spiders have.
Five for the Price of One
The good news is that all this work will pay off in five ways:
- Your search engine robot spider can find your pages
- The robot spiders for the webwide public search engines such as HotBot, Infoseek, AltaVista find your pages
- Robot-based link checker can check your links
- Robot-based site map creator can find your pages to make a map
- Your site is now accessible to blind and visually-disabled web surfers (as described in the W3C Web Accessibility Initiative), and those using text browsers such as PDAs.
Test the Index
Many of the search services require minimal commitment on your part. All you have to do is go to the service Web site, register with a user ID or email address and password, then give them the home page URL. The search service will send their indexing spider to follow links on your site very quickly, so try to do this during a quiet time.
Once you have signed up, you'll see all the setup and configuration options in the browser interface. Some are more elaborate than others: Atomz has a bunch of tabs and subpanes within the tabs, FreeFind has a nice Wizard interface. Webinator has a fairly elaborate mail-back access control: you must have an email address on the server to index that server.
If your server is slow, you are charged by the byte, or you have long files, choose a service that will do smart updating, and only get the contents of pages if they have changed.
If you have access to your web site log or monitor window, you can watch the spider as it follows links throughout your site. Otherwise, or in addition, choose a service that provides reports on the indexing process.
Try Searching Your Site
Remote search services provide almost-instant gratification: you can test them as soon as they're done with the indexing. Most of them have a test search form on their site: if not, copy their form to your local page and try it out.
There are two basic kinds of search queries: those which match pages on your site that contain every search term and those that match any search term, though they may not show you every matching page . A few will let searchers choose the best approach.
Figure 1. PicoSearch result showing all pages which have any of the search terms.
Figure 2. SiteMiner result for the same search, showing all three pages with all search terms.
If your site contains text from other languages, you need to watch out for letter matching issues. Some search engines can only match the 26 English characters, while others can match diacritical characters (such as î and á) and special characters (ø and ß). PicoSearch and MondoSearch also offer multilingual interfaces. Non-Roman scripts such as Arabic, Russian and Japanese are even harder, although PicoSearch offers results in Chinese.
When you do a search, and the engine locates a set of pages that match your search, it has to sort them as best it can. This is particularly difficult with one and two word searches-it's hard to tell which is the most relevant page (the best match).
Like hairstyles and music, success in relevance ranking is a matter of taste. You should do a number of searches to see what you think of any search engine you choose. Try searches with just one word, others with two, and still others with four or five. This should give you a feeling for the kinds of relevance ranking that a search engine will do.
Search forms are the user interfaces to the search engine, so you can have several different forms, for your various needs.
- Search Field: this is very small form with a text field and Search button: it can go on your front page or even in the navigation bar on every page.
Figure 3. MondoSearch Simple Search Form.
- Advanced Search: lets the searcher have more control over the search, with options for date ranges or special zones. Only SearchButton, MondoSearch and Webinator include advanced search forms, though they have slightly quirky options.
When the remote search server gets the form command, it looks in the index, matches the search words, and organizes the results. The URL of the results page is that of the search service, not of your server, because that's where the results page is coming from, but the URLs for the found pages themselves include your server name.
Customizing Results Pages
Everyone is familiar with webwide public search engines and their lists of results. A local search results page is very similar, although for the best user experience, the search results page should look and feel like the rest of your site. If you are using a remote search service as a permanent part of your site, be sure that you choose a service that lets you customize the page design enough for your comfort level.
Simple Customization features
- page color: let you select the page color, so it matches the rest of your site design
- background: set the background graphic
- text and link colors: keep the text, link, active link and visited link colors consistent with the rest of your site
Figure 4. FreeFind Results Page Options Wizard.
Some services let you lay out your results page, including the page sections above, to the left, to the right, and below the results list. This allows you to include your normal navigation and site structure links, showing searchers more about the scope of your site. This usually includes fields for you to paste in your HTML code, and you will probably have to try this a couple of times to get the right relation ship with the results list, so this is only accessible to those who have some HTML tag experience.
Figure 5. Webinator field to insert page header in HTML.
Several of the free site search services will display banner advertising on the search results, although none of the paid versions will do so. For many sites, it's a fair trade for searching services, but for others, such as libraries and public schools, advertising is inappropriate, so they should choose a version without advertising.
Figure 6. PinPoint default result page, showing banner advertising.
Search Result List Items
As with the results page, the list of pages which match the search is familiar. Some search engines let you customize the elements of the items on this list, which lets you match the layout to the data you have. For example, some sites have useful URLs which give some context to the page, while others are just confusing.
Other features may include
- a ranking number or graphic indicator of how well the engine thinks the page matches the search terms
- a file modification date (in two-digit US date format: 09/16/99, four-digit year-first format: 1999-09-16, or some other format)
- a file size (best in K, Kilobytes, rather than bytes)
Figure 7. IntraSearch result showing items with URL, size and update date.
If you have carefully written META DESCRIPTION tag contents for each of your pages, so they'll rank well and look great in webwide search engines, you will probably want your site search to display them as well. Be sure to choose a remote search service that will show these.
Otherwise, some services do a good job of extracting useful text, while others just grab the text from the top of the page.
Figure 8. SearchButton result showing selected text extracted from pages.
Some services extract lines containing the search terms and/or highlight the words which match the search terms.
Figure 9. Atomz result showing items with top text and matching text.
Care and Feeding of Your Site Search
Although the remote search service is taking care of the server-side of things, you still have to keep track of the status, even if it's just to make sure it's still running, although these services have been fairly reliable so far. You should also perform test searches, some that you do every time, others that check new information on your site. And, as you change the layout and design of your site, make sure that the search form and results page reflect these changes.
Updating the Index
To keep your search index synchronized with the content on your site, you'll need to set up some kind of update schedule. If your site changes rarely, you can tell the service when to re-index. However, if your site changes more often, you will want to set up a scheduled update.
Watching the Searches
Analyzing your search log or report can teach you what your visitors are looking for - it's like having a free, automated market research survey. For example, if you have a movie site and everyone starts searching for the Blair Witch Project, you know it's hot, and can make sure you have good information so they don't go somewhere else.
Figure 10. SearchButton Report Options.
How to Choose a Remote Search Service
Read through the listings above, and try out the search engines in the SearchTools search page <http://www.searchtools.com/search/>. Think about which of the features we describe is vital, and which you can live without (it's like buying a car). Then try out two or three that have the most important elements, and see how well they fit with your site.
Product Special Features and Issues
- Advantages: A very powerful and configurable service, provides lots of control over indexing, follows complex links nicely, indexes PDF files, has many search options, and provides good schedules for updating. The results page and listing layout is entirely configurable using HTML and a simple tag-based scripting language, and there are no banner ads on the results page, just an Atomz logo. Free to 500 pages, paid version for more pages.
- Disadvantages: free version will only index 500 pages, index ignores "NOFOLLOW" tag, search finds every possible match, finds plurals and other word forms, by default, retrieves synonyms and soundalike words: just finds too much!
- Advantages: Allows you to index multiple sites (up to 32K of HTML data), update indexes often, handles complex links nicely. The search form lets users choose to match "any words" or "all words". Nice administration wizard interface walks through the options. Free, with advertising, to 32 MB of HTML text.
- Disadvantages: Banner ads on result page, not many useful options for customization results page or matched items, sometime the server is slow to respond, no indexing reports, limited search reports.
- Advantages: Allows you to index multiple sites, good indexing of complex URLs, search finds only pages which match all search words, will update once a week. Free, with advertising, to 500 pages.
- Disadvantages: Ignores robots.txt, which is usually an uncooperative move; not much results page customizing possible; index report is the "site map"; there is only minimal search reporting.
- Advantages: good with complex links such as redirects and framesets: shows frame page results in context; lets admin controls speed of indexing; search form can include choice of "any words" or "all words", handles extended Roman characters, marks pages by language, unusual results format - shows pages in categories, otherwise results pages layout is very flexible; can be customized for any language; local server version available.
Figure 11. MondoSearch Category Results.
- Disadvantages: no built-in update schedule, browser administration somewhat disorganized, very modal (you must click OK before changing pages). Paid only.
- Advantages: good with complex links such as redirects and framesets, tends to follow many links. Excellent index reporting, especially the live online version. Recognizes extended Roman characters and can be customized to show results in many languages including Chinese. Free, with advertising, to 5,000 pages.
- Disadvantages: No update scheduling: you must do it interactively each time. search finds every possible match, which is usually too many pages. Free version is not very customizable for results pages or match items. Minimal search reporting. We also saw a distressing error when trying to search a site that is not yet indexed.
- Advantages: Very configurable results page design and results listing options, thorough index report by email. Free, with advertising, to 5,000 pages.
- Disadvantages: Some problems following complex links, no update scheduling so you must do it interactively every time. Search finds every possible match, which is often too many pages. Problems with extracting text for page description. Almost no search reporting, makes it hard to track usage. Server ca be slow.
- Advantages: It's good with complex links such as redirects, and indexes automatically once a month. The search form offers a link to the Advanced form, for power users. The search reports are excellent and the service also provides access to the search log for even more detail. Free, with advertising, to 5,000 pages. Can request no-ads version for small public-service sites, no-advertising paid version available.
- Disadvantages: May want to index more frequently than free version allows. Search finds every possible match (even when that's not wanted). Search form HTML can be hard to locate, and the results page and result item customization is very limited.
- Advantages: good with complex links such as redirects, search finds only pages which match all search words. Free, with advertising, to 10,000+ pages.
- Advantages: good with complex links such as client image maps, search finds only pages which match all search words, great results-page layout customization options, automatically updates every two weeks and on demand. Free, with logo, to 5,000 pages: local server version available.
- Disadvantages: Complex access system for logging into administration site, title problem with server redirects, no result item customization, no update scheduling, minimal search reporting.
As you can see, there's no one search engine that has all the advantages. Which one you should choose depends on your site, and your particular needs. You won't know what you like until you take a couple of test drives!
Avi Rappoport is the Principal Consultant for Search Tools Consulting, specializing in Web Site, Intranet and Portal search engines. She reports and analyzes the industry for SearchTools.com (which runs on a PowerMac 6100). You can contact her at firstname.lastname@example.org.
Disclaimer: Search Tools Consulting has consulting relationships with MondoSearch and SearchButton, but we do not allow our customers to influence our reviews.