It may also become one of the most hotly contested standards issues ever to hit the networking and information industries. At least five distinct groups stand ready to defend their particular approach to on-line metadata.
One way to understand metadata is to think of it simply as an index or template that describes information on a server--perhaps an abstract of the information and descriptions of its format, multimedia content, precise location and access fees. A library card catalog is one familiar form of metadata.
Today, search engines based on robots--programs that drill from link to link across the WorldWide Web--are the Internet's primary search tools. But with information doubling every year, robots alone will quickly become impractical, because users won't have the time to wade through all the information they gather.
The Internet community also discourages introducing client-based robots--also known as worms, spiders and Web crawlers--because they tend to chew up massive bandwidth and server resources in their indiscriminate information quests. It's simply too costly to support teenagers who send out robots to look for sex in all the wrong places.
Standards and Control
Such problems have led many to conclude that the Internet needs a catalog approach, based on metadata, to organize network-based information. The
first step in such an approach is to determine a common structure and content for metadata. The stakes in how this information is defined are quite high. For many vendors, control of the Internet's metadata is key to this high-stakes marketplace. That message hasn't been lost on companies like InfoSeek Corp., IBM or Microsoft. Microsoft is developing a protocol-independent decentralized storage system involving metadata.
However, metadata is the lifeblood of other groups, too. Stuart Weibel, senior research scientist at OCLC Online Computer Library Center, says libraries use the rich and intricate machine-readable cataloging record (USMARC) to describe their resources. With millions of documents already indexed using USMARC, there is an interest in preserving this legacy system. Members of the Internet community have been working on a distributed WHOIS++ architecture to complement the simplified USMARC effort.
An overlapping and already deployed--but still evolving--architecture, called Harvest, is also a major contender. Harvest was developed at the University of Colorado, Boulder, and has major funding from the Advanced Research Projects Agency.
Metadata is also a "big" issue for the WorldWide Web Consortium. W3C director Tim Berners-Lee says the consortium is working to define a metadata architectural model that would allow various approaches to coexist. The work doesn't yet have a timetable and Berners-Lee cautions that "quite a complex language is required."
Finally, various scientists are working on new ways to tackle the metadata problem--especially methods that move away from the notion of maintaining a central metadata repository. For example, research is underway at the Massachusetts Institute of Technology to treat documents as objects surrounded by a metadata wrapper. A more primitive approach is to include metadata in the text format header of HTML documents.
How metadata develops will have broad consequences for content providers and users searching for it. Metadata may not even
get off the ground if indexing data becomes a complex manual task left to content server administrators. If the index is too simplistic, however, finding information on the Internet will become increasingly unmanageable.
What can users do to protect their interests in this time of flux? Eddie Correia, editorial development analyst for CMP Publications, says CMP chose a search engine with open application programming interfaces (APIs). "Even if some search engines disappear, we can write to the APIs," he says. CMP decided, after examining 25 search engines, to use a product from Open Text Corp.
Of course, how metadata will work with today's search engines is still being determined. The most likely scenarios put search engines to work sifting through catalog information or searching text only after an appropriate catalog entry has been found. Described below are some emerging metadata approaches and their pros and cons.
Links to the Library
This month, when the Internet Engineering Task Force meets in Stockholm, there is likely to be a "birds of a feather" session to examine a proposal calling for a system relying on a simplified version of USMARC. The approach grew out of an effort sponsored by nonprofit OCLC and NCSA, the OCLC/NCSA Metadata Workshop. Among the workshop participants were representatives of the Library of Congress, a number of universities, Bunyip Information Systems, Xerox Parc, SoftQuad, Mitre, the National Science Foundation and other government agencies and national laboratories.
Weibel sees the effort as a compromise--one that would let librarians expand on relevant records, while keeping content providers from having to deal with an existing format that is "complex, esoteric and, in many cases, impenetrable" to those without training and experience. Weibel's hopes that a simplified cataloging of Internet resources will influence standards in both the Internet and the library communities. Details are to be solidified by this month (see the Conferences button at
http://www.oclc.org:5046/~weibel
).
Today, OCLC's on-line catalog contains about 33 million bibliographic records, and the organization serves some 18,000 libraries. By this summer, OCLC is expected to make 50,000 Internet bibliographic records available to librarians through an offering called NetFirst.
By July, Weibel expects to see several prototypes based on the simplified USMARC approach. Bunyip will implement a version of the workshop semantics in its syntactical architecture, and Weibel is open to other groups, such as Harvest and supporters of Z39.50, doing the same.
WHOIS++ Bunyip?
Among those refining the OCLC approach is Bunyip. Bunyip Information Systems is a 24-person company, best known for its Archie index and the WHOIS++ protocol for directory service.
Bunyip has been working to extend WHOIS++ into an Internet search architecture and expects to be able to let users experiment with client and server implementations by this month. An 11-server beta network was scheduled to be in place by early June. The architecture relies on a search protocol intended to be compatible with X.500 but to also provide greater scalability. That protocol works with a structured data interface to those databases to be searched. In May, plans were underway to support OCLC's cataloging approach as well as the IETF-defined Internet Anonymous FTP Archives (IAFA) and Bunyip's better-known White Pages directory services.
Chris Weider, author of the proposed indexing standard for WHOIS++, says what Bunyip is trying to do is establish a partially structured and simple set of metadata that can easily be searched. Bunyip, he says, helped shape the data set to be presented by OCLC and plans to implement technology for searching that data set.
Weider believes an important advantage of Bunyip's approach is that it is "a lot simpler" than the ANSI Z39.50 client/server search standard--which is being promoted by some as an entry to cataloging. Z39.50 is the fou
ndation for WAIS' search approach, although experts say WAIS has customized the standard, which in some instances proved too complex and in others lacking. One of the key advantages of Z39.50 is that it lets users retain their search history as they continue to search a server.
Weider also believes WHOIS++'s query routing protocol is better than Z39.50. It lets content providers build indexes for their databases and expand those indexes in a hierarchical tree structure when they become too large for a single server to handle. The approach relies on pointers, so that a user searching for particular information will be expeditiously directed through the tree structure to the information being sought, rather than having to visit every server on a network. "Your network traffic is, therefore, enormously reduced," says Weider, "and there is a lot less stress on the lower servers."
Weider says the architecture, which supports parallel searches, should be able to speed the delivery of search results since, unlike Z39.50, every server does not have to be searched. The intermediate indexes also tend to be small, he says, "where with WAIS, the index can be two to three times the size of the underlying data."
Initially, the WHOIS++ implementation will only support full-text keyword searches with Booleans and then move into catalog type information. Still, Weider admits to "open issues" in making this sort of approach work: "How consistent do vocabularies have to be to get an efficient and meaningful search? What sort of data formats support the searches people want to make? People still have problems with Boolean searches. If this turns out to be more complicated, people will scream a lot, and then, if they are told they have no choice, they simply won't use it."
There are also key questions about which architectural approaches will win out in the marketplace. Weider, for example, admits that Bunyip's approach overlaps with the Harvest approach, which is already being deployed on the Internet.
Harvest, h
e says, also has tools to make it easier to pull information out of files automatically to create an index. "It can take a reasonable guess as to a format," he says. But Weider's feeling is that humans will have to become involved in indexing at some point because the kind of tools needed to truly do that job well would require a semantic understanding close to human intelligence.
Harvesting Information
So, what is Harvest? Michael Schwartz, Harvest's principal investigator and an associate professor at the University of Colorado, describes Harvest as an architecture of server-based gatherers, plus brokers [each with a query interface] that collect information from the gatherers. Brokers can communicate with other brokers, and gatherers can communicate with many brokers. Harvest gatherers can run at a content provider's site or can support native FTP, Gopher or HTTP protocols across the network. The technology includes its own search engine as well as hooks to WAIS and Verity search engines. Harvest has its own search protocol, although Schwartz was debating whether to add Z39.50 to that support in May.
Harvest is also intended to be flexible enough to work with a variety of metadata types, although it is now limited to a keyword indexing, rather than a cataloging role. "Over time, we want an architecture that lets you plug in different approaches," says Schwartz. Today, the Harvest software is offered free to users and licensed to vendors building products with it.
Schwartz believes catalogs will ultimately reflect various industry groups--with one type of catalog used for scientific research and others for a vertical markets. On-line services, such as Lexis-Nexis, will join the Internet as secure payment methods are put in place. Users will pay for whoever does the best job of organizing these sectors. His vision is that Harvest will be the technology pulling this all together.
Schwartz estimates 1,000 gatherers are now deployed. He adds that Harvest can reduce network traffic by a fac
tor of 60, index space requirements by a factor of 43 and FTP/HTTP/Gopher server requirements by a factor of four. Parallel search capabilities are also under development.
In May, Schwartz was negotiating with ARPA to develop an agile information integration system to address legacy systems and protocols better. A new Harvest release was also expected.
InfoSeek: What's Relevant
InfoSeek has another approach to metadata. Today, InfoSeek provides a search service said to bring back the most relevant information for a given search request. It processes 500,000 to a million queries per day.
InfoSeek president Steve Kirsch, says InfoSeek relies on a proprietary combination of human judgment and technology resources. The service is subscriber- or transaction-fee-based and identifies duplicate documents. Kirsch believes such a relevance-based approach is superior to Harvest.
He's also actively working to foster a standard for metadata with other search engine vendors--one that would perhaps accommodate the USMARC language as an extension. Overall, what Kirsch would like to see happen is for cataloging to be handled as an extension of Z39.50.
"From my perspective," he says, "getting everyone to agree on standards for representation of collections and standards for allowing people to combine search results from a distributed search are the two top items on the agenda to make a world search possible."
IBM is also moving to establish a metadata architecture with a far-reaching plan and technology that supports multiple search engines, applications, encryption and digital cash schemes. That "container" technology plan was described in our June issue, "Selling Knowledge on the 'Net," page 102.
Finally, Martijn Koster, a software engineer and Webmaster for NEXOR, is working to index the Internet with the ALIWEB distributed indexing approach. While Koster believes that the work of OCLC is very similar to his own, he says that ALIWEB, one of his spare-time projects, simply hasn't had the res
ources of other approaches.
Koster sees pros and cons to most of the cataloging schemes now being proposed. For one, he doesn't want to see OCLC or any other organization or vendor in "a sole position of power." He'd also like to see more information on how OCLC will collect metadata from various servers.
Koster says that Harvest still lacks tools to create metadata automatically and its design for moving information among networks hasn't been tested extensively. He also fears that Harvest could evolve into "a humongously big database where you can't do a sensible search and get a sensible result."
On the positive side, he believes Harvest could be used as a broker with many metadata schemes, such as OCLC's. For now, he believes the answer is most likely to lie with well-trained people and approaches being taken by companies--such as TelTech Resource Network--who know the advantages of various search methods and where to find the experts and databases they need.
Christine Hudgins-Bonafield can be reached at cbonafield@nwc.com. Or, you can
e-mail Christine directly.