As we know, de-duplication is a hot topic in the field of federated searching. There is considerable confusion regarding fact vs. fiction on this holy grail of searching. This document is intended to offer a reality check.

How does de-duplication work? It’s simple: you download all the citations from every database you’re searching, compare them all, and delete duplicates. So what’s the problem? Try this exercise yourself:

Keyword search = “Internet”

ProQuest® New York Times results: 10,000+ hits (ProQuest ceiling for results is 10,000)

Citation #10: “As Broadband Gains, The Internet's Snails, Like AOL, Fall Back; Saul Hansell; New York Times, New York, N.Y.; Feb 3, 2003; Late Edition (East Coast); pg. C.1”

ProQuest results are downloaded in sets of 10. This citation appears in results set 1.

EBSCO Academic Search Elite™ results: 107,477 hits

Citation #49: “As Broadband Gains, The Internet's Snails, Like AOL, Fall Back.; By: Hansell, Saul., New York Times, 2/3/2003, Vol. 152 Issue 52383, pC1, 0p, 2 graphs”

EBSCO results are downloaded in sets of 10. This citation appears in results set 5.

Gale® InfoTrac® Expanded Academic Index™ results: 106,135 hits

Citation #66: “As broadband gains, the Internet's snails, like AOL, fall back. (fewer subscribers to slow Internet-access dial-up services like AOL Time Warner's America Online and others) Saul Hansell. The New York Times Feb 3, 2003 pC1(N) pC1(L) col 2 (35 col in)”

InfoTrac results are downloaded in sets of 20. This citation appears in results set 4.

Because the same citation appears in different native results sets, in order to de-duplicate these results, it is necessary to download all results from all databases:

Largest results set: EBSCO Academic Search Elite: 107,477 hits

Results downloaded in sets of 10 @ 5 seconds per set = 5.97 hours to download

How long are your users willing to wait?

 

 

Contact Us | Privacy Notice
Copyright © 1998 - 2003 WebFeat, Inc. All rights reserved.
1-888-757-9119