As we know, de-duplication is a hot topic
in the field of federated searching. There is considerable confusion
regarding fact vs. fiction on this holy grail of searching. This document
is intended to offer a reality check.
How does de-duplication work? It’s simple: you download
all the citations from every database you’re searching, compare
them all, and delete duplicates. So what’s the problem? Try this
exercise yourself:
Keyword search = “Internet”
ProQuest® New York Times results: 10,000+
hits (ProQuest ceiling for results is 10,000)
Citation #10: “As
Broadband Gains, The Internet's Snails, Like AOL, Fall Back;
Saul Hansell; New York Times, New York, N.Y.; Feb 3, 2003; Late
Edition (East Coast); pg. C.1”
ProQuest results are downloaded in sets of 10. This citation appears
in results set 1.
|
EBSCO Academic Search Elite™ results: 107,477
hits
Citation #49: “As
Broadband Gains, The Internet's Snails, Like AOL, Fall Back.;
By: Hansell, Saul., New York Times, 2/3/2003, Vol. 152 Issue
52383, pC1, 0p, 2 graphs”
EBSCO results are downloaded in sets of 10. This citation appears
in results set 5. |
Gale® InfoTrac® Expanded Academic Index™ results: 106,135
hits
Citation #66: “As
broadband gains, the Internet's snails, like AOL, fall back.
(fewer subscribers to slow Internet-access dial-up services like
AOL Time Warner's America Online and others) Saul Hansell. The
New York Times Feb 3, 2003 pC1(N) pC1(L) col 2 (35 col in)”
InfoTrac results are downloaded in sets of 20. This citation appears
in results set 4. |
Because the same citation appears in different native results sets,
in order to de-duplicate these results, it is necessary to download all results from all databases:
Largest results set: EBSCO Academic Search Elite: 107,477 hits
Results downloaded in sets of 10 @ 5 seconds per set = 5.97 hours to
download
How long are your users willing to wait? |