Merging Result Sets


Question from: ZIG, January 1998 meeting


Question:
When a client wishes that a number of result sets (more than one) be merged, or merged and sorted, how may this be accomplished? How may the size of the resultant set be determined? How may "degenerate duplicates" be treated?

Response:
Result sets may be merged by a search that ORs the result sets together. If this is the mechanism used to merge, then when a client desires that a number of result sets be merged and sorted, this would be accomplished via a two-step process, where the first step is a search (as described) and the second step is to sort the resulting set.

Alternatively, a merge may be accomplished via the merge function of the Sort Service. However, the result of a merge, when done via Sort, does not reflect the size of the output result set, which might not be the sum of the sizes of the input sets, because the merge might produce duplicates which the server might unilaterally remove.

It should be noted that the unilateral removal of duplicates is a controversial issue, which this commentary does not attempt to resolve. When sorting a single result set, it appears to be the consensus within the ZIG (though not unanimous) that the server should not unilaterally remove duplicates. (It may be further observed that some people base this belief on the assumption that a single result set should be normalized, that is, not contain any degenerate duplicates -- see below -- . When several result sets are merged (via the Sort service) and sorted, similarly, the server should not attempt to perform any meaningful duplicate detection/removal (a separate service is being developed for the purpose of duplicate detection and removal). However, when several sets are merged, it is indeed possible that the same physical record is identified by more than one result set, that is, the same result set item exists in two or more sets (where a result set item is defined to be a database name and a pointer to a record within the database). This category of duplication has been referred to informally as degenerate duplication. While it appears to be ZIG consensus that a simple search should not produce degenerate duplicates, it is clear that a merge of multiple result sets (whether via Search or Sort) may do so. There is no consensus within the ZIG as to whether the server should unilaterally remove degenerate duplicates created as a result of a merge, but there is consensus that there cannot be any rule stating either that a server must, or may not, do so.

Now, there is a fundamental inconsistency in this logic: while it appears reasonable to assume that result sets are "normalized", in the sense that they do not include degenerate duplicates, it does not appear to be reasonable to compel a server to unilaterally remove degenerate duplicates on either a merge via Sort or a boolean Search. (This is not an issue for a simple Search because the server should simply not include the degenerate duplicates to begin with, so removal is not an issue.)

It may be concluded that when result sets are merged, the size of the resultant result set is not guaranteed to be the sum of the sizes of the input result sets. When the merge is performed via Sort, the result set size is not provided (because it is not a parameter of the Sort Response). When the merge is performed via Search, the result set size is provided (because it is a parameter of the Search Response).

It should also be concluded that when a result set has been created as the result of a merge (either via Sort or Search) the result set may not be assumed normalized; that is, it may contain degenerate duplicates. There is currently no mechanism provided by Z39.50 to remove degenerate duplicates, however the Duplicate Detection Service, currently under development, will provide this capability.


Status: Withdrawn (6/98). Will be re-written, to address the ambguities described here.
Library of Congress