Unknown

Dataset Information

0

HyperMinHash: MinHash in LogLog space.


ABSTRACT: In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error ϵ on a Jaccard index t, given a random oracle, HyperMinHash needs O(ϵ-2(loglogn+log1ϵ)) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

SUBMITTER: Yu YW 

PROVIDER: S-EPMC10824537 | biostudies-literature | 2022 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

HyperMinHash: MinHash in LogLog space.

Yu Yun William YW   Weber Griffin M GM  

IEEE transactions on knowledge and data engineering 20200317 1


In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, u  ...[more]

Similar Datasets

| S-EPMC6626348 | biostudies-literature
| S-EPMC4915045 | biostudies-literature
| S-EPMC7423146 | biostudies-literature
| S-EPMC10190105 | biostudies-literature
| S-EPMC5440850 | biostudies-literature
| S-EPMC7713896 | biostudies-literature
| S-EPMC8450716 | biostudies-literature
| S-EPMC4734808 | biostudies-literature
| S-EPMC7168699 | biostudies-literature
2010-05-16 | E-GEOD-14265 | biostudies-arrayexpress