Book Notes: Algorithms and Data Structures for Massive Datasets

5 min readSep 23, 2024

This is a great intermediate algorithm and data structure book focusing on processing massive data/big data. It has covered lots of interesting and practical problems. Highly recommend to anyone who are interested in data processing.

You can get your copy from Amazon.

Hash:

How to dedup a big file: break into smaller chunks, compare the hashes
How to detect plagiarism: rabin-karp string match, compute hashes for small chunks. Rabin-Karp string comparison algorithm uses rolling hash to quickly check substrings. Rolling hash can calculate the hash for the next substring in constant time. Hash(i+1) = hash(i) + char_added — char_removed
Collision resolution:
chaining: linked list or balanced tree to hold the overflow for the bucket
open addressing: linear probing, cuckoo hashing (two locations for a candidate, if one is available, use it; if both are occupied, kick one out, move it to its alternative, etc.)
MurmurHash
consistent hash
Chord distributed protocol, uses a finger table to speed up node lookup. If the node does not contain the key, it forwards the request to the node that has the smallest distance to the key.

Membership:

bloom filter
squid cache
formula to calculate false positive rate
problems
cannot handle deletion, use a variant called counting bloom filter
cannot resize (don’t store original keys, so cannot recalculate hashes)
hot queries may increase false positive (use more hashes for hot elements)
quotient filter
uses longer hash to reduce false positive
stores the fingerprints
supports resize, merge, delete
divides a hash into two parts: quotient, remainder
metadata bits: bucket_occupied, run_continued, is_shifted
resize: steal one bit from remainder and give it to the quotient, this doubles the size
comparing bloom filter and quotient filter
quotient filter
overall performance drops when it is getting full
insert is significantly faster
successful read is significantly higher
more complicated to implement
bloom filter
read is slightly faster
merkel tree
a hash tree with branching factor of 2
the hash of a node: hash(left_child_hash + right_child_hash)
used in blockchain for Bitcoin
allows verification of a transaction very quickly

Frequency estimation (given an element, what’s the frequency/count for it):

zipf’s law: https://www.geeksforgeeks.org/zipfs-law/
the second most used word appears half as often as the most used word
the third most used word appears one-third as often as the most used word, etc.
Boyer-More majority vote algorithm: given an unordered elements, we know there is one element that appears more than half (majority element), how to find this element in the most efficient way.
keep a count for one element and index for that element, if next element is the same, increment the count; if not, decrement the count. When the count reaches to 0, reset the index to the next element, keep the process.
Count-min sketch
a 2D array/matrix, d rows and w columns
d hash functions, the hash value is within the range of [1..w]
adding an entry needs to calculate d hashes, then update the corresponding C[i][j]
estimate the count of an entry: get all d values, using the similar way as adding the entry, take the minimum as the estimate
top k problem
use Count-min sketch and a min heap
update Count-min sketch, after adding one entry, get the estimate for that entry
add the entry and its estimate to the min heap, using estimate amount as the key, this effectively maintains top k entries
range query
instead of estimate for a single entry, estimate for a range of entries
divide the range into several levels (similar to skip list), top level contains a range [1..2^n] where the query range falls into; the bottom level contains many ranges, each containing only one entry
each level has its own Count-min sketch, we add the ranges as the entries into the Count-min sketch
when we estimate for the range, e.g. [3..13], we break it into [3..4], [5..8], [9..12], [13], then get the estimate for each range and sum up

Cardinality estimation (how many distinct elements):

traditional way: sort, then count
DB supports approximate distinct count, e.g. Big Query by default uses estimated distinct count
HyperLogLog (HLL)
compute fixed size hash values, 32 bit, 64 bit
probabilistic counting
a: num of trailing zeroes in hash + 1, e.g. 1100, a=3
for multiple hashes, find a for each hash, then take the max
2 ^ max(a) is the estimate
problems: estimate has big errors, cannot be always power of 2
Stochastic averaging
divide hashes into m=2^b buckets, using first b bits of the hash
do probabilistic counting on each bucket, getting an estimate Ei, calculage Avg(Ei)
estimate is 2 ^ Avg(Ei) * m
this solves the outlier problem, it also solves the problem of always power of 2
LogLog
on top of stochastic averaging, multiply by a normalization factor
normalization factor is a function of m, has a formula to calculate
HyperLogLog
similar to Stochastic averaging, but instead of using Avg or arithmetic mean, use harmonic mean
normalization factor is a function of m, when cardinality is small, use linear counting
HyperLogLog++

Sampling (this chapter is quite difficult)

Bernoulli sampling
naive: use a pseudo-random number generator (PRNG), generate a uniformly distributed random number r between 0 and 1, if r < p, then include the element into the sample
Poisson sampling
Reservoir sampling
Bernoulli, poisson sampling has the drawback of variable sample size
reservoir size k
include first k elements
process each element after that, with the index i, use the probability k/i to include
Slide window sampling
chain sampling
priority sampling

Quantiles (another difficult chapter)

naive: sorting and selection problem, if the input is static
quick select: BFPRT gives a high quality of pivot by recursively divide data into groups of size 5, then pick the median recursively until only one median is left (median of median algorithm)
digest: cluster data points and generate metadata like mean, count, etc. for each cluster
T-digest
Q-digest

External Sort

B-tree: intermediate nodes store the values
B+ tree: only leaf nodes store the values, leaf nodes also linked together using linked list, optimizing sequential access for range queries
Bε-tree: allows tuning and tradeoff between inserts and lookups in external memory
each internal node has a buffer: temporarily store inserts and deletes
LSM-tree
Cassandra, LevelDB, RocksDB, etc.
External merge sort
External quick sort

Book Notes: Algorithms and Data Structures for Massive Datasets

Written by Guozheng Ge

No responses yet