Book Notes: Algorithms and Data Structures for Massive Datasets
5 min readSep 23, 2024
This is a great intermediate algorithm and data structure book focusing on processing massive data/big data. It has covered lots of interesting and practical problems. Highly recommend to anyone who are interested in data processing.
You can get your copy from Amazon.
Hash:
- How to dedup a big file: break into smaller chunks, compare the hashes
- How to detect plagiarism: rabin-karp string match, compute hashes for small chunks. Rabin-Karp string comparison algorithm uses rolling hash to quickly check substrings. Rolling hash can calculate the hash for the next substring in constant time. Hash(i+1) = hash(i) + char_added — char_removed
- Collision resolution:
- chaining: linked list or balanced tree to hold the overflow for the bucket
- open addressing: linear probing, cuckoo hashing (two locations for a candidate, if one is available, use it; if both are occupied, kick one out, move it to its alternative, etc.)
- MurmurHash
- consistent hash
- Chord distributed protocol, uses a finger table to speed up node lookup. If the node does not contain the key, it forwards the request to the node that has the smallest distance to the key.
Membership:
- bloom filter
- squid cache
- formula to calculate false positive rate
- problems
- cannot handle deletion, use a variant called counting bloom filter
- cannot resize (don’t store original keys, so cannot recalculate hashes)
- hot queries may increase false positive (use more hashes for hot elements)
- quotient filter
- uses longer hash to reduce false positive
- stores the fingerprints
- supports resize, merge, delete
- divides a hash into two parts: quotient, remainder
- metadata bits: bucket_occupied, run_continued, is_shifted
- resize: steal one bit from remainder and give it to the quotient, this doubles the size
- comparing bloom filter and quotient filter
- quotient filter
- overall performance drops when it is getting full
- insert is significantly faster
- successful read is significantly higher
- more complicated to implement
- bloom filter
- read is slightly faster
- merkel tree
- a hash tree with branching factor of 2
- the hash of a node: hash(left_child_hash + right_child_hash)
- used in blockchain for Bitcoin
- allows verification of a transaction very quickly
Frequency estimation (given an element, what’s the frequency/count for it):
- zipf’s law: https://www.geeksforgeeks.org/zipfs-law/
- the second most used word appears half as often as the most used word
- the third most used word appears one-third as often as the most used word, etc.
- Boyer-More majority vote algorithm: given an unordered elements, we know there is one element that appears more than half (majority element), how to find this element in the most efficient way.
- keep a count for one element and index for that element, if next element is the same, increment the count; if not, decrement the count. When the count reaches to 0, reset the index to the next element, keep the process.
- Count-min sketch
- a 2D array/matrix, d rows and w columns
- d hash functions, the hash value is within the range of [1..w]
- adding an entry needs to calculate d hashes, then update the corresponding C[i][j]
- estimate the count of an entry: get all d values, using the similar way as adding the entry, take the minimum as the estimate
- top k problem
- use Count-min sketch and a min heap
- update Count-min sketch, after adding one entry, get the estimate for that entry
- add the entry and its estimate to the min heap, using estimate amount as the key, this effectively maintains top k entries
- range query
- instead of estimate for a single entry, estimate for a range of entries
- divide the range into several levels (similar to skip list), top level contains a range [1..2^n] where the query range falls into; the bottom level contains many ranges, each containing only one entry
- each level has its own Count-min sketch, we add the ranges as the entries into the Count-min sketch
- when we estimate for the range, e.g. [3..13], we break it into [3..4], [5..8], [9..12], [13], then get the estimate for each range and sum up
Cardinality estimation (how many distinct elements):
- traditional way: sort, then count
- DB supports approximate distinct count, e.g. Big Query by default uses estimated distinct count
- HyperLogLog (HLL)
- compute fixed size hash values, 32 bit, 64 bit
- probabilistic counting
- a: num of trailing zeroes in hash + 1, e.g. 1100, a=3
- for multiple hashes, find a for each hash, then take the max
- 2 ^ max(a) is the estimate
- problems: estimate has big errors, cannot be always power of 2
- Stochastic averaging
- divide hashes into m=2^b buckets, using first b bits of the hash
- do probabilistic counting on each bucket, getting an estimate Ei, calculage Avg(Ei)
- estimate is 2 ^ Avg(Ei) * m
- this solves the outlier problem, it also solves the problem of always power of 2
- LogLog
- on top of stochastic averaging, multiply by a normalization factor
- normalization factor is a function of m, has a formula to calculate
- HyperLogLog
- similar to Stochastic averaging, but instead of using Avg or arithmetic mean, use harmonic mean
- normalization factor is a function of m, when cardinality is small, use linear counting
- HyperLogLog++
Sampling (this chapter is quite difficult)
- Bernoulli sampling
- naive: use a pseudo-random number generator (PRNG), generate a uniformly distributed random number r between 0 and 1, if r < p, then include the element into the sample
- Poisson sampling
- Reservoir sampling
- Bernoulli, poisson sampling has the drawback of variable sample size
- reservoir size k
- include first k elements
- process each element after that, with the index i, use the probability k/i to include
- Slide window sampling
- chain sampling
- priority sampling
Quantiles (another difficult chapter)
- naive: sorting and selection problem, if the input is static
- quick select: BFPRT gives a high quality of pivot by recursively divide data into groups of size 5, then pick the median recursively until only one median is left (median of median algorithm)
- digest: cluster data points and generate metadata like mean, count, etc. for each cluster
- T-digest
- Q-digest
External Sort
- B-tree: intermediate nodes store the values
- B+ tree: only leaf nodes store the values, leaf nodes also linked together using linked list, optimizing sequential access for range queries
- Bε-tree: allows tuning and tradeoff between inserts and lookups in external memory
- each internal node has a buffer: temporarily store inserts and deletes
- LSM-tree
- Cassandra, LevelDB, RocksDB, etc.
- External merge sort
- External quick sort