Learning Notes #6 Bloom Filters – A Probabilistic Data Structure

By: Mr.ParottaSalna

23 December 2024 at 14:24

I have came across reading Bloom Filters when i wanted to implement username check likewise in instagram. Today i came back to refresh on bloom filters and note it for my future self.

What is a Bloom Filter ?

A Bloom filter is a space-efficient, probabilistic data structure designed to test whether an element is part of a set. It can return two types of results

True: The element is probably in the set.
False: The element is definitely not in the set.

Notably, Bloom filters do not store the actual elements themselves, and there is a chance of false positives, but never false negatives.

If it says, the given word is not present then we can be 100% sure about it. This is the benefit we are getting out of Bloom Filters.

But setting up a bloom filter is not an easy task. You will soon get to know.

How Does a Bloom Filter Work?

A Bloom filter uses a bit array of size and independent hash functions. Here’s how it operates,

Adding an Element
- Compute the hash values for the element for each hash functions.
- Map these hash values to positions in the bit array.
- Set the corresponding bits to 1.
Querying an Element
- Compute the hash values for the element for each hash functions.
- Check the corresponding bits in the bit array.
- If all bits are 1, the element is probably in the set. If any bit is 0, the element is definitely not in the set.

As you can imagine, when we are continously adding element to array (considering the array size is smaller), then the percentage of false positives will increase. On the other hand choosing the correct numbers of hash functions also matters.

Setting Parameters

To effectively use a Bloom filter, it’s important to set the parameters appropriately

Bit Array Size (m):
- The size of the bit array determines the capacity and accuracy of the filter.
- A larger m reduces the false positive rate but requires more memory.
Number of Hash Functions (k):
- The number of hash functions affects the distribution of bits set to 1.
- An optimal k minimizes the false positive rate for a given m and number of elements (n).
Number of Elements (n):
- Estimate the number of elements to be stored to configure m and k appropriately.

Someone derived a formula

Bit Array Size

The false positive rate represents the probability that a non-existing element is incorrectly identified as present in the Bloom filter. It depends on the size of the bit array (m), the number of hash functions (k), and the number of elements inserted (n). To achieve a desired false positive rate, we can calculate the optimal bit array size using the formula

Here, p denotes the desired false positive rate.

Optimal Number of Hash Functions

The optimal number of hash functions (k) is determined by the size of the bit array and the number of elements to be inserted. It can be calculated using the formula

This ensures an equal distribution of hash values across the bit array, minimizing collisions and maximizing the accuracy of the filter.

Probability of False Positives

The probability of false positives (P_fp) is influenced by the number of hash functions (k), the bit array size (m), and the number of elements inserted (n). It can be estimated using the formula.

Putting all together (Python Code)

Setting the fpr (false positive rate) to 0.1 %, let’s calculate bit array size, no. of hash functions.


import math

# Expected number of items in the collection
n = 300_000

# Acceptable false-positive rate (0.01 = 1%)
fpr = 0.01

# Optimal size (number of elements in the bit array)
# m = -((n * ln(p)) / (ln(2)^2))
m = -(n * math.log(fpr)) / (math.log(2) ** 2)

# Optimal number of hash functions
# k = (m / n) * ln(2)
k = (m / n) * math.log(2)

print(f"Optimal Bloom filter size: {math.ceil(m)} bits")
print(f"Optimal number of hash functions: {math.ceil(k)}")

Practical Considerations

Hash Functions:
- Choose independent and uniformly distributed hash functions to minimize collisions.
- Common choices include MurmurHash and FNV.
Performance:
- More hash functions increase computational cost but can reduce the false positive rate.
- Balance the number of hash functions to achieve acceptable performance.
Capacity Planning:
- Overestimating n leads to wasted space; underestimating increases the false positive rate.
- Plan for future growth to maintain efficiency.

Online Calculator : https://hur.st/bloomfilter/?utm_source=parottasalna.com

Normal view