A bloom filter is a form of hash table that is used to identify whether or not an object exists within a given set. This filter is designed to filter out known negatives rather than being the absolute confirmation of the existance of the object, making it useful to speed up access for cached elements.
A bloom filter is a large bit array, with all entries initially set to zero.
To store an entry in a bloom filter, produce a set number of hash entries, and set the element associated with each hash to 1; the value is unchanged if it is already set. To search for an entry, perform the same hashing algorithm, and confirm that all associated elements are non-zero.
There is no set algorithm for the hash functions, as long as they appear to be sufficiently random.
O(1) in storing or retrieving information. It is not possible to iterate through a bloom filter, nor it is possible to delete items from said filter.
There are no false negatives, but false positives can appear as the table fills.
False positive rate
Given that m is the size of the table in bits, n is the number of active keys inserted into the table, and k is the number of hash functions used, the probability of a given bit remaining as zero is:
The chance of a false positive is:
To minimize the number of false positives, k needs to be high enough to provide enough bits to prove non-existance of an element, but not too high to fill up the table. Estimating the expected load will help determine the optimal number of hashes.
If deletion from a bloom filter is desired, you can attempt to use a counter array rather tha a bit array. However, this greatly reduces the efficiency of the table but allows you to safely delete elements.
See The Bloom filter from Google Teck talks for other common variations.
- Bloom Filters - the math - Table 3; False positive rate under various m/n and k combinations