Bloom Filter
algo.data.bloom
provides implementations of Bloom filter, which can store a large amount of data by allowing small errors. There are the two following implementations:
Creating a BloomFilter instance
BloomFilter.apply
method creates an instance of BloomFilter
as the following:
- Immutable
-
import HashFunctionFactory.Implicits.default val bloomFilter: immutable.BloomFilter[String] = immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
- Mutable
-
import HashFunctionFactory.Implicits.default val bloomFilter: mutable.BloomFilter[String] = mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
BloomFilter.apply
method takes two parameters capacity and tolerance. Further, BloomFilter
also takes an instance of HashFunctionFactory as a context (implicit) parameter. These parameters determine the characteristics of BloomFilter
.
Characteristics
Parameters the capacity and tolerance are tunable for a false positive probability. The false positive probability is less than the tolerance if the number of items BloomFilter
contains is less than or equal to the capacity. If a lower false positive probability is required, use a higher capacity or a lower tolerance.
Based on the capacity and tolerance, BloomFilter
determines hash space (the number of bits) and the number of hash functions to use. BloomFilter
requires a larger hash space and more hash functions if a higher capacity or lower tolerance is requested. In other words, BloomFilter
requires a larger hash space if it should achieve a lower false positive probability.
Note that the capacity is not the hard limit of the number of items BloomFilter
can contain. While BloomFilter
can contain more items than the capacity, it cannot achieve a lower false positive probability than the tolerance.
Getting parameters
BloomFilter
provides functions returning parameters the following:
- The capacity
- The tolerance
- The number of bits
- The number of hash functions
- Immutable
-
val bloomFilter = immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) // Getting the capacity assert(bloomFilter.capacity == 1_000_000) // Getting the tolerance assert(bloomFilter.tolerance == 0.001) // Getting the number of bits assert(bloomFilter.numOfBits == 14_377_588) // Getting the number of hash functions assert(bloomFilter.numOfHashFunctions == 10)
- Mutable
-
val bloomFilter = mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) // Getting the capacity assert(bloomFilter.capacity == 1_000_000) // Getting the tolerance assert(bloomFilter.tolerance == 0.001) // Getting the number of bits assert(bloomFilter.numOfBits == 14_377_588) // Getting the number of hash functions assert(bloomFilter.numOfHashFunctions == 10)
Adding an item
BloomFilter
provides add
method to add the given item to itself:
- Immutable
-
val bloomFilter = immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) val newBloomFilter: immutable.BloomFilter[String] = bloomFilter.add("item1")
- Mutable
-
val bloomFilter = mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) bloomFilter.add("item1")
add
method of immutable.BloomFilter
returns a new BloomFilter having the given item. add
method of mutable.BloomFilter
updates itself to contain the given item. In both, BloomFilter
increments its size by one if it doesn’t have the given item.
Testing membership of an item
BloomFilter
provides contains
method to test whether BloomFilter
contains the given item or not:
- Immutable
-
val bloomFilter = immutable .BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) assert(!bloomFilter.contains("item1")) val newBloomFilter = bloomFilter.add("item1") assert(newBloomFilter.contains("item1"))
- Mutable
-
val bloomFilter = mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) assert(!bloomFilter.contains("item1")) bloomFilter.add("item1") assert(bloomFilter.contains("item1"))
contains
method returns true if BloomFilter
contains the given item. Note that BloomFilter
can return true if it doesn’t have the given item but has the other item with the same hash value, which is a false positive case. In contrast, BloomFilter
doesn’t return false if it has the given item.
Getting the size and false positive probability
BloomFilter
provides size
method to get the number of items it has:
- Immutable
-
val bloomFilter = immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) assert(bloomFilter.size == 0) val newBloomFilter = bloomFilter.add("item1") assert(newBloomFilter.size == 1)
- Mutable
-
val bloomFilter = mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) assert(bloomFilter.size == 0) bloomFilter.add("item1") assert(bloomFilter.size == 1)
BloomFilter
also provides falsePositiveProbability
method to get the current false positive probability:
- Immutable
-
val bloomFilter = immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) assert(bloomFilter.falsePositiveProbability == 0.0) val newBloomFilter = bloomFilter.add("item1") println(newBloomFilter.falsePositiveProbability == 2.6493427380062913e-62)
- Mutable
-
val bloomFilter = mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001) assert(bloomFilter.falsePositiveProbability == 0.0) bloomFilter.add("item1") assert(bloomFilter.falsePositiveProbability == 2.6493427380062913e-62)
Hash Function Factory
BloomFilter
, a probabilistic collection, hashes the given item using hash functions. The hash functions are crucial for a false positive probability. HashFunctionFactory
creates these hash functions.
HashFunctionFactory.Implicits.default
provides the default implementation of HashFunctionFactory
. This default is easier to use but might not be a suitable performance. It is possible to implement a custom HashFunctionFactory
as the following:
implicit def customHashFunctionFactory[T]: HashFunctionFactory[T] =
new HashFunctionFactory[T] {
override def apply(index: Int): T => Int = {
// Return a hash function for the given index.
???
}
}