Bloom Filter

algo.data.bloom provides implementations of Bloom filter, which can store a large amount of data by allowing small errors. There are the two following implementations:

Creating a BloomFilter instance

BloomFilter.apply method creates an instance of BloomFilter as the following:

Immutable
import HashFunctionFactory.Implicits.default

val bloomFilter: immutable.BloomFilter[String] =
  immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
Mutable
import HashFunctionFactory.Implicits.default

val bloomFilter: mutable.BloomFilter[String] =
  mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)

BloomFilter.apply method takes two parameters capacity and tolerance. Further, BloomFilter also takes an instance of HashFunctionFactory as a context (implicit) parameter. These parameters determine the characteristics of BloomFilter.

Characteristics

Parameters the capacity and tolerance are tunable for a false positive probability. The false positive probability is less than the tolerance if the number of items BloomFilter contains is less than or equal to the capacity. If a lower false positive probability is required, use a higher capacity or a lower tolerance.

Based on the capacity and tolerance, BloomFilter determines hash space (the number of bits) and the number of hash functions to use. BloomFilter requires a larger hash space and more hash functions if a higher capacity or lower tolerance is requested. In other words, BloomFilter requires a larger hash space if it should achieve a lower false positive probability.

Note that the capacity is not the hard limit of the number of items BloomFilter can contain. While BloomFilter can contain more items than the capacity, it cannot achieve a lower false positive probability than the tolerance.

Getting parameters

BloomFilter provides functions returning parameters the following:

  • The capacity
  • The tolerance
  • The number of bits
  • The number of hash functions
Immutable
val bloomFilter =
  immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)

// Getting the capacity
assert(bloomFilter.capacity == 1_000_000)

// Getting the tolerance
assert(bloomFilter.tolerance == 0.001)

// Getting the number of bits
assert(bloomFilter.numOfBits == 14_377_588)

// Getting the number of hash functions
assert(bloomFilter.numOfHashFunctions == 10)
Mutable
val bloomFilter =
  mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)

// Getting the capacity
assert(bloomFilter.capacity == 1_000_000)

// Getting the tolerance
assert(bloomFilter.tolerance == 0.001)

// Getting the number of bits
assert(bloomFilter.numOfBits == 14_377_588)

// Getting the number of hash functions
assert(bloomFilter.numOfHashFunctions == 10)

Adding an item

BloomFilter provides add method to add the given item to itself:

Immutable
val bloomFilter =
  immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)

val newBloomFilter: immutable.BloomFilter[String] =
  bloomFilter.add("item1")
Mutable
val bloomFilter =
  mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)

bloomFilter.add("item1")

add method of immutable.BloomFilter returns a new BloomFilter having the given item. add method of mutable.BloomFilter updates itself to contain the given item. In both, BloomFilter increments its size by one if it doesn’t have the given item.

Testing membership of an item

BloomFilter provides contains method to test whether BloomFilter contains the given item or not:

Immutable
val bloomFilter =
  immutable
    .BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
assert(!bloomFilter.contains("item1"))

val newBloomFilter = bloomFilter.add("item1")
assert(newBloomFilter.contains("item1"))
Mutable
val bloomFilter =
  mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
assert(!bloomFilter.contains("item1"))

bloomFilter.add("item1")
assert(bloomFilter.contains("item1"))

contains method returns true if BloomFilter contains the given item. Note that BloomFilter can return true if it doesn’t have the given item but has the other item with the same hash value, which is a false positive case. In contrast, BloomFilter doesn’t return false if it has the given item.

Getting the size and false positive probability

BloomFilter provides size method to get the number of items it has:

Immutable
val bloomFilter =
  immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
assert(bloomFilter.size == 0)

val newBloomFilter = bloomFilter.add("item1")
assert(newBloomFilter.size == 1)
Mutable
val bloomFilter =
  mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
assert(bloomFilter.size == 0)

bloomFilter.add("item1")
assert(bloomFilter.size == 1)

BloomFilter also provides falsePositiveProbability method to get the current false positive probability:

Immutable
val bloomFilter =
  immutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
assert(bloomFilter.falsePositiveProbability == 0.0)

val newBloomFilter = bloomFilter.add("item1")
println(newBloomFilter.falsePositiveProbability == 2.6493427380062913e-62)
Mutable
val bloomFilter =
  mutable.BloomFilter[String](capacity = 1_000_000, tolerance = 0.001)
assert(bloomFilter.falsePositiveProbability == 0.0)

bloomFilter.add("item1")
assert(bloomFilter.falsePositiveProbability == 2.6493427380062913e-62)

Hash Function Factory

BloomFilter, a probabilistic collection, hashes the given item using hash functions. The hash functions are crucial for a false positive probability. HashFunctionFactory creates these hash functions.

HashFunctionFactory.Implicits.default provides the default implementation of HashFunctionFactory. This default is easier to use but might not be a suitable performance. It is possible to implement a custom HashFunctionFactory as the following:

implicit def customHashFunctionFactory[T]: HashFunctionFactory[T] =
  new HashFunctionFactory[T] {
    override def apply(index: Int): T => Int = {
      // Return a hash function for the given index.
      ???
    }
  }