ADS1_: Problems for practicals (hashing)

Compute frequencies of words in a text (sperated by spaces). Output them sorted by number of occurrences.
Design algorithm that given a string and value m find all duplicate aperances of subwords of length m.
Bloom filter is a data-structure for approximate representation of sets. It has a bit array B[1...p] and a hash function $h$ which assigns keys to indexes in array. Insert(x) sets B[h(x)] to 1.Find(x) tests if B[h(x)]=1.

Lets insert some set S to the filter. Put n=|S|. It is possible that Find(x) returns true also for an element x which is not in S. This is the case that S contains an element y, h(y)=h(x). Given $p$ and $n$ compute the probability of such false positive.
Bloom filter reliability can be improved by using k filters with different hash functions. Insert adds the element into all filters while Find test that all filters have corresponding bit set. If the probability of an error of one filter is p, the combined error probabilitz is p^k. Figure out how to set p and k if we want to represent set with 10⁶ elements with error probability at most 10⁹. Minimise the memory use.

Naposledy změněno: středa, 15. dubna 2020, 19.50