BloomFilter 布隆过滤 - 瀚海星空

2012-12-20

周海汉 2012.12.20 听说世界末日狂欢要到了？

Bloom 1970年发明了布隆过滤器，对大数据尤其有用。

Bloom过滤原理：

如果要判断一个数是否在一个集合中，最自然的想法是直接遍历，如果找到，则存在;没有找到，则不存在。如果数据是有序集合，则可以用折半查找。

Hash方法是一种很好的映射方式，方便快速查找。我们可以先用Hash做索引，如果数据Hash后在索引表中存在，则数据可能存在。因为Hash可能存在碰撞的可能。

Bloom过滤是对Hash的进一步优化，采用一组Hash，获得数据不同的位。如果每个位都置1了，则可能存在数据，这是考虑到很少的碰撞情况。如果有一个位没有置1，则该数据必然不存在。由于一个很大的数据在Hash中仅占几个位，所以极大的减少了对磁盘的访问和加快了查找数据速度。

Bloom过滤步骤：

假设Bloom数位列表长度n=20，每个数据D，采用k=3个Hash函数，获取到3个不同位置,并将相应的位置置1.

如下图所示HASHSET(D1)={Hash1(D1),Hash2(D1),Hash3(D1)}={2,8,16},则将2，8，16三个位置置1，红色表示。

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
	1			1			1				1				1

HASHSET(D2)={Hash1(D2),Hash2(D2),Hash3(D2)}={2,5,12},则将2，5，12三个位置置1，黑色表示。注意到2位置D2和D1重复了。没关系，仍然置1，这不妨碍找到D1和D2.

现在如果有D3，HASHSET(D3)={Hash1(D3),Hash2(D3),Hash3(D3)}={2,5,12}，那么我们认为D3是存在的。但也可能因为Hash碰撞原因不存在。所以需要读完磁盘的真实数据才知道。

现在如果有D4，HASHSET(D4)={Hash1(D4),Hash2(D4),Hash3(D4)}={5,12,19}，则该数据D4必然不存在。

Hadoop有对Bloom Filter的实现 代码 [zhouhh@Hadoop48 hadoop-1.0.4]$ cd ./src/core/org/apache/hadoop/util/bloom/ [zhouhh@Hadoop48 bloom]$ ls BloomFilter.java DynamicBloomFilter.java HashFunction.java RemoveScheme.java CountingBloomFilter.java Filter.java Key.java RetouchedBloomFilter.java [zhouhh@Hadoop48 bloom]$ cat BloomFilter.java

public class BloomFilter extends Filter {
  private static final byte[] bitvalues = new byte[] {
    (byte)0x01,
    (byte)0x02,
    (byte)0x04,
    (byte)0x08,
    (byte)0x10,
    (byte)0x20,
    (byte)0x40,
    (byte)0x80
  };

  /** The bit vector. */
  BitSet bits;

  /** Default constructor - use with readFields */
  public BloomFilter() {
    super();
  }

  /**
   * Constructor
   * @param vectorSize The vector size of <em>this</em> filter.数组大小
   * @param nbHash The number of hash function to consider.多少个Hash函数
   * @param hashType type of the hashing function 什么hash函数(see
   * {@link org.apache.hadoop.util.hash.Hash}).
   */
  public BloomFilter(int vectorSize, int nbHash, int hashType) {
    super(vectorSize, nbHash, hashType);

    bits = new BitSet(this.vectorSize);
  }

  @Override
  public void add(Key key) { //将key 添加到Bloom位数组
    if(key == null) {
      throw new NullPointerException("key cannot be null");
    }

    int[] h = hash.hash(key); //这返回一组位置
    hash.clear();

    for(int i = 0; i < nbHash; i++) {
      bits.set(h[i]);
    }
  }

  @Override
  public void and(Filter filter) {
    if(filter == null
        || !(filter instanceof BloomFilter)
        || filter.vectorSize != this.vectorSize
        || filter.nbHash != this.nbHash) {
      throw new IllegalArgumentException("filters cannot be and-ed");
    }

    this.bits.and(((BloomFilter) filter).bits);
  }

  @Override
  public boolean membershipTest(Key key) { //检查是否在集合内
    if(key == null) {
      throw new NullPointerException("key cannot be null");
    }

    int[] h = hash.hash(key);
    hash.clear();
    for(int i = 0; i < nbHash; i++) {
      if(!bits.get(h[i])) { //比较每个Hash位置，如果有一个不同，则返回失败
        return false;
      }
    }
    return true; //返回真，但可能会有碰撞误报
  }

  @Override
  public void not() {
    bits.flip(0, vectorSize - 1);
  }

  @Override
  public void or(Filter filter) {
    if(filter == null
        || !(filter instanceof BloomFilter)
        || filter.vectorSize != this.vectorSize
        || filter.nbHash != this.nbHash) {
      throw new IllegalArgumentException("filters cannot be or-ed");
    }
    bits.or(((BloomFilter) filter).bits);
  }

  @Override
  public void xor(Filter filter) {
    if(filter == null
        || !(filter instanceof BloomFilter)
        || filter.vectorSize != this.vectorSize
        || filter.nbHash != this.nbHash) {
      throw new IllegalArgumentException("filters cannot be xor-ed");
    }
    bits.xor(((BloomFilter) filter).bits);
  }

  @Override
  public String toString() {
    return bits.toString();
  }

  /**
   * @return size of the the bloomfilter
   */
  public int getVectorSize() {
    return this.vectorSize;
  }

  // Writable

  @Override
  public void write(DataOutput out) throws IOException {
    super.write(out);
    byte[] bytes = new byte[getNBytes()];
    for(int i = 0, byteIndex = 0, bitIndex = 0; i < vectorSize; i++, bitIndex++) {
      if (bitIndex == 8) {
        bitIndex = 0;
        byteIndex++;
      }
      if (bitIndex == 0) {
        bytes[byteIndex] = 0;
      }
      if (bits.get(i)) {
        bytes[byteIndex] |= bitvalues[bitIndex];
      }
    }
    out.write(bytes);
  }

  @Override
  public void readFields(DataInput in) throws IOException {
    super.readFields(in);
    bits = new BitSet(this.vectorSize); //位集合
    byte[] bytes = new byte[getNBytes()];
    in.readFully(bytes);
    for(int i = 0, byteIndex = 0, bitIndex = 0; i < vectorSize; i++, bitIndex++) {
      if (bitIndex == 8) {
        bitIndex = 0;
        byteIndex++;
      }
      if ((bytes[byteIndex] & bitvalues[bitIndex]) != 0) {
        bits.set(i);
      }
    }
  }

  /* @return number of bytes needed to hold bit vector */
  private int getNBytes() {
    return (vectorSize + 7) / 8;
  }
}//end class

[zhouhh@Hadoop48 bloom]$ cat HashFunction.java 这是Hash函数

package org.apache.hadoop.util.bloom;

import org.apache.hadoop.util.hash.Hash;

/**
 * Implements a hash object that returns a certain number of hashed values.
 *
 * @see Key The general behavior of a key being stored in a filter
 * @see Filter The general behavior of a filter
 */
public final class HashFunction {
  /** The number of hashed values. */
  private int nbHash;

  /** The maximum highest returned value. */
  private int maxValue;

  /** Hashing algorithm to use. */
  private Hash hashFunction;

  /**
   * Constructor.
   *
* /
public HashFunction(int maxValue, int nbHash, int hashType) {
    if (maxValue  0");
    }

    if (nbHash  0");
    }

    this.maxValue = maxValue;
    this.nbHash = nbHash;
    this.hashFunction = Hash.getInstance(hashType);
    if (this.hashFunction == null)
      throw new IllegalArgumentException("hashType must be known");
  }

  /** Clears thishash function. A NOOP */
  public void clear() {
  }

  /**
   * Hashes a specified key into several integers.
   * @param k The specified key.
   * @return The array of hashed values.
   */
  public int[] hash(Key k){
      byte[] b = k.getBytes();
      if (b == null) {
        throw new NullPointerException("buffer reference is null");
      }
      if (b.length == 0) {
        throw new IllegalArgumentException("key length must be > 0");
      }
      int[] result = new int[nbHash];
      for (int i = 0, initval = 0; i < nbHash; i++) {
          initval = hashFunction.hash(b, initval);//每个Hash函数是用初始值来区分的。初始值又作为下次输入。
          result[i] = Math.abs(initval % maxValue);//对数组长度，求余，以免溢出
      }
      return result;
  }
}

参考： http://www.cnblogs.com/yuyijq/archive/2012/02/08/2343374.html http://en.wikipedia.org/wiki/Bloom_filter Hbase 官方文档最新中文版http://abloz.com/hbase/book.html#perf.reading

如非注明转载, 均为原创. 本站遵循知识共享CC协议,转载请注明来源

FEATURED TAGS

css vc6 http automake linux make makefile voip 乱码 awk flash vista vi vim javascript pietty putty ssh posix subversion svn windows 删除编译多线程 wxwidgets ie ubuntu 开源 c python bash 备份性能 scp 汉字 log ruby 中文 bug msn nginx php shell wordpress mqueue android eclipse java mac ios html5 js mysql protobuf apache hadoop install iocp twisted centos mapreduce hbase thrift tutorial hive erlang lucene hdfs sqoop utf8 filter 草原 yarn ganglia 恢复 scrapy django fsimage flume tail flume-ng mining scala go kafka gradle cassandra baas spring postgres maven mybatis mongodb https nodejs 镜像心理学机器学习 Keras theano anaconda docker spark akka-http json 群论区块链加密抽象代数离散对数同余欧拉函数扩展欧几里德算法 ES6 node-inspect debug win10 vscode 挖矿

FEATURED TAGS

FRIENDS