什么情況下需要布隆過(guò)濾器?
先來(lái)看幾個(gè)比較常見(jiàn)的例子
字處理軟件中,需要檢查一個(gè)英語(yǔ)單詞是否拼寫正確
在 FBI,一個(gè)嫌疑人的名字是否已經(jīng)在嫌疑名單上
在網(wǎng)絡(luò)爬蟲(chóng)里,一個(gè)網(wǎng)址是否被訪問(wèn)過(guò)
yahoo, gmail等郵箱垃圾郵件過(guò)濾功能
這幾個(gè)例子有一個(gè)共同的特點(diǎn):如何判斷一個(gè)元素是否存在一個(gè)集合中?
常規(guī)思路
數(shù)組
鏈表
樹(shù)、平衡二叉樹(shù)、Trie
Map (紅黑樹(shù))
哈希表
雖然上面描述的這幾種數(shù)據(jù)結(jié)構(gòu)配合常見(jiàn)的排序、二分搜索可以快速高效的處理絕大部分判斷元素是否存在集合中的需求。但是當(dāng)集合里面的元素?cái)?shù)量足夠大,如果有500萬(wàn)條記錄甚至1億條記錄呢?這個(gè)時(shí)候常規(guī)的數(shù)據(jù)結(jié)構(gòu)的問(wèn)題就凸顯出來(lái)了。數(shù)組、鏈表、樹(shù)等數(shù)據(jù)結(jié)構(gòu)會(huì)存儲(chǔ)元素的內(nèi)容,一旦數(shù)據(jù)量過(guò)大,消耗的內(nèi)存也會(huì)呈現(xiàn)線性增長(zhǎng),最終達(dá)到瓶頸。有的同學(xué)可能會(huì)問(wèn),哈希表不是效率很高嗎?查詢效率可以達(dá)到O(1)。但是哈希表需要消耗的內(nèi)存依然很高。使用哈希表存儲(chǔ)一億 個(gè)垃圾 email 地址的消耗?哈希表的做法:首先,哈希函數(shù)將一個(gè)email地址映射成8字節(jié)信息指紋;考慮到哈希表存儲(chǔ)效率通常小于50%(哈希沖突);因此消耗的內(nèi)存:8 * 2 * 1億 字節(jié) = 1.6G 內(nèi)存,普通計(jì)算機(jī)是無(wú)法提供如此大的內(nèi)存。這個(gè)時(shí)候,布隆過(guò)濾器(Bloom Filter)就應(yīng)運(yùn)而生。在繼續(xù)介紹布隆過(guò)濾器的原理時(shí),先講解下關(guān)于哈希函數(shù)的預(yù)備知識(shí)。
哈希函數(shù)
哈希函數(shù)的概念是:將任意大小的數(shù)據(jù)轉(zhuǎn)換成特定大小的數(shù)據(jù)的函數(shù),轉(zhuǎn)換后的數(shù)據(jù)稱為哈希值或哈希編碼。下面是一幅示意圖:
什么是布隆過(guò)濾器?本質(zhì)上布隆過(guò)濾器( BloomFilter )是一種數(shù)據(jù)結(jié)構(gòu),比較巧妙的概率型數(shù)據(jù)結(jié)構(gòu)(probabilistic data structure),特點(diǎn)是高效地插入和查詢,可以用來(lái)告訴你 “某樣?xùn)|西一定不存在或者可能存在”。
相比于傳統(tǒng)的 Set、Map 等數(shù)據(jù)結(jié)構(gòu),它更高效、占用空間更少,但是缺點(diǎn)是其返回的結(jié)果是概率性的,而不是確切的。
布隆過(guò)濾器原理
布隆過(guò)濾器內(nèi)部維護(hù)一個(gè)bitArray(位數(shù)組), 開(kāi)始所有數(shù)據(jù)全部置 0 。當(dāng)一個(gè)元素過(guò)來(lái)時(shí),能過(guò)多個(gè)哈希函數(shù)(hash1,hash2,hash3…)計(jì)算不同的在哈希值,并通過(guò)哈希值找到對(duì)應(yīng)的bitArray下標(biāo)處,將里面的值 0 置為 1 。需要說(shuō)明的是,布隆過(guò)濾器有一個(gè)誤判率的概念,誤判率越低,則數(shù)組越長(zhǎng),所占空間越大。誤判率越高則數(shù)組越小,所占的空間越小。
以上圖為例,具體的操作流程:假設(shè)集合里面有3個(gè)元素{x, y, z},哈希函數(shù)的個(gè)數(shù)為3。首先將位數(shù)組進(jìn)行初始化,將里面每個(gè)位都設(shè)置位0。對(duì)于集合里面的每一個(gè)元素,將元素依次通過(guò)3個(gè)哈希函數(shù)進(jìn)行映射,每次映射都會(huì)產(chǎn)生一個(gè)哈希值,這個(gè)值對(duì)應(yīng)位數(shù)組上面的一個(gè)點(diǎn),然后將位數(shù)組對(duì)應(yīng)的位置標(biāo)記為1。查詢W元素是否存在集合中的時(shí)候,同樣的方法將W通過(guò)哈希映射到位數(shù)組上的3個(gè)點(diǎn)。如果3個(gè)點(diǎn)的其中有一個(gè)點(diǎn)不為1,則可以判斷該元素一定不存在集合中。反之,如果3個(gè)點(diǎn)都為1,則該元素可能存在集合中。注意:此處不能判斷該元素是否一定存在集合中,可能存在一定的誤判率??梢詮膱D中可以看到:假設(shè)某個(gè)元素通過(guò)映射對(duì)應(yīng)下標(biāo)為4,5,6這3個(gè)點(diǎn)。雖然這3個(gè)點(diǎn)都為1,但是很明顯這3個(gè)點(diǎn)是不同元素經(jīng)過(guò)哈希得到的位置,因此這種情況說(shuō)明元素雖然不在集合中,也可能對(duì)應(yīng)的都是1,這是誤判率存在的原因。## 為什么不直接使用hashtable
Hash table的存儲(chǔ)效率一般只有50%,為了避免碰撞的問(wèn)題,一般哈希存儲(chǔ)到一半的時(shí)候都采取內(nèi)存翻倍或者其他措施,所以很耗費(fèi)內(nèi)存。
Hash面臨的問(wèn)題就是沖突。假設(shè) Hash 函數(shù)是良好的,如果我們的位陣列長(zhǎng)度為 m個(gè)點(diǎn),那么如果我們想將沖突率降低到例如 1%, 這個(gè)散列表就只能容納 m/100 個(gè)元素。解決方法較簡(jiǎn)單, 使用k>1的布隆過(guò)濾器,即k個(gè)函數(shù)將每個(gè)元素改為對(duì)應(yīng)于k個(gè)bits,因?yàn)檎`判度會(huì)降低很多,并且如果參數(shù)k和m選取得好,一半的m可被置為1。
布隆過(guò)濾器的設(shè)計(jì)一般會(huì)由用戶決定增加元素的個(gè)數(shù)n和誤差率p,后面的所有參數(shù)都由系統(tǒng)來(lái)設(shè)計(jì)。
1.首先根據(jù)傳入的n和p計(jì)算需要的內(nèi)存大小m bits:
2.再由m,n得到hash function個(gè)數(shù)k:
布隆過(guò)濾器添加元素
將要添加的元素給k個(gè)哈希函數(shù)
得到對(duì)應(yīng)于位數(shù)組上的k個(gè)位置
將這k個(gè)位置設(shè)為1
布隆過(guò)濾器查詢?cè)?/h3>
將要查詢的元素給k個(gè)哈希函數(shù)
得到對(duì)應(yīng)于位數(shù)組上的k個(gè)位置
如果k個(gè)位置有一個(gè)為0,則肯定不在集合中
如果k個(gè)位置全部為1,則可能在集合中
下圖是Bloomfilter誤判率表:
為每個(gè)URL分配兩個(gè)字節(jié)就可以達(dá)到千分之幾的沖突。比較保守的實(shí)現(xiàn)是,為每個(gè)URL分配4個(gè)字節(jié),項(xiàng)目和位數(shù)比是1∶32,誤判率是0.00000021167340。對(duì)于5000萬(wàn)數(shù)量級(jí)的URL,布隆過(guò)濾器只占用200MB的空間。
c++代碼:
定義布隆濾波器的結(jié)構(gòu)體
uint8_t cInitFlag; //初始化標(biāo)志
uint8_t cResv[3];
uint32_t dwMaxItems; //n - BloomFilter中最大元素個(gè)數(shù) (輸入量)
double dProbFalse; //p - 假陽(yáng)概率(誤判率) (輸入量,比如萬(wàn)分之一:0.00001)
uint32_t dwFilterBits; //m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0))))); - BloomFilter的比特?cái)?shù)
uint32_t dwHashFuncs; // k = round(log(2.0) * m / n); - 哈希函數(shù)個(gè)數(shù)
uint32_t dwSeed; // MurmurHash的種子偏移量
uint32_t dwCount; // Add()的計(jì)數(shù),超過(guò)MAX_BLOOMFILTER_N則返回失敗
uint32_t dwFilterSize; // dwFilterBits / BYTE_BITS
unsigned char *pstFilter; // BloomFilter存儲(chǔ)指針,使用malloc分配
uint32_t *pdwHashPos; // 存儲(chǔ)上次hash得到的K個(gè)bit位置數(shù)組(由bloom_hash填充)
}BaseBloomFilter;
定義寫入文件的結(jié)構(gòu)體
uint32_t dwMagicCode; // 文件頭部標(biāo)識(shí),填充 __MGAIC_CODE__
uint32_t dwSeed;
uint32_t dwCount;
uint32_t dwMaxItems; // n - BloomFilter中最大元素個(gè)數(shù) (輸入量)
double dProbFalse; // p - 假陽(yáng)概率 (輸入量,比如萬(wàn)分之一:0.00001)
uint32_t dwFilterBits; // m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0))))); - BloomFilter的比特?cái)?shù)
uint32_t dwHashFuncs; // k = round(log(2.0) * m / n); - 哈希函數(shù)個(gè)數(shù)
uint32_t dwResv[6];
uint32_t dwFileCrc; // (未使用)整個(gè)文件的校驗(yàn)和
uint32_t dwFilterSize; // 后面Filter的Buffer長(zhǎng)度
}BloomFileHead;
根據(jù)傳入的元素個(gè)數(shù)n和誤差率p, 計(jì)算布隆濾波器的內(nèi)存大小m bits和hash function個(gè)數(shù)k:
/**
* n - Number of items in the filter
* p - Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
* m - Number of bits in the filter
* k - Number of hash functions
*
* f = ln(2) × ln(1/2) × m / n = (0.6185) ^ (m/n)
* m = -1*n*ln(p)/((ln(2))^2) = -1*n*ln(p)/(ln(2)*ln(2)) = -1*n*ln(p)/(0.69314718055995*0.69314718055995))
* = -1*n*ln(p)/0.4804530139182079271955440025
* k = ln(2)*m/n
**/
uint32_t m, k, m2;
m =(uint32_t) ceil(-1.0 * n * log(p) / 0.480453);
m = (m - m % 64) + 64; // 8字節(jié)對(duì)齊
double double_k = (0.69314 *m /n);
k = round(double_k);
*pm = m;
*pk = k;
return;
}
初始化bloomfilter, 根據(jù)size分配內(nèi)存:
* @brief 初始化布隆過(guò)濾器
* @param pstBloomfilter 布隆過(guò)濾器實(shí)例
* @param dwSeed hash種子
* @param dwMaxItems 存儲(chǔ)容量
* @param dProbFalse 允許的誤判率
* @return 返回值
* -1 傳入的布隆過(guò)濾器為空
* -2 hash種子錯(cuò)誤或誤差>=1
*/
inline int InitBloomFilter(BaseBloomFilter *pstBloomfilter, uint32_t dwSeed, uint32_t dwMaxItems, double dProbFalse){
if(pstBloomfilter == NULL)
return -1;
if(dProbFalse <=0 || dProbFalse >= 1)
return -2;
//檢查是否重復(fù)Init, 釋放內(nèi)存
if(pstBloomfilter->pstFilter != NULL)
free(pstBloomfilter->pstFilter);
if(pstBloomfilter->pdwHashPos != NULL)
free(pstBloomfilter->pdwHashPos);
memset(pstBloomfilter, 0, sizeof(pstBloomfilter));
//初始化內(nèi)存結(jié)構(gòu),并計(jì)算BloomFilter需要的空間
pstBloomfilter->dwMaxItems = dwMaxItems;
pstBloomfilter->dProbFalse = dProbFalse;
pstBloomfilter->dwSeed = dwSeed;
//計(jì)算 m, k
_CalcBloomFilterParam(pstBloomfilter->dwMaxItems, pstBloomfilter->dProbFalse,
&pstBloomfilter->dwFilterBits, &pstBloomfilter->dwHashFuncs);
//分配BloomFilter的存儲(chǔ)空間
pstBloomfilter->dwFilterSize = pstBloomfilter->dwFilterBits / BYTE_BITS;
pstBloomfilter->pstFilter = (unsigned char *)malloc(pstBloomfilter->dwFilterSize);
if(NULL == pstBloomfilter->pstFilter)
return -100;
//哈希結(jié)果數(shù)組,每個(gè)哈希函數(shù)一個(gè)
pstBloomfilter->pdwHashPos = (uint32_t*)malloc(pstBloomfilter->dwHashFuncs * sizeof(uint32_t));
if(NULL == pstBloomfilter->pdwHashPos)
return -200;
printf("Init BloomFilter(n=%u, p=%e, m=%u, k=%d), malloc() size=%.6fMB, items:bits=1:%0.1lfn",
pstBloomfilter->dwMaxItems, pstBloomfilter->dProbFalse, pstBloomfilter->dwFilterBits,
pstBloomfilter->dwHashFuncs, (double)pstBloomfilter->dwFilterSize/1024/1024,
pstBloomfilter->dwFilterBits*1.0/pstBloomfilter->dwMaxItems);
// 初始化BloomFilter的內(nèi)存
memset(pstBloomfilter->pstFilter, 0, pstBloomfilter->dwFilterSize);
pstBloomfilter->cInitFlag = 1;
return 0;
}
釋放內(nèi)存:
inline int FreeBloomFilter(BaseBloomFilter * pstBloomfilter){
if(pstBloomfilter == NULL)
return -1;
pstBloomfilter->cInitFlag = 0;
pstBloomfilter->dwCount=0;
free(pstBloomfilter->pstFilter);
pstBloomfilter->pstFilter = NULL;
free(pstBloomfilter->pdwHashPos);
pstBloomfilter->pdwHashPos = NULL;
return 0;
}
ResetBloom filter:
inline int RealResetBloomFilter(BaseBloomFilter *pstBloomfilter){
if (pstBloomfilter == NULL)
return -1;
memset(pstBloomfilter->pstFilter, 0, pstBloomfilter->dwFilterSize);
pstBloomfilter->cInitFlag = 1;
pstBloomfilter->dwCount = 0;
return 0;
}
計(jì)算hash函數(shù),這里使用的是MurmurHash2:
{
const uint64_t m = 0xc6a4a7935bd1e995;
const int r = 47;
uint64_t h = seed ^ (len * m);
const uint64_t * data = (const uint64_t *)key;
const uint64_t * end = data + (len/8);
while(data != end)
{
uint64_t k = *data++;
k *= m;
k ^= k >> r;
k *= m;
h ^= k;
h *= m;
}
const uint8_t * data2 = (const uint8_t*)data;
switch(len & 7)
{
case 7: h ^= ((uint64_t)data2[6]) << 48;
case 6: h ^= ((uint64_t)data2[5]) << 40;
case 5: h ^= ((uint64_t)data2[4]) << 32;
case 4: h ^= ((uint64_t)data2[3]) << 24;
case 3: h ^= ((uint64_t)data2[2]) << 16;
case 2: h ^= ((uint64_t)data2[1]) << 8;
case 1: h ^= ((uint64_t)data2[0]);
h *= m;
};
h ^= h >> r;
h *= m;
h ^= h >> r;
return h;
}
計(jì)算多少個(gè)hash函數(shù):
FORCE_INLINE void bloom_hash(BaseBloomFilter *pstBloomfilter, const void * key, int len){
int i;
uint32_t dwFilterBits = pstBloomfilter->dwFilterBits;
uint64_t hash1 = MurmurHash2_x64(key, len, pstBloomfilter->dwSeed);
uint64_t hash2 = MurmurHash2_x64(key, len, MIX_UINT64(hash1));
for (i = 0; i < (int)pstBloomfilter->dwHashFuncs; i++)
{
pstBloomfilter->pdwHashPos[i] = (hash1 + i*hash2) % dwFilterBits;
}
return;
}
向Bloomfilter新增元素:
// 成功返回0,當(dāng)添加數(shù)據(jù)超過(guò)限制值時(shí)返回1提示用戶
FORCE_INLINE int BloomFilter_Add(BaseBloomFilter *pstBloomfilter, const void * key, int len){
if((pstBloomfilter == NULL) || (key == NULL) || (len<=0))
return -1;
int i;
if(pstBloomfilter->cInitFlag != 1){
memset(pstBloomfilter->pstFilter,0 , pstBloomfilter->dwFilterSize);
pstBloomfilter->cInitFlag =1;
}
bloom_hash(pstBloomfilter, key, len);
for(i=0;i<(int)pstBloomfilter->dwHashFuncs; ++i){
SETBIT(pstBloomfilter, pstBloomfilter->pdwHashPos[i]);
}
pstBloomfilter->dwCount++;
if(pstBloomfilter->dwCount <= pstBloomfilter->dwMaxItems)
return 0;
else
return 1;
}
查詢?cè)?
if((pstBloomfilter == NULL) || (key == NULL) || (len<=0))
return -1;
int i;
bloom_hash(pstBloomfilter, key, len);
for (i = 0; i < (int)pstBloomfilter->dwHashFuncs; i++)
{
// 如果有任意bit不為1,說(shuō)明key不在bloomfilter中
// 注意: GETBIT()返回不是0|1,高位可能出現(xiàn)128之類的情況
if (GETBIT(pstBloomfilter, pstBloomfilter->pdwHashPos[i]) == 0)
return 1;
}
return 0;
}
把bloomfilter保存在文件里:
if ((pstBloomfilter == NULL) || (szFileName == NULL))
return -1;
int iRet;
FILE *pFile;
static BloomFileHead stFileHeader = {0};
pFile = fopen(szFileName, "wb");
if (pFile == NULL)
{
perror("fopen");
return -11;
}
// 先寫入文件頭
stFileHeader.dwMagicCode = __MGAIC_CODE__;
stFileHeader.dwSeed = pstBloomfilter->dwSeed;
stFileHeader.dwCount = pstBloomfilter->dwCount;
stFileHeader.dwMaxItems = pstBloomfilter->dwMaxItems;
stFileHeader.dProbFalse = pstBloomfilter->dProbFalse;
stFileHeader.dwFilterBits = pstBloomfilter->dwFilterBits;
stFileHeader.dwHashFuncs = pstBloomfilter->dwHashFuncs;
stFileHeader.dwFilterSize = pstBloomfilter->dwFilterSize;
iRet = fwrite((const void*)&stFileHeader, sizeof(stFileHeader), 1, pFile);
if (iRet != 1)
{
perror("fwrite(head)");
return -21;
}
// 接著寫入BloomFilter的內(nèi)容
iRet = fwrite(pstBloomfilter->pstFilter, 1, pstBloomfilter->dwFilterSize, pFile);
if ((uint32_t)iRet != pstBloomfilter->dwFilterSize)
{
perror("fwrite(data)");
return -31;
}
fclose(pFile);
return 0;
}
讀取文件的bloomfilter:
inline int LoadBloomFilterFromFile(BaseBloomFilter *pstBloomfilter, char *szFileName)
{
if ((pstBloomfilter == NULL) || (szFileName == NULL))
return -1;
int iRet;
FILE *pFile;
static BloomFileHead stFileHeader = {0};
if (pstBloomfilter->pstFilter != NULL)
free(pstBloomfilter->pstFilter);
if (pstBloomfilter->pdwHashPos != NULL)
free(pstBloomfilter->pdwHashPos);
//
pFile = fopen(szFileName, "rb");
if (pFile == NULL)
{
perror("fopen");
return -11;
}
// 讀取并檢查文件頭
iRet = fread((void*)&stFileHeader, sizeof(stFileHeader), 1, pFile);
if (iRet != 1)
{
perror("fread(head)");
return -21;
}
if ((stFileHeader.dwMagicCode != __MGAIC_CODE__)
|| (stFileHeader.dwFilterBits != stFileHeader.dwFilterSize*BYTE_BITS))
return -50;
// 初始化傳入的 BaseBloomFilter 結(jié)構(gòu)
pstBloomfilter->dwMaxItems = stFileHeader.dwMaxItems;
pstBloomfilter->dProbFalse = stFileHeader.dProbFalse;
pstBloomfilter->dwFilterBits = stFileHeader.dwFilterBits;
pstBloomfilter->dwHashFuncs = stFileHeader.dwHashFuncs;
pstBloomfilter->dwSeed = stFileHeader.dwSeed;
pstBloomfilter->dwCount = stFileHeader.dwCount;
pstBloomfilter->dwFilterSize = stFileHeader.dwFilterSize;
pstBloomfilter->pstFilter = (unsigned char *) malloc(pstBloomfilter->dwFilterSize);
if (NULL == pstBloomfilter->pstFilter)
return -100;
pstBloomfilter->pdwHashPos = (uint32_t*) malloc(pstBloomfilter->dwHashFuncs * sizeof(uint32_t));
if (NULL == pstBloomfilter->pdwHashPos)
return -200;
// 將后面的Data部分讀入 pstFilter
iRet = fread((void*)(pstBloomfilter->pstFilter), 1, pstBloomfilter->dwFilterSize, pFile);
if ((uint32_t)iRet != pstBloomfilter->dwFilterSize)
{
perror("fread(data)");
return -31;
}
pstBloomfilter->cInitFlag = 1;
printf("Load BloomFilter(n=%u, p=%f, m=%u, k=%d), malloc() size=%.2fMBn",
pstBloomfilter->dwMaxItems, pstBloomfilter->dProbFalse, pstBloomfilter->dwFilterBits,
pstBloomfilter->dwHashFuncs, (double)pstBloomfilter->dwFilterSize/1024/1024);
fclose(pFile);
return 0;
}
完整的代碼:
#include
#include
#include
#include
#include
#define MAX_ITEMS 6000000 // 設(shè)置最大元素個(gè)數(shù)
#define ADD_ITEMS 10000 // 添加測(cè)試元素
#define P_ERROR 0.0001// 設(shè)置誤差
#define __BLOOMFILTER_VERSION__ "1.1"
#define __MGAIC_CODE__ (0x01464C42)
/**
* BloomFilter使用例子:
* static BaseBloomFilter stBloomFilter = {0};
*
* 初始化BloomFilter(最大100000元素,不超過(guò)0.00001的錯(cuò)誤率):
* InitBloomFilter(&stBloomFilter, 0, 100000, 0.00001);
* 重置BloomFilter:
* ResetBloomFilter(&stBloomFilter);
* 釋放BloomFilter:
* FreeBloomFilter(&stBloomFilter);
*
* 向BloomFilter中新增一個(gè)數(shù)值(0-正常,1-加入數(shù)值過(guò)多):
* uint32_t dwValue;
* iRet = BloomFilter_Add(&stBloomFilter, &dwValue, sizeof(uint32_t));
* 檢查數(shù)值是否在BloomFilter內(nèi)(0-存在,1-不存在):
* iRet = BloomFilter_Check(&stBloomFilter, &dwValue, sizeof(uint32_t));
*
* (1.1新增) 將生成好的BloomFilter寫入文件:
* iRet = SaveBloomFilterToFile(&stBloomFilter, "dump.bin")
* (1.1新增) 從文件讀取生成好的BloomFilter:
* iRet = LoadBloomFilterFromFile(&stBloomFilter, "dump.bin")
**/
#define FORCE_INLINE __attribute__((always_inline))
#define BYTE_BITS (8)
#define MIX_UINT64(v) ((uint32_t)((v>>32)^(v)))
#define SETBIT(filter, n) (filter->pstFilter[n/BYTE_BITS] |= (1<<(n%BYTE_BITS)))
#define GETBIT(filter, n) (filter->pstFilter[n/BYTE_BITS] &= (1<<(n%BYTE_BITS)))
using namespace std;
#pragma pack(1)
typedef struct{
uint8_t cInitFlag; //初始化標(biāo)志
uint8_t cResv[3];
uint32_t dwMaxItems; //n - BloomFilter中最大元素個(gè)數(shù) (輸入量)
double dProbFalse; //p - 假陽(yáng)概率(誤判率) (輸入量,比如萬(wàn)分之一:0.00001)
uint32_t dwFilterBits; //m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0))))); - BloomFilter的比特?cái)?shù)
uint32_t dwHashFuncs; // k = round(log(2.0) * m / n); - 哈希函數(shù)個(gè)數(shù)
uint32_t dwSeed; // MurmurHash的種子偏移量
uint32_t dwCount; // Add()的計(jì)數(shù),超過(guò)MAX_BLOOMFILTER_N則返回失敗
uint32_t dwFilterSize; // dwFilterBits / BYTE_BITS
unsigned char *pstFilter; // BloomFilter存儲(chǔ)指針,使用malloc分配
uint32_t *pdwHashPos; // 存儲(chǔ)上次hash得到的K個(gè)bit位置數(shù)組(由bloom_hash填充)
}BaseBloomFilter;
//BloomFilter 文件頭部定義
typedef struct{
uint32_t dwMagicCode; // 文件頭部標(biāo)識(shí),填充 __MGAIC_CODE__
uint32_t dwSeed;
uint32_t dwCount;
uint32_t dwMaxItems; // n - BloomFilter中最大元素個(gè)數(shù) (輸入量)
double dProbFalse; // p - 假陽(yáng)概率 (輸入量,比如萬(wàn)分之一:0.00001)
uint32_t dwFilterBits; // m = ceil((n * log(p)) / log(1.0 / (pow(2.0, log(2.0))))); - BloomFilter的比特?cái)?shù)
uint32_t dwHashFuncs; // k = round(log(2.0) * m / n); - 哈希函數(shù)個(gè)數(shù)
uint32_t dwResv[6];
uint32_t dwFileCrc; // (未使用)整個(gè)文件的校驗(yàn)和
uint32_t dwFilterSize; // 后面Filter的Buffer長(zhǎng)度
}BloomFileHead;
#pragma pack()
// 計(jì)算BloomFilter的參數(shù)m,k
static inline void _CalcBloomFilterParam(uint32_t n, double p, uint32_t *pm, uint32_t *pk){
/**
* n - Number of items in the filter
* p - Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
* m - Number of bits in the filter
* k - Number of hash functions
*
* f = ln(2) × ln(1/2) × m / n = (0.6185) ^ (m/n)
* m = -1*n*ln(p)/((ln(2))^2) = -1*n*ln(p)/(ln(2)*ln(2)) = -1*n*ln(p)/(0.69314718055995*0.69314718055995))
* = -1*n*ln(p)/0.4804530139182079271955440025
* k = ln(2)*m/n
**/
uint32_t m, k, m2;
m =(uint32_t) ceil(-1.0 * n * log(p) / 0.480453);
m = (m - m % 64) + 64; // 8字節(jié)對(duì)齊
double double_k = (0.69314 *m /n);
k = round(double_k);
*pm = m;
*pk = k;
return;
}
// 根據(jù)目標(biāo)精度和數(shù)據(jù)個(gè)數(shù),初始化BloomFilter結(jié)構(gòu)
/**
* @brief 初始化布隆過(guò)濾器
* @param pstBloomfilter 布隆過(guò)濾器實(shí)例
* @param dwSeed hash種子
* @param dwMaxItems 存儲(chǔ)容量
* @param dProbFalse 允許的誤判率
* @return 返回值
* -1 傳入的布隆過(guò)濾器為空
* -2 hash種子錯(cuò)誤或誤差>=1
*/
inline int InitBloomFilter(BaseBloomFilter *pstBloomfilter, uint32_t dwSeed, uint32_t dwMaxItems, double dProbFalse){
if(pstBloomfilter == NULL)
return -1;
if(dProbFalse <=0 || dProbFalse >= 1)
return -2;
//檢查是否重復(fù)Init, 釋放內(nèi)存
if(pstBloomfilter->pstFilter != NULL)
free(pstBloomfilter->pstFilter);
if(pstBloomfilter->pdwHashPos != NULL)
free(pstBloomfilter->pdwHashPos);
memset(pstBloomfilter, 0, sizeof(pstBloomfilter));
//初始化內(nèi)存結(jié)構(gòu),并計(jì)算BloomFilter需要的空間
pstBloomfilter->dwMaxItems = dwMaxItems;
pstBloomfilter->dProbFalse = dProbFalse;
pstBloomfilter->dwSeed = dwSeed;
//計(jì)算 m, k
_CalcBloomFilterParam(pstBloomfilter->dwMaxItems, pstBloomfilter->dProbFalse,
&pstBloomfilter->dwFilterBits, &pstBloomfilter->dwHashFuncs);
//分配BloomFilter的存儲(chǔ)空間
pstBloomfilter->dwFilterSize = pstBloomfilter->dwFilterBits / BYTE_BITS;
pstBloomfilter->pstFilter = (unsigned char *)malloc(pstBloomfilter->dwFilterSize);
if(NULL == pstBloomfilter->pstFilter)
return -100;
//哈希結(jié)果數(shù)組,每個(gè)哈希函數(shù)一個(gè)
pstBloomfilter->pdwHashPos = (uint32_t*)malloc(pstBloomfilter->dwHashFuncs * sizeof(uint32_t));
if(NULL == pstBloomfilter->pdwHashPos)
return -200;
printf("Init BloomFilter(n=%u, p=%e, m=%u, k=%d), malloc() size=%.6fMB, items:bits=1:%0.1lfn",
pstBloomfilter->dwMaxItems, pstBloomfilter->dProbFalse, pstBloomfilter->dwFilterBits,
pstBloomfilter->dwHashFuncs, (double)pstBloomfilter->dwFilterSize/1024/1024,
pstBloomfilter->dwFilterBits*1.0/pstBloomfilter->dwMaxItems);
// 初始化BloomFilter的內(nèi)存
memset(pstBloomfilter->pstFilter, 0, pstBloomfilter->dwFilterSize);
pstBloomfilter->cInitFlag = 1;
return 0;
}
//釋放bloomfilter
inline int FreeBloomFilter(BaseBloomFilter * pstBloomfilter){
if(pstBloomfilter == NULL)
return -1;
pstBloomfilter->cInitFlag = 0;
pstBloomfilter->dwCount=0;
free(pstBloomfilter->pstFilter);
pstBloomfilter->pstFilter = NULL;
free(pstBloomfilter->pdwHashPos);
pstBloomfilter->pdwHashPos = NULL;
return 0;
}
//重置bloomfilter
inline int ResetBloomFilter(BaseBloomFilter *pstBloomfilter){
if(pstBloomfilter == NULL)
return -1;
pstBloomfilter->cInitFlag =0;
pstBloomfilter->dwCount =0;
return 0;
}
inline int RealResetBloomFilter(BaseBloomFilter *pstBloomfilter){
if (pstBloomfilter == NULL)
return -1;
memset(pstBloomfilter->pstFilter, 0, pstBloomfilter->dwFilterSize);
pstBloomfilter->cInitFlag = 1;
pstBloomfilter->dwCount = 0;
return 0;
}
FORCE_INLINE uint64_t MurmurHash2_x64 ( const void * key, int len, uint32_t seed )
{
const uint64_t m = 0xc6a4a7935bd1e995;
const int r = 47;
uint64_t h = seed ^ (len * m);
const uint64_t * data = (const uint64_t *)key;
const uint64_t * end = data + (len/8);
while(data != end)
{
uint64_t k = *data++;
k *= m;
k ^= k >> r;
k *= m;
h ^= k;
h *= m;
}
const uint8_t * data2 = (const uint8_t*)data;
switch(len & 7)
{
case 7: h ^= ((uint64_t)data2[6]) << 48;
case 6: h ^= ((uint64_t)data2[5]) << 40;
case 5: h ^= ((uint64_t)data2[4]) << 32;
case 4: h ^= ((uint64_t)data2[3]) << 24;
case 3: h ^= ((uint64_t)data2[2]) << 16;
case 2: h ^= ((uint64_t)data2[1]) << 8;
case 1: h ^= ((uint64_t)data2[0]);
h *= m;
};
h ^= h >> r;
h *= m;
h ^= h >> r;
return h;
}
// 雙重散列封裝,k個(gè)函數(shù)函數(shù), 比如要20個(gè)
FORCE_INLINE void bloom_hash(BaseBloomFilter *pstBloomfilter, const void * key, int len){
int i;
uint32_t dwFilterBits = pstBloomfilter->dwFilterBits;
uint64_t hash1 = MurmurHash2_x64(key, len, pstBloomfilter->dwSeed);
uint64_t hash2 = MurmurHash2_x64(key, len, MIX_UINT64(hash1));
for (i = 0; i < (int)pstBloomfilter->dwHashFuncs; i++)
{
// k0 = (hash1 + 0*hash2) % dwFilterBits; // dwFilterBits bit向量的長(zhǎng)度
// k1 = (hash1 + 1*hash2) % dwFilterBits;
pstBloomfilter->pdwHashPos[i] = (hash1 + i*hash2) % dwFilterBits;
}
return;
}
// 向BloomFilter中新增一個(gè)元素
// 成功返回0,當(dāng)添加數(shù)據(jù)超過(guò)限制值時(shí)返回1提示用戶
FORCE_INLINE int BloomFilter_Add(BaseBloomFilter *pstBloomfilter, const void * key, int len){
if((pstBloomfilter == NULL) || (key == NULL) || (len<=0))
return -1;
int i;
if(pstBloomfilter->cInitFlag != 1){
memset(pstBloomfilter->pstFilter,0 , pstBloomfilter->dwFilterSize);
pstBloomfilter->cInitFlag =1;
}
bloom_hash(pstBloomfilter, key, len);
for(i=0;i<(int)pstBloomfilter->dwHashFuncs; ++i){
SETBIT(pstBloomfilter, pstBloomfilter->pdwHashPos[i]);
}
pstBloomfilter->dwCount++;
if(pstBloomfilter->dwCount <= pstBloomfilter->dwMaxItems)
return 0;
else
return 1;
}
FORCE_INLINE int BloomFilter_Check(BaseBloomFilter *pstBloomfilter, const void * key, int len){
if((pstBloomfilter == NULL) || (key == NULL) || (len<=0))
return -1;
int i;
bloom_hash(pstBloomfilter, key, len);
for (i = 0; i < (int)pstBloomfilter->dwHashFuncs; i++)
{
// 如果有任意bit不為1,說(shuō)明key不在bloomfilter中
// 注意: GETBIT()返回不是0|1,高位可能出現(xiàn)128之類的情況
if (GETBIT(pstBloomfilter, pstBloomfilter->pdwHashPos[i]) == 0)
return 1;
}
return 0;
}
inline int SaveBloomFilterToFile(BaseBloomFilter *pstBloomfilter, char *szFileName){
if ((pstBloomfilter == NULL) || (szFileName == NULL))
return -1;
int iRet;
FILE *pFile;
static BloomFileHead stFileHeader = {0};
pFile = fopen(szFileName, "wb");
if (pFile == NULL)
{
perror("fopen");
return -11;
}
// 先寫入文件頭
stFileHeader.dwMagicCode = __MGAIC_CODE__;
stFileHeader.dwSeed = pstBloomfilter->dwSeed;
stFileHeader.dwCount = pstBloomfilter->dwCount;
stFileHeader.dwMaxItems = pstBloomfilter->dwMaxItems;
stFileHeader.dProbFalse = pstBloomfilter->dProbFalse;
stFileHeader.dwFilterBits = pstBloomfilter->dwFilterBits;
stFileHeader.dwHashFuncs = pstBloomfilter->dwHashFuncs;
stFileHeader.dwFilterSize = pstBloomfilter->dwFilterSize;
iRet = fwrite((const void*)&stFileHeader, sizeof(stFileHeader), 1, pFile);
if (iRet != 1)
{
perror("fwrite(head)");
return -21;
}
// 接著寫入BloomFilter的內(nèi)容
iRet = fwrite(pstBloomfilter->pstFilter, 1, pstBloomfilter->dwFilterSize, pFile);
if ((uint32_t)iRet != pstBloomfilter->dwFilterSize)
{
perror("fwrite(data)");
return -31;
}
fclose(pFile);
return 0;
}
// 從文件讀取生成好的BloomFilter
inline int LoadBloomFilterFromFile(BaseBloomFilter *pstBloomfilter, char *szFileName)
{
if ((pstBloomfilter == NULL) || (szFileName == NULL))
return -1;
int iRet;
FILE *pFile;
static BloomFileHead stFileHeader = {0};
if (pstBloomfilter->pstFilter != NULL)
free(pstBloomfilter->pstFilter);
if (pstBloomfilter->pdwHashPos != NULL)
free(pstBloomfilter->pdwHashPos);
//
pFile = fopen(szFileName, "rb");
if (pFile == NULL)
{
perror("fopen");
return -11;
}
// 讀取并檢查文件頭
iRet = fread((void*)&stFileHeader, sizeof(stFileHeader), 1, pFile);
if (iRet != 1)
{
perror("fread(head)");
return -21;
}
if ((stFileHeader.dwMagicCode != __MGAIC_CODE__)
|| (stFileHeader.dwFilterBits != stFileHeader.dwFilterSize*BYTE_BITS))
return -50;
// 初始化傳入的 BaseBloomFilter 結(jié)構(gòu)
pstBloomfilter->dwMaxItems = stFileHeader.dwMaxItems;
pstBloomfilter->dProbFalse = stFileHeader.dProbFalse;
pstBloomfilter->dwFilterBits = stFileHeader.dwFilterBits;
pstBloomfilter->dwHashFuncs = stFileHeader.dwHashFuncs;
pstBloomfilter->dwSeed = stFileHeader.dwSeed;
pstBloomfilter->dwCount = stFileHeader.dwCount;
pstBloomfilter->dwFilterSize = stFileHeader.dwFilterSize;
pstBloomfilter->pstFilter = (unsigned char *) malloc(pstBloomfilter->dwFilterSize);
if (NULL == pstBloomfilter->pstFilter)
return -100;
pstBloomfilter->pdwHashPos = (uint32_t*) malloc(pstBloomfilter->dwHashFuncs * sizeof(uint32_t));
if (NULL == pstBloomfilter->pdwHashPos)
return -200;
// 將后面的Data部分讀入 pstFilter
iRet = fread((void*)(pstBloomfilter->pstFilter), 1, pstBloomfilter->dwFilterSize, pFile);
if ((uint32_t)iRet != pstBloomfilter->dwFilterSize)
{
perror("fread(data)");
return -31;
}
pstBloomfilter->cInitFlag = 1;
printf("Load BloomFilter(n=%u, p=%f, m=%u, k=%d), malloc() size=%.2fMBn",
pstBloomfilter->dwMaxItems, pstBloomfilter->dProbFalse, pstBloomfilter->dwFilterBits,
pstBloomfilter->dwHashFuncs, (double)pstBloomfilter->dwFilterSize/1024/1024);
fclose(pFile);
return 0;
}
int main()
{
printf(" test bloomfiltern");
static BaseBloomFilter stBloomFilter = {0};
InitBloomFilter(&stBloomFilter, 0, MAX_ITEMS, P_ERROR);
char url[128] = {0};
for(int i = 0; i < ADD_ITEMS; i++){
sprintf(url, "https://0voice.com/%d.html", i);
if(0 == BloomFilter_Add(&stBloomFilter, (const void*)url, strlen(url))){
// printf("add %s success", url);
}else{
printf("add %s failed", url);
}
memset(url, 0, sizeof(url));
}
// 4. check url exist or not
char* str = "https://0voice.com/0.html";
if (0 == BloomFilter_Check(&stBloomFilter, (const void*)str, strlen(str)) ){
printf("https://0voice.com/0.html existn");
}
char* str2 = "https://0voice.com/10001.html";
if (0 != BloomFilter_Check(&stBloomFilter, (const void*)str2, strlen(str2)) ){
printf("https://0voice.com/10001.html not existn");
}
char * file = "testBloom.txt";
if(SaveBloomFilterToFile(&stBloomFilter, file) == 0)
printf("file save in %sn", file);
else
printf("file can't exitn");
//cout << "Hello world!" << endl;
if(0 == LoadBloomFilterFromFile(&stBloomFilter, file))
printf("read successn");
else
printf("read failuren");
FreeBloomFilter(&stBloomFilter);
getchar();
return 0;
}
-
函數(shù)
+關(guān)注
關(guān)注
3文章
4346瀏覽量
63006 -
數(shù)據(jù)結(jié)構(gòu)
+關(guān)注
關(guān)注
3文章
573瀏覽量
40237 -
過(guò)濾器
+關(guān)注
關(guān)注
1文章
433瀏覽量
19746
發(fā)布評(píng)論請(qǐng)先 登錄
相關(guān)推薦
一文理解布隆過(guò)濾器和布谷鳥(niǎo)過(guò)濾器
![一文理解<b class='flag-5'>布</b><b class='flag-5'>隆</b><b class='flag-5'>過(guò)濾器</b>和布谷鳥(niǎo)<b class='flag-5'>過(guò)濾器</b>](https://file1.elecfans.com//web1/M00/F4/79/wKgaoWcsIZCAd4CBAAJAiCQrz_o550.png)
一種隱私保護(hù)的可逆布魯姆過(guò)濾器PPIBF設(shè)計(jì)
![一種隱私保護(hù)的可逆布魯姆<b class='flag-5'>過(guò)濾器</b>PPIBF設(shè)計(jì)](https://file.elecfans.com/web2/M00/49/5F/poYBAGKhwKiAdzKRAAATNabBwSg950.jpg)
過(guò)濾器的作用
如何使用計(jì)數(shù)型布隆過(guò)濾器進(jìn)行可排序密文檢索的方法概述
![如何使用計(jì)數(shù)型<b class='flag-5'>布</b><b class='flag-5'>隆</b><b class='flag-5'>過(guò)濾器</b>進(jìn)行可排序密文檢索的方法概述](https://file.elecfans.com/web1/M00/81/55/pIYBAFwsaY6AC7W9AABZL9dPPQ0750.png)
解密高效空氣過(guò)濾器的性能及要求
過(guò)濾器的應(yīng)用有哪些
![<b class='flag-5'>過(guò)濾器</b>的應(yīng)用有哪些](https://file.elecfans.com/web1/M00/BE/01/o4YBAF7d8vyAMtPPAAA_c-Ev2iQ671.png)
Google在Play商店中推出搜索過(guò)濾器
絲扣Y過(guò)濾器
絲扣Y過(guò)濾器及過(guò)濾器測(cè)試原理簡(jiǎn)介
科普一下12種管道過(guò)濾器
漢克森過(guò)濾器系列介紹
![漢克森<b class='flag-5'>過(guò)濾器</b>系列介紹](https://file.elecfans.com/web2/M00/94/8E/poYBAGP-obGASHT5AAFd_le7l4A769.png)
一文解析布隆過(guò)濾器設(shè)計(jì)原理
![一文解析<b class='flag-5'>布</b><b class='flag-5'>隆</b><b class='flag-5'>過(guò)濾器</b>設(shè)計(jì)原理](https://file1.elecfans.com/web2/M00/82/B3/wKgZomRdr66AKGpbAAAgo1EbxjI924.png)
評(píng)論