SZaru is a library to use Sawzall aggregators in pure C++, Ruby and Python. Currently, I have implemented the following 3 aggregators:
Statistical samplings that record the 'top N' data items based on CountSketch algorithm from "Finding Frequent Items in Data Streams", Moses Charikar, Kevin Chen and Martin Farach-Colton, 2002.
Statistical estimators for the total number of unique data items.
include <iostream> include <szaru.h> using namespace std; using namespace SZaru;
TopEstimator<int32_t> *topEst = TopEstimator<int32_t>::Create(3); topEst->AddWeightedElem("abc", 1); topEst->AddWeightedElem("def", 2); topEst->AddWeightedElem("ghi", 3); topEst->AddWeightedElem("def", 4); topEst->AddWeightedElem("jkl", 5);
vector< TopEstimator<int32_t>::Elem > topElems; topEst->Estimate(topElems);
cout << topElems[0].value << ", " << topElems[0].weight << endl; // => def, 6 cout << topElems[1].value << ", " << topElems[1].weight << endl; // => jkl, 5 cout << topElems[2].value << ", " << topElems[2].weight << endl; // => ghi, 3
delete topEst;