SZaru: Porting of excellent Sawzall aggregators.

Last modified: Sat Nov 13 21:41:17 JST 2010


Overview

SZaru is a library to use Google's Sawzall aggregators in pure C++, Ruby and Python.

Sawzall aggregators use memory efficient and one-pass algorithms to approximately compute popular statistics. For example, a simple algorithm of 'top N' computation requires O(K) memories where K means the number of unique elements. But SZaru requires only O(N) memories (in most cases N < M) instead of losing some accuracy.

Therefore, SZaru may be useful for large data processing or streaming data processing.

Currently, I have imported the following 3 aggregators from szl (OSS implementation of Sawzall):

Top
Statistical samplings that record the 'top N' data items based on CountSketch algorithm from "Finding Frequent Items in Data Streams", Moses Charikar, Kevin Chen and Martin Farach-Colton, 2002.
Unique
Statistical estimators for the total number of unique data items.
Quantile
Approximate N-tiles for data items from an ordered domain based on the following paper: Munro & Paterson, "Selection and Sorting with Limited Storage", Theoretical Computer Science, Vol 12, p 315-323, 1980.

Example

C++

Ruby

Python

Install

Core library

git clone git://github.com/llamerada/SZaru.git
cd SZaru
./waf configure
./waf 
sudo ./waf install
    

Ruby

# After installing core library
sudo gem install szaru
    

Python

# Change current directory to core library directory
cd SZaru 
cd python
python setup.py build
sudo python setup.py install
    

Document

Source

github repository: https://github.com/llamerada/SZaru

License

Apache License Version 2.0