Efficient ways of searching compressed files, that are comparable to searching uncompressed files, would benefit several applications, analytical or otherwise.
Specifically if space savings can be achieved without adding significant costs in searching for content then cost savings can be achieved by reducing number of disks and thereby reducing number of nodes/servers needed to store the data.
Some relevant ideas (some can be applied in combination with others)
- Pre-processing the file by reading it once and extracting metadata that can be used to quickly match against predicates ex: partitioning, file-level indexes, min/max values of a column etc. ex: Netezza (compresses records after pre-processing and uses FPGA to decompress files), Hive(ex: using ORC serde to store data in compressed columnar form) use some of these approaches on compressed files.
- For files with small alphabets code compression would help. For example 2-bit code(similar to bitmaps) compression on a genome file containing A,C, T, G characters – gives 75% space saving (or 25% compression ratio) compared to an original ASCII file containing the characters themselves. Both compressed and uncompressed files can be read in approximately same time since the alphabet contains very few letters.
- Flip the search by compressing search string using compression algorithm’s specifics i.e. translate search string into a compressed binary format that can be quickly compared against the binary compressed content. (NOTE: cannot be used for regular expression searches yet could have wide applicability). For example
- For code compression, code the search string itself.
- For Huffman pre-process to read and build prefix-free codes trie and then code the search string.
- For LZW pre-process once by decompressing the file completely to rebuild code table and then code the search string using the code table.
- For LZ77(used by gzip) pre-process once per sliding window and generate a separate file (a serialized symbol table with block ids as key and a trie of repeated strings may be??) which can be later used to check a search string is present in a block.
For illustrative purposes below is an experiment, its implementation and some results. It does not apply above ideas instead tries to establish baseline by creating an analogy for grep and zgrep(decompress and search) tools.
- Implement GREP to search all matches of a regular expression on a stream reader.
- Unzip .gz files compressed files and retain a copy of original .gz file
- Read uncompressed file and GREP.
- Read compressed file, decompress and stream the content to GREP.
- Repeat and measure difference in time ( TO-DO NOTE: memory is another parameter to measure)
To ensure apples-to-apples comparison use one language for all of the above,
Source code for the implementation is here.
- It is based on GREP implementation using NFAs (available here) based on Kleene’s theorem which states equivalency between DFA and Regular Expressions.
- For 22K compressed file and 225K uncompressed file and 1000 trials – Average Time difference uncompressed – compressed : -0.002 seconds i.e. the additional cost of decompressing and searching is 2 ms
- For 147K compressed file and 2.3M uncompressed file and 100 trials – Average Time difference between searching uncompressed vs compressed files is : -0.009 seconds. i.e. the additional cost of decompressing and searching is 9 ms.
- For 34M compressed file and 229M uncompressed file and 10 trials – Average Time difference between searchinguncompressed – compressed : -1.773 seconds i.e. the additionalcost of decompressing and searching is 1.773 seconds
- Attempt implementation of Flipped Search on LZW and LZ77 and compare results.