TODO: Create a Stub-like template which suggests people read the Corresponding Talk page as well.

TODO: Create an infobox for a publication (or just an article) and apply it to this paper.

A paper about the lightweight compression schemes used in Actian Vector (then MonetDB/X100), which allow also for adaptivity of compression scheme, pipeline-effective compression and decompression and other useful features.

WRITEME: Describe context for authoring this.

Take-home messages Edit

  • Don't store raw DB data on disk, store the compressed form.
  • Don't decompress entire pages into memory; only decompress small working sets into CPU cache.
  • Don't compress an entire column; compress chunks of it independently to: 1. avoid global dictionary overflow. 2. Adapt compression to local features.
  • Using exception-patching allows: 1. Accounting for distribution outliers 2. Decoding in tight loops with no branching
  • Fast compression is also useful, not just fast decompression.
  • With appropriate compression schemes, can hold as much as x25 as high TPC-H scale factors in memory.
  • The exceptional values mechanism is usable as a skip-list into the compressed data.
  • Compression schemes should (and can) allow for random-access by index into the compressed data.
  • Sampling can be used to choose a compression scheme for a chunk of column data.

Concepts discussed Edit

TODO: Create a glossary template instead of using plain lists here.

DBMS data Compression schemes:

  • FOR: Frame of reference; encode difference to constant value)
  • DICT: Dictionary; encode indices into a list-of-values)
  • DELTA: Differences; encode current value minus previous value
  • PFOR: Patched FOR - like FOR, but with the decode result 'patched' with an exceptions pass
  • PDICT: Patched DICT (see PFOR)
  • PDELTA: Patched DELTA

DBMSes discussed Edit

The compression schemes known to be used (at the time, 2008) in some DBMSes were mentioned.

  • IBM DB2: Drops pointer prefixes in B-trees
  • Teradata: Dictionary compression for columns
  • Oracle: Dictionary compression for disk storage blocks
  • Sybase IQ: Multi-scheme compression, each 'page' compressed separately with its own scheme