- Most distributed systems(ex: HBase, VoltDB) require coordination among-st its nodes for several/all of its tasks. Coordination can be either taken up by the system itself or the system can give this responsibility to a separate coordination system ex: Zookeeper.
- The coordination system itself needs to be fault tolerant, preferably very high throughput, scalable (preferably horizontally scalable) system i.e. it is a distributed system on its own right.
- Zookeeper Paper (Wait free coordination for large distributed systems) has these key ideas
- Provide wait-free base features and let clients of zookeeper implement higher order primitives on-top of these simpler primitives ex: read/write locks, simple locks without herd effect, group membership, configuration management.
- The base features are
- sequential/linearizable writes at leader node
- All reads are locally executed by the client
- FIFO execution of client requests
- Optimistic locking i.e. Version based operations
- The underlying mechanisms for linearizables writes is
- A protocol for Atomic broadcast — zookeeper defines this as guaranteed delivery of a message to all nodes (ex: a write message from leader)
- ZAB(Zookeeper Atomic Broadcast) seems to be essentially a variant of two phase commit protocol. An article explaining it further is here.
- VoltDB is a in-memory, fault tolerant, ACID compliant, horizontal scale system. Some of its architecture choices are here. The key ideas are
- Single thread operations
- No client controlled transactions. instead each stored procedure runs in a transaction.
- Partitioning and shared-nothing cluster (with replication for fault tolerancve)
- leveraging/support deterministic operations. Instead of writing at master and propagating changes to all nodes containing replicas, each node which has a replica executes same transaction in parallel.
- VoltDB is based on this paper which tries to analyze operations involved the OLTP systems and its key observations are
- buffer management, locking, logging and latching are the costliest operations in OLTP….unless one strips out all of these components, the performance of a main memory-optimized database is unlikely to be much better than a conventional database where most of the data fit into RAM.
- In a fully stripped down system — e.g., that is single threaded, implements recovery via copying state from other nodes in the network, fits in memory, and uses reduced functionality transactions — the performance is orders of magnitude better than an unmodified system.
- VoltDB uses Zookeeper for leader election. More here.
Some more notes on VoltDB:
- ACID and CAP – https://voltdb.com/blog/
disambiguating-acid-and-cap - Errors in Databases – https://voltdb.com/blog/
clarifications-cap-theorem- and-data-related-errors