Bigtable: A Distributed Storage System for Structured Data

Abstract

  • Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size

Data Model

A Bigtable is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

(row:string, column:string, time:int64) → string

Rows

  • The row keys in a table are arbitrary strings
  • Every read or write of data under a single row key is atomic
  • makes it easier for concurrent updates to the same row.
  • The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of dis- tribution and load balancing

Column Families

  • Column keys are grouped into sets called column fami- lies, which form the basic unit of access control
  • All data stored in a column family is usually of the same type
  • A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used

Timestamps

  • Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp

API

  • The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights.
  • Bigtable can be used with MapReduce, a framework for running large-scale parallel computations de- veloped at Google. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs.

Building Blocks

  • Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File System (GFS) to store log and data files. A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications. Bigtable de- pends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.
  • The Google SSTable file format is used internally to store Bigtable data
  • Bigtable relies on a highly-available and persistent distributed lock service called Chubby

Implementation

The Bigtable implementation has three major compo- nents:

  • a library that is linked into every client
  • one master server
  • many tablet servers

A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is auto- matically split into multiple tablets, each approximately 100-200 MB in size by default.

Tablet Location

  • We use a three-level hierarchy analogous to that of a B+- tree to store tablet location information
    1.png

Tablet Assignment

Each tablet is assigned to one tablet server at a time

- The master keeps track of the set of live tablet servers
- Bigtable uses Chubby to keep track of tablet servers
- The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible
- When a master is started by the cluster management system, it needs to discover the current tablet assign- ments before it can change them

Tablet Serving

  • Updates are committed to a commit log that stores redo records
  • the recently committed ones are stored in memory in a sorted buffer called a memtable
  • the older updates are stored in a sequence of SSTables
  • To recover a tablet, a tablet server reads its metadata from the METADATA table. This meta- data contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet
    2.png

Refinements

  1. Locality groups
    • Clients can group multiple column families together into a locality group
  2. Compression
    • Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used
  3. Caching for read performance
    • To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS
  4. Bloom filters
    • a read operation has to read from all SSTables that make up the state of a tablet
  5. Commit-log implementation
    • we append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log file
  6. Speeding up tablet recovery
    • If the master moves a tablet from one tablet server to another, the source tablet server first does a minor compaction on that tablet
  7. Exploiting immutability
    • Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable

Lesson

  1. large distributed sys- tems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures as- sumed in many distributed protocols
  2. it is important to delay adding new features until it is clear how the new features will be used
  3. the importance of proper system-level monitoring
  4. The most important lesson we learned is the value of simple designs
Donate article here
Share the post