Cloud Computing I - Week04 - 1 - Key-Value stores

Reading Resources

Why Key-Value/NoSQL?

Key value Abstraction

Key -> Value
dictionary DS
sort of a Database
RDBMS is
- Schema based - Structured Table
- each row has a unique primary key
- Queried using SQL (Structured Query Language)
- Supports Joins
- may have foreign keys, which are used to lookup and relate to another table

Bad things about RDBMS kind DBs

Data large and unstructured
lots of random reads and writes are present
sometimes write-heavy
foreign keys are rarely needed
joins infrequent

Today's needs

speed
avoid SPoF
Low Total Cost of Operation/Ownership
Fewer System Admins
incremental Scalability
Scale out, Not Scale up

Scale out, Not Scale up

Scale up grow your cluster capacity by replacing with more powerful machines
- traditional approach
- not cost effective, as you are buyin above the sweet spot on the price curve
- And you need to replace machines often
Scale out incrementally grow your cluster capacity by adding more COTS(Components Off the Shelf)
- cheaper
- over a long duration, phase in a few newer(faster) machines as you phasse out a few older machines
- used by most companies who run datacentres and clouds today

Key-value/NoSQL Data Model

NoSQL = "Not only SQL"
apis get() put()
some powerful operations, e.g. "CQL in Cassandra Key-Value Store"
Tables
- "Column Families" in Cassandara
- "Table" in HBase
- "Collection" in MongoDB
*They are tables in RBDMS but,
- may be unstructed: May not have schema, i.e. some things may get going missing
- Don't always support joins or have foreign keys
- Can have index tables, just like RDBMS

Which of the following may NOT be present in key-value/NoSQL stores?

[x] Joins

[x] Foreign Keys

[ ] Tables

[ ] Rows

[x] Same Columns in Some rows

Column Oriented Storage

Why is column-oriented storage useful? (select all correct answers)

[x] Queries that touch only a few columns are faster than those in a row-oriented storage.

[x] It enables faster range searches using one column.

[ ] It leads to lower total storage for the table.

[x] It prevents the entire table from being read for a query.

Cassandara

distributed key-value store
intended to run in a datacenter
designed by FB
Open Source
used by IBM, Adobe, HP, eBay, Ericsson, twitter, spotify
Netflix uses to keep track of your current position in the video you're watching

Design (Key -> server mapping)

similar to DHT without finger table or routing
client sends the queries to a server known as Coordinator
key -> server mapping is the "Partitioner"

Data Placement Strategies

Simple strategy
- uses the partitioner
  - random partitioner : chord like hash partitioning
  - ByteOrderedPartitioner: assigns ranges of keys to servers
    - easier for range queries (e.g. get me all twitter users starting from ..)
    - if you use hash based, then you may need to ask every server
Network Topology Strategy
- for multi-DC deployments
- 2 replicas per DC
- 3 replicas per DC
- per DC
  - first replica placed according to partitioner
  - then go clockwise around ring until you hit different rack

Snitches

to map IP to racks and DCs
configured in cassandara.yaml
Some Options
- SimpleSnitch unware of topology (Rack-Unaware)
- RackInferring assumes topology of network by octet of server's IP address
  - 101.102.103.104 = x.<DC Octet>.<rack octet>.<node octet>
  - PropertyFileSnitch uses a config file
  - EC2Snitch uses EC2
    - EC2 Region = DC
    - Availaility Zone = rack
- more options are available

Writes

need to be lock free and fast
client sends write to one coordinator node in Cassandara cluster
- Coordinator may be per-key, per-client, or per-query
- per-key coordinator ensures writes for the key are serialised
Coordinator uses Partitioner to send query to all replica nodes responsible for key
when X replicas repond, coordinator returns an ack to the client
- X? for now assume it to be a small no.
Keys should always be writable: (Hinted Handoff Mechanism)
- if any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up
- when all replicas are down, the Coordinator(front end) buffers writes (for upto few hours)
One ring per Dc
- perDC coordinator elected to coordinate with other DCs
- Election done via ZooKeeper, which runs a PAXOS(consensus) variant
- Zookeper is open source Apache server system

Writes at the Replica node

On receiving a write

Log it in disk commit log (for failure recovery)
Make changes to appropriate memtables
- Memtables = in memory representation of multiple key-value pairs
- Cache that can be searched by key
- Write-back cache as opposed to write-through

Later, when the memtables is full or old, flush to disk

Data File an SSTable (Sorted String Table) -- list of key-value pairs, sorted by key
Index File an SSTable of (Key, position in data sstable)pairs
- binary search here leads to a large overhead
- for this bloom filter is used
and a Bloom filter (for efficient search) - next Slide

Bloom Filter

compact way of representing a set of items
one of the operations could be to check whether a certain key is there or not
other operations could be inserting in the set of keys
some prob of False Positive: an item not in set may check true as being in set
never False negative

it a very large bit map (128 bits shown) intially all 0
Hash 1-k is used on Key K

bloom filter still a doubt

Compaction

Data updates accumulate over time and SStables and logs need to be compacted
- The process of compaction merges SSTables, i.e., by merging updates for a key
- run periodically and locally at each server

Delete

dont delete
instead mark tombstone to the log
eventually, when compaction encounter s tomstone it will delete the item

Reads

Similar to writes
Coordinator can contact X replicas
- coordinator sends read to replicas that hace responded quickest in past
- when X replicas respond, coordinator returns the latest-timestamped value from among those X
- X?

MemberShip

any server in cluster could be the coordinator
so every server needs to maintain a list of all the other servers that are currently in the server
List needs to be updated automatically as servers join, leave, and fail

Suspicion Mechanism

to adaptively set the timeout based on undelying network and failure behaviour
accrual detector:
- failure detector outputs a value (PHI) representing suspicion
- Apps set an appropriate threshold
- PHI calculation for a member
  - Inter-arrival times for Gossip messages
  - PHI(t) = -log(CDF or Prob(t_now - t_last))/log 10
- PHI basically determines the detection timeout, but takes into account historical inter-arrival time variations for gossiped heartbeats
in Practice, PHI = 5 => 10-15sec detection time

Cassandara vs RDBMS

	Writes	Reads
MySQL	300ms avg	350ms avg
Cassandara	0.12ms avg	15ms avg

The Mystery of X - The Cap Theorem

3 very imp properties for any Distributed system , particularly storage system
- Consistency all nodes see same data at any time
- Availability the system operations all the time
- Partition-Tolerance the system continues to work inspite of network partitions
the CAP theorem says you cannot guarantee all 3 but only 2 out of these.
500ms increase in latency for operations at Amazon.com and Google.com can cause 20% drop in revenue
At Amazon each milisecond of latency implies a $6M yearly loss
Service Level Agreements mainly deal with the latencies faced by the customers

CAP Theorem Fallout

today's cloud needs partition tolerance, so a system needs to choose between the consistency and availability
Cassandra: Eventual (weak) consistency, availability, partition-tolerance
Traditional RDBMSs: Strong consistency over availability under a partition

CAP TradeOff

Starting point for NoSQL Revolution
A distributed storage system can achieve at most two of C, A and P
C or A when P is required
RDBMSs generally are stored on a single system so dont require partition tolerance
HBase, Hyper Table etc. dont require availability

Eventual Consistency

if all the writes stop (to a key), then all its values(replicas) will converge eventually to the latest write
if writes continue, the system always tries to keep converging.
- i.e. it keeps trying to catch up with the updated values and can't figure out the latest values by the clients
in case of many back to back writes, it may still return a stale value to the query sender
works well when the no. of writes is low

Eventual consistency says:

[ ] All values for a given key will eventually be different.

[ ] All values for a given key will be the same within a time bound.

[x] All values for a given key will eventually be the same after writes have stopped.

RDBMSs vs Key value stores

RDBMSs provid ACID
- Atomicity
- Consistency
- Isolation
- Durability
Key-Value provides
- BASE
  - Basically
  - Available
  - Soft-state (e.g. memtable)
  - Eventual
- Prefers availability over consistency

Mystery of X

Cassandara has the consistency levels(X) for each operations read/write
ANY any server (may not be replica)
- fastest: coordinator caches write and replies quickly to client
ALL all replicas
- ensures strong consistency, but slowest
ONE at least one replica
- Faster than ALL , but cannot tolerate failure
Quorum across all replicas in all DCs
- global consistency but still fast
Local_qurom quorum in coordinator's DC
- faster: only waits for quorum in first DC client contacts
Each_Quorum quorum in every DC
- lets each DC do its own quorum: supports hierarchical replies

QUORUM

refers to >50% (ie. majority)
any 2 quoroms intersects
- client 1 does a write
important as it brings strong consistency guarantee same as ALL
several key-value/NoSQL stores(e.g. Riak and Cassandara) use quorums
Reads
- clients specifies value of R (<= N, total no. of replicas of that key)
- R = read consistency level
- coordinator waits for R replicas to respond before sending result to client
- in bgd, coordinator checks for consistency of remaining (N-R) replicas, and initiates read repair if needed
WRITES
- client specifies W (<= N)
- W = write consistency levels
- clients writes new valurs to W replicas and returns 2 flavors
  - Coordinator blocks until quorum is reached
  - Asynchronous: Just write and return

2 necessary conditions for the strong consistency
- W+R > N
- W > N/2
select values based on applications

W	R	when needed
1	1	very few writes and reads
N	1	great for read-heavy workloads
N/2 + 1	N/2 + 1	great for write heavy workloads
1	n	great for write heavy workloads with mostly one client writing per key

The Consistency Spectrum

The Consistency Spectrum
Eventual Consistency	<<<<< Faster Reads and writes ---	Strong Consistency

	---- More Consistency >>>>>

Cassandra offers eventual consistency
Amazon's Dynamo or LinkedIn's Voldemort Systems uses last writers policy (LWW)
DYNAMO DB system inspired cassandara and LinkedIn's system
Now new consistency models are being researched on

Newer Consistency Models

Per-Key sequential: Per Key, all operations have a global order
CRDTs (Commutative Replicated Data Types): Data structures for which communtated writes give same results [INRIA, France]
- commutating writes means, if you reverse the orders of the writes the writes would be the same
- e.g. value == int, and only op allowed is +1
- Effectively, servers dont need to worry about consistency
Red blue rewrite client txns to separate ops into red ops vs blue ops [MPI-SWS Germany]
- there are some order which are needed to be ordered.
- Blue ops can be executed (commutated) in any order across DCs
- Red ops need to be executed in the same order
Causal Consistency:

Strong consistency models

Linearizability:
- each ops by a client is visible instantaneously (real time) to all other clients
- i.e. only one copy of key-value pair and is stored in a server
- difficult to support in distributed systems
Sequential Consistency:
- result of any execution is th same as if the ops of all the processors ere executed in some sequential order, and the ops of each ind. processor appears in this seq. in the order specified by the program
- not as strong as linearizability
TXM ACID properties are supported by newer sequential systems
- Hyperdex [cornell]
- Spanner [Google]
- TXn chains [MSR]

HBase

Google's BigTable was the first "Blob-based" storage
Yahoo open sourced it -> HBase
API fxn
- Get/Put row
- Scan(Row range, filter - range queries)
- MultiPut (multi key value pairs)
Unlike Cassandra, HBase prefers consistency (over availability)

HBase Arch

client: sends queries for read/write
zookeeper: runs consensus
HRegionServers:
- associated with an HLog
- contains multi HRegions
  - Multiple Stores
    - contains MEMstores
    - Multiple StoreFiles
      - a HFile
HFile: file stores in underlying HDFS file, same hadoop system
HMaster: coordinates with zookeeper, and other parts

HBase Table Hierarchy

HBase Table
- split into regions: replicated across servers
  - ColumnFamily = subset of columns with similar query
  - One store per combination of ColumnFamily + region
    - Memstore for each Store: in-memory updates to Store, flushed to disk when full
      - StoreFiles for each store for each region: where the data lives
        
        HFile
HFile
- SSTable frm Google's Bigtable

Which of the following HBase components is stored in memory?

[ ] HFile

[ ]HRegionServer

[x] MemStore

[ ]Store

HFile

Data
- Magic no: to id uniquely
- varities of key value pair, entire entries are known as HBase keys
  - key length
  - value length
  - row length
  - row: for entire app it is social security no.
  - col family length
  - col family: Demographic information
  - col qualifier: Ethicity
  - timestamp
  - key type
  - value

Strong Consistency: HBase Write-Ahead Log

Writer head log/ HLog is used
different keys are asigned to diff regions
regions first writes the entry to HLog, then to the Memstore
if memstore is full, it is flushed to the Store file called HFile

LOG Replay

on failure or upon bootup the log is replayed
use timestamps to find out where the db is wrt the logs
on replays edits are made to memstore

Cross-Datacenter Replication

single MASTER dc
other SLAVE dc replicate the same tables
MASTER cluster synchronously sends HLogs over to slave DCs
Coordination among clusters is via ZOOKEEPER
ZOOKEEPER can be used like a file system to store control info

/hbase/replication/state
/hbase/replication/peers/<peer cluster no>
/hbase/replication/rs/<hlog>

Cloud Computing I - Week04 - 1 - Key-Value stores

Reading Resources

Why Key-Value/NoSQL?

Key value Abstraction

Bad things about RDBMS kind DBs

Today's needs

Scale out, Not Scale up

Key-value/NoSQL Data Model

Column Oriented Storage

Cassandara

Design (Key -> server mapping)

Data Placement Strategies

Snitches

Writes

Writes at the Replica node

Bloom Filter

bloom filter still a doubt

Compaction

Delete

Reads

MemberShip

Suspicion Mechanism

Cassandara vs RDBMS

The Mystery of X - The Cap Theorem

CAP Theorem Fallout

CAP TradeOff

Eventual Consistency

RDBMSs vs Key value stores

Mystery of X

QUORUM

The Consistency Spectrum

Newer Consistency Models

Strong consistency models

HBase

HBase Arch

HBase Table Hierarchy

HFile

Strong Consistency: HBase Write-Ahead Log

LOG Replay

Cross-Datacenter Replication

results matching ""

No results matching ""