Cassandra
Historia
- Cassandra foi criado para resolver o problema de pesquisa inbox no Facebook
- Combinou ideias do Dynamo da Amazon com o modelo de dados BigTable do Google
- Em 2008, o Facebook disponibilizou como open-source e tornou-se um projeto incubado da Apache
- Recentemente, em 2010, o Cassandra tornou-se um projeto Top Level da Apache.
Forcas
- Scalable
- Cassandra is incrementally and linearly scalable; capacity can be added with no downtime. The schema-less data model improves agility in development, alleviating the need for updates.
- Reliable¶
- The failure of multiple nodes can be tolerated. Failed nodes can be replaced with no downtime. Cross-data center replication is well supported. Failures of nodes within the cluster are monitored with an Accrual Style Failure Detector. Because all nodes are symmetric and there are no “master” nodes, there is no single point of failure.
- Durable¶
- Durability is the property that writes, once completed, will survive permanently even in the face of hardware failure. Cassandra provides configurable durability by appending writes to a commitlog first (which obviates the need for disk seeks since this is a sequential operation), then using the fsync system call to flush the data to disk.
- Analytics Without ETL¶
- Hadoop jobs can be executed directly against your cluster.
- Performant¶
- Consistency is tunable per operation, allowing consistency levels to be traded for faster response times when needed. There are no reads or seeks in the write path. Multiple cache tuning options allow for optimizing towards specific workloads and data models.
Pontos relevates
- Massivamente escalável
- Armazenamento particionado em linhas
- Arquitetura sem maestre
- Desempenho escalarmente linear
- Sem SPOF
- Suporte de gravaçao-leitura através de múltiplos datacenters e zonas disponiveis em nuvens
- Métodos de acesso via API e queries
- CQL e Thrift
- Replicaçao
- Ponto a ponto
- Escrito em JAva
- Consistencia: consistencia tunavel
- Compressao de dados interna
- Suporte a MapReduce
- Indices primários e secundários
- Aspectos de segurança
MapReduce
- Wikipedia
- MapReduce é um modelo de programação desenhado para processar grandes volumes de dados em paralelo, dividindo o trabalho em um conjunto de tarefas independentes.
- Programas MapReduce são escritas em um determinado estilo influenciado por construções de programação funcionais, especificamente expressões idiomáticas para listas de processamento de dados
- Este módulo explica a natureza do presente modelo de programação e como ela pode ser usada para escrever programas que são executados no ambiente Hadoop.
Ultimas funcionalidades
Cassandra 1.2 introduced many improvements, which are described briefly in this section.
Cassandra 1.2.2 and later support CQL3-based implementations of IAuthenticator and IAuthorizer for use with these security features, which were introduced a little earlier:
Internal authentication based on Cassandra-controlled login accounts and passwords.
Object permission management using internal authorization to grant or revoke permissions for accessing Cassandra data through the familiar relational database GRANT/REVOKE paradigm.
Client-to-node-encryption that protects data in flight from client machines to a database cluster was also released in Cassandra 1.2.
Virtual nodes
Prior to this release, Cassandra assigned one token per node, and each node owned exactly one contiguous range within the cluster. Virtual nodes (vnodes) change this paradigm from one token and range per node to many tokens per node. This allows each node to own a large number of small ranges distributed throughout the ring, which has a number of important advantages. The shuffle tool upgrades a cluster to use vnodes. Murmur3Partitioner
This new default partitioner provides faster hashing and improved performance. Faster startup times
The release provides faster startup/bootup times for each node in a cluster, with internal tests performed at DataStax showing up to 80% less time needed to start primary indexes. The startup reductions were realized through more efficient sampling and loading of indexes into memory caches. The index load time is improved dramatically by eliminating the need to scan the partition index. Improved handling of disk failures
In previous versions, a single unavailable disk had the potential to make the whole node unresponsive (while still technically alive and part of the cluster). Memtables were not flushed and the node eventually ran out of memory. If the disk contained the commitlog, data could no longer be appended to the commitlog. Thus, the recommended configuration was to deploy Cassandra on top of RAID 10, but this resulted in using 50% more disk space. New disk management solves these problems and eliminates the need for RAID as described in the hardware recommendations. Multiple independent leveled compactions
Increases the performance of leveled compaction. Cassandra's leveled compaction strategy creates data files of a fixed, relatively small size that are grouped into levels. Configurable and more frequent tombstone eviction
Tombstones are evicted more often and automatically in Cassandra 1.2 and are easier to manage. Configuring tombstone eviction instead of manually performing compaction can save users time, effort, and disk space. Support for concurrent schema changes
Support for concurrent schema changes: Cassandra 1.1 introduced modifying schema objects in a concurrent fashion across a cluster, but did not support programmatically and concurrently creating and dropping tables (permanent or temporary). Version 1.2 includes this support, so multiple users can add/drop tables, including temporary tables, in this way.