Data Management in cloud environments: NoSQL and NewSQL Data Stores

Authors: Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari and Miriam AM Capretz 

Presented by: Anisha Rao, Edward Montoya, Lina Louis, Himanshee, Nilisha Makam Prashantha

Introduction

  • The Significant Paper is taken from:’The Journal Of Cloud Computing’ which has an impact factor of 5.71 and has 399 citations.
  • The paper mostly gives a survey of what are the new technologies implemented to satisfy the scaling opportunities in the cloud environment.
  • It gives us a brief outline of NoSQL and NewSQL data stores, how are they related to each other, and insights and shortcomings from different dimensions.
  • We also discuss how this paper can help us implement the future scope for our term project.
  • Key learnings and insights from the paper will also be summarized in the next sections.

Why do we need NewSQL?

  • Limited support for transactions/ACID in NoSql.

One of the reasons why people moved to NoSQL from traditional DBMS is for the advantage of creating new opportunities for companies who had the need to prioritize availability and performance. The key advantage of designing NoSQL was to achieve these 2 points. But this advantage can turn the other way around for a few companies who cannot give up on transaction requirements(exceptions excluded). Thus the only option available for organizations was to either acquire a powerful single node machine or to develop their own custom sharding service that supports transactions. 

  • NoSql offers weak/eventual consistency.

This gap in the database system made us come up with the concept that would have the ACID properties of RDBMS and achieve the scalable performance of NoSQL. This concept is coined as NewSQL.

Each NoSql solution has a high learning curve. The strong necessity for the best of both NoSql and traditional RDBMS.

One other major difficulty faced while implementing NoSQL was that the diversity of different data stores rapidly increased as a result of which the query languages used along with the tools were also diverse, because of which people could not learn so many languages and use such different tools all at once. Therefore In order to leverage the learning curve of SQL which is acquired over a period of time and also try our hand to implement NoSQL in the backend, NewSQL serves as an advantage.


Key Learnings From The Paper

  • Partitioning:
    • Scaling out properties implemented by NoSQL and NewSQL are horizontal partitioning and Consistent Hashing which follow homogeneous cluster node architecture :

Some of the scaling-out options implemented by NoSQL and NewSQL are horizontal partitioning strategies like Range partitioning and consistent hashing.Tools like MongoDB,Cassandra,BerkeleyDB use range partitioning and Voldemort,DynamoDB,VoltDB use consistent hashing. The disadvantage of range partitioning is: it can result in hotspots and load balancing issues.

  • Diverse partitioning & Heterogeneous Architecture:

The interesting thing we can discuss here is the diverse partitioning strategies NewSQL offers. To deviate from the normal homogeneous architecture followed before, NewSQL implements heterogeneous architecture with the introduction of NuoDB.

  • NewSQL investigates more opportunities with regard to diverse partitioning strategies using heterogeneous cluster node architecture. Eg: NuoDB:

For NUODB it designates a few nodes as storage managers(SM) and a few as transaction engines (TE). SM is responsible for maintaining the data and storing a partition of the database(The partition is called ‘BLOCKS’ in NuoDB). TE acts as an in-memory cache and processes all the queries. The main point is that NuoDB implements load balancing schemes to ensure that data normally used together often reside in the same TE.  This point helps to overcome the disadvantages of range partitioning.

  • NewSQL offers live Data Migration:

Another aspect of partitioning in Newsql is that it supports live data migration. When there is a heavy load on one particular node it will allow the DBMS to shift data to other less crowded nodes to alleviate hotspots and makes sure DBMS has no interruption in its service. It point also acts as overcoming problems from range partitioning.

You may think that this is similar to the re-balancing concept in NoSQL systems, it’s not just that, It’s more difficult for Newsql to do this since it also needs to maintain ACID properties for transactions during migration.

  • Requirement for standardizing terminology
    • Standardizing query mechanisms might increase the adaptability of NoSQL into practice.

The comparison and analysis of NoSQL and NewSql give us the opportunity to ponder upon thoughts about certain areas which we should concentrate on in order to improve the adoption of these data stores better.

  • Terminology discrepancy leads to confusion.

As we analyze, we notice that there are a lot of discrepancies when it comes to terminologies used across these 2 fields. A lot of time they may have different names but they might be doing the same functionality or vice versa. For eg: Riak’s quorum read and write requests are also referred to like the same functionalities of routing parameters in Voldemort.

Ex: The term consistency, when used we might get confused if they are talking about eventual consistency or ACID properties consistency. It creates ambiguity.

Establishing standards not only helps in comparison of data stores but also helps the user to understand better when they are switching between different of  NoSQL products.

In a similar fashion, standardizing query mechanisms based on their data models will increase the adoption of NoSQL and would ease migration from different data models.

  • Concurrency is of interest for NoSQL and NewSQL
    • Solutions need to be able to accommodate large amount of concurrent users
    • High read and write rates
    • Facilitated by partitioning and replication
  • Primary schemes for handling concurrency 
    • Pessimistic concurrency control  (pessimistic locking)

Pessimistic concurrency control, or pessimistic locking, assumes that two or more concurrent users will try to update the same record or object at the same time. To prevent this situation, a lock is placed onto the accessed entity so that exclusive access is guaranteed to a single operation; other clients trying to access the same data must wait until the first one finishes its work. The entity that is locked depends on the underlying data model.

For example, key-value stores lock records consisting of key-value pairs, column-family stores lock rows, and document stores enforce locking at document level. In graph databases, specifically in Neo4J, locks are acquired on nodes and their relationships. BerkeleyDB and MongoDB implement readers-writer locks which allow either multiple readers to access data or a single writer to modify them. Pessimistic locking techniques can lead to performance degradation, especially in write-intensive scenarios.

  • Optimistic concurrency control (optimistic locking)

Assumes that conflicts are possible but rare. Instead of locking the record, the data store checks at the end of the operation to determine whether concurrent users have attempted to modify the same record. If a conflict is identified, different conflict-resolution strategies can be used, such as failing the operation immediately or retrying one of the operations.

  • Multi-version concurrency control (MVCC)
    • More advanced technique
    • Solutions in NoSQL and NewSQL

In MVCC, when the data store needs to update a record, it does not overwrite the old data, but instead adds a new

version and marks the old version as obsolete. Multiple versions are stored, but only one is marked as current. With the MVCC approach, a read operation sees the data

the way they were when it began reading, even if the data were modified or deleted by other operations in the meantime.

  • Security And Privacy

Generally speaking, it is possible to affirm that the security features of NoSQL solutions are not as mature as those included in traditional RDBMSs.

  • Advanced security and privacy provisions are needed
    • NoSQL solutions are not as adequate as traditional RDBMS
    • NoSQL is limited in capability
  • Advanced Security Needed –

Many solutions, such as Redis, Memcached, Voldemort, and Riak, are designed to be used in secure networked environments only. Therefore, they assume that it is the network administrator’s responsibility to ensure that only authorized applications have access to the data store, using mechanisms such as firewalls, operating system configurations, or the adoption of virtual private networks (VPN). In these cases, there is no fine-grained access control to the data store.

  • More security features offered at a cost (Enterprise)
    • MongoDB
    • Cassandra
  • Enterprise has more security –

MongoDB and Cassandra offer additional security functionalities in their enterprise editions, acknowledging the fact that security is a particularly relevant concern for large companies. For instance, data-at-rest encryption and auditing functionalities are available only in Cassandra Enterprise Edition.

  • NewSQL solutions offer RDMBS like security
    • Clustrix
    • NuoDB
  • NewSQL solutions offer RDBMS like security –

Among the NewSQL solutions, Clustrix and NuoDB use the authorization and authentication schemes of traditional RDBMS by supporting the GRANT/REVOKE statements.

Grant – SQL Grant command is specifically used to provide privileges to database objects for a user. This command also allows users to grant permissions to other users too.

Revoke – Revoke command withdraw user privileges on database objects if any are granted. It does operations opposite to the Grant command. When a privilege is revoked from a particular user U, then the privileges granted to all other users by user U will be revoked.


How do we plan to improve our project based on key learnings from the paper?

  • Audit data access. This helps us share Users’ data and health data responsibly:

One of the key learnings we plan to take from the paper to implement in the future scope of our project is data auditing.

Our project mainly deals with accessing users’ personal information related to their mental health. It is important to know who accessed the information of the user if we are planning to share it with medical practitioners like therapists, counselors, etc. Under such circumstances, it is important to audit data access patterns.

MongoDB – the tool we used mostly to store most of the data, does not provide auditing functionality, which is why we took a takeaway to implement one of the new SQL tools which provide data auditing as the future scope of our project.

  • Enable joining the underlying data with other tables in order to conduct analysis. We want both schema-less tables and joining capabilities.

If someone wants to join the mental health data that we are collecting with other kinds of data like ECG, blood pressure data collected from apple watch, Fitbit, or other medical devices then it is difficult on MongoDB since it(MongoDB) does not allow joins.


CONCLUSION

  • Best solution 

This significant paper discussed the challenges in selecting a database solution for cloud computing.

It compared and contrasted NoSQL and NewSQL.

We believe that with consideration of our key learnings the implementation of a NewSQL database could potentially better fit our needs. 

  • NewSQL could potentially better fit our project needs

This is evident in how NewSQL offers RDBMS-like security.NewSQL can be defined as a class of modern relational DBMSs that seek to provide the same scalable performance of NoSQL for OLTP workloads and simultaneously guarantee ACID compliance for transactions as in RDBMS. NewSQL solutions want to achieve the scalability of NoSQL without having to discard the relational model with SQL and transaction support of the legacy DBMS. Arguably the best of both worlds.

  • NewSQL is like RDBMS with NoSQL scalable performance

For the future scope of this paper, we believe that the authors could potentially explore Hybrid Transaction-Analytical Processing (HTAP). 


REFERENCES:

  1. “What’s Really New with NewSQL? “- Andrew Pavlo and Matthew Aslett
  2. Data management in cloud environments: NoSQL and NewSQL data stores. Katarina Grolinger1, Wilson A Higashino1,2*, Abhinav Tiwari1 and Miriam AM Capretz1 

(Base Paper).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: