Unformatted Attachment Preview
International Journal of Information Management 37 (2017) 63–74
Contents lists available at ScienceDirect
International Journal of Information Management
journal homepage: www.elsevier.com/locate/ijinfomgt
SecureNoSQL: An approach for secure search of encrypted NoSQL
databases in the public cloud夽
Mohammad Ahmadian ∗ , Frank Plochan, Zak Roessler, Dan C. Marinescu
Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
a r t i c l e
i n f o
Article history:
Received 12 April 2016
Accepted 17 November 2016
Available online 20 December 2016
Keywords:
Search over encrypted data
Database as a service
NoSQL
Encryption
Cloud computing
Security
Query processing
Data integrity
a b s t r a c t
While many schemes have been proposed to search encrypted relational databases, less attention has
been paid to NoSQL databases. In this paper we report on the design and the implementation of a security
scheme called “SecureNoSQL” for searching encrypted cloud NoSQL databases. Our solution is one of the
first efforts covering not only data confidentiality, but also the integrity of the datasets residing on a cloud
server. In our system a secure proxy carries out the required transformations and the cloud server is not
modified. The construction is applicable to all NoSQL data models and, in our experiments, we present
its application to a Document-store data model. The contributions of this paper include: (1) a descriptive
language based on a subset of JSON notations; (2) a tool to create and parse a security plan consisting of
cryptographic modules, data elements, and mappings of cryptographic modules to the data fields; and
(3) a query and data validation mechanism based on the security plan.
© 2016 Elsevier Ltd. All rights reserved.
1. Introduction and motivation
Data analytics, enterprise, and multimedia applications, as
well as applications in many areas of science, engineering, and
economics, including genomics, structural biology, high energy
physics, astronomy, meteorology, and the study of the environment take advantage of cloud computing for processing very large
datasets. Companies heavily involved in cloud computing such as
Google and Amazon, e-commerce companies such as eBay, and
social media networks such as Facebook, Twitter, or LinkedIn discovered early on that traditional relational databases cannot handle
the massive amount of data and the real-time demands of online
applications critical for their business model. The relational schema
is of little use for such applications and conversion to NoSQL
databases seems a much better approach.
The name NoSQL given to the storage model discussed in this
paper might be misleading. Michael Stonebreaker notes that “blind-
夽 The datasets used in the experiments reported in this paper are available at:
https://github.com/MoAhmadian/SecureNoSQL.
∗ Corresponding author.
E-mail addresses: ahmadian@knights.ucf.edu (M. Ahmadian),
frank.plochan@knights.ucf.edu (F. Plochan), zak.roessler@knights.ucf.edu
(Z. Roessler), dcm@cs.ucf.edu (D.C. Marinescu).
URL: http://www.cs.ucf.edu/ ahmadian (M. Ahmadian).
http://dx.doi.org/10.1016/j.ijinfomgt.2016.11.005
0268-4012/© 2016 Elsevier Ltd. All rights reserved.
ing performance depends on removing overhead. Such overhead
has nothing to do with SQL, but instead revolves around traditional
implementations of ACID transactions, multi-threading, and disk
management” (Stonebraker, 2010). The “soft-state” approach in the
design of NoSQL databases allows data to be inconsistent and transfers the task of implementing only the subset of the ACID properties
required by a specific application to the application developer.
NoSQL systems ensure that data will be “eventually consistent” at
some future point in time, instead of enforcing consistency at the
time when a transaction is “committed”. Data partitioning among
multiple storage servers and data replication are also tenets of the
NoSQL philosophy as they increase availability, reduce the response
time, and enhance scalability.
Big Data and mobile applications are the two most important
growth areas of cloud computing. Big Data growth can be viewed as
a three-dimensional phenomenon; it implies an increased volume
of data, requires an increased processing speed to produce more
results, and at the same time, it involves a diversity of data sources
and data types (Marinescu, 2013). A delicate balance between data
security and privacy and efficiency of database access is critical
for such applications. Many cloud services used by these applications operate under tight latency constraints; moreover, these
applications have to deal with extremely high data volumes and are
expected to provide reliable services for very large communities of
users. Nowadays NoSQL databases are widely supported by cloud
64
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
service providers. Their advantages over traditional databases are
critical for Big Data application.
Data security and integrity are important factors when choosing a database for cloud applications. They are particularly critical
for applications running on public clouds where multiple virtual machines (VMs) often share the same physical platform (Xu,
Jiang, Wang, Yuan, & Ren, 2014; Xu, Zhang, Wu, & Shi, 2015;
Yu & Wen, 2010). The importance of database security and its
impact on a large number of individuals are illustrated by the
consequences of two major security breaches: Weiss and Miller
(2015); and Silver-Greenberg, Goldstein, and Perlroth (2014). In
November 2013 approximately 40 million records were stolen from
an unencrypted database used by Target stores. The compromised
information included personally identifiable information (PII) and
credit card data. According to a SEC (Securities and Exchange Commission) report, two months later a cyber-attack on JP Morgan
Chase, compromised PII records of some 76 million households and
7 million small businesses.
Classic cryptography primitives can protect data while in storage, but plaintext data is vulnerable to insider interference during
processing. This is particularly troubling when searching databases
containing personal information such as healthcare or financial
records (Lin, Tsai, & Lin, 2014), as the entire database is exposed
to such attacks. These circumstances motivate us to investigate
methods for searching encrypted NoSQL databases. Though general computations with encrypted data are theoretically feasible
using the algorithms for Fully Homomorphic Encryption (FHE)
(Gentry, 2009), this is by no means a practical solution at this
time. Existing algorithms for homomorphic encryption increase
the processing time of encrypted data by many orders of magnitude compared with the processing of plaintext data. A recent
implementation of FHE (Halevi & Shoup, 2014) requires about six
minutes per batch; the processing time for a simple operation
on encrypted data dropped to almost one second after improvements (Ducas & Micciancio, 2015). Related areas of research are:
Learning With Errors (LWE) (Brakerski & Vaikuntanathan, 2011)
and Lattice-based encryption (Cash, Hofheinz, Kiltz, & Peikert,
2012; Micciancio & Regev, 2009) and Attribute-based Encryption
(Gorbunov, Vaikuntanathan, & Wee, 2013).
In this paper we restrict our discussion to query processing particularly over encrypted NoSQL databases. A secure proxy called
“SecureNoSQL” for accessing cloud remote servers and applying
efficient cryptographic primitives for query, response and data
encryption/decryption is introduced. We also designed a descriptive language using JSON1 notation which enables its users to
generate a security plan. The security plan has four sections which
elaborately introduce the data elements, cryptographic modules
and the mappings between them. The main contributions of this
paper are:
• A JSON-based language for users to: (i) create a security plan for
the database; (ii) describe the security parameters; and (iii) assign
proper cryptographic primitives to the data elements.
• A multi-key, multi-level security mechanism for policy enforcement. This feature is essential because the encryption key is
subject to more frequent changes than the crypto-module. Furthermore, keys are assigned for a single data element, while
encryption algorithms could be applied for several data elements
1
JSON (JavaScript Object Notation) is a lightweight text-based syntax for storing
and exchanging data objects consisting of key-value pairs. It is used primarily to
transmit data between a server and a web application. JSON’s popularity is due to
the fact that it is self-describing and easy to understand by human and machine. For
more information, visit: http://www.json.org.
•
•
•
•
•
with several keys. This separation allows a more efficient enforcement of security policy and of key management.
An effective validation process for the security plan. This validation process enables users to initially evaluate all requests
locally, rather than forwarding large numbers of fallacious keyvalue pairs to a cloud server. It also limits the cloud server
workload and reduces the response time latency.
Support for a comprehensive, flexible protection. The solution is
open-ended; users can add new customized cryptographic modules simply by using the designed descriptive language.
A balanced system with a security level-proportional overhead;
the overhead is proportional to the desired level of security.
A secure proxy which translates queries to run over encrypted
data on the remote cloud server with respect to semantics of
queries. The cloud database server is not modified and treats
encrypted documents in the same way as a plaintext database.
Properties of the distributed database such as replication hold
for encrypted data.
Support for cloud data integrity and protection against an insider
attack.
The rest of this paper is organized as follows: related work and
NoSQL data models are presented in Section 2. The threat model
for a cloud database is discussed in Section 3. The organization of
the system is presented in Section 4 and structure of the security
plan and the notation of the descriptive language for generation
of security plan is discussed in Section 5. Then the mechanism of
query processing is investigated in Section 6. Finally, in Section 8 we
report on measurements of the database response time to different
types of queries and on the encryption and decryption time for OPE
encryptions with output lengths of 64, 128, 256, 512 and 1024-bit.
2. NOSQL databases and related work
NoSQL describes a fairly large number of NoSQL database technologies, more than 120 by our count, have been created in recent
years. NoSQL databases are non-relational, distributed, horizontally
scalable, and schema-free. They are classified based on their data
models. Choosing proper a data model has an extremely important influence on the performance and scalability of the data stores.
Since our work has a tight connection to NoSQL data models, we
provide brief definitions for several data models.
2.1. Key-value store
This simple data model resembles an associative map or a dictionary where a key uniquely identifies the value. The data can
be either a primitive data type such as a string, an integer, an
array, or it can be an object. This model is effective for storing
distributed data; thus, it is highly scalable and this motivates its
use by cloud data management systems. Systems such as Bigtable
(Chang et al., 2008), CouchDB,2 DynamoDB (Sivasubramanian,
2012), MemcacheDB3 and Redis 4 use this model. This model is not
suitable for applications demanding relations or structures.
2.2. Column-family store
In this model the data are stored in a column-oriented style and
the dataset is comprised of several rows, each row is indexed by a
unique key, the so-called primary key. Each row is composed of a set
of column families, and different rows can have different column
2
3
4
http://couchdb.apache.org.
http://www.Memcached.org.
http://redis.io.
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
families. Similarly, the row key resembles the key, and the set of
column families resembles the value represented by the row key.
However, each column family further acts as a key for the one or
more columns that it holds, whereas each column consists of a keyvalue pair. Hadoop HBase directly implements the Google Bigtable
concepts, whereas Amazon SimpleDB and DynamoDB contain only
a set of column name-value pairs in each row, without having column families. Sometimes, SimpleDB and DynamoDB are classified
as key-value stores. Typically, the data belonging to a given row
are stored together at the same server node. Cassandra provides
the additional functionality of super-columns, which are formed
by grouping various columns together. Cassandra can store a single
row across multiple server nodes using composite partition keys.
In column-family stores, the configuration of column families is
typically performed during start-up. A column family in different
rows can contain different columns. A prior definition of columns is
not required and any data type can be stored in this data model. In
general, column-family stores provide more powerful indexing and
querying than key-value stores because they are based on column
families and columns in addition to row keys. Similar to key-value
stores, any logic requiring relations must be implemented in the
client application.
2.3. Document store
In this model data are stored inside an internal structure, while
in the key-value store the data are opaque to the database. Thus,
the database engine applies metadata to create a higher level of
granularity and delivers a richer experience for modern programming techniques. Document-oriented databases use a key to locate
the document inside the data store. Most document stores use JSON
or BSON (Binary JSON). Document stores are suited to applications
where the input data can be represented in a document format.
A document can contain complex data structures such as nested
objects. A document store allows document grouping into collections. A document in a collection should have a unique key. Unlike
a relational database management system (RDBMS),5 where every
row in a table follows the same schema, a document in document
stores may have a different structure from other documents. Document stores provide the capability of indexing documents based on
the primary key as well as on the contents of the documents. Like
key-value stores, they are inefficient in multiple-key transactions
involving cross-document operations.
2.4. Graph database
This data model is used to represent complex structures and
the highly connected data often encountered in real-world applications. In graph databases, the nodes and edges have individual
properties consisting of key-value pairs. Graph databases are a
good alternative for social networking applications, pattern recognition, dependency analysis and recommendation systems. Some
graph databases such as Neo4J6 support ACID7 properties. Graph
data stores are not as efficient as other NoSQL data stores and do
not scale well horizontally when related nodes are distributed to
different servers.
The first SQL-aware query processing using an encrypted
database was CryptDB (Popa, Redfield, Zeldovich, & Balakrishnan,
2011). CryptDB satisfies data confidentiality for an SQL rela-
5
A relational database management system (RDBMS) is a database management
system (DBMS) that is based on the relational model.
6
http://neo4j.com.
7
ACID (Atomicity, Consistency, Isolation, Durability) properties guarantee that
database transactions are processed reliably.
65
tional database. However, CryptDB cannot perform queries over
data encrypted with different keys. One important application of
searching encrypted data (Cash et al., 2013, 2014; Cheon, Kim, &
Kim, 2016; Song, Wagner, & Perrig, 2000; Tu, Kaashoek, Madden, &
Zeldovich, 2013) is in cloud computing where the clients outsource
their storage and computation. In Cash et al. (2014) a practical
searchable security scheme is introduced which can search on
encrypted data sets in sub-linear time complexity by using different types of indices; however, it is not practical on NoSQL data
sets which are designed to scale to millions of users doing updates
simultaneously (Cattell, 2011).
Order-preserving symmetric encryption (OPE) is a deterministic
encryption scheme which maps integers in the range [1, M] into
a much larger range [1, N] and preserves numerical ordering of
plaintexts (Boldyreva, Chenette, Lee, & O’neill, 2009; Mavroforakis,
Chenette, O’Neill, Kollios, & Canetti, 2015). OPE is attractive because
fundamental database operations such as sorting, simple matching (i.e., finding m items in a database), range queries (i.e., finding
all m items within a given range), and search operations can be
carried out efficiently over encrypted data. Moreover, OPE allows
query processing to be done as efficiently as for unencrypted
data; the database server can locate the desired encrypted data in
logarithmic-time via standard tree-based indexing data structures.
An investigation of OPE security against a known plaintext attack
with known N plaintexts is reported in Xiao and Yen (2012)
and Kerschbaum (2015); the last paper concluded that the ideal
OPE module accomplishes one-wayness security.8 The Shannon
entropy9 achieved by an ideal OPE is maximal when the mapping
of integers in the range [1, M] to a much larger range [1, N] results in
a uniform distribution. The risk of disclosure caused by main memory attack is quantified by Canim, Kantarcioglu, Hore, and Mehrotra
(2010) and Bajaj and Sion (2014). An application of OPE in cloud
environment is reported in Ahmadian, Paya, and Marinescu (2014)
and Ahmadian (2017). Also, application of classical cryptography
on relational database system for embedded devices was studied
in Ahmadian, Khodabandehloo, and Marinescu (2015).
NoSQL databases are suffering from lack of proper data protection mechanism because these databases have been designed to
support high performance and scalability requirement. In order to
protect personal and sensitive information, a privacy and security
preserving mechanism is required in Big Data platforms. Integration of privacy aware access control features into existing Big Data
is discussed in Colombo and Ferrari (2015), Liang, Susilo, and Liu
(2015), and Islam and Islam (2014). In Gantz and Reinsel (2012)
and Tankard (2012) the evolution of Big Data Systems from the
perspective of an information security application is studied. As a
matter of fact, the proxy is very important element in the designed
structure and from Information Technology prospect view there
should be special consideration for its protection. A cloud based
monitoring and threat detection system was proposed by Cheon
et al. (2016) and Chow et al. (2009) for critical component to make
infrastructure systems secure.
3. A cloud computing threat model
A threat model describes the threats against a system. The threat
model of cloud computing can be analyzed from multiple viewpoints and we investigate it from an adversarial prospective. The
adversarial threat model for the Database as a Service (DBaaS) is a
8
One-way functions are easy to compute, but computationally hard to invert.
The entropy measures the degree of uncertainty; the Shannon entropy of a disx1 , x2 , . . ., xn with probabilities p1 , p2 ,
crete random variable X with n realizations
9
. . ., pn , respectively, is: H(X) = −
n
i=1
pi log pi .
66
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
holistic process based on end-to-end security. The model identifies
two classes of threats, as external and internal attackers.
3.1. External attacker
An attacker from the outside of cloud environment might obtain
unauthorized access to the hosted databases by applying techniques or tools to monitor the communication between the clients
and the cloud servers. External attackers have to bypass firewalls,
intrusion detection systems and other defensive tools without any
authorization.
3.2. Malicious insiders
An insider attacker has different level of access to cloud
resources. Unauthorized access by malicious insiders who can
bypass most or all data protection mechanisms is a major source
of concern for cloud users. Encrypted data and a secure proxy construction such as SecureNoSQL, guarantees that malicious insiders
cannot access user data. The proxy encrypts/decrypts data and
query/response between clients and cloud. There is still the residual
risk of information leakage from encrypted datasets. A malicious
insider could exploit the leaked information to organize more
extensive attacks and amplify the information leakage.
4. System organization
This section introduces a framework to incorporate data
confidentiality and information leakage prevention algorithms.
SecureNoSQL leverages secure query processing for web and
mobile applications using DBaaS. Two different system organizations can address our design objectives. The first is suitable when
all database users belong to the same organization. Then the proxy
runs on a trusted server behind a firewall and the communication
between clients and the proxy is secure.
When the clients access the cloud using the Internet the second
organization is advisable. In this case, either the client software
includes a copy of the proxy and only encrypted data is transmitted over public communication lines, or the Secure Sockets
Layer (SSL) protocol is used to establish a secure connection to the
proxy. Fig. 1 illustrates the high-level architecture of SecureNoSQL
as a secure proxy between user’s applications and cloud NoSQL
database server. The system we report on was designed with several objectives in mind:
• Support multi-user access to an encrypted NoSQL database.
Enforce confidentiality, privacy of transactions and data integrity.
• Hide from the end-users the complexity of the security mechanisms; the database access should be transparent and the user’s
access should be the same as for an unencrypted database.
• Avoid transmission of unencrypted data over public communication lines.
• Do not require any modification of the NoSQL database management system.
• Create an open-ended system; allow the inclusion of cryptographic modules best suited for an application.
These objectives led us to design a system where a proxy mitigates the client access to the cloud remote server running an
unmodified NoSQL database processing system. In this system the
processing of a query involves three phases:
(di , dj ) =
1. Client-side query encoding in JSON format carried out by the
client software;
2. Query encryption and decryption done by a trusted proxy; and
3. Server-side query processing performed by an unmodified
NoSQL database server.
SecureNoSQL is based on general principles of NoSQL database
products. We introduce a new concept, the security plan, materialized as a JSON description of data elements, metadata and
parameter configuration of cryptosystems. A descriptive language
is introduced to generate and parse the security plan automatically.
JSON, a dominant format in NoSQL databases, is selected to express
the designed security plan. We used a subset of JSON notation readable by human and machine.
Document databases, such as MongoDB, store documents inside
the collection by JSON representation in a similar way as tables and
records in relational database systems. A query and the corresponding response are also represented in the JSON format; therefore, the
governing format in document database is JSON. BSON, a binary
extension of JSON, is used by document-oriented databases for
efficient encoding/decoding.
JSON query model is a functional, declarative notation, designed
especially for working with large volumes of structured, semistructured and unstructured JSON documents. The data owner
develops the security plan that outlines and maps out the determined crypto-primitive with specific parameters to a particular
data element.
5. Descriptive language for security plan
The NoSQL database benefits from flexible scheme that allows
to have a different number of attributes for the documents corresponding to the same object. On the other hand, a full list of
attributes is required to create a comprehensive protection for all
data elements in the database. Therefore, we define a logical operator denoted as Super Document, the union of all attributes from
different versions of the documents related to the same object. Each
database D consists of a set of arbitrary number n documents.
D = {d1 , . . ., dn }
Furthermore, documents comprised of an arbitrary number m
attributes in which each attribute also is built up with a key value
pair k, v.
di = {A1 , . . ., Am },
1in
In other words, a Super Document in the scope of a collection
(databases) is an aggregation of attributes representing specific
entity. Thus for any given document di it is required to look for n − 1
documents to extract attributes that are not member of di (relative complement). This concept is rephrased in Eq. (1). In addition,
a match function (di , dj ) determines whether two given documents di , dj are desirable for merging or not. Two documents can
be combined if they share the same attribute from an identifying
type.
Super Document is defined as: i,j = di , dj
∃Ap ∈ di ∧ ∃Aq ∈ dj if ((di , dj ) == True) ⇒ i,j = di ∪ dj
The function is defined as:
True
iff ∃Ap ∈ di ∧ ∃Aq ∈ dj | [(Ap .key = Aq .key) ∧ (Ap .value = Aq .value)]
False
Otherwise
Provided that Ap and Aq are identifier attributes.
(1)
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
67
Fig. 1. The organization of the SecureNoSQL.
the collection must be encrypted. The listing 3b illustrates how to
secure a sample collection using the description language.
The key-value pairs (KVP) are the primary data model for a
NoSQL database. The key is used as an index to access the associated value of the data pointed by the reference ref. The initialization
vector (IV) is a fixed-size, random input to the cryptographic
module encryption. Additionally, a collection exists within a single database. Documents within a collection can have different
fields. Typically, all documents in a collection are related with one
another.
Fig. 2. The high level structure of the security plan.
5.1. Database security plan
The security plan identifies the mechanism to maintain the security of the data elements in a database. It also determines how to
interpret queries issued by a specific application. The security plan
has four sections, see Fig. 2, describing the security rules for the
data elements and for meta-data such as the field-name (Key) and
the collection name. These sections are the building blocks of the
security plan showing how the rules are enforced. The sections and
their roles are:
1. Collection: includes the name of a collection and a reference to
the encryption module used to encrypt the name of the collection
and the name of fields (metadata).
2. Cryptographic modules: lists the cryptographic modules for
encrypting the fields of the database entries in the query.
3. Data elements: lists the properties of each data field including the
data type; the data type determines the cryptographic modules
to be applied to each field.
4. Mapping cryptographic modules to the fields: assigns the cryptographic modules to data fields; proxy uses this information to
encrypt and decrypt the data elements.
5.1.1. Collection
A collection is defined as a group of NoSQL documents, the
equivalent of relational database table, see Fig. 3. The name of
5.1.2. Cryptographic modules
The choice of a particular cryptosystem depends on the security policy of application. Multiple criteria for algorithm selection
include: (i) the security against theoretical attacks; (ii) the cost
of implementation; (iii) the performance; and (iv) whether the
encryption and decryption can be parallelized. Other factors
involved in the selection of an algorithm are the memory requirements and the integration in the overall system design.
The Cryptographic modules introduce all encryption modules and
their parameters such as key, key-size, initialization vector and
output-size. The structure of this section is shown in Fig. 4a complemented by the listing in Fig. 4b presenting the second section of
security plan for the previous example.
Our proof of concept uses the parametric Order Preserving
Encryption (OPE) and the Advanced Encryption Standard (AES)
modules. The system is open-ended; users can add the cryptosystems best suited to the security requirements of their application.
In our design the definitions of the cryptographic modules and of
the pairs, encryption key and initialization value, are separated following the so-called key separation principle (Galiegue & Zyp, 2013).
This security practice is based on the observations that users have
long- and short-term security policies. The cryptographic modules
are less likely to change while the key and the initialization value
change frequently.
5.1.3. The data elements
The third section of security plan, the data elements and
their properties are covered. Fig. 5 presents the structure and
description of Data element section of Security plan. The listing
68
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
Fig. 3. The structure of a collection: (a) The chart outlines the structure of a collection containing the name of collection and name of all attributes which are considered as a
meta-data, and should be protected with proper cryptographic module. (b) The description of a collection and security parameters in designed JSON based language. In this
specific case the Advanced Encryption Standard in deterministic (AES-DET) mode with a 128-bit key and an initialization vector (IV) is assigned to encrypt the name of the
collection and the fields name.
Fig. 4. The structure and function of Cryptographic modules: (a) The Security Plan with the second section, the cryptographic module, expanded. The attributes included for
each module are: name, type, key size, key, input and output size. (b) The OPE encryption including the cryptosystems and their attributes. The proxy applies these modules
using the key-value pairs (KVP).
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
69
Fig. 5. Structure and description of Data element: (a) The chart outlines the structure of Data elements containing attributes of data elements such as name, type and value for
of collection and name and then introduces security parameters for each data element. (b) The data element section of a sample database which is represented in designed
notation. A data item has 7 fields: id, name, salary, balance, ccn, ssn, and email. The id, name, email, and salary are required fields.
Fig. 6. The structure and description of Mapping cryptographic modules to the Data element: (a) Security plan with the fourth section expanded. This section establishes a
correspondence between the data fields and the cryptographic modules used to encrypt and decrypt the data fields. (b) The mapping section of the schema for a sample
database with 7 fields. For example, the id and the name will be encrypted with OPE 128 bit and AES-DET, respectively.
displayed in Fig. 5b displays data elements and its JSON description for previous example. To ensure the desired level of security
the security plan should provide the description of all sensitive data elements of database in third section of security
plan.
5.1.4. Mapping cryptographic modules to the fields
The last section of security plan specifies all cryptographic modules for all sensitive data fields. Fig. 6 and the listing presented in
Fig. 6b shows the mapping of the cryptographic modules and the
corresponding JSON format for a sample application.
The method presented in this paper can be easily extended to
the other NoSQL data models discussed in Section 2. Fig. 7 shows
how this extension from the key-value pair to the document store
model can be carried out.
5.2. Query and data validation
The proxy validates the data and the query as a JSON-formatted
input with the reference security plan. Then the proxy enforces the
crypto-primitives and generates new query following the NoSQL
query semantics. During this process the proxy applies to each field
the cryptographic modules. Finally, the proxy forwards the newly
encrypted query/data to the NoSQL database server. Fig. 8 depicts
the schema validation process.
For better illustration, consider listings depicted in Fig. 9a as an
input data; after running validation process the output is generated (see Fig. 9b). The output of validation process is a single file
which contains descriptive information for data and meta-data in
designed format and ready to execute on the SecureNoSQL.
The output of validation process is a single file containing
descriptive information for data and meta-data expressed in the
70
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
Fig. 7. SecureNoSQL applied to: (a) The key-value data model; Key1 , . . ., Keyn are all encrypted using the cryptographic module z while the corresponding values, Value1 , . . .,
Valuen are encrypted with cryptographic modules 1, 2, . . ., n, respectively. (b) The document store data model; the meta-data such as collection name encrypted as well as
attributes with assigned cryptographic modules.
encrypted; however, the output is consistent with NoSQL semantics.
7. Integrity of data/query/response
Fig. 8. The validation process of input data against security plan in the client side.
Table 1
The overhead of encryption for several encryption schemes.
Database
Plain
OPE64
OPE128
OPE256
OPE512
Size (MB)
170
430
508
662
1000
Integrity and confidentiality are two critical components of data
security. Integrity refers to the consistency of the outsourced data.
The proposed integrity verification algorithm in SecureNoSQL guarantees the integrity of data/queries (see Algorithm 1 and Fig. 11).
Data owner first applies encryption scheme on the documents,
and then calculates Hashed Message Authentication Code (HMAC)
for each one of encrypted documents. A hash value of any given
document is a fixed length of 512 bit and data owner concatenates a unique document identifier (ID) with hash value and stores
the results in efficient structure like HashTable which has constant looks-up time O(1). Next, data owner transfers the encrypted
dataset to the cloud and sends HashTable containing hash values
to the proxy. Once the proxy receives the query response from the
server, it initiates the verification process to check the authenticity
of the documents by recalculating the hash values. This process is
illustrated in Fig. 11.
Algorithm 1.
Proxy
Document Integrity Verification Algorithm in the
required format and ready to execute. The output of validation process for the example is illustrated in Fig. 9b. As it was noted earlier,
the schema reflects the desired security level expressed by the security plan for the database. Table 1 shows the overhead for several
parameters and crypto-primitives.
6. Processing queries on encrypted data
According to the proposed scheme, in order to process queries
over encrypted data the queries should transfer to the encrypted
version with respect to security plan, and this task is designed to
be conducted in the proxy. The security plan discussed in Section 5,
supplies the parameters of the cryptographic modules to be applied
for the data elements involved in the query. Fig. 10 displays the
processing and rewriting of a sample query.
For better understanding the query encryption, in Table 2 you
can find some sample encrypted queries after enforcing security
plan. As it can be seen, data elements and immediate values are
In this configuration the data owners just trust the
proxy(SecureNoSQL) and cloud servers are not trustworthy.
Thus, a result of data integrity verification, all active attacks done
by internal or external attacker will be detected by the proposed
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
71
Fig. 9. The security plan for the sample input: (a) The data element section of sample security plan. (b) The output of the JSON data validation for the sample database.
Fig. 10. The query db.customers.find({salary:{$gt:5000}, balance:{$lt:2000}}) received from an application. (a) The parsing tree of the query. (b) The cryptographic modules
applied to the data elements according to schema definition.
approach. The message authentication code (MAC) is created by
using the keyed Hash Message Authentication Code (HMAC) as
rephrased in Eq. (2).
HMAC(K, document)
= H((K ⊕ okeyPad) H((K ⊕ ikeyPad) document)
(2)
Where:
H represents the hash function ⊕ is the XOR operator
okeyPad is one-block-long outer pad ikeyPad is one-blocklong inner key pad
Algorithm 2 presents the pseudo-code of the HMAC function
for a block size of 64 bytes. The computed hash values with correspondent document’s unique identifier can be stored in the form of
key-value pair in a hash-table, thus allowing the proxy to carry the
lookup in constant time during the verification process.
Algorithm 2.
generation
Keyed Hash Message Authentication Code (HMAC)
72
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
Table 2
Five sample queries and their corresponding encrypted version.
Fig. 11. (1) Data owner transfers the encrypted database to the cloud server. (2)
Data owner sends the Hash database to proxy. (3) Clients send plain queries to the
proxy. (4) The proxy translates queries to the encrypted version, and forwards them
to the cloud server. (5) The cloud server returns the query response set. (6) The
proxy runs a hash verification process on the query response set, and then based on
the result either forwards to the decrypted response or reports integrity violation
to the client.
8. Results and discussion
The response time of a query to an encrypted NoSQL database
has several components:
1.
2.
3.
4.
the time to encode the query in JSON format;
the time to encrypt and decrypt the data;
the communication time to/from the server;
the database response time.
For our experiments we first created a sample database with one
million records and then determined the overhead of searching an
encrypted database. To do so we measured the database response
time for queries when the records were unencrypted versus when
records were encrypted. Then, we measured the encryption and the
decryption time for different sizes of the ciphertext. We wanted to
isolate the different components of the response time dominated
by the communication time.
The environment used for testing was set up on the Linux
operating system. We chose MongoDB (Dede, Govindaraju, Gunter,
Canon, & Ramakrishnan, 2013), classified as a NoSQL document
store database 3.0.2. The random data generator in JS, PHP, and
MySQL format was generated by using a tool (Keen, 2016) to generate a one million record plaintext data set. Each record had seven
different data fields including name, email, salary, as shown in Listing 9b.
We applied OPE 64, 128, 256 and 512 bit to numeric data type,
and the AES-DET 128 bit for the string data type of the plaintext
data set and generated four encrypted data sets of one million
records each. Finally, we uploaded the five datasets and created five
MongoDB databases, one with the unencrypted data, and four with
the encrypted data. Once the MongoDB databases were created we
run several types of queries including equality, greater than, less
than, greater than or equal to, less than or equal to, and OR logical
operations.
The experiments to measure the query time must be carefully
designed. To construct average query processing time each experiment has to be carried out repeatedly. We noticed a significant
reduction of database management response time after the first
execution of a query, a sign that MongoDB is optimized and caches
the results of the most recent queries. A solution is to disable the
cache, or if this is not feasible, to clear the cache before repeating the
query. Another important observation is that modern processors
have a 64-bit architecture and are optimized for operations on 64bit integers. For three of the five types of queries, Q2 (Range query),
Q3 (equality), and Q4 (logical), database response time is slightly
shorter for the encrypted database than for the unencrypted one
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
73
Fig. 12. The query processing time in milliseconds (ms) for the unencrypted database and for the encrypted databases when the 32-bit keys are encrypted as 64, 128, 256
and 512-bit integers.
Table 3
The query processing time in milliseconds (ms) for the plaintext and for the ciphertext. 32-bit plaintext integers are encrypted as 64, 128, 256 and 512-bit integers. The
record count gives the number of records retrieved by each one of the five types of queries, Q1–Q5.
Query type
Number of matching record(s)
32-bit plaintext
64-bit ciphertext
128-bit ciphertext
256-bit ciphertext
512-bit ciphertext
Q1: Comparison
Q2: Equality
Q3: Range
Q4: Logical
Q5: Aggregation
461,688
1
991,225
551,380
1
340
340
370
500
600
310
380
350
540
660
355
390
360
550
670
370
400
380
555
680
380
410
400
560
690
when the keys are 32-bit integers. A plausible explanation for this
is most likely related to the cache management.
The results reported in Table 3 and in Fig. 12 show the database
response time for the five MongoDB experiments. Each query was
carried out 100 times with disabled query cache and the average
query response in milliseconds was calculated. We also measured
the encryption and the decryption time and the results are reported
in Fig. 13. The measurement process was automated, and it was
running under the control of a script which generated the data and
reported the processing time.
Our measurements show that the response time of the NoSQL
database management system to encrypted data depends on the
Fig. 13. Execution time of the OPE module when the key is encrypted as 64, 128,
256, 512, and 1024 bit.
type of the query. The shortest and longest database response
times occur for Q1 (comparison) and Q5 (aggregated queries),
respectively; for these two extremes the time for the unencrypted
database was almost double, but the time for encrypted databases
increases only by 70–80%. As expected, the query processing type
for a given type of query increases, but only slightly, less than 5%
when the key length increases from 64, to 128, 256, and 512 bit.
The OPE encryption time increases significantly with the size
of the encryption space; it increases almost tenfold when the size
of the encrypted output increases from 64-bit to 1024-bit and it
is about 10 ms for 256-bit. The decryption time is considerably
smaller; it increases only slightly from 0.11 ms to 0.17 when the
size of the encrypted key increases from 64-bit to 1024 bit.
Secure proxy is an important element for the proposed architecture; therefore, the potential attacks that could affect the proxy,
also should be taken into consideration. In general, two major possible attacks on proxy are Denial of Service (DoS) and unauthorized
access. In DoS attack, the attacker sends so many network traffic to
the proxy, that the system is not capable of processing within the
expected time frame. Successful DoS attacks can turn the proxy to a
bottleneck of the system. In unauthorized access attacks, attackers
use a proxy to mask their connections while attacking the different
targets.
Several solutions exist for improving the security of proxy
against DoS attacks and reducing the consecutive impacts, including blocking the undesired packets or using multiple proxies with
load balancers. Moreover, for prevention of unauthorized access
attacks, it is required to use best fit authorization to access the
proxy. User authentication based on group membership with different authorizations are the best practical solutions.
74
M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74
9. Conclusions and future work
Though the OPE encryption scheme has known security vulnerabilities it can be very useful for NoSQL database query processing
for the data models discussed in Section 2. While the key is
encrypted using OPE, the other fields of a record can be encrypted
using strong encryption, thus reducing the vulnerability of the data
attacks. Strong encryption of the value fields could increase the
encryption time but will have little effect on the decryption time.
An important observation is that increasing the size of the codomain of the OPE mapping function from 264 to 2128 , 2256 , and
to 2512 results in an increase of database response time up to
5%, except for Q3-type queries when the increase is significant.
The penalty for using encrypted, rather than unencrypted NoSQL
databases such as MongoDB is less than 5% for Q2, Q4, and Q5
which is considered to be relatively small. Moreover, the overall
query response time is dominated by the communication time. The
secure proxy is a critical component of the system. The proxy is
multi-threaded and its cache management is non-trivial. The management of the security attributes is rather involved. On the other
hand, a proxy integrated in the client-side software can be lightweight and considerably simpler. We are currently implementing
the two versions of proxy. Experimental results for multiple large
datasets with up to one million documents show that SecureNoSQL
is rather efficient. Our approach can be extended to a multi-proxy
structure for Big Data applications. We are now implementing a
sophisticated mechanism for maintaining consistency of hash values database in the proxies datasets based on the PAXOS algorithm
(Lamport, 2001; Marinescu, 2013).
Acknowledgment
We thank Victor Shoup from New York University for NTL C++
library used to manipulating arbitrary length integers.
References
Ahmadian, M. (2017). SECURE QUERY PROCESSING IN CLOUD NoSQL. In 2017 IEEE
international conference on consumer electronics (ICCE) (2017 ICCE)
Ahmadian, M., Khodabandehloo, J., & Marinescu, D. (2015). A security scheme for
geographic information databases in location based systems. In IEEE
SoutheastCon (pp. 1–7). http://dx.doi.org/10.1109/SECON.2015.7132941
Ahmadian, M., Paya, A., & Marinescu, D. (2014). Security of applications involving
multiple organizations and order preserving encryption in hybrid cloud
environments. In IEEE international conf. on parallel distributed processing
symposium workshops (IPDPSW) (pp. 894–903). http://dx.doi.org/10.1109/
IPDPSW.2014.102
Bajaj, S., & Sion, R. (2014). Trusteddb: A trusted hardware-based database with
privacy and data confidentiality. IEEE Transactions on Knowledge and Data
Engineering, 26, 752–765.
Boldyreva, A., Chenette, N., Lee, Y., & O’neill, A. (2009). Order-preserving
symmetric encryption. In Annual international conference on the theory and
applications of cryptographic techniques (pp. 224–241). Springer.
Brakerski, Z., & Vaikuntanathan, V. (2011). Fully homomorphic encryption from
ring-LWE and security for key dependent messages. In Annual cryptology
conference (pp. 505–524). Springer.
Canim, M., Kantarcioglu, M., Hore, B., & Mehrotra, S. (2010). Building disclosure
risk aware query optimizers for relational databases. Proceedings of the VLDB
Endowment, 3, 13–24.
Cash, D., Hofheinz, D., Kiltz, E., & Peikert, C. (2012). Bonsai trees, or how to delegate
a lattice basis. Journal of Cryptology, 25, 601–639.
Cash, D., Jaeger, J., Jarecki, S., Jutla, C. S., Krawczyk, H., Rosu, M.-C., et al. (2014).
Dynamic searchable encryption in very-large databases: Data structures and
implementation. IACR Cryptology ePrint Archive, 2014, 853.
Cash, D., Jarecki, S., Jutla, C., Krawczyk, H., Roşu, M.-C., & Steiner, M. (2013).
Highly-scalable searchable symmetric encryption with support for
Boolean queries. In Advances in cryptology—CRYPTO 2013 (pp. 353–373).
Springer.
Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39,
12–27.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al.
(2008). Bigtable: A distributed storage system for structured data. ACM
Transactions on Computer Systems (TOCS), 26, 4.
Cheon, J. H., Kim, M., & Kim, M. (2016). Optimized search-and-compute circuits
and their application to query evaluation on encrypted data. IEEE Transactions
on Information Forensics and Security, 11, 188–199.
Chow, R., Golle, P., Jakobsson, M., Shi, E., Staddon, J., Masuoka, R., et al. (2009).
Controlling data in the cloud: Outsourcing computation without outsourcing
control. In Proceedings of the 2009 ACM workshop on cloud computing security
(pp. 85–90). ACM.
Colombo, P., & Ferrari, E. (2015). Privacy aware access control for big data: A
research roadmap. Big Data Research, 2, 145–154.
Dede, E., Govindaraju, M., Gunter, D., Canon, R. S., & Ramakrishnan, L. (2013).
Performance evaluation of a MongoDB and hadoop platform for scientific data
analysis. In Proceedings of the 4th ACM workshop on scientific cloud computing
(pp. 13–20). ACM.
Ducas, L., & Micciancio, D. (2015). FHEW: Bootstrapping homomorphic encryption
in less than a second. In Annual international conference on the theory and
applications of cryptographic techniques (pp. 617–640). Springer.
Galiegue, F., & Zyp, K. (2013). JSON schema: Core definitions and terminology.
Internet Engineering Task Force (IETF).
Gantz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital
shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future,
2007, 1–16.
Gentry, C. (2009). A fully homomorphic encryption scheme (Ph.D. thesis). Stanford
University.
Gorbunov, S., Vaikuntanathan, V., & Wee, H. (2013). Attribute-based encryption for
circuits. In Proc. of the forty-fifth annual ACM symposium on theory of computing
(pp. 545–554). http://dx.doi.org/10.1145/2488608.2488677
Halevi, S., & Shoup, V. (2014). Algorithms in HElib. In International cryptology
conference (pp. 554–571). Springer.
Islam, M., & Islam, M. (2014). An approach to provide security to unstructured big
data. In 8th international conf. on software, knowledge information management
and applications (SKIMA) (pp. 1–5). http://dx.doi.org/10.1109/SKIMA.2014.
7083392
Keen, B. (2016). benkeen/generatedata. https://github.com/benkeen/generatedata
Kerschbaum, F. (2015). Frequency-hiding order-preserving encryption. In
Proceedings of the 22nd ACM SIGSAC conference on computer and
communications security (pp. 656–667). ACM.
Lamport, L. (2001). Paxos made simple. ACM Sigact News, 32, 18–25.
Liang, K., Susilo, W., & Liu, J. (2015). Privacy-preserving ciphertext multi-sharing
control for big data storage. IEEE Transactions on Information Forensics and
Security, 10, 1578–1589. http://dx.doi.org/10.1109/TIFS.2015.2419186
Lin, C.-H., Tsai, S.-H., & Lin, Y.-P. (2014). Secure transmission using MIMO
precoding. IEEE Transactions on Information Forensics and Security, 9, 801–813.
http://dx.doi.org/10.1109/TIFS.2014.2309211
Marinescu, D. C. (2013). Cloud computing: Theory and practice. Newnes.
Mavroforakis, C., Chenette, N., O’Neill, A., Kollios, G., & Canetti, R. (2015). Modular
order-preserving encryption, revisited. In Proc. of the 2015 ACM SIGMOD
international conf. on management of data (pp. 763–777). http://dx.doi.org/10.
1145/2723372.2749455
Micciancio, D., & Regev, O. (2009). Lattice-based cryptography. In Post-quantum
cryptography. pp. 147–191. Springer.
Popa, R. A., Redfield, C. M. S., Zeldovich, N., & Balakrishnan, H. (2011). CryptDB:
Protecting confidentiality with encrypted query processing. In Proc. of the
twenty-third ACM symposium on operating systems principles (pp. 85–100).
http://dx.doi.org/10.1145/2043556.2043566
Silver-Greenberg, J., Goldstein, M., & Perlroth, N. (2014). JPMorgan chase hack
affects 76 million households. New York Times, 2.
Sivasubramanian, S. (2012). Amazon dynamoDB: A seamlessly scalable
non-relational database service. In Proc. of ACM SIGMOD international conf. on
management of data (pp. 729–730). http://dx.doi.org/10.1145/2213836.
2213945
Song, D. X., Wagner, D., & Perrig, A. (2000). Practical techniques for searches on
encrypted data. In 2000 IEEE symposium on security and privacy 2000. S&P 2000.
Proceedings (pp. 44–55). IEEE.
Stonebraker, M. (2010). SQL databases v. NoSQL databases. Communications of the
ACM, 53, 10–11. http://dx.doi.org/10.1145/1721654.1721659
Tankard, C. (2012). Big data security. Network Security, 2012, 5–8.
Tu, S., Kaashoek, M. F., Madden, S., & Zeldovich, N. (2013). Processing analytical
queries over encrypted data. Proceedings of the VLDB Endowment, 6, 289–300.
Weiss, N. E., & Miller, R. S. (2015). The target and other financial data breaches:
Frequently asked questions. In Congressional research service, prepared for
members and committees of congress, February (Vol. 4) (p. 2015).
Xiao, L., & Yen, I.-L. (2012). Security analysis for order preserving encryption
schemes. In 2012 46th annual conference on information sciences and systems
(CISS) (pp. 1–6). IEEE.
Xu, L., Jiang, C., Wang, J., Yuan, J., & Ren, Y. (2014). Information security in big data:
Privacy and data mining. IEEE Access, 2, 1149–1176. http://dx.doi.org/10.1109/
ACCESS.2014.2362522
Xu, L., Zhang, X., Wu, X., & Shi, W. (2015). ABSS: An attribute-based sanitizable
signature for integrity of outsourced database with public cloud. In Proceedings
of the 5th ACM conference on data and application security and privacy (pp.
167–169). ACM.
Yu, X., & Wen, Q. (2010). A view about cloud data security from data life cycle. In
International conf. on computational intelligence and software engineering (CiSE)
(pp. 1–4). http://dx.doi.org/10.1109/CISE.2010.5676895
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/269293971
Security of sharded NoSQL databases: A
comparative analysis
Conference Paper · June 2014
DOI: 10.1109/CIACS.2014.6861323
CITATIONS
READS
6
681
3 authors, including:
Rahat Masood
Awais Shibli
UNSW Australia
National University of Sciences and Technology
21 PUBLICATIONS 57 CITATIONS
54 PUBLICATIONS 141 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Awais Shibli on 03 January 2015.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
2014 Conference on Information Assurance and Cyber Security (CIACS)
Security of Sharded NoSQL Databases:
A Comparative Analysis
Anam Zahid, Rahat Masood, Muhammad Awais Shibli
School of Electrical Engineering and Computer Sciences (SEECS)
National University of Sciences and Technology (NUST)
Islamabad, Pakistan
Email: {12msitazahid, rahat.masood, awais.shibli}@seecs.edu.pk
The BASE (Basically Available, Soft State and
Eventually Consistent) properties of NoSQL databases
allow them to be scalable and thus, NoSQL systems
inherently support auto-sharding phenomenon [2]. Autosharding is the automatic and native horizontal distribution
of data among different severs in NoSQL databases which,
in turn, increase the performance and throughput of database
operations [9]. Another significant advantage of autosharding is load balancing across the cluster such that a
single server does not get overloaded with all the queries.
This makes NoSQL databases a good candidate for high
transaction and write-intensive database applications [4, 5].
Abstract—NoSQL databases are easy to scale-out because
of their flexible schema and support for BASE (Basically
Available, Soft State and Eventually Consistent) properties.
The process of scaling-out in most of these databases is
supported by sharding which is considered as the key feature
in providing faster reads and writes to the database. However,
securing the data sharded over various servers is a challenging
problem because of the data being distributedly processed and
transmitted over the unsecured network. Though, extensive
research has been performed on NoSQL sharding mechanisms
but no specific criterion has been defined to analyze the
security of sharded architecture. This paper proposes an
assessment criterion comprising various security features for
the analysis of sharded NoSQL databases. It presents a
detailed view of the security features offered by NoSQL
databases and analyzes them with respect to proposed
assessment criteria. The presented analysis helps various
organizations in the selection of appropriate and reliable
database in accordance with their preferences and security
requirements.
Apart from scalability and performance, data security is
probably one of the most difficult challenges faced by the
NoSQL databases now-a-days. These databases were
initially not designed by considering security as an
important feature. Therefore, it has become the sole
responsibility of NoSQL consumers to protect these
databases by using third party tools and services [10].
Furthermore, sharding also pose security risks, even greater
than standalone NoSQL deployments. Some of the main
security risks encountered by these sharded databases are
due to geographic distribution of data, unencrypted data
storage, unauthorized exposure of backup and replicated
data, insecure communication over the network and many
more [3, 13]. Such risks implies the need for strong security
solutions in sharded databases, which not only provides the
security of data-at-rest on a single node, but also maintains
data security during transmission between various nodes in
a sharded environment [11, 12]. Though, NoSQL databases
provide mechanisms to practice basic security principle CIA
(Confidentiality, Integrity and Availability), but still there is
some room for improvement in security solutions especially
those which are designed for sharded environments.
Keywords—Sharding; Database Security; NoSQL; Data and
Applications Security; Comparative Analysis
I.
INTRODUCTION
The term “NoSQL” was first introduced in 1988 for the
relational databases not having SQL interfaces [1].
However, the term was re-introduced in 2009 for the kind of
modern web-scale databases which trade transactional
consistency over large-scale data distributions and
incremental scalability. NoSQL databases were not
originally meant to replace traditional databases, rather they
are more suitable to adopt when relational databases does
not seem appropriate. The main reasons behind adopting
NoSQL databases are their simple yet flexible architecture
and the capability of handling large amount of multimedia,
word processing, social media, emails and other
unstructured data files [6]. While, the conventional
relational databases are hard to scale-out mainly due to their
pre-defined schemas and I/O performance bottlenecks [2, 4].
These issues have made relational databases difficult to fit
in the new computing paradigms such as Grid and Cloud
applications, data warehousing, web2.0, social networking
etc [7]. Contrary to it, NoSQL databases are becoming a
primary choice for cloud applications because of their
highly available, reliable and scalable nature [8].
This paper targets sharding mechanisms in popular
NoSQL databases such as CouchDB, MongoDB and
Cassandra etc and performs a detailed assessment to analyze
the security in their sharded architecture. To the best of our
knowledge, there are no previous assessment criteria to
evaluate the security of existing sharded NoSQL databases.
Therefore, we have identified the factors critical for the
security of sharded databases after a thorough review of
current literature, existing database security controls and
practical approaches. These factors will be used as
1
978-1-4799-5852-8/14/$31.00 ©2014 IEEE
assessment criteria and help enterprises in the evaluation
and selection of suitable sharded NoSQL database according
to their security requirements.
Beside these models, there are also access control
models which satisfy various other security requirements
and offers distinct features from one another. These models
include PBAC (Policy Based Access Control), TBAC (Task
Based Access Control), ABAC (Attribute Based Access
Control), FGAC (Fine Grained Access Control) etc.
Sharded databases need to configure access control policies
to manage consistent authorization strategies throughout the
cluster and to ensure that databases and users have restricted
access only to the resources required to perform their
defined functions [28].
The rest of the paper is organized as follows: Section II
elaborates the assessment criteria for evaluating security of
sharding techniques in various NoSQL databases, Section
III consists of literature review along with the critical
analysis of each technique, Section IV summarizes research
findings and Section V concludes the paper after presenting
some future directions.
II. ASSESSMENT CRITERIA
C. Secure Configurations
A security breach may easily occur due to
misconfigurations at the OS, database or application layer
[26] therefore, sharded database must include a list of
configurations that can be applied by the system
administrators to secure databases uniformly across clusters
at both physical and logical level. These configurations are
implemented based on the security and business needs of an
organization. Sharded databases usually contain one or more
configuration databases holding information about data
configurations and distributions. Hence, it is the
responsibility of cluster administrator to automatically
propagate database configuration information throughout
the cluster [24]. Some of the important categories of secure
configurations include securing of patches and updates,
services, protocols, roles and accounts, files and directories,
backups, ports and registries etc [27].
Organizations moving their data to distributed NoSQL
data stores must consider security as an important factor
besides consistency and availability. Distributing data to
multiple servers in different data centers provides more
avenues for both physical and virtual security attacks
therefore, it is very important to identify the factors
necessary for enforcing security in sharded databases and to
consider that these factors should be applicable to every
sharded database equally. For this purpose, we have
extensively reviewed the existing security controls of
sharded NoSQL databases [13, 17, 18] and have identified a
criterion for the security assessment of sharded database
architectures. This criteria will not only serve as generic
guidelines for the security evaluation of existing and
upcoming NoSQL databases but also highlights those areas
in sharded databases which needs security improvements. A
brief description against each factor of the criteria is given
below:
D. Data Encryption
Data Encryption is used to provide confidentiality of
data and applications in a database system [17]. It includes
the encryption of data-at-rest as well as the encryption of
data-in-transit over the network. Sharded databases are
needed to apply data encryption techniques at table, row and
column level to secure the information [15, 19]. The
common methods of data-at- rest encryption are the
implementations of algorithms such as DES (Data
Encryption Standard), AES (Advance Encryption Standard)
and Hashing (MD5, SHA1 &2) etc. In addition, there exist
many different mechanisms to encrypt the data-in-transit.
These mechanisms protect the data transmissions from
server-to-server and from server to client applications.
These communications can be encrypted using standard
encryption techniques such as SSL, IPSec, TLS and SSH
etc. Moreover, management of encryption keys is also an
important factor of data encryption and should be handled
with proper care [16].
A. Authentication
The process of verifying a user’s identity who wants to
access the resources, data or applications of an organization
is known as authentication [18]. Authentication can be
provided in many ways, ranging from a single user
authentication to mutual authentication of user with
database server and then to two-way authentication between
database servers [14]. Sharded databases may provide its
own authentication mechanism or it may rely on some third
party systems such as LADP directories or PKI to identify
and authenticate its users and database servers [17]. A few
of the well known authentication techniques include
Password based authentication, Multi-Factor authentication,
Certificate based authentication, authentication protocols
like SSL, SSH and Kerberos etc.
B. Access Controls
Access control is the mechanism through which we can
ensure that only an authorized person is allowed to access
system resources. Access control can be applied at system,
database, object and content level depending on the
configurations of the database administrator [22]. A number
of access control models have been proposed to provide
secure access to database tables and columns having
sensitive values attach to them. Three of the most
conventional access control models include Discretionary
Access Control (DAC), Mandatory Access Control (MAC)
and Role Based Access control (RBAC) model [23, 25].
E. Auditing
Database Auditing refers to the monitoring and
recording of individual and collective actions performed by
database users [20]. Auditing helps in the identification of
foot prints or possible password cracking attempts before
the occurrence of an attack [21]. Therefore, in sharded
database environments the Database administrators (DBAs)
should use regular auditing or fine grained auditing
techniques for the detection and monitoring of unauthorized
access to data and operations [18]. Some of the important
2
auditing activities include auditing of database connections,
privileged activities and transaction logs.
servers. Query routers redirect the client query to the
appropriate shards after looking up shard addresses from the
config. servers. Query routers are also responsible for
cluster balancing using the two primary operations of chunk
splitting and balancing. MongoDB supports collection level
partitioning of data. Each collection is distributed evenly
across mongoDB sharded cluster in the form of chunks [30].
2) Security Analysis: MongoDB supports Plain
MongoDB-CR for user authentication in its open source
edition. Additionally, it uses SSL with X.509 certificates not
only for secure communication between user and mongoDB
cluster but for intra-cluster authentication as well. Access
control is enabled by using system defined and custom
defined roles within a sharded cluster. Moreover, Data
encryption is only provided at the transport level by the use
of SSL and basic level auditing of data operations is offered.
Security of database configurations is supported at very low
level and is totally relied upon database administrator.
This section has presented an overview of the identified
assessment criteria along with few well known examples of
each. In the next section, we have analyzed the security
features of six different widely adopted, open-source
NoSQL sharded databases on the basis of this common
criterion. Each of the defined security assessment criteria
has been classified into three metric values (low, medium
and high) in order to help NoSQL consumers in the
selection of suitable sharded database according to their
requirements. These metric values describe the significance
of identified security factors in sharded databases and are
shown in Table I while the identified security assessment
criteria with respect to these values can be seen in Table II.
This classification also acts as a guideline for the security
assessment of various sharded databases throughout the
paper.
III.
ANALYSIS OF SHARDING SECURITY IN
NOSQL DATABASES
B. Redis
1) Introduction: Redis is a document-oriented, highly
available and distributed NoSQL data store. Unlike other
NoSQL databases, redis supports three kinds of partitioning
namely “client side partitioning”, “proxy assisted
partitioning” and “Query routing”. The client side
partitioning give rights to the redis client to directly select
the appropriate node for the storage of data keys. Proxy
assisted partitioning, on the other hand needs query to be
delivered to a proxy server instead of sending it directly to
the appropriate redis instance. The query routing
partitioning suggests that a client query can be sent to any
instance of the redis cluster, and it is the responsibility of
the instance to forward client query to the appropriate
instance [31]. However, partitioning can only be survived in
the scenarios where majority of the master nodes are
reachable in a cluster having master-slave configurations.
Redis also supports master-slave replication within its
cluster for the efficiency of reads and data redundancy [32].
2) Security Analysis: Redis provides password based
authentication to its clients. These passwords are stored in
plain text format and set by the system administrators.
However, it does not provide authentication by default and
listens on all IP addresses on port 6739. In addition, redis
does not ensure any kind of access control mechanisms and
rely on third party SSL implementations for the security of
data transmissions over untrusted networks. Redis also does
not provide support for configuration security, data
encryption and auditing mechanisms.
This section provides an introduction to sharding
techniques in various NoSQL databases and analyzes their
current security features on the basis of proposed
assessment criteria discussed in Section 2. The analysis
covers current security features offered by sharded NoSQL
databases and also highlights effective security controls
which are lacking. It will help various organizations,
NoSQL vendors and consumers in the selection of
appropriate NoSQL database according to their security
requirements. Additionally, this analysis will also provide
assistance in improving security controls of various NoSQL
databases. Following is the detailed security analysis of
sharding architecture of various existing NoSQL databases
namely MongoDB, Redis, HBase, Cassandra, CouchDB and
Couchbase server when compared with defined assessment
criteria. It may be noted that the basic criteria behind
selecting these particular NoSQL databases in this paper are
their popularity, open source nature and easily available
documentation.
A. MongoDB
1) Introduction: MongoDB is an open source, highly
available, document oriented, scalable and fault tolerant
NoSQL database [29]. It supports sharding by configuring a
sharded cluster. Each cluster is composed of three main
components namely shards, query routers and configuration
TABLE I.
Metric
Value
SECURITY ASSESSMENT IMPORTANCE KEY
Description
High
Provides complete support of required features needed to
secure data
Medium
Provides a limited set of security feature only and it is
advisable to implement missing features
Low
Offers very basic security features or no security at all
C. CouchDB
1) Introduction: CouchDB is a document oriented,
peer based nosql database having high scalability and
availability [33]. The cluster configuration of CouchDB
support document redistribution across nodes for large
performance improvements. Any change in a document on a
single node is periodically copied to other nodes of the
cluster using the phenomenon of incremental replication
3
TABLE II.
SECURITY ASSESSMENT CRITERIA
Metric Value
Description
Authentication
High
Medium
Low
Sharded database cluster should provide one of the following authentication mechanisms such as
Two-factor authentication (e.g. passwords with fingerprints, PIN code with mobile number etc)
Certificate based authentication using PKI with LDAP
It should also support client side, intra-cluster as well as inter-cluster authentication.
Supports only one type of authentication e.g. SSO, OpenID, SAML or some certificate based authentication etc either within a sharded
cluster or with client.
Only Simple Password based client authentication is supported or no authentication at all.
Access Control
High
Dynamic access control rules following principal of least privileges, separation of duties with custom defined access control policies
such as UCON (Usage Control Mode) etc.
Medium
Support for either RBAC , DAC, MAC, ABAC or FGAC etc.
Low
Few predefined roles such as user and DBA or no other support for access control at all.
Secure Configuration
High
This type of database configuration security ensures
Security of Database Log and configuration files
Hardened database servers
Limited number of protocols, services and administrator accounts
Secured backup and recovery of databases
Regular Patching and updates etc
Medium
Only supports a limited number of protocols for communication and strict access control to registries and accounts
Low
Use of default configurations
Data Encryption
High
Encryption of data-at-transit as well as data-at-rest at application, network and database level. Use of secure protocols like SSL, TLS etc
and strong encryption with secure key management mechanisms for data security.
Medium
Encryption of data-at-rest or use of transport layer security
Low
No security of data-at-rest or data during transmission
Auditing
High
Transparent auditing of database, application and user profiles using auditing and monitoring tools. Logging of all internal and external
activities and events.
Medium
Database level auditing, logging of all changes to user profiles
Low
Database connection level auditing (i-e log-on, log-off etc.) or No auditing mechanisms
admin party, for its users. Moreover, authorization is only
implemented at database level. A very medium level
auditing is provided to log views and events in log files.
However, CouchDB does not provide automatic logging
thus the configuration of logs is the responsibility of the
database administrators. Furthermore, automatic backups of
database logs and replicas are also not supported in
CouchDB database.
[34]. This incremental replication also helps in maintaining
data redundancy and consistency in a CouchDB cluster.
Moreover, CouchDB uses the techniques of “oversharding”
and “iterative shard replacements” to distribute data evenly
across the cluster [35]. This technique helps sharded cluster
to grow optimally without much downtime.
2) Security Analysis: CouchDB supports basic
password based authentication as well as cookie based
authentication for its users. Passwords in CouchDB are
hashed using PBKDF2 hashing algorithm and are sent over
the network using SSL for the security of data transmission.
Access control in CouchDB only supports a single role i-e
D. Cassandra
1) Introduction: Cassandra is a column-oriented,
highly scalable and distributed NoSQL database based on
4
F. Couchbase Server
1) Introduction: Couchbase server is an open source
distributed,
document-oriented
and
shared-nothing
architecture based NoSQL database. It has a true sharednothing architecture with auto-sharding and cross cluster
replication (XCDR) facilities. All the servers in a cluster are
distributed across various data centers. The documents are
also uniformly distributed across the cluster and stored in
special data containers called vbuckets. A Couchbase cluster
scales in completely horizontal fashion and more nodes can
be added and removed when needed [41]. The mappings of
vbuckets to the nodes of the cluster are stored in a lookup
structure called cluster map. This cluster map is stored in all
the cluster nodes as well as in the Couchbase client nodes.
Whenever a cluster wants to scale-in or scale-out, a
balancing round starts to balance vbuckets evenly among
the cluster. High availability and failover is maintained
through replication at the vbucket level. This is done by
maintaining an active vbucket present on one node and its
replica on another node [42]. A cluster performance and
monitoring is maintained by database administrators
through either its administrative web interface or
management REST API [43].
2) Security Analysis: The Couchbase administrative
web interface and its management REST API use HTTP
basic authentication while SASL based external
authentication is also supported.
Additionally, the
administrative
console
only
provides
read-only
administration rights to its clients. Couchbase only provides
secure data replication using SSL in its XCDR technology
which is not a part of its open source distribution.
Couchbase server supports logging of every component i-e
view, index and vbucket etc. However, configuration
security is still absent in Couchbase server and is totally
relied upon database administrator.
the architecture of Google’s BigTable and Amazon’s
Dynamo data store. Cassandra uses partition and replication
strategy of Dynamo combined together with the column
family data model from BigTable [36]. A Cassandra cluster
consists of multiple decentralized nodes for the storage of
partitioned data items. Each node is responsible for the
management and storage of its own data items. All these
data items are distributed transparently over the nodes and
each node can route the read and write requests to the
appropriate node [37]. Replication policies are mainly used
at two levels namely “Simple Strategy”, and “Network
Topolgy Strategy”, to achieve high availability and
scalability. All the write operations are first written into
commit log for durability and recoverability [38].
2) Security Analysis: Cassandra provides very weak
password based authentication where all passwords are
stored using MD5 hash. All the authentication and
authorization in Cassandra is provided between client and
Cassandra cluster i-e inter-node message exchange does not
support authentication by default. Hence, any malicious user
having access to network used by Cassandra cluster can
cause damage and extract data after bypassing client side
authentications. However, Cassandra provides intra-cluster
transmission security at cluster, datacenter and rack level by
enabling SSL/TLS in its configurations. By default there is a
single super user in Cassandra but other users can be created
by assigning permissions to them using CQL (Cassandra
Query Language). Furthermore, Cassandra does not support
any auditing, logging, data-at-rest encryption and
configuration security across the cluster in its open-source
version.
E. HBase
1) Introduction: HBase is an open source columnoriented, automatically distributed and scalable hadoop data
store build on the concept of Google’s BigTable underlying
architecture. It uses distributed configuration, replication
and write-ahead-logging (WAL) mechanisms to recover
from automatic failovers. The client query in HBase is
directly transferred to the particular RegionServer after
performing a lookup operation in the .META. and –ROOTcatalog tables [39]. An HBase cluster has multiple
RegionServers and a single Master server. Each
RegionServer has multiple Regions to store table data in
them. When a single table becomes too big, it is distributed
across multiple Regions [40].
2) Security Analysis: HBase not only supports token
based authentication for mapreduce tasks but user
authentication is also supported by HBase. The user
authentication is done by using SASL (Simple
Authentication and Security Layer) with Kerberos on per
connection basis. Additionally, authorizations in HBase are
managed by ACL (Access Control Lists) or Coprocessors
with column family level granularity and on per user basis.
HBase also provides logging support up to data node level.
However, other high level auditing and monitoring features
are absent in HBase along with configuration security and
encryption of data-at-rest.
IV.
RESEARCH FINDINGS & DISCUSSION
From the analysis of above mentioned sharding
supported NoSQL databases, it can be seen that most of
them offer no security or very less security which is an
important aspect of distributed data processing
environments where multiple users send multiple data
requests over unsecured channel. The assessment clearly
shows that: most of the sharded databases provide password
based client side authentication like in MongoDB,
Cassandra and Redis. While intra-cluster authentications are
not very common and are only provided by using SSL/TLS
protocols between client and server. Role-based
authorization for access control is implemented at basic
level in these databases and all the rights to read, write or
modify data are assigned to a super user by default. These
databases also provide no support for custom defined roles
except in MongoDB and HBase while authorization is
mostly defined at the granularity of database level. Sharded
database configuration security is the domain which is
completely neglected by NoSQL databases. Most of the
NoSQL solution vendors recommend the use of VPN and
Firewalls while providing very basic built-in support for
5
individual comparative analysis of sharded NoSQL
databases against each criterion is shown in Fig. 1 - 5 below.
Authentication
Secure Configurations
The analysis shows that there is significant need to
improve security of sharded NoSQL databases. Considering
this analytical evaluation, it is required to perform research
in the domain of sharded NoSQL databases with an
objective to achieve reliable, efficient and secure sharding
mechanisms. It is recommended that there should be a
holistic solution to achieve robust security features in
sharded databases keeping in mind scalability and
performance issue.
Metric Values
Metric Values
network security. Security of backups and replicas is also
considered to be the sole responsibility of database
administrators. Furthermore, most of databases do not
support any kind of data transmission security; others have
it disabled by default. Therefore, all the intra-cluster and
client side transmission security is recommended to be
ensured through SSL/TLS protocol. Finally, most of these
sharded databases provide support for auditing at database
or table level but lacks in providing automatic auditing and
monitoring with their open source releases. Table III
presents a summarized view of research findings and the
High
Medium
Medium
Low
Low
Sharded NoSQL Databases
Sharded NoSQL Databases
Fig. 1. Comparative Results for Authentication
Fig. 3. Comparative Results for Secure Configurations
Access Control
Data Encryption
Metric Values
Metric Values
High
High
High
Medium
Medium
Low
Low
Sharded NoSQL Databases
Sharded NoSQL Databases
Fig. 2. Comparative Results for Access Control
Fig. 4. Comparative Results for Data Encryption
6
DynamoDB and SimpleDB and analyze them as how cloud
computing paradigm affects the security features among
these sharded database delivery models.
Auditing
Metric Values
REFERENCES
[1]
R. Masood. "Fine-Grained Access Control for Database Management
Systems." MS (CCS) thesis, National University of Science and
Technology, Pakistan, 2013.
[2] 10Gen
Corporation.
"NoSQL
Explained."
Internet:
http://www.mongodb.com/nosql-explained, 2011 [Mar. 25, 2014]
[3] IBM Corporation. "Data Security and Privacy – A holistic Approach."
Internet:
www.ibm.com/software/data/optim/protect-data-privacy,
Sept. 2011 [Apr. 11, 2014].
[4] CodeFutures Corporation. "Cost-effective Database Scalability using
database Sharding." Internet: www.codefutures.com/databasesharding, Jul. 2008 [Mar. 24, 2014]
[5] A. Viswanathan and C.J. Kothari. "Hibernate Framework-based
database
sharding
for
SaaS
Applications."
Internet:
http://www.ibm.com/developerworks/library/os-hibernatesaas, Oct.
12, 2010 [Apr. 2, 2014]
[6] C. Roe. "The Growth of Unstructured Data: What To Do with All
Those Zettabytes?" Internet: http://www.dataversity.net/the-growthof-unstructured-data-what-are-we-going-to-do-with-all-thosezettabytes, Mar. 15, 2012. [Mar. 28, 2014].
[7] N. Hardiman. "Cloud computing and the rise of big data." Internet:
http://www.techrepublic.com/blog/the-enterprise-cloud/cloudcomputing-and-the-rise-of-big-data, Oct. 1, 2013. [Apr. 5, 2014]
[8] A. Schram and K.M. Anderson. "MySQL to NoSQL: data modeling
challenges in supporting scalability," in Proc. of the 3rd annual
conference on Systems, programming, and applications: software for
humanity, 2012, pp. 191-202.
[9] MongoDB Inc. "MongoDB sharding guide - Sharding and MongoDB
Release 2.4.6." Internet: http://docs.mongodb.org/manual/sharding/,
Mar. 2014 [Mar. 28, 2014].
[10] J.D. Meier, A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla and
A. Murukan. "Chapter 18: Securing Your Database Server.” Internet:
http://msdn.microsoft.com/en-us/library/ff648664.aspx, Jun. 2006
[Mar. 29, 2014]
[11] T. Allard, N. Anciaux, L. Bouganim, Y. Guo, L. Le Folgoc, B.
Nguyen, P. Pucheral, I. Ray, I. Ray, and S. Yin. "Secure personal
data servers: a vision paper," in Proc. of the VLDB Endowment 3,
2010, pp. 25-35.
[12] U.T. Mattsson. “A practical implementation of transparent encryption
and separation of duties in enterprise databases: protection against
High
Medium
Low
Sharded NoSQL Databases
Fig. 5. Comparative Results for Auditing
V. CONCLUSION AND FUTURE WORK
This paper presents assessment criteria for the evaluation
of various open source and sharded NoSQL databases.
These sharded databases are analyzed according to a
proposed assessment criterion and their criticality is
identified in the existing systems. This not only helps in
further analysis of those areas in sharded databases which
lacks in security but also enables NoSQL vendors &
customers to improve the existing implemented security
techniques. The survey findings show that improving
security is a continuous process and should not be stopped
at any cost. It has also found that there is not one complete
solution to all sharded database security problems and any
organization that needs to implement these sharded
databases must consider security at every level including
security of database cluster itself, transmission security,
security of data-at-rest and backups/replica security. As our
future work, we would like to survey different NoSQL
sharding solutions deployed on cloud paradigm in the form
of DBaaS (Database-as-a-Service) such as Amazon’s
TABLE III.
COMPARATIVE ANALYSIS OF SHARDING SECURITY IN VARIOUS NOSQL DATABASES
NoSQL
database
MongoDB
Redis
CouchDB
Cassandra
HBase
Couchbase Server
Assessment Criteria
Authentication
Medium
Low
Medium
Low
Medium
Medium
Access Control
High
Low
Low
Low
Medium
Low
Secure Configuration
Medium
Low
Low
Low
Low
Low
Data Encryption
Medium
Low
Medium
Medium
Low
Low
Auditing
Low
Low
Medium
Low
Medium
Medium
7
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33] J.C. Anderson, J. Lehnardt and N. Slater (2010, Jan.) CouchDB: The
Definitive Guide. (1st edition). [On-line] Available: http://itebooks.info/read/288/ [Apr. 21, 2014]
[34] Couchbase.
“Eventual
Consistency.”
Internet:
http://docs.couchdb.org/en/latest/intro/consistency.html, 2014 [Apr.
21, 2014]
[35] Couchbase. “CouchDB The definitive guide: Clustering.” Internet:
http://guide.couchdb.org/draft/clustering.html, 2014 [Apr. 21, 2014]
[36] Cassandra Wiki. “Cassandra Wiki Architecture Overview.” Internet:
http://wiki.apache.org/cassandra/ArchitectureOverview, 2013 [Apr.
21, 2014]
[37] IBM Corp. “Consider the Apache Cassandra database.” Internet:
http://www.ibm.com/developerworks/library/os-apache-cassandra/,
2012 [Apr. 21, 2014]
[38] U. Mansoor. “Cassandra Chapter 5: Data Replication Strategies.”
Internet: http://10kloc.wordpress.com/2012/12/27/cassandra-chapter5-data-replication-strategies, Dec. 12, 2012 [Apr. 22, 2014]
[39] Apache HBase. “The Apache HBase Reference Guide.” Internet:
http://hbase.apache.org/book.html#arch.overview, Apr. 06, 2014
[Apr. 21, 2014]
[40] Karnataki, Vivek, “HBase-Overview of architecture and data model.”
Internet: http://netwovenblogs.com/2013/10/10/hbase-overview-ofarchitecture-and-data-model, Oct. 10, 2013 [Apr. 23, 2014]
[41] Couchbase. “Couchbase Server under the Hood - An Architectural
Overview.”
Internet:
http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/
Couchbase_Server_Architecture_Review.pdf, 2013 [Apr. 24, 2014]
[42] Couchbase. “Dealing with Memcached Challenges - Getting the
performance
without
the
gotchas.”
Internet:
http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/
Couchbase_Whitepaper_Dealing_with_Memcached_Challenges.pdf,
2013 [Apr. 16, 2014]
[43] Couchbase,
“Couchbase
Server
Features”.
Internet:
http://www.couchbase.com/couchbase-server/features#content1, 2014
[Apr. 12, 2014]
external and internal attacks on databases.” In E-Commerce
Technology 2005, 2005, pp. 559-565.
T. Baccam. "Oracle Database Security: What to look for and Where
to secure." Internet: https://www.sans.org/reading-room/analystsprogram/oraclewhitepaper-201004, Apr. 2010 [Mar. 12, 2014]
R. Duncan. "An Overview of Different Authentication Methods and
Protocols." SANS Institute, Washington, DCUS, 2001. Available:
http://www.sans.org/readingroom/whitepapers/authentication/overview-authentication-methodsprotocols-118
K.W. Nafi, T. Shekha Kar, S.A. Hoque, and M. M. A. Hashem.
(2013, Mar.) "A newer user authentication, file encryption and
distributed
server
based
cloud
computing
security
architecture." (IJACSA) International Journal of Advanced Computer
Science and Applications. [On-line]. 3(10), pp. 181-186. Available:
http://arxiv.org/ftp/arxiv/papers/1303/1303.0598.pdf [Mar. 12, 2014]
Meyer, Christopher, and Jörg Schwenk. "Lessons Learned From
Previous SSL/TLS Attacks-A Brief Chronology Of Attacks And
Weaknesses." in proc. IACR Cryptology, 2013, pp. 49
D.I.S.A. “Database security technical implementation guide version
8, release 1.” Database STIG for USA Dept. of Defence, Sept. 2007
Oracle Corp. "Oracle database security guide 10g” Internet:
http://docs.oracle.com/cd/B12037_01/network.101/b10773.pdf, Dec.
2003 [Mar. 20, 2014]
A. Lane. “Choosing and Implementing an Enterprise database
encryption strategy.” Information Week Report, Rep. S7220713,
2013
P. Badkar, “Oracle Database Security Guide 11g Release 1.” Internet:
http://docs.oracle.com/cd/B28359_01/network.111/b28531.pdf, Jan.
2914 [Mar. 11, 2014].
J.D. Meier, A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla and
A. Murukan. "Chapter 2: Threats and Countermeasures.” Internet:
msdn.microsoft.com/en-us/library/ff648641.aspx, Jan. 2006 [Apr. 21,
2014].
M.G. Piattini and E. Fernandez-Medina. "Secure databases: state of
the art." in Proc. IEEE 34th Annual International Carnahan
Conference on Security Technology, 2000, pp. 228-237
E. Bertino, and R. Sandhu. "Database security-concepts, approaches,
and challenges." In proc. IEEE transactions on Dependable and
Secure Computing, 2005, pp. 2-19
G.E. de Silveira. “Usenix: A Configuration Distribution System for
Heterogeneous Networks.” in Proc. of the Twelfth Systems
Administration Conference (LISA), 1998, pp. 283
R. Sandhu, D. Ferraiolo, and R. Kuhn. "The NIST model for rolebased access control: towards a unified standard." in Proc. ACM
workshop on Role-based access control, 2000, pp. 47 - 63
A. Ely. “Strategy: Responding to database compromise” Information
Week analytics, Rep. S1680810, 2010.
J.D. Meier, A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla and
A. Murukan. "Chapter 18: Securing Your Database Server.” Internet:
http://msdn.microsoft.com/en-us/library/ff648664.aspx, Jun. 2006
[Apr. 20, 2014]
N. Delessy, E.B. Fernandez, M.M. Larrondo-Petrie, and J. Wu.
"Patterns for access control in distributed systems." in Proc. of the
14th Conference on Pattern Languages of Programs, 2007, pp. 3
MongoDB Inc. "MongoDB Architecture Guide." Internet:
http://info.mongodb.com/rs/mongodb/images/MongoDB_Architectur
e_Guide.pdf, Mar. 2014. [Apr. 14, 2014]
10Gen Corporation. "Sharding and MongoDB Release 2.6.0."
Internet:
http://docs.mongodb.org/master/MongoDB-shardingguide.pdf, Apr. 2, 2014 [Apr. 20, 2014]
Redis. “Partitioning: how to split data among multiple Redis
instances.” Internet: http://redis.io/topics/partitioning, 2014 [Apr. 20,
2014]
E. Wolff. “Redis: The Universal NoSQLTool” Internet:
http://www.slideshare.net/ewolff/redis-15061825, Nov. 07, 2012
[Apr. 21, 2014]
8
View publication stats
ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 4, No.16 , July 2015
ISSN : 2322-5157
www.ACSIJ.org
A Survey on Security Issues in Big Data and NoSQL
Ebrahim Sahafizadeh1, Mohammad Ali Nematbakhsh2
1
Computer engineering department, University of Isfahan
Isfahan,81746-73441,Iran
sahafizadeh@eng.ui.ac.ir
2
Computer engineering department, University of Isfahan
Isfahan,81746-73441,Iran
nematbakhsh@eng.ui.ac.ir
Abstract
2.1 Big Data
This paper presents a survey on security and privacy issues in big
data and NoSQL. Due to the high volume, velocity and variety of big
data, security and privacy issues are different in such streaming data
infrastructures with diverse data format. Therefore, traditional
security models have difficulties in dealing with such large scale
data. In this paper we present some security issues in big data and
highlight the security and privacy challenges in big data
infrastructures and NoSQL databases.
Keywords: Big Data, NoSQL, Security, Access Control
Big data is a term refers to the collection of large data sets
which are described by what is often referred as multi 'V'. In
[8] 7 characteristics are used to describe big data:
Volume, variety, volume, value, veracity, volatility and
complexity, however in [9], it doesn't point to volatility and
complexity. Here we describe each property.
Volume: Volume is referred to the size of data. The size of
data in big data is very large and is usually in terabytes and
petabytes scale.
Velocity: Velocity referred to the speed of data producing and
processing. In big data the rate of data producing and
processing is very high.
Variety: Variety refers to the different types of data in big
data. Big data includes structured, unstructured and semistructured data and the data can be in different forms.
Veracity: Veracity refers to the trust of data.
Value: Value refers to the worth drives from big data.
Volatility: "Volatility refers to how long the data is going to
be valid and how long it should be stored" [8].
Complexity: "A complex dynamic relationship often exists in
big data. The change of one data might result in the change of
more than one set of data triggering a rippling effect" [8].
Some researchers defined the important characteristics of big
data are volume, velocity and variety. In general, the
characteristics of big data are expressed as three Vs.
1. Introduction
The term big data refers to high volume, velocity and variety
information which requires new forms of processing. Due to
these properties which are referred sometimes as 3 'V's, it
becomes difficult to process big data using traditional database
management tools [1]. A new challenge is to develop novel
techniques and systems to extensively exploit the large
volume of data. Many information management architectures
have been developed towards this goal [2].
As developing new technologies and increasing the use of big
data in several scopes, security and privacy has been
considered as a challenge in big data. There are many security
and privacy issues about big data [1, 2, 3, 4, 5 and 6]. In [7]
top ten security and privacy challenges in big data is
highlighted. Some of these challenges are: secure
computations, secure data storages, granular access control
and data provenance.
2.2 NoSQL
The term NoSQL stands for "Not only SQL" and it is used for
modern scalable databases. Scaling is the ability of the system
to increase throughput when the demands increase in terms of
data processing. To support big data processing, the platforms
incorporate scaling in two forms of scalability: horizontal
scaling and vertical scaling [10].
Horizontal Scaling: in horizontal scaling the workload
distributes across many servers. In this type of scalability
multiple systems are added together in order to increase the
throughput.
In this paper we focus on researches in access control in big
data and security issues on NoSQL databases. In section 2 we
have an overview on big data and NoSQL technologies, in
section 3 we discuss security challenges in big data and
describe some access control model in big data and in section
4 we discuss security challenges in NoSQL databases.
2. Big Data and NoSQL Overview
In this section we have an overview on Big Data and NoSQL.
68
Copyright (c) 2015 Advances in Computer Science: an International Journal. All Rights Reserved.
ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 4, No.16 , July 2015
ISSN : 2322-5157
www.ACSIJ.org
Vertical Scaling: in vertical scaling more processors, more
memory and faster hardware are installed within a single
server.
The main advantages of NoSQL is presented in [11] as the
following: "1) reading and writing data quickly; 2) supporting
mass storage; 3) easy to expand; 4) low cost". In [11] the data
models that studied NoSQL systems support are classified as
Key-value, Column-oriented and Document. There are many
products claim to be part of the NoSQL database, such as
MongoDB, CouchDB, Riak, Redis, Voldermort, Cassandera,
Hypertable and HBase.
Apache Hadoop is an open source implementation of Google
big table [12] for storing and processing large datasets using
clusters of commodity hardware. Hadoop uses HDFS which is
a distributed file system to store data across clusters. In section
6 we have an overview of Hadoop and discuss an access
control architecture presented for Hadoop.
[2] using data content. In this case the semantic content of
data plays the major role in access control decision making.
"CBAC makes access control decisions based on the content
similarity between user credentials and data content
dynamically" [2].
Attribute relationship methodology is another method to
enforce security in big data proposed in [3] and [4]. Protecting
the valuable information is the main goal of this methodology.
Therefore [4] focuses on attribute relevance in big data as a
key element to extract the information. In [4], it is assumed
that the attribute with higher relevance is more important than
other attributes. [3] uses a graph to model attributes and their
relationship. Attributes are expressed as node and relationship
is shown by the edge between each node and the method is
proposed by selecting protected attributes from this graph. The
method proposed in [4] is as follow:
"First, all the attributes of the data is extracted and then
generalize the properties. Next, compare the correlation
between attributes and evaluate the relationship. Finally
protect sele...