A paper in IEEE format

User Generated

npuvxra2030

Computer Science

Description

I want a report of 15 pages in an IEEE format.

I have attached 9 papers. and i want some information to be added from each paper such that it develops a report


Unformatted Attachment Preview

International Journal of Information Management 37 (2017) 63–74 Contents lists available at ScienceDirect International Journal of Information Management journal homepage: www.elsevier.com/locate/ijinfomgt SecureNoSQL: An approach for secure search of encrypted NoSQL databases in the public cloud夽 Mohammad Ahmadian ∗ , Frank Plochan, Zak Roessler, Dan C. Marinescu Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA a r t i c l e i n f o Article history: Received 12 April 2016 Accepted 17 November 2016 Available online 20 December 2016 Keywords: Search over encrypted data Database as a service NoSQL Encryption Cloud computing Security Query processing Data integrity a b s t r a c t While many schemes have been proposed to search encrypted relational databases, less attention has been paid to NoSQL databases. In this paper we report on the design and the implementation of a security scheme called “SecureNoSQL” for searching encrypted cloud NoSQL databases. Our solution is one of the first efforts covering not only data confidentiality, but also the integrity of the datasets residing on a cloud server. In our system a secure proxy carries out the required transformations and the cloud server is not modified. The construction is applicable to all NoSQL data models and, in our experiments, we present its application to a Document-store data model. The contributions of this paper include: (1) a descriptive language based on a subset of JSON notations; (2) a tool to create and parse a security plan consisting of cryptographic modules, data elements, and mappings of cryptographic modules to the data fields; and (3) a query and data validation mechanism based on the security plan. © 2016 Elsevier Ltd. All rights reserved. 1. Introduction and motivation Data analytics, enterprise, and multimedia applications, as well as applications in many areas of science, engineering, and economics, including genomics, structural biology, high energy physics, astronomy, meteorology, and the study of the environment take advantage of cloud computing for processing very large datasets. Companies heavily involved in cloud computing such as Google and Amazon, e-commerce companies such as eBay, and social media networks such as Facebook, Twitter, or LinkedIn discovered early on that traditional relational databases cannot handle the massive amount of data and the real-time demands of online applications critical for their business model. The relational schema is of little use for such applications and conversion to NoSQL databases seems a much better approach. The name NoSQL given to the storage model discussed in this paper might be misleading. Michael Stonebreaker notes that “blind- 夽 The datasets used in the experiments reported in this paper are available at: https://github.com/MoAhmadian/SecureNoSQL. ∗ Corresponding author. E-mail addresses: ahmadian@knights.ucf.edu (M. Ahmadian), frank.plochan@knights.ucf.edu (F. Plochan), zak.roessler@knights.ucf.edu (Z. Roessler), dcm@cs.ucf.edu (D.C. Marinescu). URL: http://www.cs.ucf.edu/ ahmadian (M. Ahmadian). http://dx.doi.org/10.1016/j.ijinfomgt.2016.11.005 0268-4012/© 2016 Elsevier Ltd. All rights reserved. ing performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management” (Stonebraker, 2010). The “soft-state” approach in the design of NoSQL databases allows data to be inconsistent and transfers the task of implementing only the subset of the ACID properties required by a specific application to the application developer. NoSQL systems ensure that data will be “eventually consistent” at some future point in time, instead of enforcing consistency at the time when a transaction is “committed”. Data partitioning among multiple storage servers and data replication are also tenets of the NoSQL philosophy as they increase availability, reduce the response time, and enhance scalability. Big Data and mobile applications are the two most important growth areas of cloud computing. Big Data growth can be viewed as a three-dimensional phenomenon; it implies an increased volume of data, requires an increased processing speed to produce more results, and at the same time, it involves a diversity of data sources and data types (Marinescu, 2013). A delicate balance between data security and privacy and efficiency of database access is critical for such applications. Many cloud services used by these applications operate under tight latency constraints; moreover, these applications have to deal with extremely high data volumes and are expected to provide reliable services for very large communities of users. Nowadays NoSQL databases are widely supported by cloud 64 M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 service providers. Their advantages over traditional databases are critical for Big Data application. Data security and integrity are important factors when choosing a database for cloud applications. They are particularly critical for applications running on public clouds where multiple virtual machines (VMs) often share the same physical platform (Xu, Jiang, Wang, Yuan, & Ren, 2014; Xu, Zhang, Wu, & Shi, 2015; Yu & Wen, 2010). The importance of database security and its impact on a large number of individuals are illustrated by the consequences of two major security breaches: Weiss and Miller (2015); and Silver-Greenberg, Goldstein, and Perlroth (2014). In November 2013 approximately 40 million records were stolen from an unencrypted database used by Target stores. The compromised information included personally identifiable information (PII) and credit card data. According to a SEC (Securities and Exchange Commission) report, two months later a cyber-attack on JP Morgan Chase, compromised PII records of some 76 million households and 7 million small businesses. Classic cryptography primitives can protect data while in storage, but plaintext data is vulnerable to insider interference during processing. This is particularly troubling when searching databases containing personal information such as healthcare or financial records (Lin, Tsai, & Lin, 2014), as the entire database is exposed to such attacks. These circumstances motivate us to investigate methods for searching encrypted NoSQL databases. Though general computations with encrypted data are theoretically feasible using the algorithms for Fully Homomorphic Encryption (FHE) (Gentry, 2009), this is by no means a practical solution at this time. Existing algorithms for homomorphic encryption increase the processing time of encrypted data by many orders of magnitude compared with the processing of plaintext data. A recent implementation of FHE (Halevi & Shoup, 2014) requires about six minutes per batch; the processing time for a simple operation on encrypted data dropped to almost one second after improvements (Ducas & Micciancio, 2015). Related areas of research are: Learning With Errors (LWE) (Brakerski & Vaikuntanathan, 2011) and Lattice-based encryption (Cash, Hofheinz, Kiltz, & Peikert, 2012; Micciancio & Regev, 2009) and Attribute-based Encryption (Gorbunov, Vaikuntanathan, & Wee, 2013). In this paper we restrict our discussion to query processing particularly over encrypted NoSQL databases. A secure proxy called “SecureNoSQL” for accessing cloud remote servers and applying efficient cryptographic primitives for query, response and data encryption/decryption is introduced. We also designed a descriptive language using JSON1 notation which enables its users to generate a security plan. The security plan has four sections which elaborately introduce the data elements, cryptographic modules and the mappings between them. The main contributions of this paper are: • A JSON-based language for users to: (i) create a security plan for the database; (ii) describe the security parameters; and (iii) assign proper cryptographic primitives to the data elements. • A multi-key, multi-level security mechanism for policy enforcement. This feature is essential because the encryption key is subject to more frequent changes than the crypto-module. Furthermore, keys are assigned for a single data element, while encryption algorithms could be applied for several data elements 1 JSON (JavaScript Object Notation) is a lightweight text-based syntax for storing and exchanging data objects consisting of key-value pairs. It is used primarily to transmit data between a server and a web application. JSON’s popularity is due to the fact that it is self-describing and easy to understand by human and machine. For more information, visit: http://www.json.org. • • • • • with several keys. This separation allows a more efficient enforcement of security policy and of key management. An effective validation process for the security plan. This validation process enables users to initially evaluate all requests locally, rather than forwarding large numbers of fallacious keyvalue pairs to a cloud server. It also limits the cloud server workload and reduces the response time latency. Support for a comprehensive, flexible protection. The solution is open-ended; users can add new customized cryptographic modules simply by using the designed descriptive language. A balanced system with a security level-proportional overhead; the overhead is proportional to the desired level of security. A secure proxy which translates queries to run over encrypted data on the remote cloud server with respect to semantics of queries. The cloud database server is not modified and treats encrypted documents in the same way as a plaintext database. Properties of the distributed database such as replication hold for encrypted data. Support for cloud data integrity and protection against an insider attack. The rest of this paper is organized as follows: related work and NoSQL data models are presented in Section 2. The threat model for a cloud database is discussed in Section 3. The organization of the system is presented in Section 4 and structure of the security plan and the notation of the descriptive language for generation of security plan is discussed in Section 5. Then the mechanism of query processing is investigated in Section 6. Finally, in Section 8 we report on measurements of the database response time to different types of queries and on the encryption and decryption time for OPE encryptions with output lengths of 64, 128, 256, 512 and 1024-bit. 2. NOSQL databases and related work NoSQL describes a fairly large number of NoSQL database technologies, more than 120 by our count, have been created in recent years. NoSQL databases are non-relational, distributed, horizontally scalable, and schema-free. They are classified based on their data models. Choosing proper a data model has an extremely important influence on the performance and scalability of the data stores. Since our work has a tight connection to NoSQL data models, we provide brief definitions for several data models. 2.1. Key-value store This simple data model resembles an associative map or a dictionary where a key uniquely identifies the value. The data can be either a primitive data type such as a string, an integer, an array, or it can be an object. This model is effective for storing distributed data; thus, it is highly scalable and this motivates its use by cloud data management systems. Systems such as Bigtable (Chang et al., 2008), CouchDB,2 DynamoDB (Sivasubramanian, 2012), MemcacheDB3 and Redis 4 use this model. This model is not suitable for applications demanding relations or structures. 2.2. Column-family store In this model the data are stored in a column-oriented style and the dataset is comprised of several rows, each row is indexed by a unique key, the so-called primary key. Each row is composed of a set of column families, and different rows can have different column 2 3 4 http://couchdb.apache.org. http://www.Memcached.org. http://redis.io. M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 families. Similarly, the row key resembles the key, and the set of column families resembles the value represented by the row key. However, each column family further acts as a key for the one or more columns that it holds, whereas each column consists of a keyvalue pair. Hadoop HBase directly implements the Google Bigtable concepts, whereas Amazon SimpleDB and DynamoDB contain only a set of column name-value pairs in each row, without having column families. Sometimes, SimpleDB and DynamoDB are classified as key-value stores. Typically, the data belonging to a given row are stored together at the same server node. Cassandra provides the additional functionality of super-columns, which are formed by grouping various columns together. Cassandra can store a single row across multiple server nodes using composite partition keys. In column-family stores, the configuration of column families is typically performed during start-up. A column family in different rows can contain different columns. A prior definition of columns is not required and any data type can be stored in this data model. In general, column-family stores provide more powerful indexing and querying than key-value stores because they are based on column families and columns in addition to row keys. Similar to key-value stores, any logic requiring relations must be implemented in the client application. 2.3. Document store In this model data are stored inside an internal structure, while in the key-value store the data are opaque to the database. Thus, the database engine applies metadata to create a higher level of granularity and delivers a richer experience for modern programming techniques. Document-oriented databases use a key to locate the document inside the data store. Most document stores use JSON or BSON (Binary JSON). Document stores are suited to applications where the input data can be represented in a document format. A document can contain complex data structures such as nested objects. A document store allows document grouping into collections. A document in a collection should have a unique key. Unlike a relational database management system (RDBMS),5 where every row in a table follows the same schema, a document in document stores may have a different structure from other documents. Document stores provide the capability of indexing documents based on the primary key as well as on the contents of the documents. Like key-value stores, they are inefficient in multiple-key transactions involving cross-document operations. 2.4. Graph database This data model is used to represent complex structures and the highly connected data often encountered in real-world applications. In graph databases, the nodes and edges have individual properties consisting of key-value pairs. Graph databases are a good alternative for social networking applications, pattern recognition, dependency analysis and recommendation systems. Some graph databases such as Neo4J6 support ACID7 properties. Graph data stores are not as efficient as other NoSQL data stores and do not scale well horizontally when related nodes are distributed to different servers. The first SQL-aware query processing using an encrypted database was CryptDB (Popa, Redfield, Zeldovich, & Balakrishnan, 2011). CryptDB satisfies data confidentiality for an SQL rela- 5 A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model. 6 http://neo4j.com. 7 ACID (Atomicity, Consistency, Isolation, Durability) properties guarantee that database transactions are processed reliably. 65 tional database. However, CryptDB cannot perform queries over data encrypted with different keys. One important application of searching encrypted data (Cash et al., 2013, 2014; Cheon, Kim, & Kim, 2016; Song, Wagner, & Perrig, 2000; Tu, Kaashoek, Madden, & Zeldovich, 2013) is in cloud computing where the clients outsource their storage and computation. In Cash et al. (2014) a practical searchable security scheme is introduced which can search on encrypted data sets in sub-linear time complexity by using different types of indices; however, it is not practical on NoSQL data sets which are designed to scale to millions of users doing updates simultaneously (Cattell, 2011). Order-preserving symmetric encryption (OPE) is a deterministic encryption scheme which maps integers in the range [1, M] into a much larger range [1, N] and preserves numerical ordering of plaintexts (Boldyreva, Chenette, Lee, & O’neill, 2009; Mavroforakis, Chenette, O’Neill, Kollios, & Canetti, 2015). OPE is attractive because fundamental database operations such as sorting, simple matching (i.e., finding m items in a database), range queries (i.e., finding all m items within a given range), and search operations can be carried out efficiently over encrypted data. Moreover, OPE allows query processing to be done as efficiently as for unencrypted data; the database server can locate the desired encrypted data in logarithmic-time via standard tree-based indexing data structures. An investigation of OPE security against a known plaintext attack with known N plaintexts is reported in Xiao and Yen (2012) and Kerschbaum (2015); the last paper concluded that the ideal OPE module accomplishes one-wayness security.8 The Shannon entropy9 achieved by an ideal OPE is maximal when the mapping of integers in the range [1, M] to a much larger range [1, N] results in a uniform distribution. The risk of disclosure caused by main memory attack is quantified by Canim, Kantarcioglu, Hore, and Mehrotra (2010) and Bajaj and Sion (2014). An application of OPE in cloud environment is reported in Ahmadian, Paya, and Marinescu (2014) and Ahmadian (2017). Also, application of classical cryptography on relational database system for embedded devices was studied in Ahmadian, Khodabandehloo, and Marinescu (2015). NoSQL databases are suffering from lack of proper data protection mechanism because these databases have been designed to support high performance and scalability requirement. In order to protect personal and sensitive information, a privacy and security preserving mechanism is required in Big Data platforms. Integration of privacy aware access control features into existing Big Data is discussed in Colombo and Ferrari (2015), Liang, Susilo, and Liu (2015), and Islam and Islam (2014). In Gantz and Reinsel (2012) and Tankard (2012) the evolution of Big Data Systems from the perspective of an information security application is studied. As a matter of fact, the proxy is very important element in the designed structure and from Information Technology prospect view there should be special consideration for its protection. A cloud based monitoring and threat detection system was proposed by Cheon et al. (2016) and Chow et al. (2009) for critical component to make infrastructure systems secure. 3. A cloud computing threat model A threat model describes the threats against a system. The threat model of cloud computing can be analyzed from multiple viewpoints and we investigate it from an adversarial prospective. The adversarial threat model for the Database as a Service (DBaaS) is a 8 One-way functions are easy to compute, but computationally hard to invert. The entropy measures the degree of uncertainty; the Shannon entropy of a disx1 , x2 , . . ., xn with probabilities p1 , p2 , crete random variable X with n realizations  9 . . ., pn , respectively, is: H(X) = − n i=1 pi log pi . 66 M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 holistic process based on end-to-end security. The model identifies two classes of threats, as external and internal attackers. 3.1. External attacker An attacker from the outside of cloud environment might obtain unauthorized access to the hosted databases by applying techniques or tools to monitor the communication between the clients and the cloud servers. External attackers have to bypass firewalls, intrusion detection systems and other defensive tools without any authorization. 3.2. Malicious insiders An insider attacker has different level of access to cloud resources. Unauthorized access by malicious insiders who can bypass most or all data protection mechanisms is a major source of concern for cloud users. Encrypted data and a secure proxy construction such as SecureNoSQL, guarantees that malicious insiders cannot access user data. The proxy encrypts/decrypts data and query/response between clients and cloud. There is still the residual risk of information leakage from encrypted datasets. A malicious insider could exploit the leaked information to organize more extensive attacks and amplify the information leakage. 4. System organization This section introduces a framework to incorporate data confidentiality and information leakage prevention algorithms. SecureNoSQL leverages secure query processing for web and mobile applications using DBaaS. Two different system organizations can address our design objectives. The first is suitable when all database users belong to the same organization. Then the proxy runs on a trusted server behind a firewall and the communication between clients and the proxy is secure. When the clients access the cloud using the Internet the second organization is advisable. In this case, either the client software includes a copy of the proxy and only encrypted data is transmitted over public communication lines, or the Secure Sockets Layer (SSL) protocol is used to establish a secure connection to the proxy. Fig. 1 illustrates the high-level architecture of SecureNoSQL as a secure proxy between user’s applications and cloud NoSQL database server. The system we report on was designed with several objectives in mind: • Support multi-user access to an encrypted NoSQL database. Enforce confidentiality, privacy of transactions and data integrity. • Hide from the end-users the complexity of the security mechanisms; the database access should be transparent and the user’s access should be the same as for an unencrypted database. • Avoid transmission of unencrypted data over public communication lines. • Do not require any modification of the NoSQL database management system. • Create an open-ended system; allow the inclusion of cryptographic modules best suited for an application. These objectives led us to design a system where a proxy mitigates the client access to the cloud remote server running an unmodified NoSQL database processing system. In this system the processing of a query involves three phases:  (di , dj ) = 1. Client-side query encoding in JSON format carried out by the client software; 2. Query encryption and decryption done by a trusted proxy; and 3. Server-side query processing performed by an unmodified NoSQL database server. SecureNoSQL is based on general principles of NoSQL database products. We introduce a new concept, the security plan, materialized as a JSON description of data elements, metadata and parameter configuration of cryptosystems. A descriptive language is introduced to generate and parse the security plan automatically. JSON, a dominant format in NoSQL databases, is selected to express the designed security plan. We used a subset of JSON notation readable by human and machine. Document databases, such as MongoDB, store documents inside the collection by JSON representation in a similar way as tables and records in relational database systems. A query and the corresponding response are also represented in the JSON format; therefore, the governing format in document database is JSON. BSON, a binary extension of JSON, is used by document-oriented databases for efficient encoding/decoding. JSON query model is a functional, declarative notation, designed especially for working with large volumes of structured, semistructured and unstructured JSON documents. The data owner develops the security plan that outlines and maps out the determined crypto-primitive with specific parameters to a particular data element. 5. Descriptive language for security plan The NoSQL database benefits from flexible scheme that allows to have a different number of attributes for the documents corresponding to the same object. On the other hand, a full list of attributes is required to create a comprehensive protection for all data elements in the database. Therefore, we define a logical operator denoted as Super Document, the union of all attributes from different versions of the documents related to the same object. Each database D consists of a set of arbitrary number n documents. D = {d1 , . . ., dn } Furthermore, documents comprised of an arbitrary number m attributes in which each attribute also is built up with a key value pair k, v. di = {A1 , . . ., Am }, 1in In other words, a Super Document in the scope of a collection (databases) is an aggregation of attributes representing specific entity. Thus for any given document di it is required to look for n − 1 documents to extract attributes that are not member of di (relative complement). This concept is rephrased in Eq. (1). In addition, a match function (di , dj ) determines whether two given documents di , dj are desirable for merging or not. Two documents can be combined if they share the same attribute from an identifying type. Super Document  is defined as:  i,j = di , dj  ∃Ap ∈ di ∧ ∃Aq ∈ dj if ((di , dj ) == True) ⇒ i,j = di ∪ dj The function  is defined as: True iff ∃Ap ∈ di ∧ ∃Aq ∈ dj | [(Ap .key = Aq .key) ∧ (Ap .value = Aq .value)] False Otherwise Provided that Ap and Aq are identifier attributes. (1) M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 67 Fig. 1. The organization of the SecureNoSQL. the collection must be encrypted. The listing 3b illustrates how to secure a sample collection using the description language. The key-value pairs (KVP) are the primary data model for a NoSQL database. The key is used as an index to access the associated value of the data pointed by the reference ref. The initialization vector (IV) is a fixed-size, random input to the cryptographic module encryption. Additionally, a collection exists within a single database. Documents within a collection can have different fields. Typically, all documents in a collection are related with one another. Fig. 2. The high level structure of the security plan. 5.1. Database security plan The security plan identifies the mechanism to maintain the security of the data elements in a database. It also determines how to interpret queries issued by a specific application. The security plan has four sections, see Fig. 2, describing the security rules for the data elements and for meta-data such as the field-name (Key) and the collection name. These sections are the building blocks of the security plan showing how the rules are enforced. The sections and their roles are: 1. Collection: includes the name of a collection and a reference to the encryption module used to encrypt the name of the collection and the name of fields (metadata). 2. Cryptographic modules: lists the cryptographic modules for encrypting the fields of the database entries in the query. 3. Data elements: lists the properties of each data field including the data type; the data type determines the cryptographic modules to be applied to each field. 4. Mapping cryptographic modules to the fields: assigns the cryptographic modules to data fields; proxy uses this information to encrypt and decrypt the data elements. 5.1.1. Collection A collection is defined as a group of NoSQL documents, the equivalent of relational database table, see Fig. 3. The name of 5.1.2. Cryptographic modules The choice of a particular cryptosystem depends on the security policy of application. Multiple criteria for algorithm selection include: (i) the security against theoretical attacks; (ii) the cost of implementation; (iii) the performance; and (iv) whether the encryption and decryption can be parallelized. Other factors involved in the selection of an algorithm are the memory requirements and the integration in the overall system design. The Cryptographic modules introduce all encryption modules and their parameters such as key, key-size, initialization vector and output-size. The structure of this section is shown in Fig. 4a complemented by the listing in Fig. 4b presenting the second section of security plan for the previous example. Our proof of concept uses the parametric Order Preserving Encryption (OPE) and the Advanced Encryption Standard (AES) modules. The system is open-ended; users can add the cryptosystems best suited to the security requirements of their application. In our design the definitions of the cryptographic modules and of the pairs, encryption key and initialization value, are separated following the so-called key separation principle (Galiegue & Zyp, 2013). This security practice is based on the observations that users have long- and short-term security policies. The cryptographic modules are less likely to change while the key and the initialization value change frequently. 5.1.3. The data elements The third section of security plan, the data elements and their properties are covered. Fig. 5 presents the structure and description of Data element section of Security plan. The listing 68 M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 Fig. 3. The structure of a collection: (a) The chart outlines the structure of a collection containing the name of collection and name of all attributes which are considered as a meta-data, and should be protected with proper cryptographic module. (b) The description of a collection and security parameters in designed JSON based language. In this specific case the Advanced Encryption Standard in deterministic (AES-DET) mode with a 128-bit key and an initialization vector (IV) is assigned to encrypt the name of the collection and the fields name. Fig. 4. The structure and function of Cryptographic modules: (a) The Security Plan with the second section, the cryptographic module, expanded. The attributes included for each module are: name, type, key size, key, input and output size. (b) The OPE encryption including the cryptosystems and their attributes. The proxy applies these modules using the key-value pairs (KVP). M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 69 Fig. 5. Structure and description of Data element: (a) The chart outlines the structure of Data elements containing attributes of data elements such as name, type and value for of collection and name and then introduces security parameters for each data element. (b) The data element section of a sample database which is represented in designed notation. A data item has 7 fields: id, name, salary, balance, ccn, ssn, and email. The id, name, email, and salary are required fields. Fig. 6. The structure and description of Mapping cryptographic modules to the Data element: (a) Security plan with the fourth section expanded. This section establishes a correspondence between the data fields and the cryptographic modules used to encrypt and decrypt the data fields. (b) The mapping section of the schema for a sample database with 7 fields. For example, the id and the name will be encrypted with OPE 128 bit and AES-DET, respectively. displayed in Fig. 5b displays data elements and its JSON description for previous example. To ensure the desired level of security the security plan should provide the description of all sensitive data elements of database in third section of security plan. 5.1.4. Mapping cryptographic modules to the fields The last section of security plan specifies all cryptographic modules for all sensitive data fields. Fig. 6 and the listing presented in Fig. 6b shows the mapping of the cryptographic modules and the corresponding JSON format for a sample application. The method presented in this paper can be easily extended to the other NoSQL data models discussed in Section 2. Fig. 7 shows how this extension from the key-value pair to the document store model can be carried out. 5.2. Query and data validation The proxy validates the data and the query as a JSON-formatted input with the reference security plan. Then the proxy enforces the crypto-primitives and generates new query following the NoSQL query semantics. During this process the proxy applies to each field the cryptographic modules. Finally, the proxy forwards the newly encrypted query/data to the NoSQL database server. Fig. 8 depicts the schema validation process. For better illustration, consider listings depicted in Fig. 9a as an input data; after running validation process the output is generated (see Fig. 9b). The output of validation process is a single file which contains descriptive information for data and meta-data in designed format and ready to execute on the SecureNoSQL. The output of validation process is a single file containing descriptive information for data and meta-data expressed in the 70 M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 Fig. 7. SecureNoSQL applied to: (a) The key-value data model; Key1 , . . ., Keyn are all encrypted using the cryptographic module z while the corresponding values, Value1 , . . ., Valuen are encrypted with cryptographic modules 1, 2, . . ., n, respectively. (b) The document store data model; the meta-data such as collection name encrypted as well as attributes with assigned cryptographic modules. encrypted; however, the output is consistent with NoSQL semantics. 7. Integrity of data/query/response Fig. 8. The validation process of input data against security plan in the client side. Table 1 The overhead of encryption for several encryption schemes. Database Plain OPE64 OPE128 OPE256 OPE512 Size (MB) 170 430 508 662 1000 Integrity and confidentiality are two critical components of data security. Integrity refers to the consistency of the outsourced data. The proposed integrity verification algorithm in SecureNoSQL guarantees the integrity of data/queries (see Algorithm 1 and Fig. 11). Data owner first applies encryption scheme on the documents, and then calculates Hashed Message Authentication Code (HMAC) for each one of encrypted documents. A hash value of any given document is a fixed length of 512 bit and data owner concatenates a unique document identifier (ID) with hash value and stores the results in efficient structure like HashTable which has constant looks-up time O(1). Next, data owner transfers the encrypted dataset to the cloud and sends HashTable containing hash values to the proxy. Once the proxy receives the query response from the server, it initiates the verification process to check the authenticity of the documents by recalculating the hash values. This process is illustrated in Fig. 11. Algorithm 1. Proxy Document Integrity Verification Algorithm in the required format and ready to execute. The output of validation process for the example is illustrated in Fig. 9b. As it was noted earlier, the schema reflects the desired security level expressed by the security plan for the database. Table 1 shows the overhead for several parameters and crypto-primitives. 6. Processing queries on encrypted data According to the proposed scheme, in order to process queries over encrypted data the queries should transfer to the encrypted version with respect to security plan, and this task is designed to be conducted in the proxy. The security plan discussed in Section 5, supplies the parameters of the cryptographic modules to be applied for the data elements involved in the query. Fig. 10 displays the processing and rewriting of a sample query. For better understanding the query encryption, in Table 2 you can find some sample encrypted queries after enforcing security plan. As it can be seen, data elements and immediate values are In this configuration the data owners just trust the proxy(SecureNoSQL) and cloud servers are not trustworthy. Thus, a result of data integrity verification, all active attacks done by internal or external attacker will be detected by the proposed M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 71 Fig. 9. The security plan for the sample input: (a) The data element section of sample security plan. (b) The output of the JSON data validation for the sample database. Fig. 10. The query db.customers.find({salary:{$gt:5000}, balance:{$lt:2000}}) received from an application. (a) The parsing tree of the query. (b) The cryptographic modules applied to the data elements according to schema definition. approach. The message authentication code (MAC) is created by using the keyed Hash Message Authentication Code (HMAC) as rephrased in Eq. (2). HMAC(K, document) = H((K ⊕ okeyPad) H((K ⊕ ikeyPad) document) (2) Where: H represents the hash function ⊕ is the XOR operator okeyPad is one-block-long outer pad ikeyPad is one-blocklong inner key pad Algorithm 2 presents the pseudo-code of the HMAC function for a block size of 64 bytes. The computed hash values with correspondent document’s unique identifier can be stored in the form of key-value pair in a hash-table, thus allowing the proxy to carry the lookup in constant time during the verification process. Algorithm 2. generation Keyed Hash Message Authentication Code (HMAC) 72 M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 Table 2 Five sample queries and their corresponding encrypted version. Fig. 11. (1) Data owner transfers the encrypted database to the cloud server. (2) Data owner sends the Hash database to proxy. (3) Clients send plain queries to the proxy. (4) The proxy translates queries to the encrypted version, and forwards them to the cloud server. (5) The cloud server returns the query response set. (6) The proxy runs a hash verification process on the query response set, and then based on the result either forwards to the decrypted response or reports integrity violation to the client. 8. Results and discussion The response time of a query to an encrypted NoSQL database has several components: 1. 2. 3. 4. the time to encode the query in JSON format; the time to encrypt and decrypt the data; the communication time to/from the server; the database response time. For our experiments we first created a sample database with one million records and then determined the overhead of searching an encrypted database. To do so we measured the database response time for queries when the records were unencrypted versus when records were encrypted. Then, we measured the encryption and the decryption time for different sizes of the ciphertext. We wanted to isolate the different components of the response time dominated by the communication time. The environment used for testing was set up on the Linux operating system. We chose MongoDB (Dede, Govindaraju, Gunter, Canon, & Ramakrishnan, 2013), classified as a NoSQL document store database 3.0.2. The random data generator in JS, PHP, and MySQL format was generated by using a tool (Keen, 2016) to generate a one million record plaintext data set. Each record had seven different data fields including name, email, salary, as shown in Listing 9b. We applied OPE 64, 128, 256 and 512 bit to numeric data type, and the AES-DET 128 bit for the string data type of the plaintext data set and generated four encrypted data sets of one million records each. Finally, we uploaded the five datasets and created five MongoDB databases, one with the unencrypted data, and four with the encrypted data. Once the MongoDB databases were created we run several types of queries including equality, greater than, less than, greater than or equal to, less than or equal to, and OR logical operations. The experiments to measure the query time must be carefully designed. To construct average query processing time each experiment has to be carried out repeatedly. We noticed a significant reduction of database management response time after the first execution of a query, a sign that MongoDB is optimized and caches the results of the most recent queries. A solution is to disable the cache, or if this is not feasible, to clear the cache before repeating the query. Another important observation is that modern processors have a 64-bit architecture and are optimized for operations on 64bit integers. For three of the five types of queries, Q2 (Range query), Q3 (equality), and Q4 (logical), database response time is slightly shorter for the encrypted database than for the unencrypted one M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 73 Fig. 12. The query processing time in milliseconds (ms) for the unencrypted database and for the encrypted databases when the 32-bit keys are encrypted as 64, 128, 256 and 512-bit integers. Table 3 The query processing time in milliseconds (ms) for the plaintext and for the ciphertext. 32-bit plaintext integers are encrypted as 64, 128, 256 and 512-bit integers. The record count gives the number of records retrieved by each one of the five types of queries, Q1–Q5. Query type Number of matching record(s) 32-bit plaintext 64-bit ciphertext 128-bit ciphertext 256-bit ciphertext 512-bit ciphertext Q1: Comparison Q2: Equality Q3: Range Q4: Logical Q5: Aggregation 461,688 1 991,225 551,380 1 340 340 370 500 600 310 380 350 540 660 355 390 360 550 670 370 400 380 555 680 380 410 400 560 690 when the keys are 32-bit integers. A plausible explanation for this is most likely related to the cache management. The results reported in Table 3 and in Fig. 12 show the database response time for the five MongoDB experiments. Each query was carried out 100 times with disabled query cache and the average query response in milliseconds was calculated. We also measured the encryption and the decryption time and the results are reported in Fig. 13. The measurement process was automated, and it was running under the control of a script which generated the data and reported the processing time. Our measurements show that the response time of the NoSQL database management system to encrypted data depends on the Fig. 13. Execution time of the OPE module when the key is encrypted as 64, 128, 256, 512, and 1024 bit. type of the query. The shortest and longest database response times occur for Q1 (comparison) and Q5 (aggregated queries), respectively; for these two extremes the time for the unencrypted database was almost double, but the time for encrypted databases increases only by 70–80%. As expected, the query processing type for a given type of query increases, but only slightly, less than 5% when the key length increases from 64, to 128, 256, and 512 bit. The OPE encryption time increases significantly with the size of the encryption space; it increases almost tenfold when the size of the encrypted output increases from 64-bit to 1024-bit and it is about 10 ms for 256-bit. The decryption time is considerably smaller; it increases only slightly from 0.11 ms to 0.17 when the size of the encrypted key increases from 64-bit to 1024 bit. Secure proxy is an important element for the proposed architecture; therefore, the potential attacks that could affect the proxy, also should be taken into consideration. In general, two major possible attacks on proxy are Denial of Service (DoS) and unauthorized access. In DoS attack, the attacker sends so many network traffic to the proxy, that the system is not capable of processing within the expected time frame. Successful DoS attacks can turn the proxy to a bottleneck of the system. In unauthorized access attacks, attackers use a proxy to mask their connections while attacking the different targets. Several solutions exist for improving the security of proxy against DoS attacks and reducing the consecutive impacts, including blocking the undesired packets or using multiple proxies with load balancers. Moreover, for prevention of unauthorized access attacks, it is required to use best fit authorization to access the proxy. User authentication based on group membership with different authorizations are the best practical solutions. 74 M. Ahmadian et al. / International Journal of Information Management 37 (2017) 63–74 9. Conclusions and future work Though the OPE encryption scheme has known security vulnerabilities it can be very useful for NoSQL database query processing for the data models discussed in Section 2. While the key is encrypted using OPE, the other fields of a record can be encrypted using strong encryption, thus reducing the vulnerability of the data attacks. Strong encryption of the value fields could increase the encryption time but will have little effect on the decryption time. An important observation is that increasing the size of the codomain of the OPE mapping function from 264 to 2128 , 2256 , and to 2512 results in an increase of database response time up to 5%, except for Q3-type queries when the increase is significant. The penalty for using encrypted, rather than unencrypted NoSQL databases such as MongoDB is less than 5% for Q2, Q4, and Q5 which is considered to be relatively small. Moreover, the overall query response time is dominated by the communication time. The secure proxy is a critical component of the system. The proxy is multi-threaded and its cache management is non-trivial. The management of the security attributes is rather involved. On the other hand, a proxy integrated in the client-side software can be lightweight and considerably simpler. We are currently implementing the two versions of proxy. Experimental results for multiple large datasets with up to one million documents show that SecureNoSQL is rather efficient. Our approach can be extended to a multi-proxy structure for Big Data applications. We are now implementing a sophisticated mechanism for maintaining consistency of hash values database in the proxies datasets based on the PAXOS algorithm (Lamport, 2001; Marinescu, 2013). Acknowledgment We thank Victor Shoup from New York University for NTL C++ library used to manipulating arbitrary length integers. References Ahmadian, M. (2017). SECURE QUERY PROCESSING IN CLOUD NoSQL. In 2017 IEEE international conference on consumer electronics (ICCE) (2017 ICCE) Ahmadian, M., Khodabandehloo, J., & Marinescu, D. (2015). A security scheme for geographic information databases in location based systems. In IEEE SoutheastCon (pp. 1–7). http://dx.doi.org/10.1109/SECON.2015.7132941 Ahmadian, M., Paya, A., & Marinescu, D. (2014). Security of applications involving multiple organizations and order preserving encryption in hybrid cloud environments. In IEEE international conf. on parallel distributed processing symposium workshops (IPDPSW) (pp. 894–903). http://dx.doi.org/10.1109/ IPDPSW.2014.102 Bajaj, S., & Sion, R. (2014). Trusteddb: A trusted hardware-based database with privacy and data confidentiality. IEEE Transactions on Knowledge and Data Engineering, 26, 752–765. Boldyreva, A., Chenette, N., Lee, Y., & O’neill, A. (2009). Order-preserving symmetric encryption. In Annual international conference on the theory and applications of cryptographic techniques (pp. 224–241). Springer. Brakerski, Z., & Vaikuntanathan, V. (2011). Fully homomorphic encryption from ring-LWE and security for key dependent messages. In Annual cryptology conference (pp. 505–524). Springer. Canim, M., Kantarcioglu, M., Hore, B., & Mehrotra, S. (2010). Building disclosure risk aware query optimizers for relational databases. Proceedings of the VLDB Endowment, 3, 13–24. Cash, D., Hofheinz, D., Kiltz, E., & Peikert, C. (2012). Bonsai trees, or how to delegate a lattice basis. Journal of Cryptology, 25, 601–639. Cash, D., Jaeger, J., Jarecki, S., Jutla, C. S., Krawczyk, H., Rosu, M.-C., et al. (2014). Dynamic searchable encryption in very-large databases: Data structures and implementation. IACR Cryptology ePrint Archive, 2014, 853. Cash, D., Jarecki, S., Jutla, C., Krawczyk, H., Roşu, M.-C., & Steiner, M. (2013). Highly-scalable searchable symmetric encryption with support for Boolean queries. In Advances in cryptology—CRYPTO 2013 (pp. 353–373). Springer. Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39, 12–27. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26, 4. Cheon, J. H., Kim, M., & Kim, M. (2016). Optimized search-and-compute circuits and their application to query evaluation on encrypted data. IEEE Transactions on Information Forensics and Security, 11, 188–199. Chow, R., Golle, P., Jakobsson, M., Shi, E., Staddon, J., Masuoka, R., et al. (2009). Controlling data in the cloud: Outsourcing computation without outsourcing control. In Proceedings of the 2009 ACM workshop on cloud computing security (pp. 85–90). ACM. Colombo, P., & Ferrari, E. (2015). Privacy aware access control for big data: A research roadmap. Big Data Research, 2, 145–154. Dede, E., Govindaraju, M., Gunter, D., Canon, R. S., & Ramakrishnan, L. (2013). Performance evaluation of a MongoDB and hadoop platform for scientific data analysis. In Proceedings of the 4th ACM workshop on scientific cloud computing (pp. 13–20). ACM. Ducas, L., & Micciancio, D. (2015). FHEW: Bootstrapping homomorphic encryption in less than a second. In Annual international conference on the theory and applications of cryptographic techniques (pp. 617–640). Springer. Galiegue, F., & Zyp, K. (2013). JSON schema: Core definitions and terminology. Internet Engineering Task Force (IETF). Gantz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future, 2007, 1–16. Gentry, C. (2009). A fully homomorphic encryption scheme (Ph.D. thesis). Stanford University. Gorbunov, S., Vaikuntanathan, V., & Wee, H. (2013). Attribute-based encryption for circuits. In Proc. of the forty-fifth annual ACM symposium on theory of computing (pp. 545–554). http://dx.doi.org/10.1145/2488608.2488677 Halevi, S., & Shoup, V. (2014). Algorithms in HElib. In International cryptology conference (pp. 554–571). Springer. Islam, M., & Islam, M. (2014). An approach to provide security to unstructured big data. In 8th international conf. on software, knowledge information management and applications (SKIMA) (pp. 1–5). http://dx.doi.org/10.1109/SKIMA.2014. 7083392 Keen, B. (2016). benkeen/generatedata. https://github.com/benkeen/generatedata Kerschbaum, F. (2015). Frequency-hiding order-preserving encryption. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security (pp. 656–667). ACM. Lamport, L. (2001). Paxos made simple. ACM Sigact News, 32, 18–25. Liang, K., Susilo, W., & Liu, J. (2015). Privacy-preserving ciphertext multi-sharing control for big data storage. IEEE Transactions on Information Forensics and Security, 10, 1578–1589. http://dx.doi.org/10.1109/TIFS.2015.2419186 Lin, C.-H., Tsai, S.-H., & Lin, Y.-P. (2014). Secure transmission using MIMO precoding. IEEE Transactions on Information Forensics and Security, 9, 801–813. http://dx.doi.org/10.1109/TIFS.2014.2309211 Marinescu, D. C. (2013). Cloud computing: Theory and practice. Newnes. Mavroforakis, C., Chenette, N., O’Neill, A., Kollios, G., & Canetti, R. (2015). Modular order-preserving encryption, revisited. In Proc. of the 2015 ACM SIGMOD international conf. on management of data (pp. 763–777). http://dx.doi.org/10. 1145/2723372.2749455 Micciancio, D., & Regev, O. (2009). Lattice-based cryptography. In Post-quantum cryptography. pp. 147–191. Springer. Popa, R. A., Redfield, C. M. S., Zeldovich, N., & Balakrishnan, H. (2011). CryptDB: Protecting confidentiality with encrypted query processing. In Proc. of the twenty-third ACM symposium on operating systems principles (pp. 85–100). http://dx.doi.org/10.1145/2043556.2043566 Silver-Greenberg, J., Goldstein, M., & Perlroth, N. (2014). JPMorgan chase hack affects 76 million households. New York Times, 2. Sivasubramanian, S. (2012). Amazon dynamoDB: A seamlessly scalable non-relational database service. In Proc. of ACM SIGMOD international conf. on management of data (pp. 729–730). http://dx.doi.org/10.1145/2213836. 2213945 Song, D. X., Wagner, D., & Perrig, A. (2000). Practical techniques for searches on encrypted data. In 2000 IEEE symposium on security and privacy 2000. S&P 2000. Proceedings (pp. 44–55). IEEE. Stonebraker, M. (2010). SQL databases v. NoSQL databases. Communications of the ACM, 53, 10–11. http://dx.doi.org/10.1145/1721654.1721659 Tankard, C. (2012). Big data security. Network Security, 2012, 5–8. Tu, S., Kaashoek, M. F., Madden, S., & Zeldovich, N. (2013). Processing analytical queries over encrypted data. Proceedings of the VLDB Endowment, 6, 289–300. Weiss, N. E., & Miller, R. S. (2015). The target and other financial data breaches: Frequently asked questions. In Congressional research service, prepared for members and committees of congress, February (Vol. 4) (p. 2015). Xiao, L., & Yen, I.-L. (2012). Security analysis for order preserving encryption schemes. In 2012 46th annual conference on information sciences and systems (CISS) (pp. 1–6). IEEE. Xu, L., Jiang, C., Wang, J., Yuan, J., & Ren, Y. (2014). Information security in big data: Privacy and data mining. IEEE Access, 2, 1149–1176. http://dx.doi.org/10.1109/ ACCESS.2014.2362522 Xu, L., Zhang, X., Wu, X., & Shi, W. (2015). ABSS: An attribute-based sanitizable signature for integrity of outsourced database with public cloud. In Proceedings of the 5th ACM conference on data and application security and privacy (pp. 167–169). ACM. Yu, X., & Wen, Q. (2010). A view about cloud data security from data life cycle. In International conf. on computational intelligence and software engineering (CiSE) (pp. 1–4). http://dx.doi.org/10.1109/CISE.2010.5676895 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/269293971 Security of sharded NoSQL databases: A comparative analysis Conference Paper · June 2014 DOI: 10.1109/CIACS.2014.6861323 CITATIONS READS 6 681 3 authors, including: Rahat Masood Awais Shibli UNSW Australia National University of Sciences and Technology 21 PUBLICATIONS 57 CITATIONS 54 PUBLICATIONS 141 CITATIONS SEE PROFILE SEE PROFILE All content following this page was uploaded by Awais Shibli on 03 January 2015. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. 2014 Conference on Information Assurance and Cyber Security (CIACS) Security of Sharded NoSQL Databases: A Comparative Analysis Anam Zahid, Rahat Masood, Muhammad Awais Shibli School of Electrical Engineering and Computer Sciences (SEECS) National University of Sciences and Technology (NUST) Islamabad, Pakistan Email: {12msitazahid, rahat.masood, awais.shibli}@seecs.edu.pk The BASE (Basically Available, Soft State and Eventually Consistent) properties of NoSQL databases allow them to be scalable and thus, NoSQL systems inherently support auto-sharding phenomenon [2]. Autosharding is the automatic and native horizontal distribution of data among different severs in NoSQL databases which, in turn, increase the performance and throughput of database operations [9]. Another significant advantage of autosharding is load balancing across the cluster such that a single server does not get overloaded with all the queries. This makes NoSQL databases a good candidate for high transaction and write-intensive database applications [4, 5]. Abstract—NoSQL databases are easy to scale-out because of their flexible schema and support for BASE (Basically Available, Soft State and Eventually Consistent) properties. The process of scaling-out in most of these databases is supported by sharding which is considered as the key feature in providing faster reads and writes to the database. However, securing the data sharded over various servers is a challenging problem because of the data being distributedly processed and transmitted over the unsecured network. Though, extensive research has been performed on NoSQL sharding mechanisms but no specific criterion has been defined to analyze the security of sharded architecture. This paper proposes an assessment criterion comprising various security features for the analysis of sharded NoSQL databases. It presents a detailed view of the security features offered by NoSQL databases and analyzes them with respect to proposed assessment criteria. The presented analysis helps various organizations in the selection of appropriate and reliable database in accordance with their preferences and security requirements. Apart from scalability and performance, data security is probably one of the most difficult challenges faced by the NoSQL databases now-a-days. These databases were initially not designed by considering security as an important feature. Therefore, it has become the sole responsibility of NoSQL consumers to protect these databases by using third party tools and services [10]. Furthermore, sharding also pose security risks, even greater than standalone NoSQL deployments. Some of the main security risks encountered by these sharded databases are due to geographic distribution of data, unencrypted data storage, unauthorized exposure of backup and replicated data, insecure communication over the network and many more [3, 13]. Such risks implies the need for strong security solutions in sharded databases, which not only provides the security of data-at-rest on a single node, but also maintains data security during transmission between various nodes in a sharded environment [11, 12]. Though, NoSQL databases provide mechanisms to practice basic security principle CIA (Confidentiality, Integrity and Availability), but still there is some room for improvement in security solutions especially those which are designed for sharded environments. Keywords—Sharding; Database Security; NoSQL; Data and Applications Security; Comparative Analysis I. INTRODUCTION The term “NoSQL” was first introduced in 1988 for the relational databases not having SQL interfaces [1]. However, the term was re-introduced in 2009 for the kind of modern web-scale databases which trade transactional consistency over large-scale data distributions and incremental scalability. NoSQL databases were not originally meant to replace traditional databases, rather they are more suitable to adopt when relational databases does not seem appropriate. The main reasons behind adopting NoSQL databases are their simple yet flexible architecture and the capability of handling large amount of multimedia, word processing, social media, emails and other unstructured data files [6]. While, the conventional relational databases are hard to scale-out mainly due to their pre-defined schemas and I/O performance bottlenecks [2, 4]. These issues have made relational databases difficult to fit in the new computing paradigms such as Grid and Cloud applications, data warehousing, web2.0, social networking etc [7]. Contrary to it, NoSQL databases are becoming a primary choice for cloud applications because of their highly available, reliable and scalable nature [8]. This paper targets sharding mechanisms in popular NoSQL databases such as CouchDB, MongoDB and Cassandra etc and performs a detailed assessment to analyze the security in their sharded architecture. To the best of our knowledge, there are no previous assessment criteria to evaluate the security of existing sharded NoSQL databases. Therefore, we have identified the factors critical for the security of sharded databases after a thorough review of current literature, existing database security controls and practical approaches. These factors will be used as 1 978-1-4799-5852-8/14/$31.00 ©2014 IEEE assessment criteria and help enterprises in the evaluation and selection of suitable sharded NoSQL database according to their security requirements. Beside these models, there are also access control models which satisfy various other security requirements and offers distinct features from one another. These models include PBAC (Policy Based Access Control), TBAC (Task Based Access Control), ABAC (Attribute Based Access Control), FGAC (Fine Grained Access Control) etc. Sharded databases need to configure access control policies to manage consistent authorization strategies throughout the cluster and to ensure that databases and users have restricted access only to the resources required to perform their defined functions [28]. The rest of the paper is organized as follows: Section II elaborates the assessment criteria for evaluating security of sharding techniques in various NoSQL databases, Section III consists of literature review along with the critical analysis of each technique, Section IV summarizes research findings and Section V concludes the paper after presenting some future directions. II. ASSESSMENT CRITERIA C. Secure Configurations A security breach may easily occur due to misconfigurations at the OS, database or application layer [26] therefore, sharded database must include a list of configurations that can be applied by the system administrators to secure databases uniformly across clusters at both physical and logical level. These configurations are implemented based on the security and business needs of an organization. Sharded databases usually contain one or more configuration databases holding information about data configurations and distributions. Hence, it is the responsibility of cluster administrator to automatically propagate database configuration information throughout the cluster [24]. Some of the important categories of secure configurations include securing of patches and updates, services, protocols, roles and accounts, files and directories, backups, ports and registries etc [27]. Organizations moving their data to distributed NoSQL data stores must consider security as an important factor besides consistency and availability. Distributing data to multiple servers in different data centers provides more avenues for both physical and virtual security attacks therefore, it is very important to identify the factors necessary for enforcing security in sharded databases and to consider that these factors should be applicable to every sharded database equally. For this purpose, we have extensively reviewed the existing security controls of sharded NoSQL databases [13, 17, 18] and have identified a criterion for the security assessment of sharded database architectures. This criteria will not only serve as generic guidelines for the security evaluation of existing and upcoming NoSQL databases but also highlights those areas in sharded databases which needs security improvements. A brief description against each factor of the criteria is given below: D. Data Encryption Data Encryption is used to provide confidentiality of data and applications in a database system [17]. It includes the encryption of data-at-rest as well as the encryption of data-in-transit over the network. Sharded databases are needed to apply data encryption techniques at table, row and column level to secure the information [15, 19]. The common methods of data-at- rest encryption are the implementations of algorithms such as DES (Data Encryption Standard), AES (Advance Encryption Standard) and Hashing (MD5, SHA1 &2) etc. In addition, there exist many different mechanisms to encrypt the data-in-transit. These mechanisms protect the data transmissions from server-to-server and from server to client applications. These communications can be encrypted using standard encryption techniques such as SSL, IPSec, TLS and SSH etc. Moreover, management of encryption keys is also an important factor of data encryption and should be handled with proper care [16]. A. Authentication The process of verifying a user’s identity who wants to access the resources, data or applications of an organization is known as authentication [18]. Authentication can be provided in many ways, ranging from a single user authentication to mutual authentication of user with database server and then to two-way authentication between database servers [14]. Sharded databases may provide its own authentication mechanism or it may rely on some third party systems such as LADP directories or PKI to identify and authenticate its users and database servers [17]. A few of the well known authentication techniques include Password based authentication, Multi-Factor authentication, Certificate based authentication, authentication protocols like SSL, SSH and Kerberos etc. B. Access Controls Access control is the mechanism through which we can ensure that only an authorized person is allowed to access system resources. Access control can be applied at system, database, object and content level depending on the configurations of the database administrator [22]. A number of access control models have been proposed to provide secure access to database tables and columns having sensitive values attach to them. Three of the most conventional access control models include Discretionary Access Control (DAC), Mandatory Access Control (MAC) and Role Based Access control (RBAC) model [23, 25]. E. Auditing Database Auditing refers to the monitoring and recording of individual and collective actions performed by database users [20]. Auditing helps in the identification of foot prints or possible password cracking attempts before the occurrence of an attack [21]. Therefore, in sharded database environments the Database administrators (DBAs) should use regular auditing or fine grained auditing techniques for the detection and monitoring of unauthorized access to data and operations [18]. Some of the important 2 auditing activities include auditing of database connections, privileged activities and transaction logs. servers. Query routers redirect the client query to the appropriate shards after looking up shard addresses from the config. servers. Query routers are also responsible for cluster balancing using the two primary operations of chunk splitting and balancing. MongoDB supports collection level partitioning of data. Each collection is distributed evenly across mongoDB sharded cluster in the form of chunks [30]. 2) Security Analysis: MongoDB supports Plain MongoDB-CR for user authentication in its open source edition. Additionally, it uses SSL with X.509 certificates not only for secure communication between user and mongoDB cluster but for intra-cluster authentication as well. Access control is enabled by using system defined and custom defined roles within a sharded cluster. Moreover, Data encryption is only provided at the transport level by the use of SSL and basic level auditing of data operations is offered. Security of database configurations is supported at very low level and is totally relied upon database administrator. This section has presented an overview of the identified assessment criteria along with few well known examples of each. In the next section, we have analyzed the security features of six different widely adopted, open-source NoSQL sharded databases on the basis of this common criterion. Each of the defined security assessment criteria has been classified into three metric values (low, medium and high) in order to help NoSQL consumers in the selection of suitable sharded database according to their requirements. These metric values describe the significance of identified security factors in sharded databases and are shown in Table I while the identified security assessment criteria with respect to these values can be seen in Table II. This classification also acts as a guideline for the security assessment of various sharded databases throughout the paper. III. ANALYSIS OF SHARDING SECURITY IN NOSQL DATABASES B. Redis 1) Introduction: Redis is a document-oriented, highly available and distributed NoSQL data store. Unlike other NoSQL databases, redis supports three kinds of partitioning namely “client side partitioning”, “proxy assisted partitioning” and “Query routing”. The client side partitioning give rights to the redis client to directly select the appropriate node for the storage of data keys. Proxy assisted partitioning, on the other hand needs query to be delivered to a proxy server instead of sending it directly to the appropriate redis instance. The query routing partitioning suggests that a client query can be sent to any instance of the redis cluster, and it is the responsibility of the instance to forward client query to the appropriate instance [31]. However, partitioning can only be survived in the scenarios where majority of the master nodes are reachable in a cluster having master-slave configurations. Redis also supports master-slave replication within its cluster for the efficiency of reads and data redundancy [32]. 2) Security Analysis: Redis provides password based authentication to its clients. These passwords are stored in plain text format and set by the system administrators. However, it does not provide authentication by default and listens on all IP addresses on port 6739. In addition, redis does not ensure any kind of access control mechanisms and rely on third party SSL implementations for the security of data transmissions over untrusted networks. Redis also does not provide support for configuration security, data encryption and auditing mechanisms. This section provides an introduction to sharding techniques in various NoSQL databases and analyzes their current security features on the basis of proposed assessment criteria discussed in Section 2. The analysis covers current security features offered by sharded NoSQL databases and also highlights effective security controls which are lacking. It will help various organizations, NoSQL vendors and consumers in the selection of appropriate NoSQL database according to their security requirements. Additionally, this analysis will also provide assistance in improving security controls of various NoSQL databases. Following is the detailed security analysis of sharding architecture of various existing NoSQL databases namely MongoDB, Redis, HBase, Cassandra, CouchDB and Couchbase server when compared with defined assessment criteria. It may be noted that the basic criteria behind selecting these particular NoSQL databases in this paper are their popularity, open source nature and easily available documentation. A. MongoDB 1) Introduction: MongoDB is an open source, highly available, document oriented, scalable and fault tolerant NoSQL database [29]. It supports sharding by configuring a sharded cluster. Each cluster is composed of three main components namely shards, query routers and configuration TABLE I. Metric Value SECURITY ASSESSMENT IMPORTANCE KEY Description High Provides complete support of required features needed to secure data Medium Provides a limited set of security feature only and it is advisable to implement missing features Low Offers very basic security features or no security at all C. CouchDB 1) Introduction: CouchDB is a document oriented, peer based nosql database having high scalability and availability [33]. The cluster configuration of CouchDB support document redistribution across nodes for large performance improvements. Any change in a document on a single node is periodically copied to other nodes of the cluster using the phenomenon of incremental replication 3 TABLE II. SECURITY ASSESSMENT CRITERIA Metric Value Description Authentication High Medium Low Sharded database cluster should provide one of the following authentication mechanisms such as Two-factor authentication (e.g. passwords with fingerprints, PIN code with mobile number etc) Certificate based authentication using PKI with LDAP It should also support client side, intra-cluster as well as inter-cluster authentication. Supports only one type of authentication e.g. SSO, OpenID, SAML or some certificate based authentication etc either within a sharded cluster or with client. Only Simple Password based client authentication is supported or no authentication at all. Access Control High Dynamic access control rules following principal of least privileges, separation of duties with custom defined access control policies such as UCON (Usage Control Mode) etc. Medium Support for either RBAC , DAC, MAC, ABAC or FGAC etc. Low Few predefined roles such as user and DBA or no other support for access control at all. Secure Configuration High This type of database configuration security ensures Security of Database Log and configuration files Hardened database servers Limited number of protocols, services and administrator accounts Secured backup and recovery of databases Regular Patching and updates etc Medium Only supports a limited number of protocols for communication and strict access control to registries and accounts Low Use of default configurations Data Encryption High Encryption of data-at-transit as well as data-at-rest at application, network and database level. Use of secure protocols like SSL, TLS etc and strong encryption with secure key management mechanisms for data security. Medium Encryption of data-at-rest or use of transport layer security Low No security of data-at-rest or data during transmission Auditing High Transparent auditing of database, application and user profiles using auditing and monitoring tools. Logging of all internal and external activities and events. Medium Database level auditing, logging of all changes to user profiles Low Database connection level auditing (i-e log-on, log-off etc.) or No auditing mechanisms admin party, for its users. Moreover, authorization is only implemented at database level. A very medium level auditing is provided to log views and events in log files. However, CouchDB does not provide automatic logging thus the configuration of logs is the responsibility of the database administrators. Furthermore, automatic backups of database logs and replicas are also not supported in CouchDB database. [34]. This incremental replication also helps in maintaining data redundancy and consistency in a CouchDB cluster. Moreover, CouchDB uses the techniques of “oversharding” and “iterative shard replacements” to distribute data evenly across the cluster [35]. This technique helps sharded cluster to grow optimally without much downtime. 2) Security Analysis: CouchDB supports basic password based authentication as well as cookie based authentication for its users. Passwords in CouchDB are hashed using PBKDF2 hashing algorithm and are sent over the network using SSL for the security of data transmission. Access control in CouchDB only supports a single role i-e D. Cassandra 1) Introduction: Cassandra is a column-oriented, highly scalable and distributed NoSQL database based on 4 F. Couchbase Server 1) Introduction: Couchbase server is an open source distributed, document-oriented and shared-nothing architecture based NoSQL database. It has a true sharednothing architecture with auto-sharding and cross cluster replication (XCDR) facilities. All the servers in a cluster are distributed across various data centers. The documents are also uniformly distributed across the cluster and stored in special data containers called vbuckets. A Couchbase cluster scales in completely horizontal fashion and more nodes can be added and removed when needed [41]. The mappings of vbuckets to the nodes of the cluster are stored in a lookup structure called cluster map. This cluster map is stored in all the cluster nodes as well as in the Couchbase client nodes. Whenever a cluster wants to scale-in or scale-out, a balancing round starts to balance vbuckets evenly among the cluster. High availability and failover is maintained through replication at the vbucket level. This is done by maintaining an active vbucket present on one node and its replica on another node [42]. A cluster performance and monitoring is maintained by database administrators through either its administrative web interface or management REST API [43]. 2) Security Analysis: The Couchbase administrative web interface and its management REST API use HTTP basic authentication while SASL based external authentication is also supported. Additionally, the administrative console only provides read-only administration rights to its clients. Couchbase only provides secure data replication using SSL in its XCDR technology which is not a part of its open source distribution. Couchbase server supports logging of every component i-e view, index and vbucket etc. However, configuration security is still absent in Couchbase server and is totally relied upon database administrator. the architecture of Google’s BigTable and Amazon’s Dynamo data store. Cassandra uses partition and replication strategy of Dynamo combined together with the column family data model from BigTable [36]. A Cassandra cluster consists of multiple decentralized nodes for the storage of partitioned data items. Each node is responsible for the management and storage of its own data items. All these data items are distributed transparently over the nodes and each node can route the read and write requests to the appropriate node [37]. Replication policies are mainly used at two levels namely “Simple Strategy”, and “Network Topolgy Strategy”, to achieve high availability and scalability. All the write operations are first written into commit log for durability and recoverability [38]. 2) Security Analysis: Cassandra provides very weak password based authentication where all passwords are stored using MD5 hash. All the authentication and authorization in Cassandra is provided between client and Cassandra cluster i-e inter-node message exchange does not support authentication by default. Hence, any malicious user having access to network used by Cassandra cluster can cause damage and extract data after bypassing client side authentications. However, Cassandra provides intra-cluster transmission security at cluster, datacenter and rack level by enabling SSL/TLS in its configurations. By default there is a single super user in Cassandra but other users can be created by assigning permissions to them using CQL (Cassandra Query Language). Furthermore, Cassandra does not support any auditing, logging, data-at-rest encryption and configuration security across the cluster in its open-source version. E. HBase 1) Introduction: HBase is an open source columnoriented, automatically distributed and scalable hadoop data store build on the concept of Google’s BigTable underlying architecture. It uses distributed configuration, replication and write-ahead-logging (WAL) mechanisms to recover from automatic failovers. The client query in HBase is directly transferred to the particular RegionServer after performing a lookup operation in the .META. and –ROOTcatalog tables [39]. An HBase cluster has multiple RegionServers and a single Master server. Each RegionServer has multiple Regions to store table data in them. When a single table becomes too big, it is distributed across multiple Regions [40]. 2) Security Analysis: HBase not only supports token based authentication for mapreduce tasks but user authentication is also supported by HBase. The user authentication is done by using SASL (Simple Authentication and Security Layer) with Kerberos on per connection basis. Additionally, authorizations in HBase are managed by ACL (Access Control Lists) or Coprocessors with column family level granularity and on per user basis. HBase also provides logging support up to data node level. However, other high level auditing and monitoring features are absent in HBase along with configuration security and encryption of data-at-rest. IV. RESEARCH FINDINGS & DISCUSSION From the analysis of above mentioned sharding supported NoSQL databases, it can be seen that most of them offer no security or very less security which is an important aspect of distributed data processing environments where multiple users send multiple data requests over unsecured channel. The assessment clearly shows that: most of the sharded databases provide password based client side authentication like in MongoDB, Cassandra and Redis. While intra-cluster authentications are not very common and are only provided by using SSL/TLS protocols between client and server. Role-based authorization for access control is implemented at basic level in these databases and all the rights to read, write or modify data are assigned to a super user by default. These databases also provide no support for custom defined roles except in MongoDB and HBase while authorization is mostly defined at the granularity of database level. Sharded database configuration security is the domain which is completely neglected by NoSQL databases. Most of the NoSQL solution vendors recommend the use of VPN and Firewalls while providing very basic built-in support for 5 individual comparative analysis of sharded NoSQL databases against each criterion is shown in Fig. 1 - 5 below. Authentication Secure Configurations The analysis shows that there is significant need to improve security of sharded NoSQL databases. Considering this analytical evaluation, it is required to perform research in the domain of sharded NoSQL databases with an objective to achieve reliable, efficient and secure sharding mechanisms. It is recommended that there should be a holistic solution to achieve robust security features in sharded databases keeping in mind scalability and performance issue. Metric Values Metric Values network security. Security of backups and replicas is also considered to be the sole responsibility of database administrators. Furthermore, most of databases do not support any kind of data transmission security; others have it disabled by default. Therefore, all the intra-cluster and client side transmission security is recommended to be ensured through SSL/TLS protocol. Finally, most of these sharded databases provide support for auditing at database or table level but lacks in providing automatic auditing and monitoring with their open source releases. Table III presents a summarized view of research findings and the High Medium Medium Low Low Sharded NoSQL Databases Sharded NoSQL Databases Fig. 1. Comparative Results for Authentication Fig. 3. Comparative Results for Secure Configurations Access Control Data Encryption Metric Values Metric Values High High High Medium Medium Low Low Sharded NoSQL Databases Sharded NoSQL Databases Fig. 2. Comparative Results for Access Control Fig. 4. Comparative Results for Data Encryption 6 DynamoDB and SimpleDB and analyze them as how cloud computing paradigm affects the security features among these sharded database delivery models. Auditing Metric Values REFERENCES [1] R. Masood. "Fine-Grained Access Control for Database Management Systems." MS (CCS) thesis, National University of Science and Technology, Pakistan, 2013. [2] 10Gen Corporation. "NoSQL Explained." Internet: http://www.mongodb.com/nosql-explained, 2011 [Mar. 25, 2014] [3] IBM Corporation. "Data Security and Privacy – A holistic Approach." Internet: www.ibm.com/software/data/optim/protect-data-privacy, Sept. 2011 [Apr. 11, 2014]. [4] CodeFutures Corporation. "Cost-effective Database Scalability using database Sharding." Internet: www.codefutures.com/databasesharding, Jul. 2008 [Mar. 24, 2014] [5] A. Viswanathan and C.J. Kothari. "Hibernate Framework-based database sharding for SaaS Applications." Internet: http://www.ibm.com/developerworks/library/os-hibernatesaas, Oct. 12, 2010 [Apr. 2, 2014] [6] C. Roe. "The Growth of Unstructured Data: What To Do with All Those Zettabytes?" Internet: http://www.dataversity.net/the-growthof-unstructured-data-what-are-we-going-to-do-with-all-thosezettabytes, Mar. 15, 2012. [Mar. 28, 2014]. [7] N. Hardiman. "Cloud computing and the rise of big data." Internet: http://www.techrepublic.com/blog/the-enterprise-cloud/cloudcomputing-and-the-rise-of-big-data, Oct. 1, 2013. [Apr. 5, 2014] [8] A. Schram and K.M. Anderson. "MySQL to NoSQL: data modeling challenges in supporting scalability," in Proc. of the 3rd annual conference on Systems, programming, and applications: software for humanity, 2012, pp. 191-202. [9] MongoDB Inc. "MongoDB sharding guide - Sharding and MongoDB Release 2.4.6." Internet: http://docs.mongodb.org/manual/sharding/, Mar. 2014 [Mar. 28, 2014]. [10] J.D. Meier, A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla and A. Murukan. "Chapter 18: Securing Your Database Server.” Internet: http://msdn.microsoft.com/en-us/library/ff648664.aspx, Jun. 2006 [Mar. 29, 2014] [11] T. Allard, N. Anciaux, L. Bouganim, Y. Guo, L. Le Folgoc, B. Nguyen, P. Pucheral, I. Ray, I. Ray, and S. Yin. "Secure personal data servers: a vision paper," in Proc. of the VLDB Endowment 3, 2010, pp. 25-35. [12] U.T. Mattsson. “A practical implementation of transparent encryption and separation of duties in enterprise databases: protection against High Medium Low Sharded NoSQL Databases Fig. 5. Comparative Results for Auditing V. CONCLUSION AND FUTURE WORK This paper presents assessment criteria for the evaluation of various open source and sharded NoSQL databases. These sharded databases are analyzed according to a proposed assessment criterion and their criticality is identified in the existing systems. This not only helps in further analysis of those areas in sharded databases which lacks in security but also enables NoSQL vendors & customers to improve the existing implemented security techniques. The survey findings show that improving security is a continuous process and should not be stopped at any cost. It has also found that there is not one complete solution to all sharded database security problems and any organization that needs to implement these sharded databases must consider security at every level including security of database cluster itself, transmission security, security of data-at-rest and backups/replica security. As our future work, we would like to survey different NoSQL sharding solutions deployed on cloud paradigm in the form of DBaaS (Database-as-a-Service) such as Amazon’s TABLE III. COMPARATIVE ANALYSIS OF SHARDING SECURITY IN VARIOUS NOSQL DATABASES NoSQL database MongoDB Redis CouchDB Cassandra HBase Couchbase Server Assessment Criteria Authentication Medium Low Medium Low Medium Medium Access Control High Low Low Low Medium Low Secure Configuration Medium Low Low Low Low Low Data Encryption Medium Low Medium Medium Low Low Auditing Low Low Medium Low Medium Medium 7 [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] J.C. Anderson, J. Lehnardt and N. Slater (2010, Jan.) CouchDB: The Definitive Guide. (1st edition). [On-line] Available: http://itebooks.info/read/288/ [Apr. 21, 2014] [34] Couchbase. “Eventual Consistency.” Internet: http://docs.couchdb.org/en/latest/intro/consistency.html, 2014 [Apr. 21, 2014] [35] Couchbase. “CouchDB The definitive guide: Clustering.” Internet: http://guide.couchdb.org/draft/clustering.html, 2014 [Apr. 21, 2014] [36] Cassandra Wiki. “Cassandra Wiki Architecture Overview.” Internet: http://wiki.apache.org/cassandra/ArchitectureOverview, 2013 [Apr. 21, 2014] [37] IBM Corp. “Consider the Apache Cassandra database.” Internet: http://www.ibm.com/developerworks/library/os-apache-cassandra/, 2012 [Apr. 21, 2014] [38] U. Mansoor. “Cassandra Chapter 5: Data Replication Strategies.” Internet: http://10kloc.wordpress.com/2012/12/27/cassandra-chapter5-data-replication-strategies, Dec. 12, 2012 [Apr. 22, 2014] [39] Apache HBase. “The Apache HBase Reference Guide.” Internet: http://hbase.apache.org/book.html#arch.overview, Apr. 06, 2014 [Apr. 21, 2014] [40] Karnataki, Vivek, “HBase-Overview of architecture and data model.” Internet: http://netwovenblogs.com/2013/10/10/hbase-overview-ofarchitecture-and-data-model, Oct. 10, 2013 [Apr. 23, 2014] [41] Couchbase. “Couchbase Server under the Hood - An Architectural Overview.” Internet: http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/ Couchbase_Server_Architecture_Review.pdf, 2013 [Apr. 24, 2014] [42] Couchbase. “Dealing with Memcached Challenges - Getting the performance without the gotchas.” Internet: http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/ Couchbase_Whitepaper_Dealing_with_Memcached_Challenges.pdf, 2013 [Apr. 16, 2014] [43] Couchbase, “Couchbase Server Features”. Internet: http://www.couchbase.com/couchbase-server/features#content1, 2014 [Apr. 12, 2014] external and internal attacks on databases.” In E-Commerce Technology 2005, 2005, pp. 559-565. T. Baccam. "Oracle Database Security: What to look for and Where to secure." Internet: https://www.sans.org/reading-room/analystsprogram/oraclewhitepaper-201004, Apr. 2010 [Mar. 12, 2014] R. Duncan. "An Overview of Different Authentication Methods and Protocols." SANS Institute, Washington, DCUS, 2001. Available: http://www.sans.org/readingroom/whitepapers/authentication/overview-authentication-methodsprotocols-118 K.W. Nafi, T. Shekha Kar, S.A. Hoque, and M. M. A. Hashem. (2013, Mar.) "A newer user authentication, file encryption and distributed server based cloud computing security architecture." (IJACSA) International Journal of Advanced Computer Science and Applications. [On-line]. 3(10), pp. 181-186. Available: http://arxiv.org/ftp/arxiv/papers/1303/1303.0598.pdf [Mar. 12, 2014] Meyer, Christopher, and Jörg Schwenk. "Lessons Learned From Previous SSL/TLS Attacks-A Brief Chronology Of Attacks And Weaknesses." in proc. IACR Cryptology, 2013, pp. 49 D.I.S.A. “Database security technical implementation guide version 8, release 1.” Database STIG for USA Dept. of Defence, Sept. 2007 Oracle Corp. "Oracle database security guide 10g” Internet: http://docs.oracle.com/cd/B12037_01/network.101/b10773.pdf, Dec. 2003 [Mar. 20, 2014] A. Lane. “Choosing and Implementing an Enterprise database encryption strategy.” Information Week Report, Rep. S7220713, 2013 P. Badkar, “Oracle Database Security Guide 11g Release 1.” Internet: http://docs.oracle.com/cd/B28359_01/network.111/b28531.pdf, Jan. 2914 [Mar. 11, 2014]. J.D. Meier, A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla and A. Murukan. "Chapter 2: Threats and Countermeasures.” Internet: msdn.microsoft.com/en-us/library/ff648641.aspx, Jan. 2006 [Apr. 21, 2014]. M.G. Piattini and E. Fernandez-Medina. "Secure databases: state of the art." in Proc. IEEE 34th Annual International Carnahan Conference on Security Technology, 2000, pp. 228-237 E. Bertino, and R. Sandhu. "Database security-concepts, approaches, and challenges." In proc. IEEE transactions on Dependable and Secure Computing, 2005, pp. 2-19 G.E. de Silveira. “Usenix: A Configuration Distribution System for Heterogeneous Networks.” in Proc. of the Twelfth Systems Administration Conference (LISA), 1998, pp. 283 R. Sandhu, D. Ferraiolo, and R. Kuhn. "The NIST model for rolebased access control: towards a unified standard." in Proc. ACM workshop on Role-based access control, 2000, pp. 47 - 63 A. Ely. “Strategy: Responding to database compromise” Information Week analytics, Rep. S1680810, 2010. J.D. Meier, A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla and A. Murukan. "Chapter 18: Securing Your Database Server.” Internet: http://msdn.microsoft.com/en-us/library/ff648664.aspx, Jun. 2006 [Apr. 20, 2014] N. Delessy, E.B. Fernandez, M.M. Larrondo-Petrie, and J. Wu. "Patterns for access control in distributed systems." in Proc. of the 14th Conference on Pattern Languages of Programs, 2007, pp. 3 MongoDB Inc. "MongoDB Architecture Guide." Internet: http://info.mongodb.com/rs/mongodb/images/MongoDB_Architectur e_Guide.pdf, Mar. 2014. [Apr. 14, 2014] 10Gen Corporation. "Sharding and MongoDB Release 2.6.0." Internet: http://docs.mongodb.org/master/MongoDB-shardingguide.pdf, Apr. 2, 2014 [Apr. 20, 2014] Redis. “Partitioning: how to split data among multiple Redis instances.” Internet: http://redis.io/topics/partitioning, 2014 [Apr. 20, 2014] E. Wolff. “Redis: The Universal NoSQLTool” Internet: http://www.slideshare.net/ewolff/redis-15061825, Nov. 07, 2012 [Apr. 21, 2014] 8 View publication stats ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 4, No.16 , July 2015 ISSN : 2322-5157 www.ACSIJ.org A Survey on Security Issues in Big Data and NoSQL Ebrahim Sahafizadeh1, Mohammad Ali Nematbakhsh2 1 Computer engineering department, University of Isfahan Isfahan,81746-73441,Iran sahafizadeh@eng.ui.ac.ir 2 Computer engineering department, University of Isfahan Isfahan,81746-73441,Iran nematbakhsh@eng.ui.ac.ir Abstract 2.1 Big Data This paper presents a survey on security and privacy issues in big data and NoSQL. Due to the high volume, velocity and variety of big data, security and privacy issues are different in such streaming data infrastructures with diverse data format. Therefore, traditional security models have difficulties in dealing with such large scale data. In this paper we present some security issues in big data and highlight the security and privacy challenges in big data infrastructures and NoSQL databases. Keywords: Big Data, NoSQL, Security, Access Control Big data is a term refers to the collection of large data sets which are described by what is often referred as multi 'V'. In [8] 7 characteristics are used to describe big data: Volume, variety, volume, value, veracity, volatility and complexity, however in [9], it doesn't point to volatility and complexity. Here we describe each property. Volume: Volume is referred to the size of data. The size of data in big data is very large and is usually in terabytes and petabytes scale. Velocity: Velocity referred to the speed of data producing and processing. In big data the rate of data producing and processing is very high. Variety: Variety refers to the different types of data in big data. Big data includes structured, unstructured and semistructured data and the data can be in different forms. Veracity: Veracity refers to the trust of data. Value: Value refers to the worth drives from big data. Volatility: "Volatility refers to how long the data is going to be valid and how long it should be stored" [8]. Complexity: "A complex dynamic relationship often exists in big data. The change of one data might result in the change of more than one set of data triggering a rippling effect" [8]. Some researchers defined the important characteristics of big data are volume, velocity and variety. In general, the characteristics of big data are expressed as three Vs. 1. Introduction The term big data refers to high volume, velocity and variety information which requires new forms of processing. Due to these properties which are referred sometimes as 3 'V's, it becomes difficult to process big data using traditional database management tools [1]. A new challenge is to develop novel techniques and systems to extensively exploit the large volume of data. Many information management architectures have been developed towards this goal [2]. As developing new technologies and increasing the use of big data in several scopes, security and privacy has been considered as a challenge in big data. There are many security and privacy issues about big data [1, 2, 3, 4, 5 and 6]. In [7] top ten security and privacy challenges in big data is highlighted. Some of these challenges are: secure computations, secure data storages, granular access control and data provenance. 2.2 NoSQL The term NoSQL stands for "Not only SQL" and it is used for modern scalable databases. Scaling is the ability of the system to increase throughput when the demands increase in terms of data processing. To support big data processing, the platforms incorporate scaling in two forms of scalability: horizontal scaling and vertical scaling [10]. Horizontal Scaling: in horizontal scaling the workload distributes across many servers. In this type of scalability multiple systems are added together in order to increase the throughput. In this paper we focus on researches in access control in big data and security issues on NoSQL databases. In section 2 we have an overview on big data and NoSQL technologies, in section 3 we discuss security challenges in big data and describe some access control model in big data and in section 4 we discuss security challenges in NoSQL databases. 2. Big Data and NoSQL Overview In this section we have an overview on Big Data and NoSQL. 68 Copyright (c) 2015 Advances in Computer Science: an International Journal. All Rights Reserved. ACSIJ Advances in Computer Science: an International Journal, Vol. 4, Issue 4, No.16 , July 2015 ISSN : 2322-5157 www.ACSIJ.org Vertical Scaling: in vertical scaling more processors, more memory and faster hardware are installed within a single server. The main advantages of NoSQL is presented in [11] as the following: "1) reading and writing data quickly; 2) supporting mass storage; 3) easy to expand; 4) low cost". In [11] the data models that studied NoSQL systems support are classified as Key-value, Column-oriented and Document. There are many products claim to be part of the NoSQL database, such as MongoDB, CouchDB, Riak, Redis, Voldermort, Cassandera, Hypertable and HBase. Apache Hadoop is an open source implementation of Google big table [12] for storing and processing large datasets using clusters of commodity hardware. Hadoop uses HDFS which is a distributed file system to store data across clusters. In section 6 we have an overview of Hadoop and discuss an access control architecture presented for Hadoop. [2] using data content. In this case the semantic content of data plays the major role in access control decision making. "CBAC makes access control decisions based on the content similarity between user credentials and data content dynamically" [2]. Attribute relationship methodology is another method to enforce security in big data proposed in [3] and [4]. Protecting the valuable information is the main goal of this methodology. Therefore [4] focuses on attribute relevance in big data as a key element to extract the information. In [4], it is assumed that the attribute with higher relevance is more important than other attributes. [3] uses a graph to model attributes and their relationship. Attributes are expressed as node and relationship is shown by the edge between each node and the method is proposed by selecting protected attributes from this graph. The method proposed in [4] is as follow: "First, all the attributes of the data is extracted and then generalize the properties. Next, compare the correlation between attributes and evaluate the relationship. Finally protect sele...
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

This question has not been answered.

Create a free account to get help with this and any other question!

Related Tags