JOURNAL OF EMERGING TECHNOLOGIES IN ACCOUNTING
Vol. 13, No. 1
Spring 2016
pp. 121–140
American Accounting Association
DOI: 10.2308/jeta-51463
Data Analytics: Introduction to Using Analytics in Auditing
David Y. Chan
St. John’s University
Alexander Kogan
Rutgers, The State University of New Jersey, Newark
ABSTRACT: This is a hands-on introductory practical data analytics teaching case that can be used in an auditing or
related course. Students will learn about data attributes, data creation, structured query language (SQL), basic
statistics, and performing basic audit procedures using analytics by utilizing the open source software R. Instructors
can use this case for an in-class discussion or an independent out-of-class assignment. A solutions guide is available
in the Teaching Notes. Multimedia files are available for download, see Appendix B.
Keywords: auditing; data analytics; R; teaching case study.
INTRODUCTION
W
e are in the era of Big Data and accounting and auditing professionals with data analytical skills are in high
demand. Accounting firms today expect graduates to have an appreciation for analytics and understand where
analytics may be used (Ernst & Young Academic Resource Center [EYARC] Colloquium on Analytics in the
Accounting Curriculum [EY 2015]). However, many accounting programs do not expose their students to data analytics in their
curricula. The AACSB issued Standard A7 that provides guidance for accounting programs including learning experiences that
develop skills and knowledge relating to data creation, data sharing, data analytics, data mining, data reporting, and data storing
in an organization (AACSB 2014). The purpose of this data analytics case is to gently introduce and familiarize students with
the use of data analytics in general and its use in the accounting and auditing contexts. Furthermore, students will learn about
data attributes, data creation, structured query language (SQL), basic statistics, and performing basic audit procedures using
analytics. The case utilizes the free open source software R (The R Foundation 2015). Instructors can use this case for an inclassroom, instructor-led discussion or an independent out-of-classroom student assignment. The case provides all the
necessary instructions from how to set up R to performing common audit-related procedures. The instructions are all inclusive
and the students will not need resources outside of this case and its related materials.
The scripts used in this case can be downloaded as a text file, see the link to ‘‘Scripts’’ in Appendix B.
Background Information
As a new staff auditor for a public accounting firm, you have been assigned to the XYZ Inc. audit engagement. XYZ is a
public company. The engagement partner has suggested the use of computer-assisted audit tools and techniques (CAATTs)
whenever possible in order to ensure that an effective and efficient audit is performed. A CAATT is computer software that
allows auditors to perform data analytics. Furthermore, CAATTs may aid the auditor in testing 100 percent of the transaction
population and automating or semi-automating the performance of audit procedures. While many audit procedures can be
performed manually or with the use of Microsoft Excel, manual methods are not always effective or efficient and the use of
spreadsheet software has data limitations. For example, Excel 2013 only allows for the 1,048,576 rows or observations and
16,384 columns or variables (Microsoft 2015). The limitation on the number of columns may not be an issue but the limitation
We acknowledge and thank Miklos A. Vasarhelyi, Hui Du, the reviewers, and participants of the 2015 AIS Educator Conference and the 2015 AAA
Annual Meeting for the insightful comments and suggestions to advance and improve our paper.
Supplemental materials can be accessed by clicking the links in Appendix B.
Editor’s note: Accepted by Miklos A. Vasarhelyi.
Submitted: March 2015
Accepted: March 2016
Published Online: April 2016
121
122
Chan and Kogan
on the number of rows may become a constraint. Furthermore, computational performance issues in Excel will result when
analyzing large datasets.
The audit manager on the engagement has assigned to you specific audit procedures to be performed within the revenue
cycle of the audit program. The audit procedures assigned include footing, re-computing, scanning, sample selection, and an
analytical procedure. There are two popular CAATTs in the audit software market: ACL and IDEA. However, the firm prefers
the use of the data analytical software R because of its open source nature and also the versatility of the analysis in the software.
The use of open source software is becoming more mainstream with many different organizations (Deloitte 2015).
Furthermore, many students are familiar with R from their statistics courses. Interfacing or communicating with R is done
through coding or scripting, unlike in Excel where there is a graphic user interface (GUI). Unlike CAATTs that utilize GUIs,
the benefit of a scripting interface is its inherent ability to document, review, and reproduce the path of analysis. Scripting
allows firms and regulators to perform reviews of the audit work. In R, users write the script and then execute the script. While
there is initially a steeper learning curve compared with GUI software, scripting software allows the auditor to reuse scripts in
subsequent analysis and therefore makes analysis more efficient.
Before proceeding with any audit procedures or analysis, you will have to install the R software on your computer. R is
compatible with Windows, Apple, and Linux computers. In this case, we will use the Windows version for demonstration
purposes. Please refer to Appendix A for detailed instructions on installing R.
Installing and Loading Packages
The R software comes with pre-installed add-on packages for basic analysis. However, R has a long-standing open source
community that develops packages for more advanced data manipulation and analysis. R has an extensive number of free
packages that are contributed by the open source community. ‘‘Free’’ does not in any way mean that the packages are inferior.
Users in the community may have created a specific feature that was not available in the basic software and wanted to share the
developed feature with the community. Some packages are developed for bleeding-edge analytics. The contribution from the
open source community and the variety of available analytics differentiates R from other analytical software. You will need to
download, install, and load two packages (‘‘sqldf’’ and ‘‘forecast’’) for the exercises in this case. You can copy and paste the
scripts from the ‘‘Scripts’’ text document (and as shown below) in the R Editor window to download and install the two
packages. A video that shows how scripts are executed in R is available for download, see Appendix B.
Script 1:
After copying and pasting the code into the R Editor window, highlight the pasted script and click and hold ‘‘Ctr’’ and then
click ‘‘R’’ on the keyboard. This will send the highlighted code into the R Console window and execute the code. Two
‘‘Question’’ dialog boxes may pop up. You should click ‘‘Yes’’ in both dialog boxes. A new window ‘‘HTTPS CRAN mirror’’
will pop up. Select the CRAN mirror closest to your area and click ‘‘OK.’’ For example, you can select USA (CA 1) if you are
located near California, United States. The software will automatically begin to download and install the two packages from the
respective selected mirror. Packages will only need to be installed once.
After installing the two packages, the ‘‘sqldf’’ and ‘‘forecast’’ packages will need to be loaded. These packages will have to be
loaded every time when the R software is started. To load the two packages, enter the following two-line script below into the R
Editor window and then highlight the two lines and click and hold ‘‘Ctr’’ and then click ‘‘R’’ on the keyboard. The scripts will
execute and the packages will be loaded into the R software. Notice the code passes on to the R Console window and executes.
Script 2:
Setting Format Output Options
R is designed for statistical analysis and occasionally the software recognizes some financial or nonfinancial values in the
exponential format ‘‘1eþ.’’ You will need to command R to utilize the fixed-point format instead of the exponential format since
it is commonly used when dealing with monetary/dollar values. The fixed-point format will output number values in the
numeric format by default. The various data attributes will be discussed in the ‘‘Data Cleaning’’ section. Enter the following
script into the R Editor window and then highlight the script and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard.
This option command will have to be executed every time when the R software is first started.
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
123
Script 3:
The R software is now set up and ready for use in this case.
Importing Data
Obtaining data is the first step in data analytics. In an audit, the client is responsible for providing the auditors with
financial data to audit. The client has provided the engagement team data from their sales journal, shipping journal, cash
journal, and customer master file. The client has also provided the aggregated revenue account balance for the last ten years.
The files are provided to the auditors in five comma-separated value (CSV) files. CSV is an open format and is commonly used
to transport data and can be easily readable by analytic software.
Use the following R Scripts below to import these five CSV files into R. The R Scripts will pull the respective CSV files from
a hosted server (also see the links to Sales1, Ship1, Cash1, Customer1, and Revenue in Appendix B). Once imported, R will store
the imported data in dataframes. Dataframes are similar to a database table or an Excel sheet. Enter the following lines of scripts
below into the R Editor window and then highlight the scripts and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard.
Script 4:
Script Definition
Code
read.csv
‘http://davidchan.net/data/Sales1.csv;
header ¼ TRUE
sep ¼ ‘‘,’’
Definition
Reading a comma separated value ‘‘csv’’ file
File location
Variable names are included at the top of columns
Data is separated by commas
After importing the data into R, the ‘‘View’’ command can be used to open a dataframe. For example, to view the imported
‘‘Sales1’’ dataframe, enter the script shown below into the R Editor window and then highlight the script and click and hold
‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard:
Script 5:
Data Diagnostics
Once the CSV data files have been imported into R, you should run some preliminary data diagnostics/checks. The purpose
of running diagnostic checks is to determine whether the files have been imported correctly and whether the data is what you
will need for performing the procedures. You can use an array of diagnostic techniques to check for the number of rows
(observations) and columns (variables), verify the variable names, and view and inspect the first six rows of data or the last six
rows of data. The diagnostic of data is important as you want to confirm that you are analyzing the data that you are expecting
to analyze.
Enter the scripts below into the R Editor window and then highlight the scripts and click and hold ‘‘Ctrl’’ and then click
‘‘R’’ on the keyboard to show the number of rows (nrow) or columns (ncol) in the ‘‘Sales1’’ dataframe:
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Chan and Kogan
124
Script 6:
Checkpoint 1
(1) How many rows are in the ‘‘Sales1’’ dataframe?
(2) How many columns are in the ‘‘Sales1’’ dataframe?
Enter and execute the script below into R to show the variable (column) names in the ‘‘Sales1’’ dataframe:
Script 7:
Checkpoint 2
(1) List the variable names in ‘‘Sales1’’ dataframe?
Enter and execute the scripts below into R to show the first six rows (head) and last six rows (tail) of the ‘‘Sales1’’
dataframe:
Script 8:
Checkpoint 3
(1) What is the third ‘‘Sales_Order_No’’ in the ‘‘Sales1’’ dataframe?
(2) What is the last ‘‘Sales_Order_No’’ in the ‘‘Sales1’’ dataframe?
Data Cleaning
The client may provide the auditors with the data needed for auditing but the data may not be in a format that is usable by
R for analysis. Data cleaning is necessary to get the data in a format that R can use and make computations from. For example,
the client may provide monetary accounting data with commas (‘‘1,500’’). Numeric data with commas are recognized as text or
as character format in R. R cannot do calculations on variables identified as character variables and thus they will need to be
converted into the numeric format. There are five basic types of data formats: (1) numeric, (2) integer, (3) character, (4) factor,
and (5) date. Below are examples of each:
Format Type
Numeric
Integer
Character
Factor
Date
Example
837223, 123.23, 2320840.98 (can have decimals)
235, 8372, 23208 (no decimals)
data, auditing, Main Street
0,1
2014-12-11
If you open ‘‘Sales1,’’ ‘‘Cash1,’’ and ‘‘Customer1’’ dataframes, then you will notice that there are not only variables that
have numeric values, but also have commas in them. Therefore, R may consider these numeric variables as either Character or
Factor variables. For subsequent analysis on these numeric variables to occur in R the commas have to be removed and the
variables have to be converted into the numeric format.
First, consider the Sales Journal Dataframe (Sales1):
R has a function called ‘‘str’’ that can be used to show the data format for each variable in a dataframe. Enter the script
below into the R Editor window and then highlight the script and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard:
Script 9:
Notice the ‘‘Invoice_Amount’’ variable in the ‘‘Sales1’’ dataframe is categorized as a Factor type variable (Figure 1).
A numeric variable cannot have commas between the numbers. You will need to strip the commas from the numeric values
in the ‘‘Invoice_Amount’’ variable and convert the variable into a numeric variable using the following code:
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
125
FIGURE 1
Sales1 Dataframe
Script 10:
Script Definition
Code
Sales1$Invoice_Amount
as.numeric
gsub
‘‘,’’, ‘‘’’
Definition
Variable ‘‘Invoice_ Amount’’ in the ‘‘Sales1’’ dataframe
Convert variable to numeric data type format
Remove comma function
Replace comma with no space
Next, let’s consider the Cash Receipts Journal Dataframe:
In the Cash Receipts dataframe, the ‘‘Invoice_Amount’’ and ‘‘Payment_Received’’ variables are both categorized as a
Character type of variables but should be Numeric variables. Again, this is due to the commas between the numbers. You will
need to strip the commas from the variables and convert the variables into Numeric variables using the follow scripts:
Script 11:
Checkpoint 4
(1) Remove the commas and convert the variables into a numeric variable for the following two variables ‘‘Customer_
Balance’’ and ‘‘Customer_Max_Credit_Amount’’ in the Customer Master File dataframe ‘‘CUS1.’’
Next, determine whether the other variables in the dataframes are categorized in the correct data type. As you recall, a
variable can be a Numeric, Integer, Character, Date, or Factor data type. If a variable is not categorized correctly, then you will
need to convert the variable into the correct type.
Enter and execute the scripts below in R to determine the data type for each of the variables in each of the respective
dataframes:
Script 12:
You will notice that the dataframe ‘‘Sales1’’ has two variables in an incorrect format; (1) ‘‘Invoice_Date,’’ and (2) ‘‘Invoice_
Description’’ (Figure 2). R is categorizing ‘‘Invoice_Date’’ and ‘‘Invoice_Description’’ as Factor variables. The ‘‘Invoice_Date’’
should be in the Date format and ‘‘Invoice_Description’’ should be in the Character format. The ‘‘Invoice_Date’’ is the invoice date
and you will need to convert the variable ‘‘Invoice_Date’’ from the Factor format to the Date format. Similarly, you will need to
convert the variable ‘‘Invoice_Description’’ from the Factor format to the Character format. The ‘‘Invoice_Description’’ variable
describes the type of sale and should be characterized in the Text or Character format. Here are the scripts to do so:
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Chan and Kogan
126
FIGURE 2
Sales1 Dataframe
Script 13:
In the Shipping dataset, you will need to convert the variable ‘‘Shipping_Number’’ to the Character format and the variable
‘‘Shipping_Date’’ to the Date format.
Script 14:
In the Customer dataset, you need to convert the variables ‘‘Customer_No’’ into the Integer format, ‘‘Customer_Name’’
into the Character format, ‘‘Customer_Address’’ into the Character format, ‘‘Customer_City’’ into the Character format,
‘‘Customer_State’’ into the Character format, ‘‘Customer_Credit_Rating’’ into the Factor format, and ‘‘Customer_Max_Credit_
Amount’’ into the Numeric format.
Script 15:
Checkpoint 5
(1) In the Cash dataset, convert the variable ‘‘Payment_Date’’ to the Date format.
Finally, you should re-examine all the dataframes and determine whether all the variables are in the correct data type before
you continue. Enter the following scripts and verify:
Script 16:
Structured Query Language
Structured Query Language (SQL) is a standard language of relational database management systems (DBMS). Auditors
can use the language to access, make queries, create new tables (dataframes), and manipulate data in a database. For the
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
127
purpose of this case study and its use with R, focus on the latter three. You will need to understand how to use the SQL
SELECT Statement. This statement starts with the SELECT keyword followed by a comma-separated list of variables that will
be displayed in the results set generated by the statement. This is followed by the FROM clause, which lists the dataframes
required to construct the result set. Then, the statement may have the WHERE clause that provides the conditions the result set
satisfies. For example, you want to SELECT the variables ‘‘Invoice_No’’ and ‘‘Invoice_Amount’’ FROM the ‘‘Sales1’’
dataframe WHERE ‘‘Invoice_Amount’’ is greater than 1500. This SQL command would return back the variables ‘‘Invoice_
No’’ and ‘‘Invoice_Amount’’ where the ‘‘Invoice_Amount’’ is greater than 1500. Further examples will be articulated below.
Create New Dataframes
Prior to analyzing data, create a new dataframe by extracting the relevant variables from pre-existing dataframes. The
purpose of creating a new dataframe is two-fold; (1) you want to preserve the original data, and (2) speed up subsequent
analysis since the dataframe will be smaller and only consisting of the relevant variables.
In auditing accounting information, you need to select variables in various data types (e.g., Numeric, Character, and Date).
Below are examples of the SQL queries to extract these various data types. First, create a new dataframe called SQL1 with the
sales journal information of a specific invoice. This can be accomplished by a SQL query for the specific invoice number. The
data type of the variable ‘‘Invoice_No’’ in the dataframe is Numeric. In the example below, SELECT all the variables FROM
the dataset ‘‘Sales1’’ WHERE ‘‘Invoice_No’’ is 1031. The asterisk (*) used after the SELECT statement is to request that all the
variables should be selected.
Script 17:
Second, create a new dataframe called SQL2 with the customer master file information of the customer Arch Coal, Inc.
Note that the variable names of Character data type variables have to be accompanied by single quotation marks ‘’.
Script 18:
Third, create two new dataframes called SQL3A and SQL3B for transactions that fall within a specific period of dates.
Before the calculation, you need to create a new dataframe called Sales2 and convert the variable ‘‘Invoice_Date’’ into the
Character data type. R is not able to compute the difference between dates without this conversion.
Script 19:
As a Character data type, the date has to be enclosed with single quotation marks ‘’. For the new dataframe SQL3A, you
need to select all the Invoices from the Sales Journal that have dates before July 1, 2010. For the new dataframe SQL3B, select
all the Invoices from the Sales Journal that have dates between July 1, 2010 and September 30, 2010.
Script 20:
Checkpoint 6
(1) Create a new dataframe called SQL3C with the customer master file information of the customer ‘‘MSCI, Inc.’’
(2) Open the dataframe SQL3C. What is the Customer Max Credit Amount?
Dataframes may need to be merged or joined for analysis. You can use SQL to join two different dataframes based on a
matching unique identifier that is common to both dataframes. You can perform a left join or an inner join. For a left join, all
the rows from the first dataframe will match with a matching row on the second dataframe. Any row on the second dataframe
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Chan and Kogan
128
FIGURE 3
Left Join
without a match in the first dataframe will be excluded (see Figure 3). An inner join is where all the rows from both tables will
be included as long as there is a match in both datasets (see Figure 4). Further examples will be articulated below.
In the example below, you will create a new dataframe called SQL4A by joining the two dataframes SQL3A and Cash1
using a unique identifier. You want to match all the rows in Cash1 with a matching row in SQL3A. The other rows in Cash1
will be excluded. Therefore, you will need to use the left join statement. The two unique identifiers that are common to both
dataframes are ‘‘Customer_No’’ and ‘‘Invoice_No.’’ Use both identifiers because a customer can have multiple invoices.
Script 21:
Next, create a new dataframe called SQL4B by joining the two dataframes SQL3B and Cash1 with a unique identifier. You
want to create a dataframe with only those rows in both SQL3B and Cash1 that are matching. Therefore, use the inner join
statement. The two unique identifiers that are common to both dataframes are ‘‘Customer_No’’ and ‘‘Invoice_No.’’ Again, you
want to use both identifiers because a customer can have multiple invoices.
Script 22:
FIGURE 4
Inner Join
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
129
Data Manipulation
When doing analysis, it is sometimes necessary to create a new variable out of a preexisting variable or variables. For
example, you may want to create a count variable based on the number of days between two dates or create a Factor variable
that will indicate the occurrence of a specific attribute.
Before manipulating any data, it is a good rule of thumb to create a duplicate dataframe in order to preserve the
content of the original dataframe. Here you will create a new dataframe called SQL5 in order to preserve the original
data in SQL4A. You also need to convert the variable ‘‘Invoice_Date’’ back into the Date data type (it was changed
above) and convert the variable ‘‘Payment_Date’’ into the Date format. Using the data from SQL5, you are interested
in creating a binary or factory variable called ‘‘Diff_Date’’ where 1 indicates a payment is made within the first 10
days and a 0 indicates otherwise. First, you need to calculate the number of days between the invoice and payment
dates.
Script 23:
Before proceeding, you will need to remove the additional variable ‘‘X’’ created as a result of the join function above. R
will be confused as there are two columns named ‘‘X.’’ You will need to remove the additional ‘‘Variable’’ in Column 8.
Script 24:
Next, create a new variable called ‘‘Early’’ by using the variable ‘‘Diff_Date’’ calculated above and the SQL CASE
statement to create a Factor variable that indicates if a payment was made within 10 days. You will also need to convert the
newly created variable ‘‘Early’’ into a Factor variable data type.
Script 25:
Similarly, you may want to create a new variable called ‘‘Late’’ to identify those payments that were made over 30 days
and thus considered late payments.
Script 26:
Summary Statistics
Before performing audit procedures on the data, you may want to run some basic statistics on the variables that you will
use. In R, you can use the function summary () to get a quick summary of basic statistical information about the individual
variables in a dataframe.
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Chan and Kogan
130
Script 27:
For each variable, R can also produce specific statistical measures such as mean, median, quantile, and range.
Script 28:
R can also be used to produce plots of charts. Charts can provide useful visualizations of data. Furthermore, charts can be
used to look for anomalies or outliers in data.
Create a Histogram of the variable ‘‘Invoice_Amount’’ in the Sales Journal:
Script 29:
Create a Boxplot with the ‘‘Invoice_Amount’’ and ‘‘Customer_Credit_Rating’’ variables from the Sales Journal:
Script 30:
Create a Scatterplot of the variable ‘‘Invoice_Amount’’ in the Sales Journal:
Script 31:
When performing analytical procedures during an audit, the goal is to look at the relationships among the data. Certain
accounts on the financial statement or related financial or nonfinancial information can be expected to be positively or
negatively related to each other. You can use a correlation test to determine the extent to which two different Numeric
variables are related to each other. In the example below, find the relationship between the variables ‘‘Customer_Max_
Credit_Amount’’ and ‘‘Customer_Credit_Rating.’’ You would expect a positive relationship between these two variables.
Generally, you would expect a company to grant higher credit limits to those customers that have higher credit ratings.
Before proceeding with the correlation test, you will need to convert the ‘‘Customer_Credit_Rating’’ variable into a Numeric
variable.
Script 32:
Audit Application
Common audit procedures that an auditor performs include footing, scanning, or re-computing the totals of a column. You
can use the sum() feature in R to add or sum a column.
Script 33:
Checkpoint 7
(1) What is the footed amount for the ‘‘Invoice_Amount’’ variable?
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
131
Scanning for irregular transactions is another popular audit procedure. Before scanning, you may want to sort the data by
the order of a specific variable. For example, create a new dataframe called Sort1 with data from dataframe Sales1 sorted by the
‘‘Sales_Order_Number’’ variable:
Script 34:
Sorting can also be done based on two variables. For example, you can sort by ‘‘Sales_Order_No’’ and then by ‘‘Invoice_
Amount.’’
Script 35:
Sorting can also be done in descending order. For example, you can use the code shown below if you want to sort the
Sales1 dataframe from low to high using the ‘‘Sales_Order_No’’ variable and from high to low for the ‘‘Invoice_Amount’’
variable. If you want to sort in descending order, you will need to add a ‘‘–’’ in front of the variable you want to be sorted in
descending order. Here, add the ‘‘–’’ in front of the variable ‘‘Invoice_Amount.’’
Script 36:
Recalculation of variable can be performed in R as well. If you want to confirm whether the discounts taken by customers
were calculated correctly, then simply re-compute the discounts and determine if there are any differences from your own
computations.
First, you need to create a new dataframe called AUD1 where only the transactions that were paid early are included. Only
invoices paid early should have a discount. Use the variable created above called ‘‘Early’’ to create the constraint.
Script 37:
The discount taken is 2 percent. You need to recalculate the discount by multiplying the invoice amount by the discount.
The new variable will be called ‘‘Recal1.’’ You will also need to round the computed discount and convert ‘‘Recal1’’ into a
Numeric variable called ‘‘Recal2.’’
Script 38:
Last, calculate if there is a difference between your calculation of the discount and what the client has taken as the discount.
Script 39:
Checkpoint 8
(1) Was there a difference between the client’s calculated discounts and your calculation?
R can also be used for making sample selections.
To take a random sample of three observations from the ‘‘Sales1’’ dataframe, enter the code below into R. In this example,
take a random sample of 3 observations.
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Chan and Kogan
132
Script 40:
R can also be used to select a systematic sample. Below you will select every fifth row in the dataframe ‘‘Sales1’’:
Script 41:
A stratified sample can also be taken. Since R does not have a package to make the stratified sample selection, a custom
function or code is required in order to make the stratified sample selection. Copy and paste the code below into the R Editor
and execute the code:
Script 42:
For more, please see: http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/.
In stratified sample selection, the population is divided into groups or strata and selections are made within the strata. You
will continue with the use of the dataframe SQL5 and use the created factor data type variable ‘‘Early’’ as the stratum indicator
for this exercise. There are two strata in the ‘‘Early’’ variable: (1) those customers that paid early and (2) those customers that
did not. However, if you did not know the number of groups in a variable, you will want to determine that first.
Script 43:
Next, make your stratified sample selection by using the custom function above. The function requires the following
inputs: dataframe, unique ID variable column number, grouping variable column number, and the sample size. In this example,
use the SQL5 dataframe, ‘‘Customer_No’’ variable (column number 2), ‘‘Early’’ variable (column 14), and a sample size of 3.
The column numbers of the variables can be obtained by using the names () function in R.
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
133
Script 44:
Checkpoint 9
(1) How many observations are in the sample selected?
Analytical procedures can be used to determine whether changes in account balances are in line with expectations. In R
you can perform a time-series analysis to statistically predict or estimate what you can expect to be the future value of an
account balance using historical data. To illustrate, use the revenue account balance for the past 11 years (2002 to 2012) of the
client to estimate what the revenue balances will be for subsequent future years. For this exercise, the exponential time-series
model will be used to make the prediction.
You will use the variable ‘‘Revenue’’ in the ‘‘Rev1’’ dataframe for this analysis. Enter the script below into the R Editor to
create the time-series vector. The new vector will be called MYTS. You will need the dataframe and variable, the start period,
the end period, and the frequency of periods. In the data, the revenue balance is for an annual basis so the frequency is one.
Script 45:
Next, you will need to use the exponential model function ‘‘HoltWinters’’ in R to model the estimation and store the
estimation information in the vector ‘‘FIT.’’ You will need the time-series vector ‘‘MYTS’’ created from above and to set the
gamma option to ‘‘False.’’
Script 46:
Once the model has made the predictions, you can obtain the estimated numerical values of the revenue balance for the
years 2013, 2014, and 2015 using the forecast() function. You can also make a plot to visually see the estimations using the
plot() function.
Script 47:
Checkpoint 10
(1) What is the forecasted revenue account balance for 2013, 2014, and 2015?
CONCLUSION
The utilization of CAATTs to perform audit procedures can increase an audit’s effectiveness and efficiency over manual
audit procedures. The efficiency gained through using CAATTs allows the auditor to reallocate their time to the evaluation of
exceptions or anomalies. The majority of the auditor’s work will be on exercising professional judgment as we move toward
automated audits or continuous audits. Although learning how to use analytical tools may take the investment of some time, the
evolution of the skillsets needed in auditing is clear and we will continue to see dependencies on data analytics. As a result, the
demand for students with data analytical skills will only increase. After completing the case, you should now have greater
appreciation for the use of data analytics in general and more specifically in the auditing context. Furthermore, you should have
gained an understanding about data attributes, data creation, the structured query language (SQL), the use of basic statistics, and
performing basic audit procedures using analytics.
REFERENCES
Association to Advance Collegiate Schools of Business (AACSB). 2014. Information Technology Skills and Knowledge for Accounting
Graduates: An Interpretation AACSB International. AACSB International Accounting Accreditation Standard A7. Tampa, FL:
AACSB. Available at: http://www.aacsb.edu/;/media/AACSB/Publications/white-papers/accounting-accreditation-standard-7.
ashx
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Chan and Kogan
134
Deloitte. 2015. Analytics Trends 2015—A Below-The-Surface Look. Available at: http://public.deloitte.com/media/analytics/trends/
analytics-trends.html
Ernst & Young (EY). 2015. Colloquium on Analytics in the Accounting Curriculum. The 2015 Ernst & Young Academic Resource
Center (EYARC). London, UK: EY.
Microsoft. 2015. Excel Specifications and Limits. Available at: https://support.office.com/en-ie/article/Excel-specifications-and-limitsca36e2dc-1f09-4620-b726-67c00b05040f
R Foundation, The. 2015. The R Project for Statistical Computing. Available at: https://www.r-project.org/
APPENDIX A
Download R
In order to complete the assignments in this case study, you will need to download and install the R software.
Download R: http://cran.stat.ucla.edu/
You may download the Windows, Mac, or Linux versions of the software but note that the interface may be different from
what is illustrated below. The instructions below are designed for the R Windows edition of the software.
A video that shows how to download R is available for download, see the link to ‘‘Download_R’’ in Appendix B.
Install R
The installation of R is similar to how you install other software. Below are the step-by-step screenshots of the installation
process.
A video that shows how to install R is available for download, see the link to ‘‘Install_R’’ in Appendix B.
Double click the R software *.exe file downloaded click ‘‘Next’’:1
Accept default and click ‘‘Next’’:
1
All images are author screenshots of the installation process for R for Windows 3.1.2.
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
Accept default and click ‘‘Next’’:
After R installs then click ‘‘Finish’’:
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
135
136
Chan and Kogan
Setup R
Open the R software from the Start window or using the Desktop icon. The R software will open and the R Console
window should be prominently displayed.
A video that shows how to set up R is available for download, see the link to ‘‘Set_Up_R’’ in Appendix B.
R Console Window:
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
Click ‘‘File’’ from the menu and click ‘‘New Script’’:
The R Editor window will open and will be prominently displayed:
Click ‘‘Windows’’ from the menu and then click ‘‘Tile Horizontally’’:
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
137
Chan and Kogan
138
The screen below will now show the ‘‘R Editor’’ window on top of the ‘‘R Console’’ window:
The R Editor window is used to read/write/edit the R Code or Script and the R Console is used to execute the script and
display the output from analysis. The tiling of the windows allows the user to see the scripts and the execution and output of the
analysis in an efficient manner.
APPENDIX B
Scripts: http://dx.doi.org/10.2308/jeta-51463.s01
Sales1: http://dx.doi.org/10.2308/jeta-51463.s02
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Data Analytics: Introduction to Using Analytics in Auditing
Ship1: http://dx.doi.org/10.2308/jeta-51463.s03
Cash1: http://dx.doi.org/10.2308/jeta-51463.s04
Customer1: http://dx.doi.org/10.2308/jeta-51463.s05
Revenue: http://dx.doi.org/10.2308/jeta-51463.s06
Execute_R_scripts: http://dx.doi.org/10.2308/jeta-51463.s07
Download_R: http://dx.doi.org/10.2308/jeta-51463.s08
Install_R://dx.doi.org/10.2308/jeta-51463.s09
Set_Up_R: http://dx.doi.org/10.2308/jeta-51463.s10
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
139
Chan and Kogan
140
CASE LEARNING OBJECTIVES AND IMPLEMENTATION GUIDANCE
Learning Objectives
Introduce students to data analytics in the accounting and auditing context.
Develop an understanding of data attributes, data creation, structured query language (SQL), basic statistics, and
performing basic audit procedures using analytics.
Provide hands-on practice in using the open source analytical software R.
Instructor Guide
This case has been successfully used in an external audit, internal audit, and fraud examination course. Furthermore, the
case has been facilitated both as an in-class, instructor-led discussion and as a student take-home assignment. The authors found
that the latter was more effective and efficient as the students were able to work the case at their own pace. The case includes all
the necessary instructions and related materials for students to complete the case with little or no instructions from the
instructor. The instructor can use the associated checkpoints in the case to verify that students have completed the case. If an
instructor prefers to use the case as an in-class-led discussion, then the authors found that allocating two to three hours of class
time should be adequate. However, there will be a positive correlation between the number of students in the class and the
amount of time that should be allocated. There will likely be more questions when there are more students in the class, and
therefore more time may be required. For efficiency purposes, it is highly recommended that the students: (1) install R on their
own computers, (2) download the required materials, and (3) confirm that their computers can access the Internet before class.
The instructor may also use the case in a hybrid approach. For the hybrid approach, the case can be discussed in class and the
students are then assigned to work on the case at home.
TEACHING NOTES AND STUDENT VERSION OF THE CASE
Teaching Notes and the Student Version of the Case are available only to non-student-member subscribers to Journal of
Emerging Technologies in Accounting through the American Accounting Association’s electronic publications system at http://
www.aaapubs.org/. Non-student-member subscribers should use their usernames and passwords for entry into the system
where the Teaching Notes can be reviewed and printed. The ‘‘Student Version of the Case’’ is available as a supplemental file
that is posted with the Teaching Notes. Please do not make the Teaching Notes available to students or post them on websites.
If you are a non-student-member of AAA with a subscription to Journal of Emerging Technologies in Accounting and have
any trouble accessing this material, please contact the AAA headquarters office at info@aaahq.org or (941) 921-7747.
Journal of Emerging Technologies in Accounting
Volume 13, Number 1, 2016
Copyright of Journal of Emerging Technologies in Accounting is the property of American
Accounting Association and its content may not be copied or emailed to multiple sites or
posted to a listserv without the copyright holder's express written permission. However, users
may print, download, or email articles for individual use.
Database Systems Journal vol. VI, no. 1/2015
23
Big Data Analytics Platforms analyze
from startups to traditional database players
Ionuţ ŢĂRANU
Bucharest University of Economic Studies
ionut.tanaru@gmail.com
Big data analytics enables organizations to analyze a mix of structured, semi-structured and
unstructured data in search of valuable business information and insights. The analytical
findings can lead to more effective marketing, new revenue opportunities, better customer
service, improved operational efficiency, competitive advantages over rival organizations and
other business benefits. With so many emerging trends around big data and analytics, IT
organizations need to create conditions that will allow analysts and data scientists to
experiment. “You need a way to evaluate, prototype and eventually integrate some of these
technologies into the business,” says Chris Curran[1]. In this paper we are going to review 10
Top Big Data Analytics Platforms and compare the key-features.
Keywords: Big data, In-memory, Hadoop, Data analysis
1
Introduction
The growth of data – both structure
and unstructured – will present
challenges as well as opportunities for
organisations over the next five years.
With growing data volumes, it is essential
that real-time information that is of use to
the business can be extracted from its IT
systems, otherwise the business risks
being swamped by a data deluge.
Meanwhile, competitors that use data to
deliver better insights to decision-makers
stand a better chance of thriving through
the difficult economy and beyond. To
analyze such a large volume of data, big
data analytics is typically performed
using specialized software tools and
applications for predictive analytics, data
mining, text mining, forecasting and data
optimization. Collectively these processes
are separate but highly integrated
functions of high-performance analytics.
Today's advances in analyzing Big Data
allow researchers to decode human DNA
in minutes, predict where terrorists plan
to attack, determine which gene is mostly
likely to be responsible for certain
diseases and, of course, which ads you
are most likely to respond to on
Facebook. The business cases for
leveraging Big Data are compelling. For
instance, Netflix mined its subscriber data
to put the essential ingredients together for
its recent hit House of Cards, and subscriber
data also prompted the company to bring
Arrested Development back from the dead.
Another example comes from one of the
biggest mobile carriers in the world.
France's Orange launched its Data for
Development project by releasing subscriber
data for customers in the Ivory Coast. The
2.5 billion records, which were made
anonymous, included details on calls and
text messages exchanged between 5 million
users. Researchers accessed the data and
sent Orange proposals for how the data
could serve as the foundation for
development projects to improve public
health and safety. Proposed projects
included one that showed how to improve
public safety by tracking cell phone data to
map where people went after emergencies;
another showed how to use cellular data for
disease containment.[2] So it seems that
data analysis is a do-or-die requirement
for today's businesses. We analyze below
notable vendor choices, from Hadoop
upstarts to traditional database players.
2. Top 10 Big Data Analytics Platforms
2.1. Actian (Fig.1.–
Platform).
Actian Analytics
24
Big Data Analytics Platforms analyze from startups to traditional database players
Fig. 1. Actian Analytics Platform
•
•
•
•
•
Analytical DBMS: Actian Matrix
(formerly
ParAccel),
Actian
Vector (formerly Vectorwise).
In-memory
DBMS:
Actian
Matrix In-Memory Option (data
stored to both memory and disk).
Hadoop distribution: None.
Stream-processing technology:
None.
Hardware/software
systems:
None (software-only vendor).
The company is counting on the
combination of fast, analytical DBMS
options, cloud services, and dataintegration and -analytics software geared
to a world in which Hadoop is a
prominent
fixture of the datamanagement
architecture.
Actian
DataFlow includes SQL-, ETL-, and
data-cleansing-on Hadoop options that
work with distributions from Apache,
Cloudera, Hortonworks, and others [3]
2.2. Amazon
• Analytical DBMS: Amazon
Redshift service (based on
ParAccel
engine);
Amazon
Relational Database Service.
• In-memory DBMS: None. Thirdparty options on AWS include
Altibase,
SAP
Hana,
and
ScaleOut.
• Hadoop distributions: Amazon
•
•
Elastic MapReduce. Third-party
options include Cloudera and MapR.
Stream-processing
technology:
Amazon Kinesis.
Hardware/software systems: Not
applicable.
AWS is located in 11 geographical
"regions": US East (Northern Virginia),
where the majority of AWS servers are
based, US West (northern California), US
West (Oregon), Brazil (São Paulo), Europe
(Ireland and Germany), Southeast Asia
(Singapore), East Asia (Tokyo and Beijing)
and Australia (Sydney). There is also a
"GovCloud", based in the Northwestern
United States, provided for U.S. government
customers,
complementing
existing
government agencies already using the US
East RegionEach Region is wholly
contained within a single country and all of
its data and services stay within the
designated Region.
Amazon Web Services 2009 (Fig. 2.–
Amazaon Web Service) hosts a who's who
list of data-management services from thirdparty players -- Cloudera, Microsoft, Oracle,
SAP, and many others -- but the cloud giant
has its own long-term ambitions where bigdata analysis is concerned.[4] Building on
its Elastic Compute Cloud (EC2) and Simple
Storage Service (S3) storage infrastructure,
Amazon launched its Hadoop-based Elastic
MapReduce service way back in. In 2013,
Database Systems Journal vol. VI, no. 1/2015
AWS added the Redshift Data
Warehousing service (based on the
ParAccel DBMS), which is supported by
another who's who list of independent
data-integration, business intelligence,
and analytics vendors. Rounding out
AWS's big-data capabilities are the
DynamoDB
NoSQL
database
management service and Kinesis Stream
Processing service.
25
•
•
Fig. 2. Amazon Web Service
•
•
•
•
•
Amazon DynamoDB provides a
scalable, low-latency NoSQL
online Database Service backed
by SSDs.
Amazon ElastiCache provides inmemory
caching
for
web
applications. This is Amazon's
implementation of Memcached
and Redis.
Amazon Relational Database
Service (RDS) provides a scalable
database server with MySQL,
Oracle,
SQL
Server,
and
PostgreSQL support.[22]
Amazon
Redshift
provides
petabyte-scale data warehousing
with column-based storage and
multi-node compute.
Amazon
SimpleDB
allows
developers to run queries on
structured data. It operates in
concert with EC2 and S3 to
provide "the core functionality of a
database".
AWS Data Pipeline provides reliable
service for data transfer between
different AWS compute and storage
services (e.g., Amazon S3, Amazon
RDS, Amazon DynamoDB, Amazon
EMR). In other words this service is
simply a data-driven workload
management system, which provides
a simple management API for
managing and monitoring of datadriven
workloads
in
cloud
applications.[23]
Amazon Kinesis streams data in real
time with the ability to process
thousands of data streams on a persecond basis. The service, designed
for real-time apps, allows developers
to pull any amount of data, from any
number of sources, scaling up or
down as needed.[5]
2.3. Cloudera
• Analytical DBMS: HBase, and
although not a DBMS, Cloudera
Impala supports SQL querying on
top of Hadoop.
• In-memory DBMS: Although not a
DBMS, Apache Spark supports inmemory analysis on top of Hadoop.
• Hadoop distributions: CDH opensource
distribution,
Cloudera
Standard, Cloudera Enterprise.
• Stream-processing
technology:
Open-source
stream-processing
options on Hadoop include Storm.
• Hardware/software
systems:
Partner appliances, preconfigured
hardware, or both available from
Cisco, Dell, HP, IBM, NetApp, and
Oracle.
Cloudera Inc. is an American-based
software company that provides Apache
Hadoop-based software, support and
services, and training to business
customers.[6]
Cloudera's open-source Apache Hadoop
26
Big Data Analytics Platforms analyze from startups to traditional database players
distribution, CDH (Cloudera Distribution
Including Apache Hadoop), targets
enterprise-class deployments of that
technology. Cloudera says that more than
50% of its engineering output is donated
upstream to the various Apache-licensed
open source projects (Apache Hive,
Apache Avro, Apache HBase, and so on)
that combine to form the Hadoop
platform. Cloudera is also a sponsor of
the Apache Software Foundation [7]
•
•
•
•
2.4. HP HAVEn
• Analytical DBMS: HP Vertica
Analytics Platform Version 7
(Crane release).
In-memory DBMS: Vertica is not
an in-memory database, but with
high RAM-to-disk ratios the
company says it can ensure nearreal-time query performance.
Hadoop distribution: None.
Stream-processing
technology:
None.
Hardware/software systems: HP
ConvergedSystem 300 for Vertica,
plus a choice of reference
architectures
for
Cloudera,
Hortonworks, and MapR Hadoop
distributions.
Fig. 3. HAVEn Ecosystem
HP calls its big-data-platform architecture
HAVEn (Fig. 3. - HAVEn Ecosystem),
an acronym for Hadoop, Autonomy,
Vertica, Enterprise Security, and "n"
applications.
The
cluster-based,
column-oriented
Vertica Analytics Platform is designed to
manage large, fast-growing volumes of
data and provide very fast query
performance when used for data
warehouses and other query-intensive
applications. The product claims to
drastically improve query performance
over traditional relational database
systems, provide high-availability, and
petabyte scalability on commodity
enterprise servers.
Its design features include:
•
•
•
Column-oriented
storage
organization,
which
increases
performance of sequential record
access at the expense of common
transactional operations such as
single record retrieval, updates, and
deletes.[9]
Standard SQL interface with many
analytics capabilities built-in, such as
time series gap filing/interpolation,
event-based
windowing
and
sessionization, pattern matching,
event
series
joins,
statistical
computation
(e.g.,
regression
analysis), and geospatial analysis.
Out-of-place updates and hybrid
storage organization, which increase
the
performance
of
queries,
Database Systems Journal vol. VI, no. 1/2015
•
•
•
•
•
insertions, and loads, but at the
expense of updates and deletes.
Compression, which reduces
storage costs and I/O bandwidth.
High compression is possible
because columns of homogeneous
datatype are stored together and
because updates to the main store
are batched.[10]
Shared
nothing
architecture,
which reduces system contention
for shared resources and allows
gradual
degradation
of
performance in the face of
hardware failure.
Easy to use and maintain through
automated data replication, server
recovery, query optimization, and
storage optimization.
Support
for
standard
programming interfaces ODBC,
JDBC, and ADO.NET.
High performance and parallel
data transfer to statistical tools
such as Distributed R, and the
ability to store machine learning
models, and use them for indatabase scoring.[11][12]
2.5. Hortonworks
• Analytical
DBMS:
HBase;
although not a DBMS, Hive is
Hortonworks' option for SQL
querying on top of Hadoop.
• In-memory DBMS: Although
not a DBMS, Apache Spark
supports in-memory analysis on
top of Hadoop.
• Hadoop
distributions:
Hortonworks
Data
Platform
(HDP) 2.0, HDP for Windows,
Hortonworks Sandbox (free,
single-node desktop software
offering Hadoop tutorials).
• Stream-processing technology:
Open-source stream-processing
options on Hadoop include Storm.
• Hardware/software
systems:
Partner appliances, preconfigured
hardware, or both available from
27
HP, Teradata and others.
On the matter of customer acquisition,
six-year-old Cloudera probably has a
slight
lead
over
three-year-old
Hortonworks (Fig. 4. - Hortonworks
Data platform), but only just. Analysts
estimate Cloudera’s base of paying
subscribers at around 350, while
Hortonworks’ CEO Rob Bearden says
his company has acquired 250 customers
over the past five quarters.
Fig. 4. Hortonworks Data platform
The most significant point of disagreement
between Cloudera and Hortonworks lies in
their answers to a single question – and the
one that, arguably, matters most to
enterprise customers: should Hadoop
complement or replace traditional enterprise
data warehouse (EDW) investments?
2.6. IBM
• Analytical DBMS: DB2, Netezza
(Fig. 5. - IBM Netezza platform).
• In-memory DBMS: DB2 with BLU
Acceleration, solidDB.
• Hadoop distribution: InfoSphere
BigInsights.
• Stream-processing
technology:
InfoSphere Streams.
• Hardware/software
systems:
PureData System For Operational
Analytics (DB2), IBM PureData
System for Analytics (Netezza );
PureData System for Hadoop
(BigInsights).
28
Big Data Analytics Platforms analyze from startups to traditional database players
Fig. 5. IBM Netezza platform
Although IBM has plenty of products and
services, it's not a product-oriented
provider of technology. IBM leads with
its deep integration and consulting
expertise in a consultative approach
focused
on
building
businessdifferentiating "solutions" that might
incorporate multiple products.
IBM Netezza Analytics’ advanced
technology fuses data warehousing and
in-database analytics into a scalable,
high-performance, massively parallel
advanced analytic platform that is
designed to crunch through petascale data
volumes. This allows users to ask
questions of the data that could not have
been contemplated on other architectures.
IBM Netezza Analytics is designed to
quickly and effectively provide better and
faster answers to the most sophisticated
business questions. [13]
2.7. Microsoft
• Analytical DBMS: SQL Server
2012 Parallel Data Warehouse
(PDW).
• In-memory DBMS: SQL Server
2014 In-Memory OLTP (option
available with SQL Server 2014,
set for release by second quarter
of 2014).
• Stream-processing technology:
Microsoft StreamInsight.
•
•
Hadoop
distribution:
HDInsight/Windows
Azure
HDInsight Service (based on
Hortonworks Data Platform).
Hardware/software systems: Dell
Parallel Data Warehouse Appliance,
HP
Enterprise
Parallel
Data
Warehouse Appliance.
The Microsoft Analytics Platform System
(Fig.6. - Microsoft Analytics Platform
System) is a turnkey big data analytics
appliance, combining Microsoft’s massively
parallel processing (MPP) data warehouse
technology–the SQL Server Parallel Data
Warehouse
(PDW)–together
with
HDInsight, Microsoft’s 100% Apache
Hadoop distribution, and delivering it as a
turnkey appliance. To integrate data from
SQL Server PDW with data from Hadoop,
APS offers the PolyBase data querying
technology.[14]
Fig.6. Microsoft Analytics Platform System
Database Systems Journal vol. VI, no. 1/2015
2.8. ORACLE
• Analytical
DBMSs:
Oracle
Database, Oracle MySQL, Oracle
Essbase.
• In-memory
DBMS:
Oracle
TimesTen, Oracle Database 12c
In-Memory Option (announced in
2013 without details, roadmaps,
or release dates).
29
•
•
•
Stream-analysis option: Oracle
Event Processing.
Hadoop distribution: Resells and
supports Cloudera Enterprise.
Hardware/software
systems:
Exadata, Exalytics, Oracle Big Data
Appliance.
Fig.7. Oracle Big Data Appliance
The Oracle Big Data Appliance consists
of hardware and software from Oracle
Corporation designed to integrate
enterprise data, both structured and
unstructured. It includes the Oracle
Exadata Database Machine and the
Oracle Exalytics Business Intelligence
Machine,
used
for
obtaining,
consolidating and loading unstructured
data into Oracle Database 11g. The
product also includes an open source
distribution of Apache Hadoop, Oracle
NoSQL Database, Oracle Data Integrator
with Application Adapter for Hadoop,
Oracle Loader for Hadoop, an open
source distribution of R, Oracle Linux,
and Oracle Java Hotspot Virtual Machine
[15]
Oracle Big Data Appliance (Fig.7. Oracle Big Data Appliance) By
combining the newest technologies from
the Hadoop ecosystem and powerful
Oracle SQL capabilities together on a
single pre-configured platform, Oracle
Big Data Appliance is uniquely able to
support rapid development of new Big
Data applications and tight integration with
existing relational data. Oracle Big Data
Appliance is pre-configured for secure
environments leveraging Apache Sentry,
Kerberos, both network encryption and
encryption at rest as well as Oracle Audit
Vault and Database Firewall.[16]
2.9. Pivotal
• Analytical
DBMS:
Pivotal
Greenplum Database.
• In-memory
DBMS:
Pivotal
GemFire and SQLFire. Pivotal HD
used in combination with GemFire
XD and HAWQ for in-memory
analysis on top of Hadoop.
• Stream-analysis option: Pivotal is
working a project aimed at
integrating its GemFire (NoSQL)
and SQLFire in-memory data grid
capabilities with Pivotal Hadoop and
Spring XD as a data-ingest
mechanism to support scalable,
streaming-data analysis.
• Hadoop distribution: Pivotal HD.
30
Big Data Analytics Platforms analyze from startups to traditional database players
•
Hardware/software
systems:
Pivotal
Data
Computing
Appliance
Pivotal HD is 100% Apache Hadoop
compliant and supports all Hadoop
Distributed File System (HDFS) file
formats. In addition, Pivotal HD supports
Apache
Hadoop-related
projects,
including Yarn (aka MapReduce 2.0),
Zookeeper and Oozie (for resource and
workflow management), Hive and HBase
(for language and analytics support).[17]
Pivotal GemFire® stores all operational
data compressed and in-memory to avoid
disk I/O time lags. Nodes operate in a
cluster, optimizing data distribution and
processing, to ensure the highest speed
and balanced utilization of system
resources. Pivotal GemFire scales elastically
and linearly – adding nodes increases
capacity predictably.[18]
2.10. SAP
• Analytical DBMSs: SAP Hana,
SAP IQ.
• In-memory DBMS: SAP Hana.
Stream-analysis option: SAP Event
Stream Processing.
• Hadoop distribution: Resells and
supports Hortonworks, Intel; Hadoop
integrations certified by Cloudera
and MapR.
• Hardware/software
systems:
Multiple hardware configuration
partners include Dell, Cisco, Fujitsu,
Hitachi, HP, and IBM.
Fig.8 Architecture
SAP HANA is an in-memory, columnoriented, relational database management
system developed and marketed by SAP
SE. [19] HANA's architecture is designed
to handle both high transaction rates and
complex query processing on the same
platform. SAP HANA was previously called
SAP High-Performance Analytic Appliance
[20]
The main process, called the index server,
has a structure Fig.8. – Architecture.
The indexer performs session management,
Database Systems Journal vol. VI, no. 1/2015
authorization, transaction management
and command processing. Note that
HANA has both a row store and a
column store. Users can create tables
using either store, but the column store
has more capabilities. The index server
also manages persistence between cached
memory images of database objects, log
files and permanent storage files.
The Authorization manager provides
authentication and authorization services.
The Authorization Manager can provide
security based on SAML, OAuth or
Kerberos authentication protocols.
The Extended Services (XS) Engine is a
web server with privileged access to the
database. Applications written with
server-side JavaScript or as Java Servlets
can be deployed to the XS Engine. These
can either be HTML web applications or
REST web service endpoints. Server-side
JavaScript
includes
jQuery-based
extensions for database access and to
access HTTP request and response
messages. The JavaScript engine is based
on the Mozilla SpiderMonkey project.
[21]
5. Conclusions
With data growing so rapidly and the rise
of unstructured data accounting for 90%
of the data today, the time has come for
enterprises to re-evaluate their approach
to data storage, management and
analytics. Legacy systems will remain
necessary for specific high-value, lowvolume workloads, and complement the
use of Hadoop -optimizing the data
management
structure
in
your
organization by putting the right Big Data
workloads in the right systems. The costeffectiveness, scalability, and streamlined
architectures of Hadoop will make the
technology more and more attractive. In
fact, the need for Hadoop is no longer a
question. The only question now
remaining is how to take advantage of it
best. All of these tools provide a rich
feature set ready for enterprise use. It will
be up to the end user to do a thorough
31
comparison and select either of these tools
References
[1] http://www.networkworld.com/article/28
37779/big-data-business-intelligence/8big-trends-in-big-data-analytics.html.
[2] http://www.datamation.com/applications
/big-data-analytics-overview.html
[3] http://www.actian.com/solutions/#custo
mer-analytics-content
[4] http://aws.amazon.com/
[5] http://en.wikipedia.org/wiki/Amazon_W
eb_Services
[6] http://en.wikipedia.org/wiki/Cloudera
[7] http://www.apache.org/foundation/spons
orship.html
[8] Vance, Ashlee (16 March 2009).
"Bottling the Magic Behind Google and
Facebook". The New York Times.
[9] Monash, C: "Are row-oriented RDBMS
obsolete?" DBMS2, January 22, 2007
[10] Monash, C: "Mike Stonebraker on
database compression – comments”,
DBMS2, March 24, 2007
[11] Gagliordi, Natalie. "HP adds scale to
open-source R in latest big data
platform". ZDNet.
[12] Prasad, Shreya; Fard, Arash; Gupta,
Vishrut; Martinez, Jorge; LeFevre, Jeff;
Xu, Vincent; Hsu, Meichun; Roy,
Indrajit (2015). "Enabling predictive
analytics in Vertica: Fast data transfer,
distributed model creation and indatabase prediction". ACM SIGMOD
International Conference on
Management of Data (SIGMOD).
[13] http://www01.ibm.com/software/data/puredata/anal
ytics/nztechnology/analytics.html
[14] http://www.microsoft.com/en-us/servercloud/products/analytics-platformsystem/
[15] Darrow, Barb (2011-10-03). "Oracle
BigData Appliance stakes big claim".
[16] http://www.oracle.com/technetwork/dat
abase/bigdataappliance/overview/index.html
[17] http://pivotal.io/
[18] http://pivotal.io/big-data/pivotalgemfire
32
Big Data Analytics Platforms analyze from startups to traditional database players
[19] Jeff Kelly (July 12, 2013). "Primer
on SAP HANA". Wikibon. Retrieved
October 9, 2013
[20] http://en.wikipedia.org/wiki/SAP_H
ANA
[21] https://developer.mozilla.org/enUS/docs/Mozilla/Projects/SpiderMonke
y
Mr. Ionuţ Ţăranu graduated from the Faculty of Cybernetics, Statistics and
Economic Informatics of the Academy of Economic Studies in 1996, having
its Master degree on “Database support for business”. At present is in the
process of getting his title of doctor in economy in the specialty of “Softcomputing methods for early medical diagnosis”. He has been an Assistant
Professor for 4 years at “Titu Maiorescu” University and also for 4 years at
Academy of Economic Studies from Bucharest. He published a series of
articles, from which the most important are Applying ABCD Rule of
Dermatoscopy using cognitive systems and ABCDE Rule in Dermoscopy –
Registration and determining the impact of parameter E for evolution in diagnosing skin cancer
using soft computing alghorithms.
Mr. Taranu is currently the General Manager of Stima Soft company. He has more than 15 years
of experience as a project manager and a business analyst with over 13 years of expertise in
Software development, Business Process Management, Enterprise Architecture design and
Outsourcing services. He is also involved in research projects, from which the most relevant are:
• Development of an Intelligent System for predicting, analyzing and monitoring
performance indicators of technological and business processes in renewable energy area;
• Development of an eHealth platform for improving quality of life and the personalization
of therapy at patients with diabetes;
• Development of an Educational Portal and integrated electronic system of education at
the University of Medicine and Pharmacy "Carol Davila” to develop medical
performance in dermatological oncology field;
Copyright of Database Systems Journal is the property of Bucharest Academy of Economic
Studies and its content may not be copied or emailed to multiple sites or posted to a listserv
without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.
Purchase answer to see full
attachment