Rasmussen College Introduction to analytics Module 02 Activity

Content Type

User Generated

User

qbuqbu31

Subject

Mathematics

School

Rasmussen University

Description

Choose 3 articles to research the most popular platforms in use today. After reviewing the articles, compare and contrast the strengths and weaknesses of two of the platforms and identify commonalities. Use the articles to support your arguments. You will need to use the Gartner Magic Quadrant to find leading platforms. Just search the web for Gartner Magic Quadrant Analytics and choose two leaders or visionaries to begin your research. You may use the two articles below, along with one of your own choosing, or you can choose three sources on your own.

ŢĂRANU, I. (2015). Big Data Analytics Platforms analyze from startups to traditional database players. Database Systems Journal, 6(1), 23-32. ***Article attached

Chan, D. Y., & Kogan, A. (2016). Data Analytics: Introduction to Using Analytics in Auditing. Journal Of Emerging Technologies In Accounting, 13(1), 121-140. doi:10.2308/jeta-51463. ***Article attached.

Please follow APA guidelines for the paper: 2 pages, double-spaced, 12-point Times New Roman font, one-inch margins. Remember to include a References page that includes all references to material used in your paper. In-text citations are also required in the body of your paper.

Unformatted Attachment Preview

JOURNAL OF EMERGING TECHNOLOGIES IN ACCOUNTING Vol. 13, No. 1 Spring 2016 pp. 121–140 American Accounting Association DOI: 10.2308/jeta-51463 Data Analytics: Introduction to Using Analytics in Auditing David Y. Chan St. John’s University Alexander Kogan Rutgers, The State University of New Jersey, Newark ABSTRACT: This is a hands-on introductory practical data analytics teaching case that can be used in an auditing or related course. Students will learn about data attributes, data creation, structured query language (SQL), basic statistics, and performing basic audit procedures using analytics by utilizing the open source software R. Instructors can use this case for an in-class discussion or an independent out-of-class assignment. A solutions guide is available in the Teaching Notes. Multimedia files are available for download, see Appendix B. Keywords: auditing; data analytics; R; teaching case study. INTRODUCTION W e are in the era of Big Data and accounting and auditing professionals with data analytical skills are in high demand. Accounting firms today expect graduates to have an appreciation for analytics and understand where analytics may be used (Ernst & Young Academic Resource Center [EYARC] Colloquium on Analytics in the Accounting Curriculum [EY 2015]). However, many accounting programs do not expose their students to data analytics in their curricula. The AACSB issued Standard A7 that provides guidance for accounting programs including learning experiences that develop skills and knowledge relating to data creation, data sharing, data analytics, data mining, data reporting, and data storing in an organization (AACSB 2014). The purpose of this data analytics case is to gently introduce and familiarize students with the use of data analytics in general and its use in the accounting and auditing contexts. Furthermore, students will learn about data attributes, data creation, structured query language (SQL), basic statistics, and performing basic audit procedures using analytics. The case utilizes the free open source software R (The R Foundation 2015). Instructors can use this case for an inclassroom, instructor-led discussion or an independent out-of-classroom student assignment. The case provides all the necessary instructions from how to set up R to performing common audit-related procedures. The instructions are all inclusive and the students will not need resources outside of this case and its related materials. The scripts used in this case can be downloaded as a text file, see the link to ‘‘Scripts’’ in Appendix B. Background Information As a new staff auditor for a public accounting firm, you have been assigned to the XYZ Inc. audit engagement. XYZ is a public company. The engagement partner has suggested the use of computer-assisted audit tools and techniques (CAATTs) whenever possible in order to ensure that an effective and efficient audit is performed. A CAATT is computer software that allows auditors to perform data analytics. Furthermore, CAATTs may aid the auditor in testing 100 percent of the transaction population and automating or semi-automating the performance of audit procedures. While many audit procedures can be performed manually or with the use of Microsoft Excel, manual methods are not always effective or efficient and the use of spreadsheet software has data limitations. For example, Excel 2013 only allows for the 1,048,576 rows or observations and 16,384 columns or variables (Microsoft 2015). The limitation on the number of columns may not be an issue but the limitation We acknowledge and thank Miklos A. Vasarhelyi, Hui Du, the reviewers, and participants of the 2015 AIS Educator Conference and the 2015 AAA Annual Meeting for the insightful comments and suggestions to advance and improve our paper. Supplemental materials can be accessed by clicking the links in Appendix B. Editor’s note: Accepted by Miklos A. Vasarhelyi. Submitted: March 2015 Accepted: March 2016 Published Online: April 2016 121 122 Chan and Kogan on the number of rows may become a constraint. Furthermore, computational performance issues in Excel will result when analyzing large datasets. The audit manager on the engagement has assigned to you specific audit procedures to be performed within the revenue cycle of the audit program. The audit procedures assigned include footing, re-computing, scanning, sample selection, and an analytical procedure. There are two popular CAATTs in the audit software market: ACL and IDEA. However, the firm prefers the use of the data analytical software R because of its open source nature and also the versatility of the analysis in the software. The use of open source software is becoming more mainstream with many different organizations (Deloitte 2015). Furthermore, many students are familiar with R from their statistics courses. Interfacing or communicating with R is done through coding or scripting, unlike in Excel where there is a graphic user interface (GUI). Unlike CAATTs that utilize GUIs, the benefit of a scripting interface is its inherent ability to document, review, and reproduce the path of analysis. Scripting allows firms and regulators to perform reviews of the audit work. In R, users write the script and then execute the script. While there is initially a steeper learning curve compared with GUI software, scripting software allows the auditor to reuse scripts in subsequent analysis and therefore makes analysis more efficient. Before proceeding with any audit procedures or analysis, you will have to install the R software on your computer. R is compatible with Windows, Apple, and Linux computers. In this case, we will use the Windows version for demonstration purposes. Please refer to Appendix A for detailed instructions on installing R. Installing and Loading Packages The R software comes with pre-installed add-on packages for basic analysis. However, R has a long-standing open source community that develops packages for more advanced data manipulation and analysis. R has an extensive number of free packages that are contributed by the open source community. ‘‘Free’’ does not in any way mean that the packages are inferior. Users in the community may have created a specific feature that was not available in the basic software and wanted to share the developed feature with the community. Some packages are developed for bleeding-edge analytics. The contribution from the open source community and the variety of available analytics differentiates R from other analytical software. You will need to download, install, and load two packages (‘‘sqldf’’ and ‘‘forecast’’) for the exercises in this case. You can copy and paste the scripts from the ‘‘Scripts’’ text document (and as shown below) in the R Editor window to download and install the two packages. A video that shows how scripts are executed in R is available for download, see Appendix B. Script 1: After copying and pasting the code into the R Editor window, highlight the pasted script and click and hold ‘‘Ctr’’ and then click ‘‘R’’ on the keyboard. This will send the highlighted code into the R Console window and execute the code. Two ‘‘Question’’ dialog boxes may pop up. You should click ‘‘Yes’’ in both dialog boxes. A new window ‘‘HTTPS CRAN mirror’’ will pop up. Select the CRAN mirror closest to your area and click ‘‘OK.’’ For example, you can select USA (CA 1) if you are located near California, United States. The software will automatically begin to download and install the two packages from the respective selected mirror. Packages will only need to be installed once. After installing the two packages, the ‘‘sqldf’’ and ‘‘forecast’’ packages will need to be loaded. These packages will have to be loaded every time when the R software is started. To load the two packages, enter the following two-line script below into the R Editor window and then highlight the two lines and click and hold ‘‘Ctr’’ and then click ‘‘R’’ on the keyboard. The scripts will execute and the packages will be loaded into the R software. Notice the code passes on to the R Console window and executes. Script 2: Setting Format Output Options R is designed for statistical analysis and occasionally the software recognizes some financial or nonfinancial values in the exponential format ‘‘1eþ.’’ You will need to command R to utilize the fixed-point format instead of the exponential format since it is commonly used when dealing with monetary/dollar values. The fixed-point format will output number values in the numeric format by default. The various data attributes will be discussed in the ‘‘Data Cleaning’’ section. Enter the following script into the R Editor window and then highlight the script and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard. This option command will have to be executed every time when the R software is first started. Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing 123 Script 3: The R software is now set up and ready for use in this case. Importing Data Obtaining data is the first step in data analytics. In an audit, the client is responsible for providing the auditors with financial data to audit. The client has provided the engagement team data from their sales journal, shipping journal, cash journal, and customer master file. The client has also provided the aggregated revenue account balance for the last ten years. The files are provided to the auditors in five comma-separated value (CSV) files. CSV is an open format and is commonly used to transport data and can be easily readable by analytic software. Use the following R Scripts below to import these five CSV files into R. The R Scripts will pull the respective CSV files from a hosted server (also see the links to Sales1, Ship1, Cash1, Customer1, and Revenue in Appendix B). Once imported, R will store the imported data in dataframes. Dataframes are similar to a database table or an Excel sheet. Enter the following lines of scripts below into the R Editor window and then highlight the scripts and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard. Script 4: Script Definition Code read.csv ‘http://davidchan.net/data/Sales1.csv; header ¼ TRUE sep ¼ ‘‘,’’ Definition Reading a comma separated value ‘‘csv’’ file File location Variable names are included at the top of columns Data is separated by commas After importing the data into R, the ‘‘View’’ command can be used to open a dataframe. For example, to view the imported ‘‘Sales1’’ dataframe, enter the script shown below into the R Editor window and then highlight the script and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard: Script 5: Data Diagnostics Once the CSV data files have been imported into R, you should run some preliminary data diagnostics/checks. The purpose of running diagnostic checks is to determine whether the files have been imported correctly and whether the data is what you will need for performing the procedures. You can use an array of diagnostic techniques to check for the number of rows (observations) and columns (variables), verify the variable names, and view and inspect the first six rows of data or the last six rows of data. The diagnostic of data is important as you want to confirm that you are analyzing the data that you are expecting to analyze. Enter the scripts below into the R Editor window and then highlight the scripts and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard to show the number of rows (nrow) or columns (ncol) in the ‘‘Sales1’’ dataframe: Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Chan and Kogan 124 Script 6: Checkpoint 1 (1) How many rows are in the ‘‘Sales1’’ dataframe? (2) How many columns are in the ‘‘Sales1’’ dataframe? Enter and execute the script below into R to show the variable (column) names in the ‘‘Sales1’’ dataframe: Script 7: Checkpoint 2 (1) List the variable names in ‘‘Sales1’’ dataframe? Enter and execute the scripts below into R to show the first six rows (head) and last six rows (tail) of the ‘‘Sales1’’ dataframe: Script 8: Checkpoint 3 (1) What is the third ‘‘Sales_Order_No’’ in the ‘‘Sales1’’ dataframe? (2) What is the last ‘‘Sales_Order_No’’ in the ‘‘Sales1’’ dataframe? Data Cleaning The client may provide the auditors with the data needed for auditing but the data may not be in a format that is usable by R for analysis. Data cleaning is necessary to get the data in a format that R can use and make computations from. For example, the client may provide monetary accounting data with commas (‘‘1,500’’). Numeric data with commas are recognized as text or as character format in R. R cannot do calculations on variables identified as character variables and thus they will need to be converted into the numeric format. There are five basic types of data formats: (1) numeric, (2) integer, (3) character, (4) factor, and (5) date. Below are examples of each: Format Type Numeric Integer Character Factor Date Example 837223, 123.23, 2320840.98 (can have decimals) 235, 8372, 23208 (no decimals) data, auditing, Main Street 0,1 2014-12-11 If you open ‘‘Sales1,’’ ‘‘Cash1,’’ and ‘‘Customer1’’ dataframes, then you will notice that there are not only variables that have numeric values, but also have commas in them. Therefore, R may consider these numeric variables as either Character or Factor variables. For subsequent analysis on these numeric variables to occur in R the commas have to be removed and the variables have to be converted into the numeric format. First, consider the Sales Journal Dataframe (Sales1): R has a function called ‘‘str’’ that can be used to show the data format for each variable in a dataframe. Enter the script below into the R Editor window and then highlight the script and click and hold ‘‘Ctrl’’ and then click ‘‘R’’ on the keyboard: Script 9: Notice the ‘‘Invoice_Amount’’ variable in the ‘‘Sales1’’ dataframe is categorized as a Factor type variable (Figure 1). A numeric variable cannot have commas between the numbers. You will need to strip the commas from the numeric values in the ‘‘Invoice_Amount’’ variable and convert the variable into a numeric variable using the following code: Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing 125 FIGURE 1 Sales1 Dataframe Script 10: Script Definition Code Sales1$Invoice_Amount as.numeric gsub ‘‘,’’, ‘‘’’ Definition Variable ‘‘Invoice_ Amount’’ in the ‘‘Sales1’’ dataframe Convert variable to numeric data type format Remove comma function Replace comma with no space Next, let’s consider the Cash Receipts Journal Dataframe: In the Cash Receipts dataframe, the ‘‘Invoice_Amount’’ and ‘‘Payment_Received’’ variables are both categorized as a Character type of variables but should be Numeric variables. Again, this is due to the commas between the numbers. You will need to strip the commas from the variables and convert the variables into Numeric variables using the follow scripts: Script 11: Checkpoint 4 (1) Remove the commas and convert the variables into a numeric variable for the following two variables ‘‘Customer_ Balance’’ and ‘‘Customer_Max_Credit_Amount’’ in the Customer Master File dataframe ‘‘CUS1.’’ Next, determine whether the other variables in the dataframes are categorized in the correct data type. As you recall, a variable can be a Numeric, Integer, Character, Date, or Factor data type. If a variable is not categorized correctly, then you will need to convert the variable into the correct type. Enter and execute the scripts below in R to determine the data type for each of the variables in each of the respective dataframes: Script 12: You will notice that the dataframe ‘‘Sales1’’ has two variables in an incorrect format; (1) ‘‘Invoice_Date,’’ and (2) ‘‘Invoice_ Description’’ (Figure 2). R is categorizing ‘‘Invoice_Date’’ and ‘‘Invoice_Description’’ as Factor variables. The ‘‘Invoice_Date’’ should be in the Date format and ‘‘Invoice_Description’’ should be in the Character format. The ‘‘Invoice_Date’’ is the invoice date and you will need to convert the variable ‘‘Invoice_Date’’ from the Factor format to the Date format. Similarly, you will need to convert the variable ‘‘Invoice_Description’’ from the Factor format to the Character format. The ‘‘Invoice_Description’’ variable describes the type of sale and should be characterized in the Text or Character format. Here are the scripts to do so: Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Chan and Kogan 126 FIGURE 2 Sales1 Dataframe Script 13: In the Shipping dataset, you will need to convert the variable ‘‘Shipping_Number’’ to the Character format and the variable ‘‘Shipping_Date’’ to the Date format. Script 14: In the Customer dataset, you need to convert the variables ‘‘Customer_No’’ into the Integer format, ‘‘Customer_Name’’ into the Character format, ‘‘Customer_Address’’ into the Character format, ‘‘Customer_City’’ into the Character format, ‘‘Customer_State’’ into the Character format, ‘‘Customer_Credit_Rating’’ into the Factor format, and ‘‘Customer_Max_Credit_ Amount’’ into the Numeric format. Script 15: Checkpoint 5 (1) In the Cash dataset, convert the variable ‘‘Payment_Date’’ to the Date format. Finally, you should re-examine all the dataframes and determine whether all the variables are in the correct data type before you continue. Enter the following scripts and verify: Script 16: Structured Query Language Structured Query Language (SQL) is a standard language of relational database management systems (DBMS). Auditors can use the language to access, make queries, create new tables (dataframes), and manipulate data in a database. For the Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing 127 purpose of this case study and its use with R, focus on the latter three. You will need to understand how to use the SQL SELECT Statement. This statement starts with the SELECT keyword followed by a comma-separated list of variables that will be displayed in the results set generated by the statement. This is followed by the FROM clause, which lists the dataframes required to construct the result set. Then, the statement may have the WHERE clause that provides the conditions the result set satisfies. For example, you want to SELECT the variables ‘‘Invoice_No’’ and ‘‘Invoice_Amount’’ FROM the ‘‘Sales1’’ dataframe WHERE ‘‘Invoice_Amount’’ is greater than 1500. This SQL command would return back the variables ‘‘Invoice_ No’’ and ‘‘Invoice_Amount’’ where the ‘‘Invoice_Amount’’ is greater than 1500. Further examples will be articulated below. Create New Dataframes Prior to analyzing data, create a new dataframe by extracting the relevant variables from pre-existing dataframes. The purpose of creating a new dataframe is two-fold; (1) you want to preserve the original data, and (2) speed up subsequent analysis since the dataframe will be smaller and only consisting of the relevant variables. In auditing accounting information, you need to select variables in various data types (e.g., Numeric, Character, and Date). Below are examples of the SQL queries to extract these various data types. First, create a new dataframe called SQL1 with the sales journal information of a specific invoice. This can be accomplished by a SQL query for the specific invoice number. The data type of the variable ‘‘Invoice_No’’ in the dataframe is Numeric. In the example below, SELECT all the variables FROM the dataset ‘‘Sales1’’ WHERE ‘‘Invoice_No’’ is 1031. The asterisk (*) used after the SELECT statement is to request that all the variables should be selected. Script 17: Second, create a new dataframe called SQL2 with the customer master file information of the customer Arch Coal, Inc. Note that the variable names of Character data type variables have to be accompanied by single quotation marks ‘’. Script 18: Third, create two new dataframes called SQL3A and SQL3B for transactions that fall within a specific period of dates. Before the calculation, you need to create a new dataframe called Sales2 and convert the variable ‘‘Invoice_Date’’ into the Character data type. R is not able to compute the difference between dates without this conversion. Script 19: As a Character data type, the date has to be enclosed with single quotation marks ‘’. For the new dataframe SQL3A, you need to select all the Invoices from the Sales Journal that have dates before July 1, 2010. For the new dataframe SQL3B, select all the Invoices from the Sales Journal that have dates between July 1, 2010 and September 30, 2010. Script 20: Checkpoint 6 (1) Create a new dataframe called SQL3C with the customer master file information of the customer ‘‘MSCI, Inc.’’ (2) Open the dataframe SQL3C. What is the Customer Max Credit Amount? Dataframes may need to be merged or joined for analysis. You can use SQL to join two different dataframes based on a matching unique identifier that is common to both dataframes. You can perform a left join or an inner join. For a left join, all the rows from the first dataframe will match with a matching row on the second dataframe. Any row on the second dataframe Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Chan and Kogan 128 FIGURE 3 Left Join without a match in the first dataframe will be excluded (see Figure 3). An inner join is where all the rows from both tables will be included as long as there is a match in both datasets (see Figure 4). Further examples will be articulated below. In the example below, you will create a new dataframe called SQL4A by joining the two dataframes SQL3A and Cash1 using a unique identifier. You want to match all the rows in Cash1 with a matching row in SQL3A. The other rows in Cash1 will be excluded. Therefore, you will need to use the left join statement. The two unique identifiers that are common to both dataframes are ‘‘Customer_No’’ and ‘‘Invoice_No.’’ Use both identifiers because a customer can have multiple invoices. Script 21: Next, create a new dataframe called SQL4B by joining the two dataframes SQL3B and Cash1 with a unique identifier. You want to create a dataframe with only those rows in both SQL3B and Cash1 that are matching. Therefore, use the inner join statement. The two unique identifiers that are common to both dataframes are ‘‘Customer_No’’ and ‘‘Invoice_No.’’ Again, you want to use both identifiers because a customer can have multiple invoices. Script 22: FIGURE 4 Inner Join Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing 129 Data Manipulation When doing analysis, it is sometimes necessary to create a new variable out of a preexisting variable or variables. For example, you may want to create a count variable based on the number of days between two dates or create a Factor variable that will indicate the occurrence of a specific attribute. Before manipulating any data, it is a good rule of thumb to create a duplicate dataframe in order to preserve the content of the original dataframe. Here you will create a new dataframe called SQL5 in order to preserve the original data in SQL4A. You also need to convert the variable ‘‘Invoice_Date’’ back into the Date data type (it was changed above) and convert the variable ‘‘Payment_Date’’ into the Date format. Using the data from SQL5, you are interested in creating a binary or factory variable called ‘‘Diff_Date’’ where 1 indicates a payment is made within the first 10 days and a 0 indicates otherwise. First, you need to calculate the number of days between the invoice and payment dates. Script 23: Before proceeding, you will need to remove the additional variable ‘‘X’’ created as a result of the join function above. R will be confused as there are two columns named ‘‘X.’’ You will need to remove the additional ‘‘Variable’’ in Column 8. Script 24: Next, create a new variable called ‘‘Early’’ by using the variable ‘‘Diff_Date’’ calculated above and the SQL CASE statement to create a Factor variable that indicates if a payment was made within 10 days. You will also need to convert the newly created variable ‘‘Early’’ into a Factor variable data type. Script 25: Similarly, you may want to create a new variable called ‘‘Late’’ to identify those payments that were made over 30 days and thus considered late payments. Script 26: Summary Statistics Before performing audit procedures on the data, you may want to run some basic statistics on the variables that you will use. In R, you can use the function summary () to get a quick summary of basic statistical information about the individual variables in a dataframe. Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Chan and Kogan 130 Script 27: For each variable, R can also produce specific statistical measures such as mean, median, quantile, and range. Script 28: R can also be used to produce plots of charts. Charts can provide useful visualizations of data. Furthermore, charts can be used to look for anomalies or outliers in data. Create a Histogram of the variable ‘‘Invoice_Amount’’ in the Sales Journal: Script 29: Create a Boxplot with the ‘‘Invoice_Amount’’ and ‘‘Customer_Credit_Rating’’ variables from the Sales Journal: Script 30: Create a Scatterplot of the variable ‘‘Invoice_Amount’’ in the Sales Journal: Script 31: When performing analytical procedures during an audit, the goal is to look at the relationships among the data. Certain accounts on the financial statement or related financial or nonfinancial information can be expected to be positively or negatively related to each other. You can use a correlation test to determine the extent to which two different Numeric variables are related to each other. In the example below, find the relationship between the variables ‘‘Customer_Max_ Credit_Amount’’ and ‘‘Customer_Credit_Rating.’’ You would expect a positive relationship between these two variables. Generally, you would expect a company to grant higher credit limits to those customers that have higher credit ratings. Before proceeding with the correlation test, you will need to convert the ‘‘Customer_Credit_Rating’’ variable into a Numeric variable. Script 32: Audit Application Common audit procedures that an auditor performs include footing, scanning, or re-computing the totals of a column. You can use the sum() feature in R to add or sum a column. Script 33: Checkpoint 7 (1) What is the footed amount for the ‘‘Invoice_Amount’’ variable? Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing 131 Scanning for irregular transactions is another popular audit procedure. Before scanning, you may want to sort the data by the order of a specific variable. For example, create a new dataframe called Sort1 with data from dataframe Sales1 sorted by the ‘‘Sales_Order_Number’’ variable: Script 34: Sorting can also be done based on two variables. For example, you can sort by ‘‘Sales_Order_No’’ and then by ‘‘Invoice_ Amount.’’ Script 35: Sorting can also be done in descending order. For example, you can use the code shown below if you want to sort the Sales1 dataframe from low to high using the ‘‘Sales_Order_No’’ variable and from high to low for the ‘‘Invoice_Amount’’ variable. If you want to sort in descending order, you will need to add a ‘‘–’’ in front of the variable you want to be sorted in descending order. Here, add the ‘‘–’’ in front of the variable ‘‘Invoice_Amount.’’ Script 36: Recalculation of variable can be performed in R as well. If you want to confirm whether the discounts taken by customers were calculated correctly, then simply re-compute the discounts and determine if there are any differences from your own computations. First, you need to create a new dataframe called AUD1 where only the transactions that were paid early are included. Only invoices paid early should have a discount. Use the variable created above called ‘‘Early’’ to create the constraint. Script 37: The discount taken is 2 percent. You need to recalculate the discount by multiplying the invoice amount by the discount. The new variable will be called ‘‘Recal1.’’ You will also need to round the computed discount and convert ‘‘Recal1’’ into a Numeric variable called ‘‘Recal2.’’ Script 38: Last, calculate if there is a difference between your calculation of the discount and what the client has taken as the discount. Script 39: Checkpoint 8 (1) Was there a difference between the client’s calculated discounts and your calculation? R can also be used for making sample selections. To take a random sample of three observations from the ‘‘Sales1’’ dataframe, enter the code below into R. In this example, take a random sample of 3 observations. Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Chan and Kogan 132 Script 40: R can also be used to select a systematic sample. Below you will select every fifth row in the dataframe ‘‘Sales1’’: Script 41: A stratified sample can also be taken. Since R does not have a package to make the stratified sample selection, a custom function or code is required in order to make the stratified sample selection. Copy and paste the code below into the R Editor and execute the code: Script 42: For more, please see: http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/. In stratified sample selection, the population is divided into groups or strata and selections are made within the strata. You will continue with the use of the dataframe SQL5 and use the created factor data type variable ‘‘Early’’ as the stratum indicator for this exercise. There are two strata in the ‘‘Early’’ variable: (1) those customers that paid early and (2) those customers that did not. However, if you did not know the number of groups in a variable, you will want to determine that first. Script 43: Next, make your stratified sample selection by using the custom function above. The function requires the following inputs: dataframe, unique ID variable column number, grouping variable column number, and the sample size. In this example, use the SQL5 dataframe, ‘‘Customer_No’’ variable (column number 2), ‘‘Early’’ variable (column 14), and a sample size of 3. The column numbers of the variables can be obtained by using the names () function in R. Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing 133 Script 44: Checkpoint 9 (1) How many observations are in the sample selected? Analytical procedures can be used to determine whether changes in account balances are in line with expectations. In R you can perform a time-series analysis to statistically predict or estimate what you can expect to be the future value of an account balance using historical data. To illustrate, use the revenue account balance for the past 11 years (2002 to 2012) of the client to estimate what the revenue balances will be for subsequent future years. For this exercise, the exponential time-series model will be used to make the prediction. You will use the variable ‘‘Revenue’’ in the ‘‘Rev1’’ dataframe for this analysis. Enter the script below into the R Editor to create the time-series vector. The new vector will be called MYTS. You will need the dataframe and variable, the start period, the end period, and the frequency of periods. In the data, the revenue balance is for an annual basis so the frequency is one. Script 45: Next, you will need to use the exponential model function ‘‘HoltWinters’’ in R to model the estimation and store the estimation information in the vector ‘‘FIT.’’ You will need the time-series vector ‘‘MYTS’’ created from above and to set the gamma option to ‘‘False.’’ Script 46: Once the model has made the predictions, you can obtain the estimated numerical values of the revenue balance for the years 2013, 2014, and 2015 using the forecast() function. You can also make a plot to visually see the estimations using the plot() function. Script 47: Checkpoint 10 (1) What is the forecasted revenue account balance for 2013, 2014, and 2015? CONCLUSION The utilization of CAATTs to perform audit procedures can increase an audit’s effectiveness and efficiency over manual audit procedures. The efficiency gained through using CAATTs allows the auditor to reallocate their time to the evaluation of exceptions or anomalies. The majority of the auditor’s work will be on exercising professional judgment as we move toward automated audits or continuous audits. Although learning how to use analytical tools may take the investment of some time, the evolution of the skillsets needed in auditing is clear and we will continue to see dependencies on data analytics. As a result, the demand for students with data analytical skills will only increase. After completing the case, you should now have greater appreciation for the use of data analytics in general and more specifically in the auditing context. Furthermore, you should have gained an understanding about data attributes, data creation, the structured query language (SQL), the use of basic statistics, and performing basic audit procedures using analytics. REFERENCES Association to Advance Collegiate Schools of Business (AACSB). 2014. Information Technology Skills and Knowledge for Accounting Graduates: An Interpretation AACSB International. AACSB International Accounting Accreditation Standard A7. Tampa, FL: AACSB. Available at: http://www.aacsb.edu/;/media/AACSB/Publications/white-papers/accounting-accreditation-standard-7. ashx Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Chan and Kogan 134 Deloitte. 2015. Analytics Trends 2015—A Below-The-Surface Look. Available at: http://public.deloitte.com/media/analytics/trends/ analytics-trends.html Ernst & Young (EY). 2015. Colloquium on Analytics in the Accounting Curriculum. The 2015 Ernst & Young Academic Resource Center (EYARC). London, UK: EY. Microsoft. 2015. Excel Specifications and Limits. Available at: https://support.office.com/en-ie/article/Excel-specifications-and-limitsca36e2dc-1f09-4620-b726-67c00b05040f R Foundation, The. 2015. The R Project for Statistical Computing. Available at: https://www.r-project.org/ APPENDIX A Download R In order to complete the assignments in this case study, you will need to download and install the R software. Download R: http://cran.stat.ucla.edu/ You may download the Windows, Mac, or Linux versions of the software but note that the interface may be different from what is illustrated below. The instructions below are designed for the R Windows edition of the software. A video that shows how to download R is available for download, see the link to ‘‘Download_R’’ in Appendix B. Install R The installation of R is similar to how you install other software. Below are the step-by-step screenshots of the installation process. A video that shows how to install R is available for download, see the link to ‘‘Install_R’’ in Appendix B. Double click the R software *.exe file downloaded click ‘‘Next’’:1 Accept default and click ‘‘Next’’: 1 All images are author screenshots of the installation process for R for Windows 3.1.2. Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing Accept default and click ‘‘Next’’: After R installs then click ‘‘Finish’’: Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 135 136 Chan and Kogan Setup R Open the R software from the Start window or using the Desktop icon. The R software will open and the R Console window should be prominently displayed. A video that shows how to set up R is available for download, see the link to ‘‘Set_Up_R’’ in Appendix B. R Console Window: Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing Click ‘‘File’’ from the menu and click ‘‘New Script’’: The R Editor window will open and will be prominently displayed: Click ‘‘Windows’’ from the menu and then click ‘‘Tile Horizontally’’: Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 137 Chan and Kogan 138 The screen below will now show the ‘‘R Editor’’ window on top of the ‘‘R Console’’ window: The R Editor window is used to read/write/edit the R Code or Script and the R Console is used to execute the script and display the output from analysis. The tiling of the windows allows the user to see the scripts and the execution and output of the analysis in an efficient manner. APPENDIX B Scripts: http://dx.doi.org/10.2308/jeta-51463.s01 Sales1: http://dx.doi.org/10.2308/jeta-51463.s02 Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Data Analytics: Introduction to Using Analytics in Auditing Ship1: http://dx.doi.org/10.2308/jeta-51463.s03 Cash1: http://dx.doi.org/10.2308/jeta-51463.s04 Customer1: http://dx.doi.org/10.2308/jeta-51463.s05 Revenue: http://dx.doi.org/10.2308/jeta-51463.s06 Execute_R_scripts: http://dx.doi.org/10.2308/jeta-51463.s07 Download_R: http://dx.doi.org/10.2308/jeta-51463.s08 Install_R://dx.doi.org/10.2308/jeta-51463.s09 Set_Up_R: http://dx.doi.org/10.2308/jeta-51463.s10 Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 139 Chan and Kogan 140 CASE LEARNING OBJECTIVES AND IMPLEMENTATION GUIDANCE Learning Objectives Introduce students to data analytics in the accounting and auditing context. Develop an understanding of data attributes, data creation, structured query language (SQL), basic statistics, and performing basic audit procedures using analytics. Provide hands-on practice in using the open source analytical software R. Instructor Guide This case has been successfully used in an external audit, internal audit, and fraud examination course. Furthermore, the case has been facilitated both as an in-class, instructor-led discussion and as a student take-home assignment. The authors found that the latter was more effective and efficient as the students were able to work the case at their own pace. The case includes all the necessary instructions and related materials for students to complete the case with little or no instructions from the instructor. The instructor can use the associated checkpoints in the case to verify that students have completed the case. If an instructor prefers to use the case as an in-class-led discussion, then the authors found that allocating two to three hours of class time should be adequate. However, there will be a positive correlation between the number of students in the class and the amount of time that should be allocated. There will likely be more questions when there are more students in the class, and therefore more time may be required. For efficiency purposes, it is highly recommended that the students: (1) install R on their own computers, (2) download the required materials, and (3) confirm that their computers can access the Internet before class. The instructor may also use the case in a hybrid approach. For the hybrid approach, the case can be discussed in class and the students are then assigned to work on the case at home. TEACHING NOTES AND STUDENT VERSION OF THE CASE Teaching Notes and the Student Version of the Case are available only to non-student-member subscribers to Journal of Emerging Technologies in Accounting through the American Accounting Association’s electronic publications system at http:// www.aaapubs.org/. Non-student-member subscribers should use their usernames and passwords for entry into the system where the Teaching Notes can be reviewed and printed. The ‘‘Student Version of the Case’’ is available as a supplemental file that is posted with the Teaching Notes. Please do not make the Teaching Notes available to students or post them on websites. If you are a non-student-member of AAA with a subscription to Journal of Emerging Technologies in Accounting and have any trouble accessing this material, please contact the AAA headquarters office at info@aaahq.org or (941) 921-7747. Journal of Emerging Technologies in Accounting Volume 13, Number 1, 2016 Copyright of Journal of Emerging Technologies in Accounting is the property of American Accounting Association and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. Database Systems Journal vol. VI, no. 1/2015 23 Big Data Analytics Platforms analyze from startups to traditional database players Ionuţ ŢĂRANU Bucharest University of Economic Studies ionut.tanaru@gmail.com Big data analytics enables organizations to analyze a mix of structured, semi-structured and unstructured data in search of valuable business information and insights. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits. With so many emerging trends around big data and analytics, IT organizations need to create conditions that will allow analysts and data scientists to experiment. “You need a way to evaluate, prototype and eventually integrate some of these technologies into the business,” says Chris Curran[1]. In this paper we are going to review 10 Top Big Data Analytics Platforms and compare the key-features. Keywords: Big data, In-memory, Hadoop, Data analysis 1 Introduction The growth of data – both structure and unstructured – will present challenges as well as opportunities for organisations over the next five years. With growing data volumes, it is essential that real-time information that is of use to the business can be extracted from its IT systems, otherwise the business risks being swamped by a data deluge. Meanwhile, competitors that use data to deliver better insights to decision-makers stand a better chance of thriving through the difficult economy and beyond. To analyze such a large volume of data, big data analytics is typically performed using specialized software tools and applications for predictive analytics, data mining, text mining, forecasting and data optimization. Collectively these processes are separate but highly integrated functions of high-performance analytics. Today's advances in analyzing Big Data allow researchers to decode human DNA in minutes, predict where terrorists plan to attack, determine which gene is mostly likely to be responsible for certain diseases and, of course, which ads you are most likely to respond to on Facebook. The business cases for leveraging Big Data are compelling. For instance, Netflix mined its subscriber data to put the essential ingredients together for its recent hit House of Cards, and subscriber data also prompted the company to bring Arrested Development back from the dead. Another example comes from one of the biggest mobile carriers in the world. France's Orange launched its Data for Development project by releasing subscriber data for customers in the Ivory Coast. The 2.5 billion records, which were made anonymous, included details on calls and text messages exchanged between 5 million users. Researchers accessed the data and sent Orange proposals for how the data could serve as the foundation for development projects to improve public health and safety. Proposed projects included one that showed how to improve public safety by tracking cell phone data to map where people went after emergencies; another showed how to use cellular data for disease containment.[2] So it seems that data analysis is a do-or-die requirement for today's businesses. We analyze below notable vendor choices, from Hadoop upstarts to traditional database players. 2. Top 10 Big Data Analytics Platforms 2.1. Actian (Fig.1.– Platform). Actian Analytics 24 Big Data Analytics Platforms analyze from startups to traditional database players Fig. 1. Actian Analytics Platform • • • • • Analytical DBMS: Actian Matrix (formerly ParAccel), Actian Vector (formerly Vectorwise). In-memory DBMS: Actian Matrix In-Memory Option (data stored to both memory and disk). Hadoop distribution: None. Stream-processing technology: None. Hardware/software systems: None (software-only vendor). The company is counting on the combination of fast, analytical DBMS options, cloud services, and dataintegration and -analytics software geared to a world in which Hadoop is a prominent fixture of the datamanagement architecture. Actian DataFlow includes SQL-, ETL-, and data-cleansing-on Hadoop options that work with distributions from Apache, Cloudera, Hortonworks, and others [3] 2.2. Amazon • Analytical DBMS: Amazon Redshift service (based on ParAccel engine); Amazon Relational Database Service. • In-memory DBMS: None. Thirdparty options on AWS include Altibase, SAP Hana, and ScaleOut. • Hadoop distributions: Amazon • • Elastic MapReduce. Third-party options include Cloudera and MapR. Stream-processing technology: Amazon Kinesis. Hardware/software systems: Not applicable. AWS is located in 11 geographical "regions": US East (Northern Virginia), where the majority of AWS servers are based, US West (northern California), US West (Oregon), Brazil (São Paulo), Europe (Ireland and Germany), Southeast Asia (Singapore), East Asia (Tokyo and Beijing) and Australia (Sydney). There is also a "GovCloud", based in the Northwestern United States, provided for U.S. government customers, complementing existing government agencies already using the US East RegionEach Region is wholly contained within a single country and all of its data and services stay within the designated Region. Amazon Web Services 2009 (Fig. 2.– Amazaon Web Service) hosts a who's who list of data-management services from thirdparty players -- Cloudera, Microsoft, Oracle, SAP, and many others -- but the cloud giant has its own long-term ambitions where bigdata analysis is concerned.[4] Building on its Elastic Compute Cloud (EC2) and Simple Storage Service (S3) storage infrastructure, Amazon launched its Hadoop-based Elastic MapReduce service way back in. In 2013, Database Systems Journal vol. VI, no. 1/2015 AWS added the Redshift Data Warehousing service (based on the ParAccel DBMS), which is supported by another who's who list of independent data-integration, business intelligence, and analytics vendors. Rounding out AWS's big-data capabilities are the DynamoDB NoSQL database management service and Kinesis Stream Processing service. 25 • • Fig. 2. Amazon Web Service • • • • • Amazon DynamoDB provides a scalable, low-latency NoSQL online Database Service backed by SSDs. Amazon ElastiCache provides inmemory caching for web applications. This is Amazon's implementation of Memcached and Redis. Amazon Relational Database Service (RDS) provides a scalable database server with MySQL, Oracle, SQL Server, and PostgreSQL support.[22] Amazon Redshift provides petabyte-scale data warehousing with column-based storage and multi-node compute. Amazon SimpleDB allows developers to run queries on structured data. It operates in concert with EC2 and S3 to provide "the core functionality of a database". AWS Data Pipeline provides reliable service for data transfer between different AWS compute and storage services (e.g., Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EMR). In other words this service is simply a data-driven workload management system, which provides a simple management API for managing and monitoring of datadriven workloads in cloud applications.[23] Amazon Kinesis streams data in real time with the ability to process thousands of data streams on a persecond basis. The service, designed for real-time apps, allows developers to pull any amount of data, from any number of sources, scaling up or down as needed.[5] 2.3. Cloudera • Analytical DBMS: HBase, and although not a DBMS, Cloudera Impala supports SQL querying on top of Hadoop. • In-memory DBMS: Although not a DBMS, Apache Spark supports inmemory analysis on top of Hadoop. • Hadoop distributions: CDH opensource distribution, Cloudera Standard, Cloudera Enterprise. • Stream-processing technology: Open-source stream-processing options on Hadoop include Storm. • Hardware/software systems: Partner appliances, preconfigured hardware, or both available from Cisco, Dell, HP, IBM, NetApp, and Oracle. Cloudera Inc. is an American-based software company that provides Apache Hadoop-based software, support and services, and training to business customers.[6] Cloudera's open-source Apache Hadoop 26 Big Data Analytics Platforms analyze from startups to traditional database players distribution, CDH (Cloudera Distribution Including Apache Hadoop), targets enterprise-class deployments of that technology. Cloudera says that more than 50% of its engineering output is donated upstream to the various Apache-licensed open source projects (Apache Hive, Apache Avro, Apache HBase, and so on) that combine to form the Hadoop platform. Cloudera is also a sponsor of the Apache Software Foundation [7] • • • • 2.4. HP HAVEn • Analytical DBMS: HP Vertica Analytics Platform Version 7 (Crane release). In-memory DBMS: Vertica is not an in-memory database, but with high RAM-to-disk ratios the company says it can ensure nearreal-time query performance. Hadoop distribution: None. Stream-processing technology: None. Hardware/software systems: HP ConvergedSystem 300 for Vertica, plus a choice of reference architectures for Cloudera, Hortonworks, and MapR Hadoop distributions. Fig. 3. HAVEn Ecosystem HP calls its big-data-platform architecture HAVEn (Fig. 3. - HAVEn Ecosystem), an acronym for Hadoop, Autonomy, Vertica, Enterprise Security, and "n" applications. The cluster-based, column-oriented Vertica Analytics Platform is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses and other query-intensive applications. The product claims to drastically improve query performance over traditional relational database systems, provide high-availability, and petabyte scalability on commodity enterprise servers. Its design features include: • • • Column-oriented storage organization, which increases performance of sequential record access at the expense of common transactional operations such as single record retrieval, updates, and deletes.[9] Standard SQL interface with many analytics capabilities built-in, such as time series gap filing/interpolation, event-based windowing and sessionization, pattern matching, event series joins, statistical computation (e.g., regression analysis), and geospatial analysis. Out-of-place updates and hybrid storage organization, which increase the performance of queries, Database Systems Journal vol. VI, no. 1/2015 • • • • • insertions, and loads, but at the expense of updates and deletes. Compression, which reduces storage costs and I/O bandwidth. High compression is possible because columns of homogeneous datatype are stored together and because updates to the main store are batched.[10] Shared nothing architecture, which reduces system contention for shared resources and allows gradual degradation of performance in the face of hardware failure. Easy to use and maintain through automated data replication, server recovery, query optimization, and storage optimization. Support for standard programming interfaces ODBC, JDBC, and ADO.NET. High performance and parallel data transfer to statistical tools such as Distributed R, and the ability to store machine learning models, and use them for indatabase scoring.[11][12] 2.5. Hortonworks • Analytical DBMS: HBase; although not a DBMS, Hive is Hortonworks' option for SQL querying on top of Hadoop. • In-memory DBMS: Although not a DBMS, Apache Spark supports in-memory analysis on top of Hadoop. • Hadoop distributions: Hortonworks Data Platform (HDP) 2.0, HDP for Windows, Hortonworks Sandbox (free, single-node desktop software offering Hadoop tutorials). • Stream-processing technology: Open-source stream-processing options on Hadoop include Storm. • Hardware/software systems: Partner appliances, preconfigured hardware, or both available from 27 HP, Teradata and others. On the matter of customer acquisition, six-year-old Cloudera probably has a slight lead over three-year-old Hortonworks (Fig. 4. - Hortonworks Data platform), but only just. Analysts estimate Cloudera’s base of paying subscribers at around 350, while Hortonworks’ CEO Rob Bearden says his company has acquired 250 customers over the past five quarters. Fig. 4. Hortonworks Data platform The most significant point of disagreement between Cloudera and Hortonworks lies in their answers to a single question – and the one that, arguably, matters most to enterprise customers: should Hadoop complement or replace traditional enterprise data warehouse (EDW) investments? 2.6. IBM • Analytical DBMS: DB2, Netezza (Fig. 5. - IBM Netezza platform). • In-memory DBMS: DB2 with BLU Acceleration, solidDB. • Hadoop distribution: InfoSphere BigInsights. • Stream-processing technology: InfoSphere Streams. • Hardware/software systems: PureData System For Operational Analytics (DB2), IBM PureData System for Analytics (Netezza ); PureData System for Hadoop (BigInsights). 28 Big Data Analytics Platforms analyze from startups to traditional database players Fig. 5. IBM Netezza platform Although IBM has plenty of products and services, it's not a product-oriented provider of technology. IBM leads with its deep integration and consulting expertise in a consultative approach focused on building businessdifferentiating "solutions" that might incorporate multiple products. IBM Netezza Analytics’ advanced technology fuses data warehousing and in-database analytics into a scalable, high-performance, massively parallel advanced analytic platform that is designed to crunch through petascale data volumes. This allows users to ask questions of the data that could not have been contemplated on other architectures. IBM Netezza Analytics is designed to quickly and effectively provide better and faster answers to the most sophisticated business questions. [13] 2.7. Microsoft • Analytical DBMS: SQL Server 2012 Parallel Data Warehouse (PDW). • In-memory DBMS: SQL Server 2014 In-Memory OLTP (option available with SQL Server 2014, set for release by second quarter of 2014). • Stream-processing technology: Microsoft StreamInsight. • • Hadoop distribution: HDInsight/Windows Azure HDInsight Service (based on Hortonworks Data Platform). Hardware/software systems: Dell Parallel Data Warehouse Appliance, HP Enterprise Parallel Data Warehouse Appliance. The Microsoft Analytics Platform System (Fig.6. - Microsoft Analytics Platform System) is a turnkey big data analytics appliance, combining Microsoft’s massively parallel processing (MPP) data warehouse technology–the SQL Server Parallel Data Warehouse (PDW)–together with HDInsight, Microsoft’s 100% Apache Hadoop distribution, and delivering it as a turnkey appliance. To integrate data from SQL Server PDW with data from Hadoop, APS offers the PolyBase data querying technology.[14] Fig.6. Microsoft Analytics Platform System Database Systems Journal vol. VI, no. 1/2015 2.8. ORACLE • Analytical DBMSs: Oracle Database, Oracle MySQL, Oracle Essbase. • In-memory DBMS: Oracle TimesTen, Oracle Database 12c In-Memory Option (announced in 2013 without details, roadmaps, or release dates). 29 • • • Stream-analysis option: Oracle Event Processing. Hadoop distribution: Resells and supports Cloudera Enterprise. Hardware/software systems: Exadata, Exalytics, Oracle Big Data Appliance. Fig.7. Oracle Big Data Appliance The Oracle Big Data Appliance consists of hardware and software from Oracle Corporation designed to integrate enterprise data, both structured and unstructured. It includes the Oracle Exadata Database Machine and the Oracle Exalytics Business Intelligence Machine, used for obtaining, consolidating and loading unstructured data into Oracle Database 11g. The product also includes an open source distribution of Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator with Application Adapter for Hadoop, Oracle Loader for Hadoop, an open source distribution of R, Oracle Linux, and Oracle Java Hotspot Virtual Machine [15] Oracle Big Data Appliance (Fig.7. Oracle Big Data Appliance) By combining the newest technologies from the Hadoop ecosystem and powerful Oracle SQL capabilities together on a single pre-configured platform, Oracle Big Data Appliance is uniquely able to support rapid development of new Big Data applications and tight integration with existing relational data. Oracle Big Data Appliance is pre-configured for secure environments leveraging Apache Sentry, Kerberos, both network encryption and encryption at rest as well as Oracle Audit Vault and Database Firewall.[16] 2.9. Pivotal • Analytical DBMS: Pivotal Greenplum Database. • In-memory DBMS: Pivotal GemFire and SQLFire. Pivotal HD used in combination with GemFire XD and HAWQ for in-memory analysis on top of Hadoop. • Stream-analysis option: Pivotal is working a project aimed at integrating its GemFire (NoSQL) and SQLFire in-memory data grid capabilities with Pivotal Hadoop and Spring XD as a data-ingest mechanism to support scalable, streaming-data analysis. • Hadoop distribution: Pivotal HD. 30 Big Data Analytics Platforms analyze from startups to traditional database players • Hardware/software systems: Pivotal Data Computing Appliance Pivotal HD is 100% Apache Hadoop compliant and supports all Hadoop Distributed File System (HDFS) file formats. In addition, Pivotal HD supports Apache Hadoop-related projects, including Yarn (aka MapReduce 2.0), Zookeeper and Oozie (for resource and workflow management), Hive and HBase (for language and analytics support).[17] Pivotal GemFire® stores all operational data compressed and in-memory to avoid disk I/O time lags. Nodes operate in a cluster, optimizing data distribution and processing, to ensure the highest speed and balanced utilization of system resources. Pivotal GemFire scales elastically and linearly – adding nodes increases capacity predictably.[18] 2.10. SAP • Analytical DBMSs: SAP Hana, SAP IQ. • In-memory DBMS: SAP Hana. Stream-analysis option: SAP Event Stream Processing. • Hadoop distribution: Resells and supports Hortonworks, Intel; Hadoop integrations certified by Cloudera and MapR. • Hardware/software systems: Multiple hardware configuration partners include Dell, Cisco, Fujitsu, Hitachi, HP, and IBM. Fig.8 Architecture SAP HANA is an in-memory, columnoriented, relational database management system developed and marketed by SAP SE. [19] HANA's architecture is designed to handle both high transaction rates and complex query processing on the same platform. SAP HANA was previously called SAP High-Performance Analytic Appliance [20] The main process, called the index server, has a structure Fig.8. – Architecture. The indexer performs session management, Database Systems Journal vol. VI, no. 1/2015 authorization, transaction management and command processing. Note that HANA has both a row store and a column store. Users can create tables using either store, but the column store has more capabilities. The index server also manages persistence between cached memory images of database objects, log files and permanent storage files. The Authorization manager provides authentication and authorization services. The Authorization Manager can provide security based on SAML, OAuth or Kerberos authentication protocols. The Extended Services (XS) Engine is a web server with privileged access to the database. Applications written with server-side JavaScript or as Java Servlets can be deployed to the XS Engine. These can either be HTML web applications or REST web service endpoints. Server-side JavaScript includes jQuery-based extensions for database access and to access HTTP request and response messages. The JavaScript engine is based on the Mozilla SpiderMonkey project. [21] 5. Conclusions With data growing so rapidly and the rise of unstructured data accounting for 90% of the data today, the time has come for enterprises to re-evaluate their approach to data storage, management and analytics. Legacy systems will remain necessary for specific high-value, lowvolume workloads, and complement the use of Hadoop -optimizing the data management structure in your organization by putting the right Big Data workloads in the right systems. The costeffectiveness, scalability, and streamlined architectures of Hadoop will make the technology more and more attractive. In fact, the need for Hadoop is no longer a question. The only question now remaining is how to take advantage of it best. All of these tools provide a rich feature set ready for enterprise use. It will be up to the end user to do a thorough 31 comparison and select either of these tools References [1] http://www.networkworld.com/article/28 37779/big-data-business-intelligence/8big-trends-in-big-data-analytics.html. [2] http://www.datamation.com/applications /big-data-analytics-overview.html [3] http://www.actian.com/solutions/#custo mer-analytics-content [4] http://aws.amazon.com/ [5] http://en.wikipedia.org/wiki/Amazon_W eb_Services [6] http://en.wikipedia.org/wiki/Cloudera [7] http://www.apache.org/foundation/spons orship.html [8] Vance, Ashlee (16 March 2009). "Bottling the Magic Behind Google and Facebook". The New York Times. [9] Monash, C: "Are row-oriented RDBMS obsolete?" DBMS2, January 22, 2007 [10] Monash, C: "Mike Stonebraker on database compression – comments”, DBMS2, March 24, 2007 [11] Gagliordi, Natalie. "HP adds scale to open-source R in latest big data platform". ZDNet. [12] Prasad, Shreya; Fard, Arash; Gupta, Vishrut; Martinez, Jorge; LeFevre, Jeff; Xu, Vincent; Hsu, Meichun; Roy, Indrajit (2015). "Enabling predictive analytics in Vertica: Fast data transfer, distributed model creation and indatabase prediction". ACM SIGMOD International Conference on Management of Data (SIGMOD). [13] http://www01.ibm.com/software/data/puredata/anal ytics/nztechnology/analytics.html [14] http://www.microsoft.com/en-us/servercloud/products/analytics-platformsystem/ [15] Darrow, Barb (2011-10-03). "Oracle BigData Appliance stakes big claim". [16] http://www.oracle.com/technetwork/dat abase/bigdataappliance/overview/index.html [17] http://pivotal.io/ [18] http://pivotal.io/big-data/pivotalgemfire 32 Big Data Analytics Platforms analyze from startups to traditional database players [19] Jeff Kelly (July 12, 2013). "Primer on SAP HANA". Wikibon. Retrieved October 9, 2013 [20] http://en.wikipedia.org/wiki/SAP_H ANA [21] https://developer.mozilla.org/enUS/docs/Mozilla/Projects/SpiderMonke y Mr. Ionuţ Ţăranu graduated from the Faculty of Cybernetics, Statistics and Economic Informatics of the Academy of Economic Studies in 1996, having its Master degree on “Database support for business”. At present is in the process of getting his title of doctor in economy in the specialty of “Softcomputing methods for early medical diagnosis”. He has been an Assistant Professor for 4 years at “Titu Maiorescu” University and also for 4 years at Academy of Economic Studies from Bucharest. He published a series of articles, from which the most important are Applying ABCD Rule of Dermatoscopy using cognitive systems and ABCDE Rule in Dermoscopy – Registration and determining the impact of parameter E for evolution in diagnosing skin cancer using soft computing alghorithms. Mr. Taranu is currently the General Manager of Stima Soft company. He has more than 15 years of experience as a project manager and a business analyst with over 13 years of expertise in Software development, Business Process Management, Enterprise Architecture design and Outsourcing services. He is also involved in research projects, from which the most relevant are: • Development of an Intelligent System for predicting, analyzing and monitoring performance indicators of technological and business processes in renewable energy area; • Development of an eHealth platform for improving quality of life and the personalization of therapy at patients with diabetes; • Development of an Educational Portal and integrated electronic system of education at the University of Medicine and Pharmacy "Carol Davila” to develop medical performance in dermatological oncology field; Copyright of Database Systems Journal is the property of Bucharest Academy of Economic Studies and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.
Purchase answer to see full attachment

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Hello,Attached find the completed work together with its plagiarism report. Go through it and in case you need anything I'll be here to assist you. All the best.

Running head: ANALYTICS PLATFORMS

Analytics Platforms

Name:
Institutional Affiliation
Course:

Date Submitted:

1

ANALYTICS PLATFORM

2

Introduction
Current Business Intelligence (BI) and analytics platforms have become popular
where major differentiators have enhanced analytics. Every year, professionals anticipate the
release of Gartner’s Magic Quadrant for BI and Analytics platforms. Analytics and data
leaders depend on Gartner’s recommendation to make their annual purchasing and initiative
decisions. The most popular analytics platforms are Microsoft power BI and tableau.
Tableau
Tableau is a data visualization tool that has demonstrated fast growth and is being
used in the industry of BI. It is the preferred method to transform or change the raw data set
into a format that is easy to understand and has zero co...