Review the case study and write a 500-750-word recommendation for best course of action.

Content Type

User Generated

User

vpnl27

Subject

Business Finance

Description

Review "Simulation Case Study: Phoenix Boutique Hotel Group" for this topic's case study, in which you provide guidance to Phoenix Boutique Hotel Group (PBHG) founder Bree Bristowe.

In addition to creating a simulation model, prepare a 500-750-word recommendation for Bristowe's best course of action. Explain your model and the rationale for your recommendations.

Use an Excel spreadsheet file for the calculations and explanations. Cells should contain the formulas (i.e., if a formula was used to calculate the entry in that cell). Students are highly encouraged to use the "Simulation Case Study: Phoenix Boutique Hotel Group Template" Excel resource to complete this assignment.

**You are required to use at least three academic references to support your report.

***Prepare this assignment according to the guidelines found in the APA Style Guide.

****You will be graded on (Rubrics):

1. Simulation Model: The simulation model is clear and purposefully chosen for the task.

2. Explanation of the Simulation Model: The explanation of the simulation model is substantial and thoughtful.

3. Recommendations for the Best Course of Action: The recommendations for the best course of action are thorough and thoughtful.

4. Rationale: The rationale directly justifies the recommendations for the best course of action with compelling support.

5. Thesis Development and Purpose: Thesis is comprehensive and contains the essence of the paper. Thesis statement makes the purpose of the paper clear.

6. Argument Logic and Construction: Clear and convincing argument that presents a persuasive claim in a distinctive and compelling manner. All sources are authoritative.

7. Documentation of Sources (citations, footnotes, references, bibliography, etc., as appropriate to assignment and style): Sources are completely and correctly documented, as appropriate to assignment and style, and format is free of error.

Unformatted Attachment Preview

Decision Sciences Journal of Innovative Education Volume 12 Number 3 July 2014 Printed in the U.S.A. C 2014 Decision Sciences Institute TEACHING BRIEF A Demonstration of Regression False Positive Selection in Data Mining Jonathan P. Pinder Wake Forest University, School of Business, P.O. Box 7897, Winston-Salem, NC 27109, e-mail: pinderjp@wfu.edu ABSTRACT Business analytics courses, such as marketing research, data mining, forecasting, and advanced financial modeling, have substantial predictive modeling components. The predictive modeling in these courses requires students to estimate and test many linear regressions. As a result, false positive variable selection (type I errors) is nearly certain to occur. This article describes an in-class demonstration that shows the frequency and impact of false positives on data mining regression-based predictive modeling. In this demonstration, 500 randomly generated independent (X) variables are individually regressed against a single, randomly generated (Y) variable, and the resulting 500 pvalues are sorted and examined. This experiment is repeated and the distribution of the number of variables significant at the 5% level resulting from this simulation is presented and discussed. The demonstration provides a tangible example in which students see the reality and risks of incorrectly inferring statistical significance of independent regression variables. Students have expressed a deeper understanding and appreciation of the risks of type I errors through this demonstration. This demonstration is innovative because the scale of the simulation allows the students to experience the near certainty that the correlations shown in the results are truly random. Subject Areas: Business Analytics, Hypothesis Testing, Regression, and Simulation. INTRODUCTION The current job market for business graduates in business analytics is drawing an increasing number of business students to analytical courses (Jordan, 2013). Specifically, advanced students are interested in learning data mining for regression modeling as a business analytics predictive modeling tool. Davenport and Harris (2007), Nisbet, Elder, and Miner (2009), Fisher and Raman (2010), and Siegel (2013) present data mining applications of regression-based predictive modeling for financial, operations, marketing, and human resource functions. When first learning regression modeling, students are confronted with decisions regarding variable selection and model structure. Traditional regression model-building emphasizes parsimony of variables, statistical significance, and efficient parameter estimation (e.g., Sharpe, DeVeaux, & Velleman, 2010). 199 200 Impact of False Positives on Data Mining Furthermore, as the number of variables increases, combinatoric issues make variable selection decisions particularly complex. In contrast, the focus of data mining and predictive modeling is on predictive accuracy, with less emphasis on significance and causal relationships (Jordan, 2013; Kuhl & Johnson, 2013; Nisbet et al., 2009). Students are customarily taught null hypothesis significance testing (NHST) as part of introductory statistics courses (e.g., Sharpe et al., 2010). Specifically, students identify a null hypothesis (H0 ) of no relationship between one variable (i.e., the independent variable) and another (i.e., the dependent variable). This null hypothesis is tested against an alternative hypothesis (HA ) of a statistically significant relationship between the variables using an inferential statistical test. A relationship between variables is considered statistically significant if there is strong evidence that the observed relationship is unlikely to be due to chance (Fisher, 1955). In the NHST commonly used in practice, analysts reject the null hypothesis when the probability of incorrectly rejecting a true null hypothesis falls beneath an established criterion level alpha (α). Alpha represents the maximum level that the researcher will accept for incorrectly rejecting the null hypothesis when the null is true. This situation is often referred to as a false positive (FP). In his influential book Statistical Methods for Research Workers (1925), Fisher proposes a 1 in 20 chance of being exceeded by chance, as a limit for statistical significance. Current convention thus typically sets the probability of a type I error (α) at 5%. In the context of regression, if the analyst finds evidence for a significant relationship between an independent and dependent variable, the null hypothesis may be rejected, and the alternative hypothesis may be assumed to be true. While students have a technical interpretive understanding of p-values and statistical significance for a single regression, they do not have an understanding of, or appreciation for, the implications and impact of running hundreds of regressions on “big data” (Jordan, 2013). They lack the expertise and the subject matter knowledge necessary to make appropriate modeling decisions, and seek to replace their naive knowledge-base with cookbook numerical decision-making (Gigerenzer, 2004). As a result, students do not fully comprehend the frequency, risks, and impacts, of using predictive models that contain predictor variables that are not actually correlated with the dependent variable. False positive selection occurs when independent variables that are not correlated with the dependent variable are selected, based upon selection criteria such as a p-value less than .05, for inclusion in a regression model (Bühlmann, Rütmann, & Kalisch, 2011). Such FPs are type I errors for the regression coefficients and are nearly certain to occur in data mining and predictive model building. Three classroom demonstrations are described in this article that introduce and explain the concepts of type I errors and FP selection in data mining and predictive modeling to students in business analytics. Through the use of the demonstrations, students can learn to avoid common mistakes and risks that occur when using regression in data mining for predictive modeling. The demonstrations are innovative because they prove empirically, and dynamically, the frequency and near-certainty of creating predictive models with illegitimate variables in data Pinder 201 mining contexts if analysts do not utilize statistical thinking (Gigerenzer, 2004) and subject matter expertise. LITERATURE REVIEW Gigerenzer (2004) and Duffy (2010) discuss the lack of statistical critical thinking that occurs when NHST becomes, as Gigerenzer phrases it, “ritualistic.” Specifically in the context of business and economics, White (2000) discusses the problem of FP selection in forecasting time series data, and presents an extensive literature review of the dangers of what White calls “data snooping.” In particular White states: “Even when no exploitable forecasting relation exists, looking long enough and hard enough at a given set of data will often reveal one or more forecasting models that look good, but are in fact useless.” As forecasting models are a specific form of predictive modeling, White’s observation can be extended to include predictive modeling in data mining contexts. Meinshausen, Meier, and Bühlmann (2009) discuss the general problem of FP selection in the context of multiple (high-dimensional) regression and propose data-splitting methods to reduce the likelihood of the problem. In the field of medical research, FP selection is a significant problem because it may cause researchers to incorrectly infer disease contributing (or curing) factors. Bühlmann et al. (2011) discuss the problem in detail, and present a subsampling method to address the issue. In an example that is relevant to business, Ferson, Sarkissian, and Simin (2008) study the estimation of asset pricing regressions and the biases that occur in the regressions due to the effects of “data snooping.” These articles, along with many other works cited in these articles, demonstrate the real and pragmatic aspects of the problem of FP selection in regression-based predictive model building that must be conveyed to students. False positive selection occurs because random chance and the law of large numbers affect the sampling process in a way that results in a sample that yields false evidence of a relationship where none actually exists. However, psychologists have long understood that people experience difficulty in understanding chance or random processes such as those involved in inferential statistics. The inability to understand chance affects various psychological processes, producing cognitive biases such as in the perception of illusory correlation in social and non-social cognitive processes (Tversky & Kahneman, 1973). People regularly attribute deterministic certainty to random processes, and the fact that random processes are part of hypothesis testing may be one of the reasons students experience difficulty comprehending the subject and appear to have a bias toward rejection of the null hypothesis. With the elements of regression and the severity of the modeling problem context established, a pedagogical methodology that conveys the immediacy of FP selection is needed. Duffy (2010) describes spreadsheet exercises that demonstrate the nature and frequency of type I errors using independent randomly generated data. One exercise had students use a correlation matrix to examine and discuss multiple relationships of what the students believed to be genuine data that were 202 Impact of False Positives on Data Mining actually independent simulated variables. In another exercise, students were shown multiple comparisons using analysis of variance for groups that were not actually statistically significantly different. These demonstrations highlight the risks of data dredging and a posterior hypothesis generation. A way to increase students’ cognizance of the frequency, and near certainty, of FP selections is thus needed. Using simulation to create variables known to be random provides such a way to demonstrate the FPs that are nearly certain to occur in a large volume of regressions. The learning aspect of the demonstrations that has the greatest impact on students’ acquisition and retention of the intellectual concepts is the live, dynamic, creation of regressions based on random numbers. Students can readily see that the numbers generated are bogus and thus any “significance” is purely coincidental. The remedies proposed are visceral to the students because the simulations are live and they can witness the FPs first hand. The combination of an effective simulation pedagogy with a compelling context provides the basis for the demonstration. THE DEMONSTRATIONS The demonstrations are designed to accomplish three learning objectives: (1) increase students’ cognizance of the frequency, and near certainty, of FP selections; (2) direct them toward the deduction of the appropriate distribution of FPs; and (3) persuade them that the proper remedies for FPs are statistical experience and subject matter expertise. They are presented in three stages of approximately 15 to 25 minutes each. The demonstrations take place in conjunction with an assigned case study. Several case studies (e.g., Nisbet et al., 2009) have been used for this exercise in two previous semesters. One example of such a case study is from the textbook by Sharpe et al. (2010). It is based on a not-for-profit fund-raising organization and contains data with 481 X variables. Prior to class, students are assigned the task of building a predictive model for the case. At the onset of class, students are queried about their models. After they have briefly discussed their predictive models, they are asked “How many false positive X variables do you think your model contains?” This discussion is followed by the question “What is the distribution of the false positives?” To better discuss these questions, students are introduced to a computer simulation that generates random variables and provides the regression results. First, 250 observations for the Y variable, the dependent variable in the regressions, are generated from a continuous Uniform distribution over the interval (0, 100): Yi = U (0, 100) i = 1, 2, ..., 250 (1) Duffy (2010) created simulated data that were scaled to induce the students into falsely correlating variables in the correlation matrix. For this reason, the randomly generated X variables in this exercise were scaled differently to create the resemblance of real data. Scaling the variables differently also demonstrates that scaling has no effect on statistical significance. Scales (power of 10) for each of the 20 X variables are generated using the formula scalej = 10p j = 1, 2, ..., n (2) Pinder 203 Figure 1: Excerpt from one sample of a random Y and 20 independent random X variables. where P is generated from a discrete Uniform distribution over the interval [1,5], i.e., P Discrete Uniform( 15 ) and n is the number of X random variables to be generated as independent regression variables. Next, 250 observations for each of the 20 X variables are generated using the formula Xij = U 0, Scalej i = 1, 2, ..., 250; j = 1, 2, ..., n. (3) An excerpt from a single sample of the randomly generated data is shown in Figure 1. Students are shown Figure 2 containing the results of the 20 simple regressions of each X variable against the Y variable for this particular simulation trial. The left panel of Figure 2 shows the coefficient of determination (r2 ), tstatistic for the regression slope coefficient, and corresponding p-value for each of the 20 X variables in the sample. The panel on the right shows the X variables sorted and ranked according to the regression results. Note the 2 FPs in this trial. A new sample of data is then generated using the F9 (recalc) key, generating 20 new regression results and the corresponding number of false positives, FP n FP = Ii (4) i=1 where n is the number of X variables (1, 2, . . . , n), and 0 if p > 0.05· Ii = 1 if p ≤ 0.05 (5) Using the Data Table command (one-way data table), 10,000 trials of 20 simple regressions were run using Excel. The results of the trials are shown 204 Impact of False Positives on Data Mining Figure 2: Results for a single simulation trial of 20 simple regressions. in Figure 3. Figure 3 and the corresponding simulation probability distribution (Figure 4 without the Binomial distribution superimposed) are shown to the students. This shows the relative percentage of times that FPs occur. Students are then asked “What is the probability distribution for the false positives?” They are often unable to answer this question, perhaps because the regression context of this discussion is considerably separated in time from previous discussions of discrete probability distributions. This raises the pedagogical question of how to direct students to the appropriate distribution. It provides a transition into the next demonstration in which manipulatives are used and provide a break in the pattern of question-and-answer dialog. At this point in the class session, a single 20-sided die (d20) is distributed to each student. The professor rolls 20 such dice and the students roll a single die. The number of “20”s rolled by the professor and the number of “20”s rolled by the students are counted separately. Students begin to recognize the distribution of their individual rolls as a single Bernoulli trial, and that the distribution of the number of “20”s rolled by the professor (with 20 dice) is Binomial (Bin(20, .05)). This guides students to the conclusion that FP is Binomial; i.e., FPBin(20, .05). Next, students are shown a comparison of the results of the simulation (Figure 3) with a Binomial 20, .05 (Figure 4): These results show that the simulation empirical distribution is Binomial for 20 trials of 5% probability of success; i.e., empirically, FP Bin(20, .05). As an example, it is noted that the probability of no FPs (f(0)) is .9520 35.85%. To further impress upon the students the relevance and risks for data mining, 250 observations of 500 X variables (X1 – X500 ) and a single Y variable are generated Pinder 205 Figure 3: Simulation results of 10,000 trials of number of false positives in 20 simple regressions. in the same manner as previously described for the 20 X variables. An excerpt from a single sample of the randomly generated data is presented in Figure 5. At this time, students are shown the spreadsheet containing the results of the 500 simple regressions of each X variable against the Y variable for this particular simulation trial. The abridged results of the regressions resulting from this sample, along with the number of FPs, are shown in Figure 6. As before, the left panel of Figure 6 shows the coefficient of determination (r2 ), t-statistic for the regression slope coefficient, and corresponding p-value for each of the 500 X variables in the sample. The panel on the right shows the X variables sorted and ranked according to regression results. Note the 28 FPs in this trial. As before, the F9 recalc key is used to generate samples of 250 observations for each of 501 Uniform random variables. Each time the key is pressed, students observe another trial of the simulation with the corresponding number of FP variables. When the students are asked to describe the distribution of FPs, they typically state Binomial (500, .05) which has an average of 25, but are often unable to describe what that distribution would look like. The students are pressed until a student recalls that the distribution will be a discrete Normal approximation to the Binomial. If students fail to recall the discrete Normal approximation, they are asked for the distribution of the number of heads resulting from dropping 100 pennies on a table (another interesting manipulative). They are further questioned as to why the Binomial distribution yields a discrete Normal. If the professor is 206 Impact of False Positives on Data Mining Figure 4: Simulation and Binomial distributions of false positives in 20 simple regressions. fortunate, there will be a student who recalls that according to the Central Limit Theorem, the sum of IID variables tends toward the Normal distribution. As such, each regression is a single IID Bernoulli trial, and these summed Bernoulli trials create a Binomial. Thus, in accordance with the Central Limit Theorem, the total number of “successes” (FPs) will be Normally distributed—as is shown in a Galton Board. Figure 7 shows the number of FPs for 10,000 trials of 500 simple regressions. It is noted that this figure is the result of 5,000,000 regressions. Figure 7 required about 40 minutes of computation time on a laptop and about 19.5 minutes on a desktop computer. Given these computational times, Figure 7 was created prior to class. The considerable volume of these regressions creates a substantial impact upon the students as does the simulated empirical evidence of the FP significance. Next, students are shown a comparison of the results of the simulation (Figure 7) with those of a Binomial distribution with 500 trials and a probability of “success” of 5% (Figure 8). Pinder 207 Figure 5: Excerpt from one sample of a random Y and 500 independent random X variables. The results show that the empirical distribution of the simulation is Binomial for 500 trials of 5% probability of success, and is also approximately Normal with an average of 25 ( = 500 × .05) and a standard deviation of (23.75)1/2 ( = (np(1-p))1/2 ); i.e., empirically, FP Bin(500, .05) N(25, 23.75). Students note that with 500 variables there is a near certainty of FPs, and are asked how few variables it would take for to be nearly certain of the emergence of FPs. Figure 9 shows the probability of at least 1 false positive variable (FP) as a function of the number of X variables computed using a Binomial distribution ( Bin(# of X variables, .05)). Students are asked “What exactly have we looked at?” to which the answer is: purely random and truly uncorrelated variables. In response, students naturally ask “What happens when we have variables that are not false positives?” One common student suggestion is to look at the number of FPs in the upper tail of the Binomial distribution, but, students quickly realize that comparing the number of “significant” variables to the Binomial distribution will not help them distinguish which variables are the FPs and which are not. Students conclude that the experience gained from seeing the FPs through simulation provides improved statistical thinking and judgment. This corresponds to the findings of Gigerenzer (2004) that in-depth experiments and demonstrations are needed pedagogically to convey statistical thinking to students. Ultimately, subject matter expertise is required to build valid predictive models. As an example, given meteorological data, one could build a predictive model that might have strong predictive metrics, but the model will not be as effective as a model built upon scientific knowledge of thermodynamics, climatology, and meteorology. For training students, the next best thing is the pseudo-experience gained through in-depth case studies. 208 Impact of False Positives on Data Mining Figure 6: Results for a single simulation trial of 500 simple regressions. At this point in the class session the students request additional time outside of class to go back to the case study that initiated the discussion of FPs and study the explanatory variables in further detail. Subsequent class sessions involve the development of predictive models for various situations such as credit applications, fraud detection, and customer retention (Nisbet et al., 2009). During those class sessions, the importance of understanding false positive variable selection in other modeling contexts (e.g., multiple regression and ANOVA) is discussed in the Pinder 209 Figure 7: Simulation results of 10,000 trials of number of false positives in 500 simple regressions. relevant contexts. Finally, predictive models are used as inputs to optimization models for applications in insurance risk management, revenue management, customer retention, and supply chain management. CONCLUSIONS The demonstrations presented in this article are designed to accomplish three learning objectives: (1) increase students’ cognizance of the frequency, and near certainty, of FP selections; (2) direct the students toward the deduction of the appropriate distribution of FPs; and (3) persuade the students that the proper remedies for FPs are statistical thinking and subject matter expertise. Unfortunately, because the course is a new advanced business analytics “topics” elective, effects on student learning, relative to other teaching methods, have yet to be measured. This is certainly an area for future research and validation. Students’ aggregate assessment of learning of predictive modeling and data mining in the prerequisite course creates a common baseline so that their qualitative assessments of the exercise provide a qualitative measure of the efficacy of this exercise as a pilot pedagogy. Figure 10 summarizes student assessments of the demonstrations’ effectiveness (based upon similar university course evaluation instruments) by two classes of MBA students enrolled in an advanced business analytics course with predictive modeling as a significant curricular component. The assessment process was voluntary and anonymous; the response rate was 100% of enrollment (64). Responses offer evidence that the lecture promotes student learning as intended. 210 Impact of False Positives on Data Mining Figure 8: Simulation and Binomial distributions of false positives in 500 simple regressions. The range of predictive modeling applications shown in Davenport and Harris (2007), Nisbet et al. (2009), Fisher and Raman (2010), and Siegel (2013), combined with the demand for business analytics professionals in the work place, indicate that predictive modeling is an important methodology for MBA students in an analytics course. Given the “big data” nature of predictive modeling, FPs are a near certainty and students must take this fact into Pinder 211 Figure 9: Probability of false positives versus number of X variables. consideration when developing predictive models. The results of the students’ responses to the in-class demonstrations suggest that the demonstrations engage students in a manner that increases their knowledge of FPs and their ability to build regression-based predictive models in data mining situations. After applying the demonstrations’ content to further case studies, students have also expressed a deeper understanding of the regression-based predictive modeling process.1 1 Spreadsheets for the demonstrations are available from the author. 212 Impact of False Positives on Data Mining Figure 10: Distribution of student responses to assessment items (n = 64). REFERENCES Bühlmann, P., Rütimann, P., & Kalisch, M. (2011). Controlling false positive selections in high-dimensional regression and causal inference. Statistical Methods in Medical Research, 22(5), 466–492. Davenport, T. H., & Harris, J. G. (2007). Competing on analytics, the new science of winning. Boston, MA: Harvard Business School Press. Duffy, S. (2010). Random numbers demonstrate the frequency of type I errors: Three spreadsheets of class demonstration. Journal of Statistics Education, 18(2), 1–16. Pinder 213 Ferson, W. E., Sarkissian, S., & Simin, T. (2008). Asset pricing models with conditional betas and alphas: The effects of data snooping and spurious regression. Journal of Financial and Quantitative Analysis, 43(2), 331–354. Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd. Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, 17(1), 69–78. Fisher, M. L., & Raman, A. (2010). The new science of retailing, how analytics are transforming the supply chain and improving performance. Boston, MA: Harvard Business School Press. Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. Jordan, J. (2013, October 21). The corporate downside of Big Data. The Wall Street Journal; Journal Report | Leadership on IT, p. R4. Kuhl, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. Meinhausen, N., Meier, L., & Bühlmann, P. (2009). p-values for high-dimensional regression. Journal of the American Statistical Association, 104(488), 1671– 1681. Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis & data mining applications. Boston, MA: Academic Press. Siegel, E. (2013). Predictive analytics. New York: John Wiley and Sons. Sharpe, N. R., DeVeaux, R. D., & Velleman, P. F. (2010). Business statistics. Boston, MA: Addison Wesley. Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5(2), 207–232. White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097– 1126. APPENDIX Implementation in Excel The implementation of the simulation of one randomly generated Y variable and 20 randomly generated X variables begins with generating 250 observations for the Y variable from a Uniform (0,100) distribution in Excel using the formula Yi = RAND()∗ 100 i = 1, 2, ..., 250 (A1) where RAND() is the Excel function that generates continuous Uniform distribution over the interval [0, 1]. A scaling factor (power of 10) for each of the 20 X variables is generated using the Excel formula Scalej = 10∧ ROUND(RAND()∗ 4, 0) j = 1, 2, ..., 20 (A2) 214 Impact of False Positives on Data Mining where j is the number of X random variables to be used as the independent regression variables. A total of 250 observations for each of the 20 X variables are generated in Excel using the formula Xij= RAND()∗ Scalej i= 1, 2,..., 250; j = 1, 2,..., 20. (A3) The first regression (Y and X1) formula is entered by selecting cell range C2:C4 and entering {= LINEST(Y range, Xrange, 1, 1)} i= 1, 2,..., 250 (A4) where LINEST() is the Excel function that calculates the regression coefficient for the X variable (i.e., slope), the corresponding standard error for that coefficient, and the r2 for the simple regression. Note that the brackets ({}) surrounding the formula are the result of using Excel’s array math, and is achieved by pressing control shift enter simultaneously, rather than simply enter, after entering the formula. The tstatistic and p-value are then calculated using the regression results. The Excel code for the t-statistic is = β1 /SEβ1 . (A5) The Excel formula for the p-value is = T.DIST.2T(ABS(t − statistic), 250 − 2). (A6) These formulae are then copied and pasted for the next 19 X variables. Finally, the number of FPs is counted using the Excel pseudocode formula: = SUM((Y range < 0.05)∗ 1). (A7) Note that the formula is also an array formula and requires control shift enter to be pressed simultaneously. The specific formulae used to calculate the 20 simple regressions of the Uniformly distributed simulation data are shown in Figure 11. These formulae result in the spreadsheet shown in Figure 12. After the number of FPs has been calculated for the single trial, the next task is to run the 20 regressions multiple times. This is accomplished using a one-way data table (found on the Data/What-If-Analysis/Data Table tab on the ribbon). The trial number is used as an index counter to force Excel to recalculate the random numbers as it puts the trial number into the blank cell C1. The formula in cell B5 is = Data!C7. In this spreadsheet (workbook), Data is the name of the worksheet with the 20 regressions, and cell C7 is the cell in that worksheet in which the number of FPs is calculated (see Figures 11 and 12). This process of running 10,000 trials is shown in Figure 13. Another option for generating the simulation data is to use the Normal distribution, as per common regression assumptions, to generate the simulation variables for the regressions. One method to generate the Yi observations would be to use the Excel formula = 100 + NORM.S.INV(RAND())∗ 10 i= 1, 2,..., 250 (A8) in which NORM.S.INV(z) is the Excel formula that returns a standard (z) score from a given percentile. This creates 250 observations from a Normal distribution Pinder 215 Figure 11: Spreadsheet formulae for simple regressions of simulation data from Uniform distributions. with an average of 100 and a standard deviation of 10; i.e., Yi N(100,102 ). Similarly, the observations for the Xij variables can be generated using the pseudocode formula: = Scalej + NORM.S.INV(RAND())∗ .10∗ Scalej i = 1, 2, ..., 250; j = 1, 2, ..., 20 (A9) where Scalej is calculated for each X variable using equation (A3) from above. This creates 250 observations for each X variable from a Normal distribution with an average of Scalej and a standard deviation of 10% of Scalej ; i.e., Xij N(Scalej , (.10 Scalej )2 ). The implementation of these formulae is shown in Figure 14. Because a p-value criterion of .05 is used in both experiments, the results from the Normally distributed simulation data are the same as those of the simulation data from a Uniform distribution. However, the Normally distributed simulation data are better because it conforms to the data assumptions of regression better than the Uniformly distributed data. The Normally distributed data are thus more appropriate for the regression assumptions and provide a more convincing demonstration of the frequency of occurrence of FP variables in regression model building. As such, the Normal distribution will be used in subsequent use of this demonstration. 216 Impact of False Positives on Data Mining Figure 12: Spreadsheet of 20 simple regressions of simulation data from Uniform distributions. Figure 13: One-way data/table to create 10,000 trials of the experiment. Pinder 217 Figure 14: Spreadsheet formulae for simple regressions of simulation data from Normal distributions. Jonathan P. Pinder is an Associate Professor of Management at Wake Forest University in Winston-Salem, North Carolina. He earned a BS in Civil Engineering at North Carolina State University and a Ph.D. in Business Administration at the University of North Carolina at Chapel Hill. Dr. Pinder’s research interests include, business analytics, revenue management, stochastic optimization, nonlinear dynamical systems, resource allocation, and inventory management models. He has published articles in numerous journals including: Decision Sciences, Journal of Operations Management, The Journal of the Operational Research Society, Journal of Forecasting, Journal of Business and Economics, Decision Sciencse Journal of Innovative Education, and Managerial and Decision Economics. He has received more than a dozen teaching awards. Copyright of Decision Sciences Journal of Innovative Education is the property of WileyBlackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. Intelligent Data Analysis 17 (2013) 753–769 DOI 10.3233/IDA-130605 IOS Press 753 A methodological approach to mining and simulating data in complex information systems Marina V. Sokolova and Antonio Fernández-Caballero∗ Instituto de Investigación en Informática de Albacete (i3A) and Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, Albacete, Spain Abstract. Complex emergent systems are known to be ill-managed because of their complex nature. This article introduces a novel interdisciplinary approach towards their study. In this sense, the DeciMaS methodological approach to mining and simulating data in complex information systems is introduced. The DeciMaS framework consists of three principal phases, preliminary domain and system analysis, system design and coding, and simulation and decision making. The framework offers a sequence of steps in order to support a domain expert who is not a specialist in data mining during the knowledge discovery process. With this aim a generalized structure of a decision support system (DSS) has been worked out. The DSS is virtually and logically organized into a three-leveled architecture. The first layer is dedicated to data retrieval, fusion and pre-processing, the second one discovers knowledge from data, and the third layer deals with making decisions and generating output information. Data mining is aimed to solve the following problems: association, classification, function approximation, and clustering. DeciMaS populates the second logical level of the DSS with agents which are aimed to complete these tasks. The agents use a wide range of data mining procedures that include approaches for estimation and prediction: regression analysis, artificial networks (ANNs), self-organizational methods, in particular, Group Method of Data Handling, and hybrid methods. The association task is solved with artificial neural networks. The ANNs are trained with different training algorithms such as backpropagation, resilient propagation and genetic algorithms. In order to assess the proposal an exhaustive experiment, designed to evaluate the possible harm caused by environmental contamination upon public health, is introduced in detail. Keywords: Complex systems, decision support systems, data mining, simulation 1. Introduction Complex and emergent information systems is an extensive field of knowledge. Generally speaking, a complex system (CS) is a composite object which consists of many heterogeneous (and, in many occasions, complex as well) subsystems [26], and incorporates emergent features that arise from interactions within the different levels. Such systems behave in non-trivial ways, originated in composite functional internal flows and structures. As a general rule, researchers face difficulties when trying to model, simulate and control complex systems. Due to these facts, it would be correct to say that one of the crucial issues of modern science is to puzzle out the CS paradigm. In fact, modeling complex systems is not ∗ Corresponding author: Antonio Fernández-Caballero, Instituto de Investigación en Informática de Albacete (i3A) and Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, 02071, Albacete, Spain. E-mail: Antonio.Fdez@ uclm.es. c 2013 – IOS Press and the authors. All rights reserved 1088-467X/13/$27.50 754 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Table 1 Methods for system analysis 1 Stage System description 2 System decomposition 3 Study of subsystems 4 Integration/ Aggregation of results Realization Expert description graphical methods, graphs, Petri nets, hierarchies, “AND-OR” and morphological trees, ontologies Criterion-based methods Heuristic approaches Alternative-based methods Hierarchical methods Problem-solving methods Data mining techniques Knowledge management tools Data fusion methods Decision building tools (hierarchical selection and composition) an easy and trivial task. Because of high complexity of CS, traditional approaches fail in developing theories and formalisms for their analysis [31]. Such a study can only be realized by a cross-sectoral approach, which uses knowledge and theoretical backgrounds from various disciplines as well as collaborative efforts of research groups and interested institutions. On the other hand, an effective approach to CS study has to follow the principles of system analysis, when we have to switch over to the abstract view of the system and perform the following flow of tasks: – Description of a system. Identification of its main properties and parameters. – Study of interconnections amongst parts of the system, which includes informational, physical, dynamical, temporal interactions, as well as the functionality of parts within the system. – Study of external system interactions with the environment and with other systems. – System decomposition and partitioning. Decomposition supposes extraction of series of system parts, and partitioning suggests extraction of parallel system parts. These methods are based on cluster analysis (iterative process of integration of system elements into groups) or content analysis (system division into parts, based on physical partitioning or function analysis). – Study of each subsystem or system part, using optimal corresponding tools (multidisciplinary approaches, problem solving methods, expert advice, knowledge discovery tools, and so on). – Integration of results received from the previous stage, and obtaining a pooled fused knowledge about the system. Synthesis of knowledge and composition of a whole model of the system. It can include formal methods for design, multi-criteria methods for optimization, decision-based and hierarchical design, artificial intelligence approaches, case-based reasoning, and others, for example, hybrid methods [36]. In Table 1 some possible approaches and tools are offered, which can be used at each stage and are not limited by the mentioned methods. To view in more details the stages (as provided in the second column of Table 1) and the methods (offered in the third column), the most frequently used are described. The first stage is “System description” that serves as a background for future system analysis. On this stage knowledge, rules and databases are created. A number of various methods are applied here: (1) expert description, essential in this case, can be combined with graphical methods of system representation (data flow diagrams, decision tables and trees, and so on), (2) graphs in form of flow graph representations, potential graph representations, resistance graph representations, line graph representations, (3) Petri nets, (4) taxonomies, vocabularies, and various kinds of hierarchies, (5) “AND-OR”, morphological trees and their modifications, (6) ontologies which unify necessary expert and technical information about domains of interest and databases. M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data 755 The second stage is “System decomposition”, necessary for studying the system components. Among the possible solutions there are criterion-based methods. The third stage, “Study of subsystems”, implies a wide usage of problem-solving methods aimed to describe knowledge and reasoning methods used to solve a task. Then, on this stage there is a great variety of data mining techniques to solve the following tasks: (1) classification, (2) regression, (3) attribute importance, (4) association, and (5) clustering (e.g. [8,23]). Some specific methods used for this list of tasks include statistics, artificial intelligence, decision trees, fuzzy logic, etc. Of course, there are a number of novel methods, modifications and hybridization of existing tools for data mining that appear permanently and are successfully used at this stage [11]. At the fourth stage, “Integration”, sets of methods dedicated to integration and composition are used: (1) methods for data integration, (2) multi-criteria evaluation, (3) decision making, including group and distributed decision making, (4) evolutionary methods, (5) artificial intelligence methods, and so on. A researcher has to accept some assumptions about the system’s detailed elaboration level to be able to analyze it in practice. This means that he/she has to chose those entities considered to be the elemental ones for research. As a rule, this task is solved by specialists. Generally, there are various specialists who collaborate on the same case study. Levin [26] enumerates the traditional classes of specialists: (1) novice or beginner, (2) professional or competent, and (3) expert. The author cites the classification of Denning [7] who suggests a seven-level specification: (1) novice or beginner, (2) advanced beginner or rookie, (3) professional or competent, (4) proficient professional or star, (5) expert or virtuoso, (6) master, and (7) legend. The role played by specialists, especially at higher levels, should not be underestimated, as they provide informational support and determine an ontological basis of the research. The qualitative and quantitative research outcomes have to be evaluated. In case of complex systems different types of criteria are needed, often hybrid and at composite scales, as mono-scaled viewpoints have proved themselves to be wrong and unappropriated for CS solutions [33]. Usually, in such cases they have to be provided as local measurements for each component of the abstract system model, and general evaluation criteria for the system are received as a fused hybrid estimation. The rest of the paper is organized as described next. Section 2 presents a short state of the art in the sphere of complex systems modeling. Next, Section 3 introduces DeciMaS, our methodological approach to mining and simulating data in complex information systems. Then, in Section 4, the DeciMaS data mining work flow is explained in detail. In order to assess the proposal, an experiment designed to evaluate the possible harm caused by environmental contamination upon public health is introduced in Section 5. Lastly, some conclusions are drawn in Section 6. 2. Related works Complex adaptive systems are characterized by self-organization, adaptation, heterogeneity across scales, and distributed control. Though these systems are ill-determined, ambiguous, uncertain, they have commonalities, that help working out a general way for their study [41]. Also, complex systems or systems of systems are characterized by high complexity and great number of interacting components. Here difficulties already appear during the creation of an abstract model of a CS, due to a great number of decisions to be made regarding its design. Non-traditional tools from different domains are highly effective in case of CS, providing novel ways for decision generation and solutions finding. CS cover several dimensions, including economical, ecological, social sub-systems, which may have any level (positive or negative) of collaboration and acceptance between them [28,32]. 756 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Notice that the aim of this approach is to introduce an integrated framework based on the agent paradigm, which enables intelligent knowledge discovery and support to a specialist during all stages of the decision creation process. Indeed, as the decision making environment has become more complex and decentralized, support provided to decision makers by traditional one function or single user focused decision support system (DSS) has evolved from simple predefined reports to complex and intelligent agent-based analysis, suggestion and judgment [15,16]. A little knowledge of non-specialists in data mining leads to erroneous conclusions and dangerous decisions. The best solution to avoid these problems is to apply a “white-box” methodology, which suggests that users should understand the algorithmic and statistical model structures underlying the software [25]. However, usage of intelligent agents may lead a user through a sequence of steps starting with information retrieval and finishing with the creation of decision alternatives and their evaluation. In this case it is more effective to use “blackbox” methodologies for data mining applications, where the agents create and simulate the dynamic behavior of a complex system. Indeed, many decision support and expert systems are based on intelligent agents [22,29]. One traditional field of application of DSS is medicine, and some recent academic reports deal with examples of novel usage of agent-based DSS for home and hospital care, pre-hospital emergency care and health monitoring and surveillance [2]. Some complex systems are sensor-based, such as medical equipment and diagnostic centers [24,30], modern sensor-based aircrafts [42], or software program tools, such as expert systems for socio-environmental simulation [38]. Specialists working with CS often face difficulties in their analysis and decision making. This problem is caused by the existence of numerous approaches that overlap and supplement each other, but do not meet all requirements of a specialist who needs a multi-focal view on a problem at hand and clear methodology applied to it. Existing approaches offer their solutions, however just a few are organized into methodologies [26,33]. However, not all of them permit both to extract general principles as well as to create specific decisions for a given domain. 3. The DeciMaS approach for complex systems study The purpose of the DeciMaS framework is to provide and to facilitate complex systems analysis, simulation, and their comprehension and management. From this standpoint, principles of the system approach are implemented in this framework. The overall approach used in the DeciMaS framework is straightforward. The system is decomposed into subsystems, and intelligent agents are used to examine them. Then, obtained fragments of knowledge are pooled together and general patterns of the system behavioral tendencies are produced [35,36]. The framework consists of the following three principal phases: 1. Preliminary domain and system analysis. This is the initial and preparatory phase where an analyst, in collaboration with experts, studies the domain of interest, extracts entities and discovers its properties and relations. Then, he/she states the main and supplemental goals of research, and the possible scenarios and functions of the system. During this exploration analysis, the analyst researches the following questions: what the system has to do and how it has to do it. As a result of this collaboration the meta-ontology and the knowledge base appear. This phase is supported by the Protégé Knowledge Editor that implements the meta-ontology and the Prometheus Design Kit which is used to design the multi-agent system. M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data 757 Fig. 1. Architecture of the proposed decision support system. 2. System design and coding. The active “element” of this phase is a developer who implements an agent-based system and prepares it for further usage. As support at this phase, JACK Intelligent Agents and JACK Development Environment software tools are used. Once coding has finished and the system has been tested, the second phase of DeciMaS is concluded. 3. Simulation and decision making. This is the last phase of the DeciMaS framework. During this phase, the final user, a decision maker, interacts with the system. This interaction consists of constructing solutions and policies, and estimating consequences of possible actions on the basis of simulation models. This phase is supported with JACK Development Environment. The proposed framework offers a sequence of steps to support a domain expert, who is not a specialist in data mining, during the knowledge discovery process. With this aim, a generalized structure of a decision support system is worked out. The DSS is virtually and logically organized into a three-leveled architecture. Figure 1 illustrates the DSS architecture. The first layer is dedicated to data retrieval, fusion and pre-processing, the second one discovers knowledge from the data, and the third layer deals with making decisions and generating output information. The DSS has on open architecture and may be filled with additional data mining methods. 4. The DeciMaS data mining work flow In a standard DSS, information is transformed from an initial “raw” state to a “knowledge” state, which suggests organized data sets, models and dependencies, and, finally, to a “new knowledge” state that contains recommendations, risk assessment values and forecasts. The way the information changes, as it passes through the DeciMaS stages, is shown in Fig. 2. In general terms, data mining refers to extracting or “mining” knowledge from data sources. Data matching is a process of bringing together data from different, and sometimes heterogeneous, data sources and comparing them to find out whether they represent the same real-world object [1,9]. However, data 758 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Fig. 2. Information transformation from “raw” to “new knowledge” state. mining may be viewed as a part of a knowledge discovery process which consists of an iterative sequence of the following steps [13]: 1. Data cleaning (to remove noise and inconsistent data). 2. Data integration (where multiple data sources may be combined). 3. Data selection (where data relevant to the analysis task are retrieved from the database). 4. Data transformation (where data are transformed or consolidated to be appropriate for mining by performing summary or aggregation operations, for instance). 5. Data mining (an essential process where intelligent methods are applied to extract data patterns). 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures). 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the retrieved knowledge to the user). Actually, steps from one to four constitute data preprocessing, which is performed to prepare data for mining. The DeciMaS framework uses various data mining techniques to support knowledge transformation to achieve the following global goals: – Information preprocessing. As a rule, primary data are incomplete, noisy and inconsistent. Attributes of interest may not be available and various errors (of data transition, measurement, etc.) may occur. A considerable number of data mining algorithms are susceptible to data quality. That is why data mining techniques usually improve accuracy and efficiency of mining algorithms [6,13, 27]. Data preprocessing within DeciMaS include methods for missing values and outliers detection and treatment, methods for smoothing and normalization [3]. – Data mining. In general, data mining is aimed to solve the following problems: association, classification, function approximation, and clustering. DeciMaS populates the second logical level of the DSS with agents that are aimed to complete these tasks. The agents use a wide range of data mining procedures which include approaches for estimation and prediction: regression analysis, artificial networks, self-organizational methods, in particular, generalized ordered weighted hybrid logarithm averaging (GOWHLA) [43], Group Method of Data Handling (GMDH) [37], and other hybrid methods [10,21,39]. Then, methods used for classification, decomposition and partitioning, based on statistical procedures, are presented [25]. Next, the association task is resolved with artificial neural networks (ANNs) [4,5]. ANNs are trained with different training algorithms such as backpropagation (BP), resilient propagation (RPROP) and training with genetic algorithms (GA) [14, 34]. – Support in decision making. DeciMaS supports data mining process, presenting to a user information from created knowledge bases and permitting generation of “what-if” scenarios. To enable M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data 1 Type of disease/Pollutant Endogenous diseases 2 Exogenous diseases 3 4 5 6 Transport Usage of petroleum products Water characteristics Wastes 7 Principal miner products 759 Table 2 Diseases studied in research Disease class Certain conditions originating in the perinatal period, Congenital malformations, deformations and chromosomal abnormalities Certain infectious and parasitic diseases Neoplasm, Diseases of the blood and bloodforming organs and certain disorders involving the immune mechanism, Endocrine, nutritional and metabolic diseases, Mental and behavioral disorders, Diseases of the nervous system, Diseases of the eye and adnexa, Diseases of the ear and mastoid process, Diseases of the circulatory system, Diseases of the respiratory system, Diseases of the digestive system, Diseases of the skin and subcutaneous tissue, Diseases of the musculoskeletal system and connective tissue, Diseases of the genitourinary system, Pregnancy, childbirth and the puerperium, Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified, External causes of morbidity and mortality Number of Lorries, Buses, Autos, Tractors, Motorcycles, Others Petroleum liquid gases; Petroleum autos; Petroleum; Kerosene; Gasohol; Fuel-oil DQO; DBO5; Solids in suspension; Nitrites Non-dangerous chemical wastes, Other non-dangerous chemical wastes, Non-dangerous metal wastes, Wastes from used equipment, paper, Dangerous wastes of glass, Dangerous wastes of rubber, Dangerous solid wastes, Dangerous vitrified wastes, Wastes from used equipment, Metallic and phosphorus wastes Hull; Mercury; Kaolin; Salt; Thenardite; Diatomite; Gypsum; Rock; Others; user-computer interaction, visualization methods which are allowed by the programming environments, are used. 5. Data and results The experiment designed to evaluate possible harm caused by environmental contamination upon public health was conducted for the Spanish region of Castilla-La Mancha. Retrospective data dated from 1989 until 2007, was used. Resources offered by Instituto Nacional de Estadística and by Instituto de Estadística de Castilla-La Mancha were used for the research [17]. The factors that described the “Environmental pollution – Human health” system were used as indicators of human health and as influencing indirect factors of environmental pollution [19]. Morbidity, classified by sex and age, was accepted as an indicator to evaluate human health. The diseases included in the research were chosen in accordance with the International Statistical Classification of Diseases and Related Health Problems (ICD) [18] and are given in Table 2. Pollutants are represented with indirect pollution sources such as number of vehicles in use, wastes (including dangerous wastes), quality of potable water, and others (see Table 2). Information was retrieved from CSV, DOC and XLS-files and fused together. There were 148 data files that contain information of interest. After extraction, data was placed in data storages agree with the domain ontology. 5.1. Information preprocessing The case of the human health impact assessment appeared to be an optimal domain for application. First, the information is scarce and heterogeneous. The heterogeneity of the initial data is multiple as var- 760 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Table 3 The outcomes of the missing values and outliers detection Factor X0 X2 ... X44 X46 X48 ... Y0 Y2 ... Missing values, % 38 38 ... 18 0 6 ... 0 ... Outliers % 0 0 ... 0 12 6 ... 0 6 ... Factor X1 X3 ... X45 X47 X49 ... Y1 Y3 ... Missing values, % 78 40 ... 12 0 18 ... 12 0 ... Outliers % 0 0.950 ... 0 12 0 ... 0 6 ... ious parameters (for example, morbidity indicators versus waste subgroups) have different amounts of data available. The periods between registration of parameters were different. For example, one parameter was registered on a monthly and the other one on a yearly basis. Some data was measured in different scales, for example, morbidity was measured in persons and in thousands of persons. Second, data sets appeared to be short, and it was decided to apply special data mining methods, such as GMDH-based models, and committee machines. Last but not least, the domain of the study represents one of the most difficult problem areas. The real interrelations between its components have not been thoroughly studied yet, even by domain experts [12]. That is why application of the DeciMaS methodology can be very effective as it facilitates the discovery of new knowledge as well as gives a new understanding of the nature of the complex system. Data are checked for the presence of outliers. These can be caused by registration errors or misprints. Outliers are eliminated in the previous step, and are marked as missing values. The data sets with more than 30% of missing values were excluded from the analysis. The outcomes are given in Table 3. The bar chart, given in Fig. 3, visualizes the filling gaps procedure for given data sets before (in red) and after (in blue). Next, data were normalized and smoothed. The reason to apply smoothing is to homogenize data after the treatment of missing values. The exponential smoothing with the coefficient α equal to 0.15 was used. Data is normalized using two normalization methods: “Z -score standardization” (for the case extreme values are not established) and the “Min-Max” normalization (if minimal and maximal values are presented in a data set). Decomposition of the studied complex system, “Environmental pollution-Human health”, was carried out by means of correlation analysis. A set of non-correlated independent variables X for each dependent variable Y were created. The independent variables (pollutants) that showed insignificant correlation with the dependent variable (disease), were also included into the set. Owing to this fact that data sets are short, non-parametric correlation coefficients are calculated using rank correlation coefficient and Kendall’s τ statistic. 5.2. Data mining For every class of diseases, plotting morbidity value against pollutant or several pollutants, regression analysis was performed. Simple, power, hyperbolic and exponential regression models were created. Each model was evaluated with Fisher F-value. The models that did not satisfy the F-test were eliminated from the list of accepted models. Generally, the number of accepted regression models was low, M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data 761 Fig. 3. The bar chart that exemplifies the filling gaps procedure: the data before (in red), and the data after filling the gaps (in blue). (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/IDA-130605) Fig. 4. Univariate linear regression to model Y0 = f (X1 ). (Colours are visible in the online version of the article; http://dx. doi.org/10.3233/IDA-130605) the predictability of the best performing univariate regression models ranged from 0.48 to 0.82 for the discrimination coefficient. Figures 4 and 5 present examples of regression models and approximation to real data. The model given in Fig. 4 is a univariate regression model, that constructs the function Y0 = f (X1 ) and is equals Y0 = 6.42X1 − 0.068. The red line represents initial data, and the blue line represents data approximated with the model. The correlation coefficient for this model equals R = 0.48, the determination coefficient equals D = 0.23 and the F-criterion equals F = 4.4. The regression model given in Fig. 5, which models the dependency Y14 = f (X44 ) and has a form Y14 = 4.43X44 − 0.144 shows better results in fitting the initial line, as well as proving the statistical criteria: the correlation coefficient R = 0.82, the determination coefficient D = 0.68 and the F-criterion F = 30.09. 762 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Fig. 5. Univariate regression to model Y14 = f (X44 ). (Colours are visible in the online version of the article; http://dx.doi. org/10.3233/IDA-130605) Fig. 6. Multiple regression to model Y15 = f (X4 , X14 ). (Colours are visible in the online version of the article; http://dx.doi. org/10.3233/IDA-130605) In general, univariate regression models for the current case study have been characterized with low values of statistical indicators and have demonstrated that they cannot be used for modeling. Multiple regression models have shown better performance results. For instance, the multiple regression model for the Y15 is given in Fig. 6. The model is written as Y15 = 0.022X14 + 0.001X4 + 0.012 and its statistical criteria for this model are: the correlation coefficient R = 0.77, the determination coefficient D = 0.59 and the F-criterion F = 20.69. Meaning the explanatory variables, X4 and X14 , explain the dependent variable Y9 in 59% of cases. In other words, in 59% of cases the model would give a correct result, and in 49% of cases the model would give an incorrect result. Neural network-based models, calculated for the experimental data sets, have demonstrated high performance results. Networks trained with resilient propagation and with backpropagation algorithms have similar architectures, and the training and testing procedures were equivalent. Neural networks trained with genetic algorithms were also applied. In this case, the genes of the chromosome represent weights, and with every new offspring population the error function is moved down until it approaches the optimal or acceptable solution, by changing the matrix of internal weights [40]. Networks trained with resilient propagation and with backpropagation algorithms have similar architectures, and the training and testing procedures were equivalent. Feedforward networks trained with the M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data 763 Table 4 Committee machine for the variable of interest Y35 Disease: External causes of death. Age group: all ages Y Model 1 Model 2 Model 3 Model 4 Committee machine X X9 , X61 X8 , X63 , X9 X60 , X62 , X12 X64 , X12 X9 , X12 , X60 , X61 , X62 , X64 R 0.67 0.72 0.85 0.87 0.91 D 0.44 0.52 0.72 0.76 0.83 MAE 0.21 0.19 0.14 0.15 0.12 Fig. 7. Charts for neural network models trained with backpropagation algorithm: the chart above shows the real model (in red) and the approximated model (in blue), and the chart below shows the error training function plotted against epochs. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/IDA-130605) backpropagation algorithm, the values of learning rate and momentum were varied within the interval [0, 0.99]. Better results were obtained with the values of the learning rate within the interval [0.85, 0.99] and the values of momentum within the range [0.3, 0.4]. Feedforward neural networks trained with the resilient propagation training algorithm have shown high performance results with the zero tolerance equal to 10−15 , the initial update value within the range [0.05, 0.15], and the maximum step equal to 50. Neural network trained with genetic algorithms using training sets and with the following parameters for training: – Population size, is the size of population, used for training, vary from 30% to 60%, – Mutation percent, equal to 20%, is the percent of the population, to which the mutation operator is applied, and 764 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Fig. 8. Charts for neural network models trained with resilient propagation algorithm: the chart above shows the real model (in red) and the approximated model (in blue), and the chart below shows the error training function plotted against epochs. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/IDA-130605) – Percent to mate, equal to 30%, is the part of the population, to which the crossover operator is applied. Dependence between diseases and pollutants was modeled with GMDH-algorithms which enable identifying both linear and nonlinear polynomial models using the same approach [20]. Figure 10 shows the models Y27 = 0.2X31 − 0.04X60 + 0.24 with R = 0.94, D = 0.88, MAE = 0.077 and Y89 = 4.20X31 − 0.01X21 − 1.01 with R = 0.91, D = 0.83, MAE = 0.124. In general, GMDH-based models demonstrate high performance results and efficiency when working with short data sets. The models were received with combinatorial algorithm, where the combination of the following polynomials were used: X , X 2 , X1 X2 , X1 X22 , X12 X2 , X12 X22 , 1/X , 1/(X1 X2 ). The selection of the model was stopped when the regulation criterion started to reduce. A final model for every dependence is a committee machine that is written as Y = f (X1 , X2 , . . . , Xn ), where n is a number of pollutants included into a model. As an example of a committee machine, the outcomes of modeling for the variable of interest Y35 Disease : External causes of death. Age group : all ages are discussed. First, after the decomposition of the number of variables (pollutants) that could be included into models for interest Y35 was reduced and included the following factors: X8 , X9 , X12 , X60 , X61 , X62 , X63 , X64 . Several models that included these factors were created for the variable, Y35 , and were then evaluated and the best were selected. The models which are included into this committee machine are: 1. Multiple regression model Y35 = f1 (X9 , X61 ). 2. Neural network trained with backpropagation algorithm Y35 = f2 (X8 , X63 , X9 ). M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Step 1 2 3 4 5 6 7 8 9 10 765 Table 5 Simulation Predicted value Changes of independent variable 20% 40% 80% 0.48 0.71 0.90 1.25 0.51 0.74 0.94 1.31 0.53 0.77 0.98 1.34 0.55 0.79 1.02 1.39 0.57 0.82 1.05 1.45 0.59 0.84 1.08 1.49 0.60 0.87 2.02 1.54 0.61 0.90 2.05 1.58 0.64 0.92 2.08 1.65 0.65 0.94 3.2 1.69 Fig. 9. Charts for neural network models trained with genetic algorithms: the chart above shows the real model (in red) and the approximated model (in blue), and the chart below shows the error function plotted against generations. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/IDA-130605) 3. Neural network trained with resilient propagation algorithm Y35 = f3 (X60 , X62 , X12 ). 4. Neural network trained with genetic algorithms Y35 = f4 (X64 , X12 ). The final model generated by the committee machine is: Y35 = f1 Rf1 + f2 Rf2 + f3 Rf3 + f4 Rf4 Rf1 + Rf2 + Rf3 + Rf4 (1) where fi is a model, included into the committee machine, and Rfi is the correlation coefficient for the i − th model, i ∈ [0, . . . , n], where n is the number of models. Table 4 shows statistical criteria for the models included into the committee machine, and for the committee machine. The criteria calculated are: the correlation coefficient R, the determination coefficient 766 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Fig. 10. Charts for neural network models trained with Group Method of Data Handling: the real model (in red) and the approximated model (in blue). (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/IDA-130605) D , and the minimal absolute error MAE. Table shows that the committee machine models has been evaluated with higher results: the correlation coefficient R = 0.91, the determination coefficient D = 0.83 and the mean absolute error MAE = 0.12. The second column indicates pollutants which was included into the model. The committee machine includes all the factors from the models. 5.3. Simulation For this case study, a decision is made by the specialist; however, information that could help him/her to ground it is offered by the system. First, models in the form of committee machines and predictions were created, and hidden patterns and possible tendencies were discovered. Second, the results of impact assessment explain the qualitative and quantitative dependencies between pollutants and diseases. Finally, the possibility of simulation is supported by the DSS. The committee machine model for the variable of interest “Neoplasms”, Y20 , given in formula 1, is used. Suppose, there is a need to change the value of a pollutant and observe how a specific morbidity class would response to this change. Suppose that the pollutant is “Usage of fuel-oil”, X8 . There are five models that compose a committee machine for the variable Y20 and one of the models, included into the committee machine for this disease (neural network trained with resilient propagation algorithm) includes X8 as an input variable. Table 5 shows outcomes of simulation. M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data 767 The first column contains values predicted by a model, the others contain values of Y20 calculated under the hypothesis that the variable X8 is going to vary. With this aim the values of the variable X8 are increased to 20, 40, and 80 percents. The model is characterized with correlation coefficient R = 0.904 and F = 7.354 (F > Ftable ). The determination coefficient, D , shows that the variable X8 explains approximately 81.8% of the variable Y20 . The values of the variable Y20 are given in a normalized scale, and represent relative growth of the process. 6. Conclusions In the DeciMaS framework we have embodied two solutions that were proposed in the Introduction section. Firstly, we have brought together existing methods for decision support systems creation, and more concretely within an agent-based decision support system architecture. These methods include data preprocessing, data mining and decision generation methods and approaches. Second, we have introduced an interdisciplinary and flexible methodology for complex, systemic domains and policies, which facilitates study and modeling for a wide range of applications. The case study presented in the article was performed in order to apply the DeciMaS framework for the identification and evaluation of environmental impact upon human health and generation of alternative decisions sets. The computational experiment was carried out by means of an agent-based decision support system that sequentially executed and completed each stage of DeciMaS. The study resulted in several constitutive outcomes and observations, regarding both subjects and methods of the study. It supported making predictions of its possible states and creating a pool of alternative decisions for the better understanding and possible correction of the complex system. Modeling helped to discover non-linear relationships between indicators of human health and pollutants, and generate linear and non-linear mathematical models, based on hybrid technique, which included different types of regressions and artificial neural networks. Our results show that models based on hybrid collective machines and neural networks may be implemented in decision support systems, as they have demonstrated good performance characteristics in approximation and forecasting. This performance gap confirms a non-linear relationship between factors influencing morbidity. Finally, different predictions of possible morbidity states under various assumptions were calculated with hybrid and neural network-based models. The predictions were generated with sensitivity analysis for the cases of explanation variables increasing and decreasing. The general conclusions to the usage of the DeciMaS framework are the following ones. DeciMaS supports the general standard flow of steps for information system’s life cycle. It makes the DeciMaS framework useful and general for application across a wide range of complex domains. The possibility of DeciMaS to be easily adapted to any domain of interest is important. The framework is organized in such a way that the change of domain is realized during the first stage of the DeciMaS framework, but all the further procedures of data mining and decision generation are completed in a similar way for various domains. This characteristic adds flexibility to the framework and widens the areas of its application. Moreover, usage of agent teams let to distribute, control, and synchronize the work flows within the system that are supervised and organized by the team leader agents to manage autonomous knowledge discovery. Additionally, the DeciMaS framework uses known terminology, and integrates tools and methods from various disciplines, making good use of their strong sides. This facilitates the usage of the DeciMaS framework by non scientific users. 768 M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data Acknowledgements This work was partially supported by Spanish Ministerio de Economía y Competitividad/FEDER under project TIN2010-20845-C03-01, and by Spanish Junta de Comunidades de Castilla-La Mancha/FEDER under project PII2I09-0069-0994. References [1] M. Agyemang, K. Barker and R. Alhajj, A comprehensive survey of numeric and symbolic outlier mining techniques, Intelligent Data Analysis 10 (2006), 521–538. [2] R. Annicchiarico, U. Cortés and C. Urdiales, Agent Technology and e-Health, Whitestein Series in Software Agent Technologies and Autonomic Computing, Birkhäuser Basel, 2008. [3] T. Bossomaier, D. Jarratt, M.M. Anver, T. Scott and J. Thompson, Data integration in agent based modelling, Complexity International 11 (2005), 6–18. [4] A. Casals and A. Fernández-Caballero, Robotics and autonomous systems in the 50th anniversary of artificial intelligence, Robotics and Autonomous Systems 55 (2007), 837–839. [5] T. Chen and Y.C. Wang, A fuzzy-neural approach for global CO2 concentration forecasting, Intelligent Data Analysis 15 (2011), 763–777. [6] B. Clarke, E. Fokoué and H.Z. Hao, Principles and Theory for Data Mining and Machine Learning, Springer Science + Business Media, 2009. [7] P.J. Denning and R. Dunham, The profession of IT: The core of the third-wave professional, Communications of the ACM 44 (2001), 21–25. [8] A.M. Denton, C.A. Besemann and D.H. Dorr, Pattern-based time-series subsequence clustering using radial distribution functions, Knowledge and Information Systems 18 (2009), 129–154. [9] C.F. Dorneles, R. Gonçalves and R. dos Santos Mello, Approximate data instance matching: A survey, Knowledge and Information Systems 27 (2011), 1–21. [10] S.J. Farlow, Self-Organizing Methods in Modeling: GMDH-Type Algorithms, Marcel Dekker, 1984. [11] E. Fuchs, T. Gruber, H. Pree and B. Sick, Temporal data mining using shape space representations of time series, Neurocomputing 74 (2010), 379–393. [12] J.M. Gohlke, S.H. Hrynkow and C.J. Portier, Health, economy, and environment: sustainable energy choices for a nation, Environmental Health Perspectives 6 (2008), 236–237. [13] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Second Edition, The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, 2006. [14] S. Haykin, Neural Networks: A Comprehensive Foundation, McMillan, 1994. [15] T.J. Hess, L.P. Rees and T.R. Rakes, Using autonomous software agents to create the next generation of decision support systems, Decision Sciences 31 (2000), 1–31. [16] J.A. Iglesias, A. Ledezma, A. Sanchis and G. Kaminka, A plan classifier based on chi-square distribution tests, Intelligent Data Analysis 15 (2011), 131–149. [17] Instituto de Estadistica de Castilla-La Mancha. http://www.ine.es/en/welcome.htm. [18] International Classification of Diseases. http:// www.who.int/classifications /icd/en/. [19] ISO 14031:1999, Environmental Management – Environmental Performance – Guidelines, 1999. [20] A.G. Ivakhnenko, G.A. Ivakhnenko, E.A. Savchenko and D. Wunsch, Problems of further development of GMDH algorithms: Part 2, Pattern Recognition and Image Analysis 12 (2002), 6–18. [21] M. Last, A. Kandel and H. Bunke, Data Mining In Time Series Databases, World Scientific, 2004. [22] C.T. Leondes, Fuzzy Logic and Expert Systems Applications, vol. 6, Academic Press, 2007. [23] T. Li, Clustering based on matrix approximation: A unifying view, Knowledge and Information Systems 17 (2008), 1–15. [24] P.J. Li, X.S. Qin, K.H. Adjallah, B. Eynard and J. Lee, Cooperative decision making for diagnosis of complex system based on game theory: survey and an alternative scheme, in: 2006 IEEE International Conference on Industrial Informatics, 2006, pp. 725–730. [25] D.T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining, John Wiley and Sons, 2005. [26] M.S. Levin, Composite Systems Decisions, Springer-Verlag, 2006. [27] T. Marwala, Computational Intelligence for Missing Data Imputation, Estimation,and Management: Knowledge Optimization Techniques, IGI Global, 2006. [28] A.A. Martins, M.G. Cardoso and I.M.S. Pinto, Mapping atmospheric pollutants emissions in European countries, Intelligent Data Analysis 16 (2012), 153–164. M.V. Sokolova and A. Fernández-Caballero / A methodological approach to mining and simulating data [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] 769 T. Nguyen, J. de Kok and M. Titus, A new approach to testing an integrated water systems model using qualitative scenarios, Environmental Modelling and Software 22 (2007), 1557–1571. G.M. O’Hare, M.J. O’Grady, R. Tynan, C. Muldoon, H.R. Kolar, A.G. Ruzzelli, D. Diamond and E. Sweeney, Embedding intelligent decision making within complex dynamic environments, Artificial Intelligence Review 27 (2007), 189–201. D. Pascot, F. Bouslama and S. Mellouli, Architecturing large integrated complex information systems: An application to healthcare, Knowledge and Information Systems 27 (2011), 115–140. S.A. Rahman, A.A. Bakar and Z.A. Mohamed-Hussein, An intelligent data pre-processing of complex datasets, Intelligent Data Analysis 16 (2012), 305–325. J. Rotmans, Tools for integrated sustainability assessment: A two-track approach, The Integrated Assessment Journal 6 (2006), 35–57. S.N. Sivanandam and S.N. Deepa, Introduction to Genetic Algorithms, Springer-Verlag, 2008. M.V. Sokolova and A. Fernández-Caballero, The Protégé – Prometheus approach to support multi-agent systems creation, in: Proceedings of the Tenth International Conference on Enterprise Information Systems, vol. AIDSS, 2008, pp. 442–445. M.V. Sokolova and A. Fernández-Caballero, Data mining driven decision making, Lecture Notes in Computer Science 5179 (2009), 220–225. D. Srinivasan, Energy demand prediction using GMDH networks, Neurocomputing 72 (2008), 625–629. D. Torii, T. Ishida, S. Bonneaud and A. Drogoul, Layering social interaction scenarios on environmental simulation, Lecture Notes in Computer Science 3425 (2005), 78–88. C. Tran, A. Abraham and L. Jain, Decision support systems using hybrid neurocomputing, Neurocomputing 61 (2004), 85–97. A.J.F. van Rooij, R.P. Johnson and L.C. Jain, Neural Network Training Using Genetic Algorithms, World Scientific, 1996. C.H. Weng and Y.L. Chen, Mining fuzzy association rules from uncertain data, Knowledge and Information Systems 23 (2009), 129–152. N. Xu and X. Wang, An information fusion method based on game theory, Proceedings of the 2nd International Conference on Signal Processing Systems 1 (2010), 95–98. L.G. Zhou and H.Y. Chen, Generalized ordered weighted logarithm aggregation operators and their applications to group decision making, International Journal of Intelligent Systems 25 (2010), 683–707. Copyright of Intelligent Data Analysis is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. Simulation Case Study: Phoenix Boutique Hotel Group Phoenix Boutique Hotel Group (PBHG) was founded in 2007 by Bree Bristowe. Having worked for several luxury resorts, Bristowe decided to pursue her dream of owning and operating a boutique hotel. Her hotel, which she called PHX, was located in an area that included several high-end resorts and business hotels. PHX filled a niche market for “modern travelers looking for excellent service and contemporary design without the frills.” Since opening PHX, Bristowe has invested, purchased, or renovated three other small hotels in the Phoenix metropolitan area: Canyon Inn PHX, PHX B&B, and The PHX Bungalows. One of the customer service enhancements Bristowe has implemented is a centralized, toll-free reservation system. Although many customers book specific hotels online, the phone reservation system enables PBHG to find the best reservation match at all properties. It has been an excellent option for those customers who have preferences regarding the type of room, amenity options, and the best price across the four hotel locations. Currently, three agents are on staff for the 6 a.m. to 2 p.m. call shift. The time between calls during this shift is represented in Table 1. The time to process reservation requests during this shift is in Table 2. Table 1: Incoming Call Distribution Time Between Calls (Minutes) 1 2 3 4 5 6 Probability 0.13 0.23 0.27 0.19 0.15 0.09 Table 2: Service Time Distribution Time to Process Customer Inquiries (Minutes) 1 2 3 4 5 6 7 Probability 0.19 0.17 0.16 0.15 0.11 0.08 0.03 Bristowe wants to ensure customers are not on hold for longer than 2 minutes. She is debating hiring additional staff for this shift based on the available data. Additionally, Bristowe and PBHG will soon be featured in a national travel magazine with a circulation of over a million subscriptions. Bristowe is worried that the current operators may not be able to handle the increase in reservations. The projected increase for call distribution is represented in Table 3. Table 3: Incoming Call Distribution Time Between Calls (Minutes) 1 2 3 4 5 6 Probability 0.26 0.27 0.24 0.14 0.11 0.06 Bristowe has asked for your advice in evaluating the current phone reservation system. Create a simulation model to investigate her concerns. Make recommendations about the reservation agents. 2 Arrival Interval Distribution Random Number Lower Limit Probability 0.13 0 0.23 11 0.27 32 0.19 54 0.15 74 0.09 90 Service Time Distribution Random Number Lower Limit Probability 0.19 0 0.17 20 0.16 39 0.15 57 0.11 74 0.08 87 0.03 97 Customer Number 1 2 3 4 5 6 7 8 9 10 11 12 Random Number 53 30 12 53 30 51 24 5 25 39 90 87 Range Upper Limit Arrival Gap Minute 10 31 53 73 89 99 1 2 3 4 5 6 Range Upper Limit Service Time (minutes) 19 38 56 73 86 96 99 1 2 3 4 5 6 7 Arrival Gap Random Number 1 1 68 54 82 81 90 42 7 4 78 32 Service Time Arrive Time Summary for This Trial Run max 13 14 15 50 94 81 98 82 63 Service Start Service End Time in System Summary for This Trial Run Average: maximums Time on Hold Time Server Idle Percent Utilization
Purchase answer to see full attachment

Tags: case study business quantitative methods

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Hoping you are doing great...Below are the responses to the prompt above. I completed the calculations on the provided spreadsheet and the evalaution on a word document. Hit me up if you need assistance with edits.
I have revised the thesis and gave some subtitles to show where each requirement is. Please see the MS word labeled BPHG.docx

Running Head: BPHG CASE STUDY

1

BPHG Case Study:
Evaluating the current phone reservation system
Student’s Name
Professor
Course
Date

EVALUATING THE CURRENT PHONE RESERVATION SYSTEM

2

Evaluating the current phone reservation system
BPHG is a group of hotels that offer toll-free calls for a reservation. This is one strategy
that makes the company competitive because customers always need information including the
price and the quality of services they expect before paying for their reservations. On this note,
the organization is aiming at ensuring that there is no long waiting time for the client, while it
still ensuring that the human resource responsible for the reservation process is not idle. The
organization...