Grand Canyon University IBM SPSS Access Statistical Analysis Help

User Generated

whfgzr123

Mathematics

Description

Some commonly employed statistical analyses include correlation and regression. In this assignment, you will practice correlation and regression techniques from an SPSS data set.

General Requirements:

Use the following information to ensure successful completion of the assignment:

  • Review "SPSS Access Instructions" for information on how to access SPSS for this assignment.
  • Access the document, "Introduction to Statistical Analysis Using IBM SPSS Statistics, Student Guide" to complete the assignment.
  • Download the file "Bank.sav" and open it with SPSS. Use the data to complete the assignment.
  • Download the file "Census.sav" and open it with SPSS. Use the data to complete the assignment.

Directions:

Perform the following tasks to complete this assignment:

  1. Locate the data set "Bank.sav" and open it with SPSS. Follow the steps in section 10.15 Learning Activity as written. Answer questions 1-3 in the activity based on your observations of the SPSS output.Type your answers into a Word document. Copy and paste the full SPSS output including any supporting graphs and tables directly from SPSS into the Word document for submission to the instructor. The SPSS output must be submitted with the problem set answers in order to receive full credit for the assignment.
  2. Locate the data set "Census.sav" and open it with SPSS. Follow the steps in section 11.16 Learning Activity as written. Answer questions 1, 2, 3, and 5 in the activity based on your observations of the SPSS output. Type your answers into a Word document. Copy and paste the full SPSS output including any supporting graphs and tables directly from SPSS into the Word document for submission to the instructor. The SPSS output must be submitted with the problem set answers in order to receive full credit for the assignment.

BANK

https://lms-grad.gcu.edu/learningPlatform/content/content.lc?operation=viewContent&contentId=6e2fb2f9-a67a-472d-881b-e90db7a99d2b

Census

https://lms-grad.gcu.edu/learningPlatform/content/content.lc?operation=viewContent&contentId=e5f845fb-cfcc-4cda-bdbd-cecc1a28db56

Unformatted Attachment Preview

Introduction to Statistical Analysis Using IBM SPSS Statistics Student Guide Course Code: 0G517 ERC 1.0 Introduction to Statistical Analysis Using IBM SPSS Statistics Licensed Materials - Property of IBM © Copyright IBM Corp. 2010 0G517 Published October 2010 US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. SPSS, SamplePower, and PASW are trademarks of SPSS Inc., an IBM Company, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. This guide contains proprietary information which is protected by copyright. No part of this document may be photocopied, reproduced, or translated into another language without a legal license agreement from IBM Corporation. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. TABLE OF CONTENTS Table of Contents LESSON 0: COURSE INTRODUCTION .................................................. 0-1 0.1 INTRODUCTION ................................................................................................................... 0-1 0.2 COURSE OBJECTIVES .......................................................................................................... 0-1 0.3 ABOUT SPSS ...................................................................................................................... 0-1 0.4 SUPPORTING MATERIALS ................................................................................................... 0-2 0.5 COURSE ASSUMPTIONS ...................................................................................................... 0-2 LESSON 1: INTRODUCTION TO STATISTICAL ANALYSIS ............. 1-1 1.1 OBJECTIVES ........................................................................................................................ 1-1 1.2 INTRODUCTION ................................................................................................................... 1-1 1.3 BASIC STEPS OF THE RESEARCH PROCESS ......................................................................... 1-1 1.4 POPULATIONS AND SAMPLES ............................................................................................. 1-3 1.5 RESEARCH DESIGN ............................................................................................................. 1-3 1.6 INDEPENDENT AND DEPENDENT VARIABLES ..................................................................... 1-4 1.7 NOTE ABOUT DEFAULT STARTUP FOLDER AND VARIABLE DISPLAY IN DIALOG BOXES .. 1-4 1.8 LESSON SUMMARY ............................................................................................................. 1-5 1.9 LEARNING ACTIVITY .......................................................................................................... 1-6 LESSON 2: UNDERSTANDING DATA DISTRIBUTIONS – THEORY2-1 2.1 OBJECTIVES ........................................................................................................................ 2-1 INTRODUCTION ......................................................................................................................... 2-1 2.2 LEVELS OF MEASUREMENT AND STATISTICAL METHODS ................................................. 2-1 2.3 MEASURES OF CENTRAL TENDENCY AND DISPERSION ..................................................... 2-5 2.4 NORMAL DISTRIBUTIONS ................................................................................................... 2-7 2.5 STANDARDIZED (Z-) SCORES ............................................................................................. 2-8 2.6 REQUESTING STANDARDIZED (Z-) SCORES...................................................................... 2-10 2.7 STANDARDIZED (Z-) SCORES OUTPUT ............................................................................. 2-10 2.8 PROCEDURE: DESCRIPTIVES FOR STANDARDIZED (Z-) SCORES ...................................... 2-10 2.9 DEMONSTRATION: DESCRIPTIVES FOR Z-SCORES............................................................ 2-11 2.10 LESSON SUMMARY ......................................................................................................... 2-12 2.11 LEARNING ACTIVITY ...................................................................................................... 2-13 LESSON 3: DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES ............................................................................................ 3-1 3.1 OBJECTIVES ........................................................................................................................ 3-1 3.2 INTRODUCTION ................................................................................................................... 3-1 3.3 USING FREQUENCIES TO SUMMARIZE NOMINAL AND ORDINAL VARIABLES .................... 3-2 3.4 REQUESTING FREQUENCIES ............................................................................................... 3-3 3.5 FREQUENCIES OUTPUT ....................................................................................................... 3-3 3.6 PROCEDURE: FREQUENCIES ............................................................................................... 3-4 3.7 DEMONSTRATION: FREQUENCIES ....................................................................................... 3-6 3.8 LESSON SUMMARY ........................................................................................................... 3-10 3.9 LEARNING ACTIVITY ........................................................................................................ 3-10 i INTRODUCTION TO STATISTICAL ANALYSIS USING IBM SPSS STATISTICS LESSON 4: DATA DISTRIBUTIONS FOR SCALE VARIABLES ......... 4-1 4.1 OBJECTIVES ........................................................................................................................4-1 4.2 INTRODUCTION ...................................................................................................................4-1 4.3 SUMMARIZING SCALE VARIABLES USING FREQUENCIES...................................................4-1 4.4 REQUESTING FREQUENCIES ................................................................................................4-2 4.5 FREQUENCIES OUTPUT .......................................................................................................4-2 4.6 PROCEDURE: FREQUENCIES ................................................................................................4-4 4.7 DEMONSTRATION: FREQUENCIES .......................................................................................4-6 4.8 SUMMARIZING SCALE VARIABLES USING DESCRIPTIVES.................................................4-11 4.9 REQUESTING DESCRIPTIVES .............................................................................................4-11 4.10 DESCRIPTIVES OUTPUT ...................................................................................................4-11 4.11 PROCEDURE: DESCRIPTIVES ...........................................................................................4-11 4.12 DEMONSTRATION: DESCRIPTIVES...................................................................................4-12 4.13 SUMMARIZING SCALE VARIABLES USING THE EXPLORE PROCEDURE ...........................4-13 4.14 REQUESTING EXPLORE ...................................................................................................4-13 4.15 PROCEDURE: EXPLORE ...................................................................................................4-16 4.16 DEMONSTRATION: EXPLORE...........................................................................................4-19 4.17 LESSON SUMMARY .........................................................................................................4-24 4.18 LEARNING ACTIVITY ......................................................................................................4-25 LESSON 5: MAKING INFERENCES ABOUT POPULATIONS FROM SAMPLES ............................................................................................ 5-1 5.1 OBJECTIVES ........................................................................................................................5-1 5.2 INTRODUCTION ...................................................................................................................5-1 5.3 BASICS OF MAKING INFERENCES ABOUT POPULATIONS FROM SAMPLES ..........................5-1 5.4 INFLUENCE OF SAMPLE SIZE ...............................................................................................5-2 5.5 HYPOTHESIS TESTING .......................................................................................................5-10 5.6 THE NATURE OF PROBABILITY .........................................................................................5-11 5.7 TYPES OF STATISTICAL ERRORS .......................................................................................5-11 5.8 STATISTICAL SIGNIFICANCE AND PRACTICAL IMPORTANCE ............................................5-12 5.9 LESSON SUMMARY ...........................................................................................................5-13 5.10 LEARNING ACTIVITY ......................................................................................................5-13 LESSON 6: RELATIONSHIPS BETWEEN CATEGORICAL VARIABLES ........................................................................................... 6-1 ii TABLE OF CONTENTS 6.1 OBJECTIVES ........................................................................................................................ 6-1 6.2 INTRODUCTION ................................................................................................................... 6-1 6.3 CROSSTABS......................................................................................................................... 6-2 6.4 CROSSTABS ASSUMPTIONS................................................................................................. 6-3 6.5 REQUESTING CROSSTABS ................................................................................................... 6-3 6.6 CROSSTABS OUTPUT .......................................................................................................... 6-3 6.7 PROCEDURE: CROSSTABS ................................................................................................... 6-4 6.8 EXAMPLE: CROSSTABS ....................................................................................................... 6-5 6.9 CHI-SQUARE TEST .............................................................................................................. 6-7 6.10 REQUESTING THE CHI-SQUARE TEST ............................................................................... 6-8 6.11 CHI-SQUARE OUTPUT....................................................................................................... 6-8 6.12 PROCEDURE: CHI-SQUARE TEST ...................................................................................... 6-9 6.13 EXAMPLE: CHI-SQUARE TEST ........................................................................................ 6-10 6.14 CLUSTERED BAR CHART ................................................................................................ 6-11 6.15 REQUESTING A CLUSTERED BAR CHART WITH CHART BUILDER .................................. 6-12 6.16 CLUSTERED BAR CHART FROM CHART BUILDER OUTPUT ............................................ 6-12 6.17 PROCEDURE: CLUSTERED BAR CHART WITH CHART BUILDER ..................................... 6-13 6.18 EXAMPLE: CLUSTERED BAR CHART WITH CHART BUILDER ......................................... 6-15 6.19 ADDING A CONTROL VARIABLE ..................................................................................... 6-16 6.20 REQUESTING A CONTROL VARIABLE ............................................................................. 6-17 6.21 CONTROL VARIABLE OUTPUT ........................................................................................ 6-17 6.22 PROCEDURE: ADDING A CONTROL VARIABLE ............................................................... 6-18 6.23 EXAMPLE: ADDING A CONTROL VARIABLE ................................................................... 6-19 6.24 EXTENSIONS: BEYOND CROSSTABS ............................................................................... 6-22 6.25 ASSOCIATION MEASURES............................................................................................... 6-23 6.26 LESSON SUMMARY ......................................................................................................... 6-23 6.27 LEARNING ACTIVITY ...................................................................................................... 6-24 LESSON 7: THE INDEPENDENT- SAMPLES T TEST .......................... 7-1 7.1 OBJECTIVES ........................................................................................................................ 7-1 7.2 INTRODUCTION ................................................................................................................... 7-1 7.3 THE INDEPENDENT-SAMPLES T TEST ................................................................................ 7-1 7.4 INDEPENDENT-SAMPLES T TEST ASSUMPTIONS ................................................................ 7-2 7.5 REQUESTING THE INDEPENDENT-SAMPLES T TEST ........................................................... 7-2 7.6 INDEPENDENT-SAMPLES T TEST OUTPUT .......................................................................... 7-3 7.7 PROCEDURE: INDEPENDENT-SAMPLES T TEST .................................................................. 7-5 7.8 DEMONSTRATION: INDEPENDENT-SAMPLES T TEST.......................................................... 7-6 7.9 ERROR BAR CHART .......................................................................................................... 7-10 7.10 REQUESTING AN ERROR BAR CHART WITH CHART BUILDER ........................................ 7-11 7.11 ERROR BAR CHART OUTPUT .......................................................................................... 7-11 7.12 DEMONSTRATION: ERROR BAR CHART WITH CHART BUILDER .................................... 7-12 7.13 LESSON SUMMARY ......................................................................................................... 7-14 7.14 LEARNING ACTIVITY ...................................................................................................... 7-14 LESSON 8: THE PAIRED-SAMPLES T TEST ........................................ 8-1 iii INTRODUCTION TO STATISTICAL ANALYSIS USING IBM SPSS STATISTICS 8.1 OBJECTIVES ........................................................................................................................8-1 8.2 INTRODUCTION ...................................................................................................................8-1 8.3 THE PAIRED-SAMPLES T TEST ............................................................................................8-1 8.4 ASSUMPTIONS FOR THE PAIRED-SAMPLES T TEST .............................................................8-2 8.5 REQUESTING A PAIRED-SAMPLES T TEST ..........................................................................8-3 8.6 PAIRED-SAMPLES T TEST OUTPUT .....................................................................................8-3 8.7 PROCEDURE: PAIRED-SAMPLES T TEST..............................................................................8-4 8.8 DEMONSTRATION: PAIRED-SAMPLES T TEST .....................................................................8-4 8.9 LESSON SUMMARY .............................................................................................................8-6 8.10 LEARNING ACTIVITY ........................................................................................................8-6 LESSON 9: ONE-WAY ANOVA ............................................................... 9-1 9.1 OBJECTIVES ........................................................................................................................9-1 9.2 INTRODUCTION ...................................................................................................................9-1 9.3 ONE-WAY ANOVA ..............................................................................................................9-1 9.4 ASSUMPTIONS OF ONE-WAY ANOVA ...............................................................................9-2 9.5 REQUESTING ONE-WAY ANOVA ......................................................................................9-2 9.6 ONE-WAY ANOVA OUTPUT ..............................................................................................9-3 9.7 PROCEDURE: ONE-WAY ANOVA ......................................................................................9-4 9.8 DEMONSTRATION: ONE-WAY ANOVA .............................................................................9-6 9.9 POST HOC TESTS WITH A ONE-WAY ANOVA ...................................................................9-8 9.10 REQUESTING POST HOC TESTS WITH A ONE-WAY ANOVA ...........................................9-9 9.11 POST HOC TESTS OUTPUT.................................................................................................9-9 9.12 PROCEDURE: POST HOC TESTS WITH A ONE-WAY ANOVA..........................................9-10 9.13 DEMONSTRATION: POST HOC TESTS WITH A ONE-WAY ANOVA .................................9-12 9.14 ERROR BAR CHART WITH CHART BUILDER ...................................................................9-14 9.15 REQUESTING AN ERROR BAR CHART WITH CHART BUILDER ........................................9-14 9.16 ERROR BAR CHART OUTPUT ..........................................................................................9-14 9.17 PROCEDURE: ERROR BAR CHART WITH CHART BUILDER ..............................................9-15 9.18 DEMONSTRATION: ERROR BAR CHART WITH CHART BUILDER .....................................9-16 9.19 LESSON SUMMARY .........................................................................................................9-18 9.20 LEARNING ACTIVITY ......................................................................................................9-18 LESSON 10: BIVARIATE PLOTS AND CORRELATIONS FOR SCALE VARIABLES .......................................................................................... 10-1 10.1 OBJECTIVES ....................................................................................................................10-1 10.2 INTRODUCTION ...............................................................................................................10-1 10.3 SCATTERPLOTS ...............................................................................................................10-1 10.4 REQUESTING A SCATTERPLOT ........................................................................................10-2 10.5 SCATTERPLOT OUTPUT ...................................................................................................10-3 10.6 PROCEDURE: SCATTERPLOT ...........................................................................................10-3 10.7 DEMONSTRATION: SCATTERPLOT ...................................................................................10-4 10.8 ADDING A BEST FIT STRAIGHT LINE TO THE SCATTERPLOT ..........................................10-5 10.9 PEARSON CORRELATION COEFFICIENT...........................................................................10-7 10.10 REQUESTING A PEARSON CORRELATION COEFFICIENT................................................10-8 10.11 BIVARIATE CORRELATION OUTPUT ..............................................................................10-8 10.12 PROCEDURE: PEARSON CORRELATION WITH BIVARIATE CORRELATIONS ...................10-9 10.13 DEMONSTRATION: PEARSON CORRELATION WITH BIVARIATE CORRELATIONS ........10-10 10.14 LESSON SUMMARY .....................................................................................................10-11 10.15 LEARNING ACTIVITY ..................................................................................................10-12 iv TABLE OF CONTENTS LESSON 11: REGRESSION ANALYSIS................................................ 11-1 11.1 OBJECTIVES .................................................................................................................... 11-1 11.2 INTRODUCTION ............................................................................................................... 11-1 11.3 SIMPLE LINEAR REGRESSION ......................................................................................... 11-1 11.4 SIMPLE LINEAR REGRESSION ASSUMPTIONS ................................................................. 11-3 11.5 REQUESTING SIMPLE LINEAR REGRESSION ................................................................... 11-4 11.6 SIMPLE LINEAR REGRESSION OUTPUT ........................................................................... 11-4 11.7 PROCEDURE: SIMPLE LINEAR REGRESSION ................................................................... 11-5 11.8 DEMONSTRATION: SIMPLE LINEAR REGRESSION........................................................... 11-7 11.9 MULTIPLE REGRESSION................................................................................................ 11-11 11.10 MULTIPLE LINEAR REGRESSION ASSUMPTIONS ........................................................ 11-11 11.11 REQUESTING MULTIPLE LINEAR REGRESSION........................................................... 11-11 11.12 MULTIPLE LINEAR REGRESSION OUTPUT .................................................................. 11-11 11.13 PROCEDURE: MULTIPLE LINEAR REGRESSION ........................................................... 11-14 11.14 DEMONSTRATION: MULTIPLE LINEAR REGRESSION .................................................. 11-16 11.15 LESSON SUMMARY ..................................................................................................... 11-22 11.16 LEARNING ACTIVITY .................................................................................................. 11-22 LESSON 12: NONPARAMETRIC TESTS .............................................. 12-1 12.1 OBJECTIVES .................................................................................................................... 12-1 12.2 INTRODUCTION ............................................................................................................... 12-1 12.3 NONPARAMETRIC ANALYSES ......................................................................................... 12-2 12.4 THE INDEPENDENT SAMPLES NONPARAMETRIC ANALYSIS .......................................... 12-2 12.5 REQUESTING AN INDEPENDENT SAMPLES NONPARAMETRIC ANALYSIS ....................... 12-3 12.6 INDEPENDENT SAMPLES NONPARAMETRIC TESTS OUTPUT .......................................... 12-3 12.7 PROCEDURE: INDEPENDENT SAMPLES NONPARAMETRIC TESTS ................................... 12-5 12.8 DEMONSTRATION: INDEPENDENT SAMPLES NONPARAMETRIC TESTS .......................... 12-8 12.9 THE RELATED SAMPLES NONPARAMETRIC ANALYSIS ................................................ 12-11 12.10 REQUESTING A RELATED SAMPLES NONPARAMETRIC ANALYSIS ............................. 12-12 12.11 RELATED SAMPLES NONPARAMETRIC TESTS OUTPUT .............................................. 12-12 12.12 PROCEDURE: RELATED SAMPLES NONPARAMETRIC TESTS ...................................... 12-13 12.13 DEMONSTRATION: RELATED SAMPLES NONPARAMETRIC TESTS .............................. 12-16 12.14 LESSON SUMMARY ..................................................................................................... 12-19 12.15 LEARNING ACTIVITY .................................................................................................. 12-20 LESSON 13: COURSE SUMMARY........................................................ 13-1 13.1 COURSE OBJECTIVES REVIEW ........................................................................................ 13-1 13.2 COURSE REVIEW: DISCUSSION QUESTIONS ................................................................... 13-1 13.3 NEXT STEPS .................................................................................................................... 13-2 APPENDIX A: INTRODUCTION TO STATISTICAL ANALYSIS REFERENCES 1 1.1 INTRODUCTION .................................................................................................................. A-1 1.2 REFERENCES ...................................................................................................................... A-1 v INTRODUCTION TO STATISTICAL ANALYSIS USING IBM SPSS STATISTICS vi COURSE INTRODUCTION Lesson 0: Course Introduction 0.1 Introduction ® The focus of this two-day course is an introduction to the statistical component of IBM ® SPSS Statistics. This is an application-oriented course and the approach is practical. You'll take a look at several statistical techniques and discuss situations in which you would use each technique, ® the assumptions made by each method, how to set up the analysis using PASW Statistics, as well as how to interpret the results. This includes a broad range of techniques for exploring and summarizing data, as well as investigating and testing underlying relationships. You will gain an understanding of when and why to use these various techniques as well as how to apply them with confidence, and interpret their output, and graphically display the results. 0.2 Course Objectives After completing this course students will be able to: • Perform basic statistical analysis using selected statistical techniques with PASW Statistics To support the achievement of this primary objective, students will also be able to: • Explain the basic elements of quantitative research and issues that should be considered in data analysis • Determine the level of measurement of variables and obtain appropriate summary statistics based on the level of measurement • Run the Frequencies procedure to obtain appropriate summary statistics for categorical variables • Request and interpret appropriate summary statistics for scale variables • Explain how to make inferences about populations from samples • Perform crosstab analysis on categorical variables • Perform a statistical test to determine whether there is a statistically significant relationship between categorical variables • Perform a statistical test to determine whether there is a statistically significant difference between two groups on a scale variable • Perform a statistical test to determine whether there is a statistically significant difference between the means of two scale variables • Perform a statistical test to determine whether there is a statistically significant difference among three or more groups on a scale dependent variable • Perform a statistical test to determine whether two scale variables are correlated (related) • Perform linear regression to determine whether one or more variables can significantly predict or explain a dependent variable • Perform non-parametric tests on data that don’t meet the assumptions for standard statistical tests 0.3 About SPSS ® ® SPSS Inc., an IBM Company is a leading global provider of predictive analytics software and solutions. The Company’s complete portfolio of products - data collection, statistics, modeling and deployment - captures people's attitudes and opinions, predicts outcomes of future customer interactions, and then acts on these insights by embedding analytics into business processes. SPSS solutions address interconnected business objectives across an entire organization by focusing on the convergence of analytics, IT architecture and business process. Commercial, government and academic customers worldwide rely on SPSS technology as a competitive advantage in attracting, 0-1 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS retaining and growing customers, while reducing fraud and mitigating risk. SPSS was acquired by ® IBM in October 2009. For more information, visit http://www.spss.com. 0.4 Supporting Materials We use several datasets in the course because no one data file contains all the types of variables and relationships between them that are ideal for every technique we discuss. As much as possible, we try to minimize the need within one lesson to switch between datasets, but the first priority is to use appropriate data for each method. The following data files are used in this course: • Bank.sav • Drinks.sav • Census.sav • Employee data.sav • SPSS_CUST.sav 0.5 Course Assumptions General computer literacy. Completion of the "Introduction to PASW Statistics" and/or "Data Management and Manipulation with PASW Statistics" courses or experience with PASW Statistics including familiarity with, opening, defining, and saving data files and manipulating and saving output. Basic statistical knowledge or at least one introductory level course in statistics is recommended. Note about Default Startup Folder and Variable Display in Dialog Boxes In this course, all of the files used for the demonstrations and exercises are located in the folder c:\Train\Statistics_IntroAnalysis. Note: If the course files are stored in a different location, your instructor will give you instructions specific to that location. Either variable names or longer variable labels will appear in list boxes in dialog boxes. Additionally, variables in list boxes can be ordered alphabetically or by their position in the file. In this course, we will display variable names in alphabetical order within list boxes. 1) 2) 3) 4) 5) 0-2 Select Edit...Options Select the General tab (if necessary) Select Display names in the Variable Lists group on the General tab Select Alphabetical Select OK and OK in the information box to confirm the change INTRODUCTION TO STATISTICAL ANALYSIS Lesson 1: Introduction to Statistical Analysis 1.1 Objectives After completing this lesson students will be able to: • Explain the basic elements of quantitative research and issues that should be considered in data analysis To support the achievement of the primary objective, students will also be able to: • • • • Explain the basic steps of the research process Explain differences between populations and samples Explain differences between experimental and non-experimental research designs Explain differences between independent and dependent variables 1.2 Introduction The goal of this course is to enable you to perform useful analyses on your data using PASW Statistics. Keeping this in mind, these lessons demonstrate how to perform descriptive and inferential statistical analyses and create charts to support these analyses. This course guide will focus on the elements necessary for you to answer questions from your data. In this chapter, we begin by briefly reviewing the basic elements of quantitative research and issues that should be considered in data analysis. We will then discuss a number of statistical procedures that PASW Statistics performs. This is an application-oriented course and the approach will be practical. We will discuss: 1) 2) 3) 4) The situations in which you would use each technique. The assumptions made by the method. How to set up the analysis using PASW Statistics. Interpretation of the results. We will not derive proofs, but rather focus on the practical matters of data analysis in support of answering research questions. For example, we will discuss what correlation coefficients are, when to use them, and how to produce and interpret them, but will not formally derive their properties. This course is not a substitute for a course in statistics. You will benefit if you have had such a course in the past, but even if not, you will understand the basics of each technique after completion of this course. We will cover descriptive statistics and exploratory data analysis, and then examine relationships between categorical variables using crosstabulation tables and chi-square tests. Testing for mean differences between groups using T Tests and analysis of variance (ANOVA) will be considered. Correlation and regression will be used to investigate the relationships between interval/scale variables and we will also discuss some non-parametric techniques. Graphs comprise an integral part of the analyses and we will demonstrate how to create and interpret these as well. 1.3 Basic Steps of the Research Process All research projects, whether analyzing a survey, doing program evaluations, assessing marketing campaigns, doing pharmaceutical research, etc., can be broken down into a number of discrete 1-1 INTRODUCTION TO STATISTICAL ANALYSIS USING IBM SPSS STATISTICS components. These components can be categorized in a variety of ways. We might summarize the main steps as: 1) Specify exactly the aims and objectives of the research along with the main hypotheses. 2) Define the population and sample design. 3) Choose a method of data collection, design the research and decide upon an appropriate sampling strategy. 4) Collect the data. 5) Prepare the data for analysis. 6) Analyse the data. 7) Report the findings. Some of these points may seem obvious, but it is surprising how often some of the most basic principles are overlooked, potentially resulting in data that is impossible to analyze with any confidence. Each step is crucial for a successful research project and it is never too early in the process to consider the methods that you intend to use for your data analysis. In order to place the statistical techniques that we will discuss in this course in the broader framework of research design, we will briefly review some of the considerations of the first steps. Statistics and research design are highly interconnected disciplines and you should have a thorough grasp of both before embarking on a research project. This introductory chapter merely skims the surface of the issues involved in research design. If you are unfamiliar with these principles, we recommend that you refer to the research methodology literature for more thorough coverage of the issues. Research Objectives It is important that a research project begin with a set of well-defined objectives. Yet, this step is often overlooked or not well defined. The specific aims and objectives may not be addressed because those commissioning the research do not know exactly which questions they would like answered. This rather vague approach can be a recipe for disaster and may result in a completely wasted opportunity as the most interesting aspects of the subject matter under investigation could well be missed. If you do not identify the specific objectives, you will fail to collect the necessary information or ask the necessary question in the correct form. You can end up with a data file that does not contain the information that you need for your data analysis step. For example, you may be asked to conduct a survey "to find out about alcohol consumption and driving". This general objective could lead to a number of possible survey questions. Rather than proceeding with this general objective, you need to uncover more specific hypotheses that are of interest to your organization. This example could lead to a number of very specific research questions, such as: “What proportion of people admits to driving while above the legal alcohol limit?” “What demographic factors (e.g., age/sex/social class) are linked with a propensity to drunk-driving?” “Does having a conviction for drunk-driving affect attitudes towards driving while over the legal limit?” These specific research questions would then define the questionnaire items. Additionally, the research questions will affect the definition of the population and the sampling strategy. For example, the third question above requires that the responder have a drunk-driving conviction. Given that a 1-2 INTRODUCTION TO STATISTICAL ANALYSIS relatively small proportion of the general population has such a conviction, you would need to take that into consideration when defining the population and sampling design. Therefore, it is essential to state formally the main aims and objectives at the outset of the research so the subsequent stages can be done with these specific questions in mind. 1.4 Populations and Samples In studies involving statistical analysis it is important to be able to characterize accurately the population under investigation. The population is the group to which you wish to generalize your conclusions, while the sample is the group you directly study. In some instances the sample and population are identical or nearly identical; consider the Census of any country. In the majority of studies, the sample represents a small proportion of the population. In the example above, the population might be defined as those people with registered drivers' licenses. We could select a sample from the drivers' license registration list for our survey. Other common examples are: membership surveys in which a small percentage of members are sent questionnaires, medical experiments in which samples of patients with a disease are given different treatments, marketing studies in which users and non users of a product are compared, and political polling. The problem is to draw valid inferences from data summaries in the sample so that they apply to the larger population. In some sense you have complete information about the sample, but you want conclusions that are valid for the population. An important component of statistics and a large part of what we cover in the course involves statistical tests used in making such inferences. Because the findings can only be generalized to the population under investigation, you should give careful thought to defining the population of interest to you and making certain that the sample reflects this population. To state it in a simple way, statistical inference provides a method of drawing conclusions about a population of interest based on sample results. 1.5 Research Design With specific research goals and a target population in mind, it is then possible to begin the design stage of the research. There are many things to consider at the design stage. We will consider a few issues that relate specifically to data analysis and statistical techniques. This is not meant as a complete list of issues to consider. For example, for survey projects, the mode of data collection, question selection and wording, and questionnaire design are all important considerations. Refer to the survey research literature as well as general research methodology literature for discussion of these and other research design issues. First, you must consider the type of research that will be most appropriate to the research aims and objectives. Two main alternatives are experimental and non-experimental research. The data may be recorded using either objective or subjective techniques. The former includes items measured by an instrument and by computer such as physiological measures (e.g. heart-rate) while the latter includes observational techniques such as recordings of a specific behavior and responses to questionnaire surveys. Most research goals lend themselves to one particular form of research, although there are cases where more than one technique may be used. For example, a questionnaire survey would be inappropriate if the aim of the research was to test the effectiveness of different levels of a new drug to relieve high blood pressure. This type of work would be more suited to a tightly controlled experimental study in which the levels of the drug administered could be carefully controlled and objective measures of blood pressure could be accurately recorded. On the other hand, this type of laboratory-based work would not be a suitable means of uncovering people’s voting intentions. 1-3 INTRODUCTION TO STATISTICAL ANALYSIS USING IBM SPSS STATISTICS The classic experimental design consists of two groups: the experimental group and the control group. They should be equivalent in all respects other than that those in the former group are subjected to an effect or treatment and the latter is not. Therefore, any differences between the two groups can be directly attributed to the effect of this treatment. The treatment variables are usually referred to as independent variables, and the quantity being measured as the effect is the dependent variable. There are many other research designs, but most are more elaborate variations on this basic theme. In non-experimental research, you rarely have the opportunity to implement such a rigorously controlled design. For example, we cannot randomly assign students to schools, however, the same general principles apply to many of the analyses you perform. 1.6 Independent and Dependent Variables In general, the dependent (sometimes referred to as the outcome) variable is the one we wish to study as a function of other variables. Within an experiment, the dependent variable is the measure expected to change as a result of the experimental manipulation. For example, a drug experiment designed to test the effectiveness of different sleeping pills might employ the number of hours of sleep as the dependent variable. In surveys and other non-experimental studies, the dependent variable is also studied as a function of other variables. However, no direct experimental manipulation is performed; rather the dependent variable is hypothesized to vary as a result of changes in the other (independent) variables. Correspondingly, independent (sometimes referred to as predictor) variables are those used to measure features manipulated by the experimenter in an experiment. In a non-experimental study, they represent variables believed to influence or predict a dependent measure. Thus terms (dependent, independent) reasonably applied to experiments have taken on more general meanings within statistics. Whether such relations are viewed causally, or as merely predictive, is a matter of belief and reasoning. As such, it is not something that statistical analysis alone can resolve. To illustrate, we might investigate the relationship between starting salary (dependent) and years of education, based on survey data, and then develop an equation predicting starting salary from years of education. Here starting salary would be considered the dependent variable although no experimental manipulation of education has been performed. One way to think of the distinction is to ask yourself which variable is likely to influence the other? In summary, the dependent variable is believed to be influenced by, or be predicted by, the independent variable(s). Finally, in some studies, or parts of studies, the emphasis is on exploring and characterizing relationships among variables with no causal view or focus on prediction. In such situations there is no designation of dependent and independent variables. For example, in crosstabulation tables and correlation matrices the distinction between dependent and independent variables is not necessary. It rather resides in the eye of the beholder (researcher). 1.7 Note about Default Startup Folder and Variable Display in Dialog Boxes In this course, all of the files used for the demonstrations and exercises are located in the folder c:\Train\Statistics_IntroAnalysis.You can set the startup folder that will appear in all Open and Save dialog boxes. We will use this option to set the startup folder. Select Edit...Options, and then select the File Locations tab Select the Browse button to the right of the Data Files text box Select Train from the Look In: drop down list, then select Statistics_IntroAnalysis from the list of folders and select Set button Click the Browse button to the right of the Other Files text box and repeat the process to set this folder to Train\Statistics_IntroAnalysis 1-4 INTRODUCTION TO STATISTICAL ANALYSIS Figure 1.1 Set Default File Location in the Edit Options Dialog Box Note: If the course files are stored in a different location, your instructor will give you instructions specific to that location. Either variable names or longer variable labels will appear in list boxes in dialog boxes. Additionally, variables in list boxes can be ordered alphabetically or by their position in the file. In this course, we will display variable names in alphabetical order within list boxes. Select General tab Select Display names in the Variable Lists group on the General tab Select Alphabetical (Not shown) Select OK and then OK in the information box to confirm the change 1.8 Lesson Summary In this lesson, we reviewed the basic elements of quantitative research and issues that should be considered in data analysis. Lesson Objectives Review Students who have completed this lesson should now be able to: • Explain the basic elements of quantitative research and issues that should be considered in data analysis To support the achievement of the primary objective, students should now also be able to: • Explain the basic steps of research process • Explain differences between populations and samples • Explain differences between experimental and non-experimental research designs • Explain differences between independent and dependent variables 1-5 INTRODUCTION TO STATISTICAL ANALYSIS USING IBM SPSS STATISTICS 1.9 Learning Activity In this set of learning activities you won’t need any supporting material. 1. In each of the following scenarios, state the possible goals of the research, the type of design you can use, and the independent and dependent variables: a. The relationship between gender and whether a product was purchased. b. The difference between income categories (e.g., low, medium, and high) and number of years of education. c. The effect of two different marketing campaigns on number of items purchased. 2. In your own organization/field, are experimental studies ever done? If not, can you imagine how an experiment might be done to study a topic of interest to you or your organization? Describe that and the challenges such an experimental design would encounter. 1-6 UNDERSTANDING DATA DISTRIBUTIONS - THEORY Lesson 2: Understanding Data Distributions – Theory 2.1 Objectives After completing this lesson students will be able to: • Determine the level of measurement of variables and obtain appropriate summary statistics based on the level of measurement To support the achievement of this primary objective, students will also be able to: • • • Describe the levels of measurement used in PASW Statistics Use measures of central tendency and dispersion Use normal distributions and z-scores Introduction Ideally, we would like to obtain as much information as possible from our data. In practice however, given the measurement level of our variables, only some information is meaningful. In this lesson we will discuss level of measurement and see how this determines the summary statistics we can request. Business Context Understanding how level of measurement impacts the kind of information we can obtain is an important step before we collect our data. In addition, level of measurement also determines the kind of research questions we can answer, and so this is a critical step in the research process. Supporting Materials The file Census.sav, a PASW Statistics data file from a survey done on the general adult population. Questions were included about various attitudes and demographic characteristics. 2.2 Levels of Measurement and Statistical Methods The term levels of measurement refers to the properties and meaning of numbers assigned to observations for each item. Many statistical techniques are only appropriate for data measured at particular levels or combinations of levels. Therefore, when possible, you should determine the analyses you will be using before deciding upon the level of measurement to use for each of your variables. For example, if you want to report and test the mean age of your customers, you will need to ask their age in years (or year of birth) rather than asking them to choose an age group into which their age falls. Because measurement type is important when choosing test statistics, we briefly review the common taxonomy of level of measurement. 2-1 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS The four major classifications that follow are found in many introductory statistics texts. They are presented beginning with the weakest and ending with those having the strongest measurement properties. Each successive level can be said to contain the properties of the preceding types and to record information at a higher level. • Nominal — In nominal measurement each numeric value represents a category or group identifier, only. The categories cannot be ranked and have no underlying numeric value. An example would be marital status, coded 1 (Married), 2 (Widowed), 3 (Divorced), 4 (Separated) and 5 (Never Married); each number represents a category and the matching of specific numbers to categories is arbitrary. Counts and percentages of observations falling into each category are appropriate summary statistics. Such statistics as the mean (the average marital status?) would not be appropriate, but the mode would be appropriate (the most frequent category). • Ordinal — For ordinal measures the data values represent ranking or ordering information. However, the difference between the data values along the scale is not equal. An example would be specifying how happy you are with your life, coded 1 (Very Happy), 2 (Happy), and 3 (Not Happy). There are specific statistics associated with ranks; PASW Statistics provides a number of them mostly within the Crosstabs, Nonparametric and Ordinal Regression procedures. The mode and median can be used as summary statistics. • Interval — In interval measurement, a unit increase in numeric value represents the same change in quantity regardless of where it occurs on the scale. For interval scale variables such summaries as means and standard deviations are appropriate. Statistical techniques such as regression and analysis of variance assume that the dependent (or outcome) variable is measured on an interval scale. Examples might be temperature in degrees Fahrenheit or SAT score. • Ratio — Ratio measures have interval scale properties with the addition of a meaningful zero point; that is, zero indicates complete absence of the characteristic measured. For statistics such as ANOVA and regression only interval scale properties are assumed, so ratio scales have stronger properties than necessary for most statistical analyses. Health care researchers often use ratio scale variables (number of deaths, admissions, discharges) to calculate rates. The ratio of two variables with ratio scale properties can thus be directly interpreted. Money is an example of a ratio scale, so someone with $10,000 has ten times the amount as someone with $1,000. The distinction between the four types is summarized below. Table 2.1 Level of Measurement Properties Property Level of Measurement Categories Nominal  Ordinal   Interval    Ratio    Ranks Equal Intervals True Zero Point  These four levels of measurement are often combined into two main types, categorical consisting of nominal and ordinal measurement levels and scale (or continuous) consisting of interval and ratio measurement levels. 2-2 UNDERSTANDING DATA DISTRIBUTIONS - THEORY The measurement level variable attribute in PASW Statistics recognizes three measurement levels: Nominal, Ordinal and Scale. The icon indicating the measurement level is displayed preceding the variable name or label in the variable lists of all dialog boxes. The following table shows the most common icons used for the measurement levels. Special data types, such as Date and Time variables have distinct icons not shown in this table. Table 2.2 Variable List Icons Measurement Level Data Type Numeric String Nominal Ordinal Not Applicable Scale Not Applicable Rating Scales and Dichotomous Variables A common scale used in surveys and market research is an ordered rating scale usually consisting of five- or seven-point scales. Such ordered scales are also called Likert scales and might be coded 1 (Strongly Agree, or Very Satisfied), 2 (Agree, or Satisfied), 3 (Neither agree nor disagree, or Neutral), 4 (Disagree, or Dissatisfied), and 5 (Strongly Disagree, or Very Dissatisfied). There is an ongoing debate among researchers as to whether such scales should be considered ordinal or interval. PASW Statistics contains procedures capable of handling such variables under either assumption. When in doubt about the measurement scale, some researchers run their analyses using two separate methods, since each make different assumptions about the nature of the measurement. If the results agree, the researcher has greater confidence in the conclusion. Dichotomous (binary) variables containing two possible responses (often coded 0 and 1) are often considered to fall into all of the measurement levels except ratio (at least as independent variables). As we will see, this flexibility allows them to be used in a wide range of statistical procedures Implications of Measurement Level As we have discussed, the level of measurement of a variable is important because it determines the appropriate summary statistics, tables, and graphs to describe the data. The following table summarizes the most common summary measures and graphs for each of the measurement levels and PASW Statistics procedures that can produce them. 2-3 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Table 2.3 Summary of Descriptive Statistics and Graphs NOMINAL ORDINAL SCALE Definition Unordered Categories Ordered Categories Metric/Numeric Values Examples Labor force status, gender, marital status Satisfaction ratings, degree of education Income, height, weight Measures of Central Tendency Mode Mode Median Mode Median Mean Min/Max/Range, InterQuartile Range (IQR) Min/Max/Range, IQR, Standard Deviation/Variance Measures of Dispersion N/A Graph Pie or Bar Pie or Bar Histogram, Box & Whisker, Stem & Leaf Procedures Frequencies Frequencies Frequencies, Descriptives, Explore Measurement Level and Statistical Methods Statistics are available for variables at all levels of measurement for more advanced analysis. In practice, your choice of method depends on the questions you are interested in asking of the data and the nature of the measurements you make. The table below suggests which statistical techniques are most appropriate, based on the measurement level of the dependent and independent variable. Much more extensive diagrams and discussion are found in Andrews et al. (1981), or other standard statistical texts. Table 2.4 Level of Measurement and Appropriate Statistical Methods Dependent Variable Independent Variables Nominal Ordinal Nominal Crosstabs Crosstabs Ordinal Nonparametric tests, Ordinal Regression Interval/ Ratio T Test, ANOVA Nonparametric correlation, Optimal Scaling Regression Nonparametric Correlation 2-4 Interval/Ratio Discriminant, Logistic Regression Ordinal Regression Correlation, Linear Regression UNDERSTANDING DATA DISTRIBUTIONS - THEORY Best Practice If in doubt about the measurement properties of your variables, you can apply a statistical technique that assumes weaker measurement properties and compare the results to methods making stronger assumptions. A consistent answer provides greater confidence in the conclusions. Apply Your Knowledge 1. PASW Statistics distinguishes three levels of measurement. Which of these is not one of those levels? a. Categorical b. Scale c. Nominal d. Ordinal 2. True or false? An ordinal variable has all properties of a nominal variable? 3. Consider the dataset depicted below. Which statements are correct? a. The variable region is an ordinal variable b. The variable age is a scale variable c. The variable agecategory is an ordinal variable d. The variable salarycategory is a scale variable 2.3 Measures of Central Tendency and Dispersion Measures of central tendency and dispersion are the most common measures used to summarize the distribution of variables. We give a brief description of each of these measures below. Measures of Central Tendency Statistical measures of central tendency give that one number that is often used to summarize the distribution of a variable. They may be referred to generically as the "average." There are three main 2-5 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS central tendency measures: mode, median, and mean. In addition, Tukey devised the 5% trimmed mean. • • • • Mode: The mode for any variable is merely the group or class that contains the most cases. If two or more groups contain the same highest number of cases, the distribution is said to be “multimodal.” This measure is more typically used on nominal or ordinal data and can easily be determined by examining a frequency table. Median - If all the cases for a variable are arranged in order according to their value, the median is that value that splits the cases into two equally sized groups. The median is the same as the 50th percentile. Medians are resistant to extreme scores, and so are considered robust measures of central tendency. Mean: - The mean is the simple arithmetic average of all the values in the distribution (i.e., the sum of the values of all cases divided by the total number of cases). It is the most commonly reported measure of central tendency. The mean along with the associated measures of dispersion are the basis for many statistical techniques. 5% trimmed mean - The 5% trimmed mean is the mean calculated after the extreme upper 5% and the extreme lower 5% of the data values are dropped. Such a measure is resistant to extreme values. The specific measure that you choose will depend on a number of factors, most importantly the level of measurement of the variable. The mean is considered the most "powerful" measure of the three classic measures of central tendency. However, it is good practice to compare the median, mean, and 5% trimmed mean to get a more complete understanding of a distribution. Measures of Dispersion Measures of dispersion or variability describe the degree of spread, dispersion, or variability around the central tendency measure. You might think of this as a measure of the extent to which observations cluster within the distribution. There are a number of measures of dispersion, including simple measures such as maximum, minimum, and range, common statistical measures, such as standard deviation and variance, as well as the interquartile range (IQR). • • • • • • 2-6 Maximum: Simply the highest value observed for a particular variable. By itself, it can tell us nothing about the shape of the distribution, merely how high the top value is. Minimum: The lowest value in the distribution and, like the maximum, is only useful when reported in conjunction with other statistics. Range: The difference between the maximum and minimum values gives a general impression of how broad the distribution is. It says nothing about the shape of a distribution and can give a distorted impression of the data if just one case has an extreme value. Variance: Both the variance and standard deviation provide information about the amount of spread around the mean value. They are overall measures of how clustered around the mean the data values are. The variance is calculated by summing the square of the difference between the value and the mean for each case and dividing this quantity by the number of cases minus 1. If all cases had the same value, the variance (and standard deviation) would be zero. The variance measure is expressed in the units of the variable squared. This can cause difficulty in interpretation, so more often the standard deviation is used. In general terms, the larger the variance, the more spread there is in the data, the smaller the variance, the more the data values are clustered around the mean. Standard Deviation: The standard deviation is the square root of the variance which restores the value of variability to the units of measurement of the original variable. It is therefore easier to interpret. Either the variance or standard deviation is often used in conjunction with the mean as a basis for a wide variety of statistical techniques. Interquartile Range (IQR) - This measure of variation is the range of values between the 25th and 75th percentile values. Thus, the IQR represents the range of the middle 50 percent of the sample and is more resistant to extreme values than the standard deviation. UNDERSTANDING DATA DISTRIBUTIONS - THEORY Like the measures of central tendency, these measures differ in their usefulness with variables of different measurement levels. The variability measures, variance and standard deviation, are used in conjunction with the mean for statistical evaluation of the distribution of a scale variable. The other measures of dispersion, although less useful statistically, can provide useful descriptive information about a variable. Apply Your Knowledge 1. True or false? The mode is that value that splits the cases into two equally sized groups. 2. True or false? Consider the table depicted below. The salaries of men are clustered tighter around their mean than the salaries of women around their mean? 2.4 Normal Distributions An important statistical concept is that of the normal distribution. This is a frequency (or probability) distribution which is symmetrical and is often referred to as the normal bell-shaped curve. The histogram below illustrates a normal distribution. The mean, median and mode exactly coincide in a perfectly normal distribution. And the proportion of cases contained within any portion of the normal curve can be exactly calculated mathematically. Its symmetry means that 50% of cases lie to either side of the central point as defined by the mean. Two of the other most frequently-used representations are the portions lying between plus and minus one standard deviation of the mean (containing approximately 68% of cases) and that between plus and minus 1.96 standard deviations (containing approximately 95% of cases, sometimes rounded up to 2.00 for convenience). Thus, if a variable is normally distributed, we expect 95% of the cases to be within roughly 2 standard deviations from the mean. 2-7 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 2.1 The Normal Distribution Many naturally occurring phenomena, such as height, weight and blood pressure, are distributed normally. Random errors also tend to conform to this type of distribution. It is important to understand the properties of normal distributions and how to assess the normality of particular distributions because of their theoretical importance in many inferential statistical procedures. 2.5 Standardized (Z-) Scores The properties of the normal distribution allow us to calculate a standardized score, often referred to as a z-score, which indicates the number of standard deviations above or below the sample mean for each value. Standardized scores can be used to calculate the relative position of each value in the distribution. Z-scores are most often used in statistics to standardize variables of unequal scale units for statistical comparisons or for use in multivariate procedures. For example, if you obtain a score of 68 out of 100 on a test of verbal ability, this information alone is not enough to tell how well you did in relation to others taking the test. However, if you know the mean score is 52.32, the standard deviation 8.00 and the scores are normally distributed, you can calculate the proportion of people who achieved a score at least as high as your own. The standardized score is calculated by subtracting the mean from the value of the observation in question (68-52.32 = 15.68) and dividing by the standard deviation for the sample (15.68/8 = 1.96). Standardized Score = Case Score - Sample Mean Standard Deviation Therefore, the mean of a standardized distribution is 0 and the standard deviation is 1. In this case, your score of 68 is 1.96 standard deviations above the mean. 2-8 UNDERSTANDING DATA DISTRIBUTIONS - THEORY The histogram of the normal distribution above displays the distribution as a Z-score so the values on the x-axis are standard deviation units. From this figure, we can see only 2.5% of the cases are likely to have a score above 68 on the verbal ability test (1.96 standardized score). The normal distribution table (see below), found in an appendix of most statistics books, shows proportions for z-score values. Table 2.5 Normal Distribution Table A score of 1.96, for example, corresponds to a value of .025 in the “one-tailed” column and .050 in the “two-tailed” column. The former means that the probability of obtaining a z-score at least as large as +1.96 is .025 (or 2.5%), the latter that the probability of obtaining a z-score of more than +1.96 or less than -1.96 is .05 (or 5%) or 2.5% at each end of the distribution. You can see these cutoffs in the histogram above. As we mentioned, another advantage of standardized scores is that they allow for comparisons on variables measured in different units. For example, in addition to the verbal test score, you might have a mathematics test score of 150 out of 200 (or 75%). Although it appears that you did better on the mathematics test from the percentages alone, you would need to calculate the z-score for the mathematics test and compare the z-scores in order to answer the question. You might want to compute z-scores for a series of variables and determine whether certain subgroups of your sample are, on average, above or below the mean on these variables by requesting descriptive statistics or using the Case Summaries procedure. For example, you might want to compare a customer’s yearly revenue using z-scores. 2-9 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 2.6 Requesting Standardized (Z-) Scores The Descriptives procedure has an option to calculate standardized score variables. A new variable containing the standardized values is calculated for the specified variables. Creating standardized scores is accomplished by following these steps: 1) Choose variables to transform into standardized-scores. 2) Review the new variables that were created. 2.7 Standardized (Z-) Scores Output The Descriptives procedure provides descriptive statistics of the original variables. The standardized variables will appear in the Data Editor. Figure 2.2 Example of Descriptives Output 2.8 Procedure: Descriptives for Standardized (Z-) Scores The Descriptives procedure is accessed from the Analyze…Descriptive Statistics…Descriptives menu choice. With the Descriptives dialog box open: 1) Place one or more scale variables in the Variable(s) box. 2) Select the Save standardized values as variables box. Figure 2.3 Descriptives Dialog Box to Create Z-Scores 2-10 UNDERSTANDING DATA DISTRIBUTIONS - THEORY 2.9 Demonstration: Descriptives for Z-Scores We will work with the Census.sav data file in this example. We create standardized scores for number of years of education (educ) and age of respondent (age). We would like to determine where respondents fall on the distribution of these variables. Detailed Steps for Z-Scores 1) Place the variable educ and age in the Variable(s) box. 2) Select the Save standardized values as variables box. Results from Z-Scores By default, the new variable name is the old variable name prefixed with the letter "Z". Two new variables, zeduc and zage, containing the z-scores of the two variables, are created at the end of the data file. These variables can be saved in your file and used in any statistical procedure. We observe that: • The first person (row) in the data file is below the average on education but above the average on age. Figure 2.4 Two Z-score Variables in the Data Editor Apply Your Knowledge 1. True or false? Only for variables of measurement level scale in PASW Statistics is it meaningful to calculate standardized scores? 2. Consider the data below, where we computed standardized values for the variables educ (highest year of education) and salary (salary in dollars). Which of the following statements are correct? a. The observation with employee_id=49 has a salary very close to the mean salary. b. The observation with employee_id=50 has a salary that is more than one standard deviation above the mean. c. The observation with employee_id=46 is more extreme in her education than in salary. 2-11 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Additional Resources For additional information on Level of Measurement and Statistical Tests, see: Andrews, Frank M, Klem, L., Davidson, T.N., O’Malley, P.M. and Rodgers, W.L. 1981. A Guide for Selecting Statistical Techniques for Analyzing Social Science Data. Ann Arbor, MI: Institute for Social Research, University of Michigan. Further Info Velleman, Paul F. and Wilkinson, L. 1993. “Nominal, Ordinal and Ratio Typologies are Misleading for Classifying Statistical Methodology,” The American Statistician, vol. 47, pp. 65-72. 2.10 Lesson Summary We explored the concept of the level of measurement and the appropriate summary statistics given level of measurement. We also discussed the normal distribution and z-scores. Lesson Objectives Review Students who have completed this lesson should now be able to: • Determine the level of measurement of variables and obtain appropriate summary statistics based on the level of measurement To support the achievement of the primary objective, students should now also be able to: • • • 2-12 Describe the levels of measurement used in PASW Statistics Use measures of central tendency and dispersion Use normal distributions and z-scores UNDERSTANDING DATA DISTRIBUTIONS - THEORY 2.11 Learning Activity The overall goal of this learning activity is to create standardized (Z-) scores for several variables. In this set of learning activities you will use the Drinks.sav data file. Supporting Materials The file Drinks.sav, a PASW Statistics data file that contains hypothetical data on 35 beverages. Included is information on their characteristics (e.g., % alcohol), price, origin, and a rating of quality. 1. Create standardized scores for all scale variables (price through alcohol). Which beverages have positive standardized scores on every variable? What does this mean? 2. What is the most extreme z-score on each variable? What is the most extreme z-score across all variables? 3. What beverage is most typical of all beverages, that is, has z-score values closest to 0 for these variables? 4. If the variable is normally distributed, what percentage of cases should be above 1 standard deviation from the mean or below 1 standard deviation from the mean? Calculate this percentage for a couple of the variables. Is the percentage of beverages with an absolute zscore above 1 close to the theoretical value? 2-13 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 2-14 DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES Lesson 3: Data Distributions for Categorical Variables 3.1 Objectives After completing this lesson students will be able to: • Run the Frequencies procedure to obtain appropriate summary statistics for categorical variables To support the achievement of this primary objective, students will also be able to: • • Use the options in the Frequencies procedure Interpret the results of the Frequencies procedure 3.2 Introduction As a first step in analyzing data, one must gain knowledge of the overall distribution of the individual variables and check for any unusual or unexpected values. You often want to examine the values that occur in a variable and the number of cases in each. For some variables, you want to summarize the distribution of the variable by examining simple summary measures including the mode, median, and minimum and maximum values. In this chapter, we will review tables and graphs appropriate for describing categorical (nominal and ordinal) variables. Business Context Summaries of individual variables provide the basis for more complex analyses. There are a number of reasons for performing single variable analyses. One would be to establish base rates for the population sampled. These rates may be of immediate interest: What percentage of our customers is satisfied with services this year? In addition, studying a frequency table containing many categories might suggest ways of collapsing groups for a more succinct and statistically appropriate table. When studying relationships between variables, the base rates of the separate variables indicate whether there is a sufficient sample size in each group to proceed with the analysis. A second use of such summaries would be as a data-checking device—unusual values would be apparent in a frequency table. Supporting Materials The file Census.sav, a PASW Statistics data file from a survey done on the general adult population. Questions were included about various attitudes and demographic characteristics. 3-1 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 3.3 Using Frequencies to Summarize Nominal and Ordinal Variables The most common technique for describing categorical data is a frequency analysis which provides a summary table indicating the number and percentage of cases falling into each category of a variable, as well as the number of valid and missing cases. We can also use the mode, which indicates the category with the highest frequency, and, if there is a large number of categories, the median (for ordinal variables), which is the value above and below which half the cases fall. Figure 3.1 Typical Frequencies Table To represent the frequencies graphically we use bar or pie charts. • • A pie chart displays the contribution of parts to a whole. Each slice of a pie chart corresponds to a group that is defined by a single grouping variable. A bar chart displays the count for each distinct value or category as a separate bar, allowing you to compare categories vertically. Figure 3.2 Pie Chart illustrated 3-2 DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES Figure 3.3 Bar Chart Illustrated 3.4 Requesting Frequencies Requesting Frequencies is accomplished by following these steps: 1) Choose variables for the Frequencies procedure. 2) Request additional summary statistics and graphs. 3) Review the procedure output to investigate the distribution of the variables including: a. Frequency Tables b. Graphs 3.5 Frequencies Output The information in the frequency table is comprised of counts and percentages: • • • • The Frequency column contains counts, i.e., the number of occurrences of each data value. The Percent column shows the percentage of cases in each category relative to the number of cases in the entire data set, including those with missing values. The Valid Percent column contains the percentage of cases in each category relative to the number of valid (non-missing) cases. The Cumulative percentage column contains the percentage of cases whose values are less than or equal to the indicated value. Cumulative percent is only useful for variables that are ordinal. 3-3 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 3.4 Example of Frequency Output 3.6 Procedure: Frequencies The Frequencies procedure is accessed from the Analyze…Descriptive Statistics…Frequencies menu choice. With the Frequencies dialog box open: 1) Place one or more variables in the Variable(s) box. 2) Open the Statistics dialog to request summary statistics. 3) Open the Charts dialog to request graphs. Figure 3.5 Frequencies Dialog Box In the Statistics dialog box: 1) Ask for the appropriate measures of central tendency and dispersion. 3-4 DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES Figure 3.6 Frequencies: Statistics Dialog Box In the Charts dialog: 1) Ask for the appropriate chart based on the scale of measurement of the variable. Figure 3.7 Frequencies: Charts Dialog Box 3-5 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 3.7 Demonstration: Frequencies We will work with the Census.sav data file in this lesson. In this example we examine the distribution of the variables marital and happy. These variables are either nominal or ordinal in scale of measurement. Detailed Steps for Frequencies 1) Place the variables marital and happy in the Variable(s) box 2) In the Statistics dialog, select Mode and Median in the Statistics dialog 3) In the Charts dialog, select Bar Chart in the Chart Types area and Percentages in the Chart Value area Results from Frequencies The first table produced is the table labeled Statistics. Figure 3.8 Statistics for Marital Status and General Happiness This table shows the number of cases having a valid value on Marital Status (2018) and General Happiness (2015), the number of cases having a (user- or system-) missing value (5 and 8, respectively) and the Mode and Median. The mode, the category that has the highest frequency, is a value of 1 and 2 respectively, and represents the category of “Married” for marital and the “Pretty Happy” group for happy. The median, the middle point of the distribution (50th percentile), is a value of 2 for both variables. The second table shows the frequencies and percentages for each variable. This table confirms that almost half of the respondents are married. Since there is almost no missing data for marital status, the percentages in the Percent column and in the Valid Percent column are almost identical. Figure 3.9 Frequency Table of Marital Status Examine the table. Note the disparate category sizes. About half of the sample is married, and there is one category that has less than 5% of the cases. Before using this variable in a crosstabulation 3-6 DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES analysis, should you consider combining some of the categories with fewer cases? Decisions about collapsing categories usually have to do with which groups need to be kept distinct in order to answer the research question asked, and the sample sizes for the groups. For example, could we create a “was previously married” group? The bar chart summarizes the distribution that we observed in the frequency table and allows us to “see” the distribution. Figure 3.10 Bar Chart of Marital Status Tip For a nominal variable (where the order of the categories is arbitrary) sorting the table and graph descending on counts gives better insight in what the main categories are (use the Format subdialog box to sort descending on counts). For the variable happy, over half of the people fall into one category, pretty happy. Might it be interesting to look at the relationship between this variable and marital status: to what extent is general happiness related to marital status? 3-7 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 3.11 Frequency Table of General Happiness Next we view a bar chart based on the general happiness variable. Does the picture make it easier to understand the distribution? Figure 3.12 Bar Chart of General Happiness Note 3-8 For an ordinal variable, sorting the categories on descending/ascending counts (which was useful for nominal variables) will disturb the natural order of categories and so is not as useful for an ordinal variable. DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES Apply Your Knowledge 1. See the output below. Which statements are correct? a. The median is an appropriate statistic to report for the variable region. b. The region that has the highest frequency is the North. c. The cumulative percent is meaningful for region. d. The columns Percent and Valid Percent are identical because there are no missing values on region 2. See the output below. Which bar chart is best to present the distribution of the ordinal variable age of employees, in categories? Bars sorted descending on count (A) or sorted on ascending value (B)? 3. See the table below (a frequency table of HAPPINESS OF MARRIAGE, with those not married defined as missing). Which statements are correct? a. 29.5% of those married are very happy in their marriage. b. The mode is the category pretty happy. c. 96.9% of those married are pretty happy or very happy 3-9 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Additional Resources For additional information on how to present data in tables and graphs, see: Further Info Few, Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press 3.8 Lesson Summary In this lesson we used the Frequencies procedure to explore the distribution of categorical variables, via both tables and graphs. Lesson Objectives Review Students who have completed this lesson should now be able to: • Run the Frequencies procedure to obtain appropriate summary statistics for categorical variables To support the achievement of the primary objective, students should now also be able to: • • Use the options in the Frequencies procedure Interpret the results of the Frequencies procedure 3.9 Learning Activity The overall goal of this learning activity is to run Frequencies to explore the distributions of several variables. In the exercises you will use the data file Census.sav. Supporting Materials 3-10 The file Census.sav, a PASW Statistics data file from a survey done on the general adult population. Questions were included about various attitudes and demographic characteristics. DATA DISTRIBUTIONS FOR CATEGORICAL VARIABLES 1. Run the Frequencies procedure on the following variables: sex, wrkstat (Labor Force Status), paeduc (Father’s highest degree), and satjob (Job or Housework). What is the scale of measurement for each? Request appropriate summary statistics and charts. 2. For which of these variables is it appropriate to use the median? What conclusions can you draw about the distributions of these variables? 3. What percent of respondents have a bachelor’s degree, or higher? What percent of respondents are working? 4. How might you combine some of the categories of wrkstat to insure that there are a sufficient number of respondents in each category? 3-11 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 3-12 DATA DISTRIBUTIONS FOR SCALE VARIABLES Lesson 4: Data Distributions for Scale Variables 4.1 Objectives After completing this lesson students will be able to: • Request and interpret appropriate summary statistics for scale variables To support the achievement of this primary objective, students will also be able to: • • Use the options in the Frequencies, Descriptives, and Explore procedures Interpret the results of the Frequencies, Descriptives, and Explore procedures 4.2 Introduction As a first step in analyzing your data, you must first gain knowledge of the overall distribution of the individual variables and check for any unusual or unexpected values. You often want to examine the values that occur in a variable and the number of cases in each. For some variables, you want to summarize the distribution of the variable by examining simple summary measures including minimum and maximum values for the range. Frequently used summary measures describe the central tendency of the distribution, such as the arithmetic mean, and dispersion, the spread around the central point. In this lesson, we will review tables and graphs appropriate for describing scale (interval and ratio) variables. Business Context Summaries of individual variables provide the basis for more complex analyses. There are a number of reasons for performing single variable analyses. One would be to establish base rates for the population sampled. These rates may be of immediate interest: What is the average customer satisfaction? In addition, studying distributions might suggest ways of collapsing information for a more succinct and statistically appropriate table. When studying relationships between variables, the base rates of the separate variables indicate whether there is a sufficient sample size in each group to proceed with the analysis. A second use of such summaries would be as a data-checking device, as unusual values would be apparent in tables. Supporting Materials The file Census.sav, a PASW Statistics data file from a survey done on the general adult population. Questions were included about various attitudes and demographic characteristics. 4.3 Summarizing Scale Variables Using Frequencies When working with categorical variables, frequency tables containing counts and percentages are appropriate summaries. For a scale variable, counts and percentages may still be of interest, especially when the variables can take only a limited number of distinct values. For example, when working with a one to five point rating scale we might be very interested in knowing the percentage of respondents who reply “Strongly Agree.” However, as the number of possible response values 4-1 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS increases, frequency tables based on interval scale variables become less useful. Suppose we asked respondents for their family income to the nearest dollar? It is likely that each response would have a different value and so a frequency table would be quite lengthy and not particularly helpful as a summary of the variable. In data cleaning, you might find a frequency table useful for examining possible clustering of cases on specific values or looking at cumulative percentages. But, beware of using frequency tables for scale variables with many values as they can be very long. If the variables of interest are scale we can expand the summaries to include means, standard deviations and other statistical measures. You will want to spend some time looking over the summary statistics you requested. Do they make sense, or is something unusual? For a categorical variable, we request a pie chart or a bar chart to graphically display the distribution of the variable. For a scale variable, a histogram is used to display the distribution. Tip A normal curve can be superimposed on the histogram and helps you to judge whether the variable is normally distributed. 4.4 Requesting Frequencies Requesting statistics and a graphical display is accomplished by following these steps: 1) Select variables in the Frequencies procedure. 2) Request additional summary statistics and graphs. 3) Review the procedure output to investigate the distribution of the variables including: a. Frequency tables (if requested) b. Statistics tables c. Graphs 4.5 Frequencies Output Statistics for the variable are presented in a separate table. 4-2 DATA DISTRIBUTIONS FOR SCALE VARIABLES Figure 4.1 Example of Summary Statistics for Frequencies Output A histogram shows the distribution graphically. A histogram has bars, but, unlike the bar chart, they are plotted along an equal interval scale. The height of each bar is the count of values of a quantitative variable falling within the interval. A histogram shows the shape, center, and spread of the distribution. Figure 4.2 Example of Histogram for Frequencies Output 4-3 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 4.6 Procedure: Frequencies The Frequencies procedure is accessed from the Analyze…Descriptive Statistics…Frequencies menu choice. With the Frequencies dialog box open: 1) 2) 3) 4) Place one or more variables in the Variable(s) box. Deselect the Display frequency tables check box for variables with many values. Open the Statistics dialog to request summary statistics. Open the Charts dialog to request graphs. Figure 4.3 Frequencies Dialog Box In the Statistics dialog: 1) Select appropriate measures of central tendency 2) Select appropriate measures of dispersion 4-4 DATA DISTRIBUTIONS FOR SCALE VARIABLES Figure 4.4 Frequencies: Statistics Dialog Box In the Charts dialog: 1) Ask for a histogram for scale variables. 2) Optionally, superimpose a normal curve on the histogram. 4-5 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 4.5 Frequencies: Charts Dialog Box 4.7 Demonstration: Frequencies We will work with the Census.sav data file in this lesson. In this demonstration we examine the distribution of number of brothers and sisters (sibs) and respondent’s age. We would like to see the distribution of these variables. Detailed Steps for Frequencies 1) Place the variables sibs and age in the Variable(s) box 2) Deselect the Display frequency tables check box for variables with many values 3) Select Mode, Median, Mean, Minimum, Maximum and Standard Deviation in the Statistics dialog 4) Select Histograms and Show normal curve on histogram in the Charts dialog Note 4-6 If you request histograms and summary statistics for scale variables with many categories, you might want to uncheck (turn off) Display frequency tables in the Frequencies dialog box, as there may be almost as many distinct values as there are cases in the data file. DATA DISTRIBUTIONS FOR SCALE VARIABLES Results from Frequencies The table labeled Statistics shows the requested statistics. Figure 4.6 Summary Statistics for Number of Brothers and Sisters and Age of Respondent This table shows the number of cases having a valid value on sibs (2021) and age (2013), the number of cases having a (user- or system-) missing value (2 and 10, respectively) and measures of central tendency and dispersion. The minimum value is 0 and the maximum value is 55 (seems unusual) for number of siblings. For age, the minimum value is 18 and the maximum value 89. Note, that the means and medians within each variable are similar, indicating that the variables are roughly normally distributed within the defined range. We can visually check the distribution of these variables with a histogram. 4-7 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 4.7 Histogram of Number of Brothers and Sisters We can see that the lower range of values is truncated at 0 and the number of people is greatest between 0 to 6 siblings, although we do have some extreme values. The distribution is not normal. 4-8 DATA DISTRIBUTIONS FOR SCALE VARIABLES Figure 4.8 Histogram of Age of Respondent We can see that the lower range of values is truncated at 18 and the number of people is highest in the middle age values (the "baby boomers") with the number of cases tapering off at the higher ages as we would expect. Thus, the age variable for respondents of this sample of adults is roughly normally distributed. Apply Your Knowledge 1. Suppose we have a variable region (with the categories north/east/south/west). Which of these statements is true? a. The mean is a meaningful statistic for region b. The standard deviation is a meaningful statistic for region c. The median is a meaningful statistic for region 4-9 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS 2. See output below, with statistics for two variables: Current Salary and Beginning Salary (data collected on employees). Which statements are correct? a. There are 474 cases in the dataset b. Both variables are skewed to the right (meaning: there are employees with some large salaries compared to the average c. Half of the employees have a current salary below 30,750. d. The highest current salary is 135,000. 3. See the histogram below for Current Salary (data collected on employees). Which of these statements is correct? a. The variable seems normally distributed b. The variable is skewed to the right c. The standard deviation would be smaller, if the case with salary of 135,000 would be removed from the histogram. 4-10 DATA DISTRIBUTIONS FOR SCALE VARIABLES 4.8 Summarizing Scale Variables using Descriptives The Descriptive procedure is a good alternative to Frequencies when the objective is to summarize scale variables. Descriptives is usually used to provide a table of statistical summaries (means, standard deviations, variance, minimum, maximum, etc.) for several scale variables. The Descriptives procedure also provides a succinct summary of the number of cases with valid values for each variable included in the table as well as the number of cases with valid values for all variables included in the table. These summaries are quite useful in evaluating the extent of missing values in your data and in identifying variables with missing values for a large proportion of the data. 4.9 Requesting Descriptives Running Descriptives is accomplished by following these steps: 1) Select variables for the Descriptives procedure. 2) Review the procedure output to investigate the distribution of the variables. 4.10 Descriptives Output The figure below shows the Descriptives output table for a few variables. Figure 4.9 Example Descriptives Output The minimum and maximum provide an efficient way to check for values outside the expected range. In general, this is a useful check for categorical variables as well. Thus, although mean and standard deviation are not relevant for respondent’s sex, minimum and maximum for this variable show that there are no values outside the expected range. The last row in the table labeled Valid N (listwise) gives the number of cases that have a valid value on all of variables appearing in the table. In this example, 1333 cases have valid values for all three variables listed. Although this number is not particularly useful for this set of variables, it would be useful for a set of variables that you intended to use for a specific multivariate analysis. As you proceed with your analysis plans, it is helpful to know how many cases have complete information and which variables are likely to be the main sources of potential problems. 4.11 Procedure: Descriptives The Descriptives procedure is accessed from the Analyze…Descriptive Statistics…Descriptives menu choice. With the Descriptives dialog box open: 1) Place one or more variables in the Variable(s) box. 4-11 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 4.10 Descriptives Dialog Box Only numeric variables appear in the Descriptives dialog box. The Save standardized values as variables feature creates new variables that are standardized forms of the original variables. These new variables, referred to as z-scores, have values standardized to a mean of 0 and standard deviation of 1. The Options button allows you to select additional summary statistics to display. You can also select the display order of the variables in the table (for example, by ascending or descending mean value). 4.12 Demonstration: Descriptives We will work with the Census.sav data file in this lesson. In this example we examine the summary statistics of number of siblings, respondent’s age, education and respondent´s gender. We would like to see the summary statistics for these variables, as well as how much missing data there is, and if there are unusual cases. Detailed Steps for Descriptives 1) Place the variables sex, sibs, educ, and age in the Variable(s) box. Results from Descriptives The table labeled Descriptive Statistics contains the statistics. Figure 4.11 Descriptives Output 4-12 DATA DISTRIBUTIONS FOR SCALE VARIABLES The column labeled N shows the number of valid observations for each variable in the table. We see there is little variation in the number of valid observations. The number of valid cases can be a useful check on the data and help us determine which variables might be appropriate for specific analyses. Here 2006 cases have valid values for the entire set of questions. The minimum and maximum provide an efficient way to check for values outside the expected range. Here the maximum for the variable sibs seems high and deserves further investigation. 4.13 Summarizing Scale Variables using the Explore Procedure Exploratory data analysis (EDA) was primarily developed by John Tukey. He devised several statistical measures and plots designed to reveal data features that might not be readily apparent from standard statistical summaries. Exploratory data analysis can be viewed either as an analysis in its own right, or as a set of data checks that investigators perform before applying inferential testing procedures. These methods are best applied to variables with at least ordinal (more commonly interval) or scale properties and which can take many different values. The plots and summaries would be less helpful for a variable that takes on only a few values (for example, a five-point scale). 4.14 Requesting Explore Running Explore is accomplished with these steps: 1) 2) 3) 4) Select variables on which to report statistics in the Dependent List box Select grouping variables in the Factor box Request additional summary statistics and graphs. Review the procedure output to investigate the summary statistics and distribution of the variables including tables and graphs Explore Output The Descriptives table displays a series of descriptive statistics for age. From the previous table (not shown), we know that these statistics are based on 1763 respondents. 4-13 INTRODUCTION TO STATISTICAL ANALYSIS WITH IBM SPSS STATISTICS Figure 4.12 Summaries for Age of Respondent First, several measures of central tendency appear: the Mean, 5% Trimmed Mean, and Median. These statistics attempt to describe with a single number where data values are typically found, or the center of the distribution. Useful information about the distribution can be gained by comparing these values to each other. If the mean were considerably above or below the median and trimmed mean, it would suggest a skewed or asymmetric distribution. The measures of central tendency are followed in the table by several measures of dispersion or variability. These indicate to what degree observations tend to cluster or be widely separated. Both the standard deviation (Std.Deviation) and Variance (standard deviation squared) appear. The standard error (Std.Error) is an estimate of the standard deviation of the mean if repeated samples of the same size (here 1763) were taken. It is used in calculating the 95% confidence interval for the sample mean. Technically speaking, if we would draw 100 samples of this size (1763) and construct a 95% confidence for the mean for each of the 100 samples, then the expectation is that 95 out of these 100 intervals will contain the (unknown) population mean. Also appearing is the Interquartile Range (often abbreviated to IQR) which is essentially the range between the 25th and the 75th percentile values. It is a variability measure more resistant to extreme scores than the standard deviation. We also see the Minimum, Maximum and Range. The final two statis...
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

Running head: MODULE 5 PROBLEM SET

Learning Activity 10.15
1. Suppose you are interested in understanding how an employee’s demographic
characteristics, beginning salary and time at the bank and in the work force are
related to current salary. Start by producing scatterplots of salbe, sex, time, age,
edlevel, and work with salnow. Add a fit line to each plot. Check on the variable
labels for time and work so you understand what these variables are measuring.

1

MODULE 5 PROBLEM SET

2

MODULE 5 PROBLEM SET

3

MODULE 5 PROBLEM SET

4

MODULE 5 PROBLEM SET

5

MODULE 5 PROBLEM SET

6

MODULE 5 PROBLEM SET

7

MODULE 5 PROBLEM SET

8

MODULE 5 PROBLEM SET

9

MODULE 5 PROBLEM SET

10

2. Describe the relationships based on the scatterplots. Do they all appear to be
linear? Are any relationships negative? What is the strongest relationship?
Chart 2: Beginning salary and current salary show a positive relationship with a
relatively steep slope of the fitted line, suggesting a strong relationship between the
two variables.
Chart 5: Job seniority and current salary show a positive linear relationship. The
relationship is however not very clear, because the values of Job seniority are almost
equally spreaded so that the fitted line does not have a significantly inclined slope.

11

MODULE 5 PROBLEM SET

Chart 7: Age of employee and current salary show a negative linear relationship. The
relationship is 0.021 with a downward slope indicating that as age increases salary
decreases.
Chart 9: Educational level and current salary show a strong positive linear
relationship. The fitted line predicts better in lower levels of the educational level.
Chart 11: Work experience and current salary show a negative linear relationship.
The relationship is 0.009 with a downward slope indicating that as work experi...


Anonymous
I use Studypool every time I need help studying, and it never disappoints.

Studypool
4.7
Indeed
4.5
Sitejabber
4.4

Related Tags