Chapter 6
Building and Using Models
All models are wrong, but some are useful.
—George E. P. Box
Model building is one of the methods used in the improvement frameworks discussed in Chapter
4. Model building is a broad topic, and in this chapter we will develop a basic understanding of
the types and uses of models and develop functional capability to apply basic regression analysis
to real problems. More extensive study and applications experience will be required to develop
mastery of the entire field of regression analysis, or with other, more complex model building
methods.
The power of statistical thinking is in developing process knowledge that can be used to manage
and improve processes. The most effective way to create process knowledge is to develop a
model that describes the behavior of the process. Webster’s New World College
Dictionary defines model as “a generalized, hypothetical description, often based on an analogy,
used in analyzing and explaining something.” In this chapter we learn how to integrate our
frameworks and tools (see Chapters 4 and 5) and enhance them by building models. Model
development is an iterative process in which we move back and forth between hypotheses about
what process variables are important and data that confirm or deny these hypotheses.
Our overall strategy is as follows: We build a model that relates the process outputs (y’s) to
process variables (x’s). We then use the model to (1) better understand the process, (2) control
the process, (3) predict future process performance, (4) measure the effects of process changes,
and (5) manage and improve the process. We build a model by collecting data. The process
outputs will display variation. Using statistical techniques, in this case least squares regression
analysis, we analyze the variation to discover the relationship between the process variables (x’s)
and the process outputs (y’s). These relationships are summarized in the form of a model. In
reviewing our strategy, we see that we have used all three elements of statistical thinking:
process, variation, and data.
Our goal is to build useful models. George Box points out that “all models are wrong, but some
are useful.”1 All models are wrong because we never know the true state of nature. Fortunately, a
model can be useful without exactly reproducing the phenomenon being studied. A useful model
is simple and parsimonious, yet enables us to manage the process effectively, predict future
process performance, and predict how process performance will change if the process operations
are changed.
There are two basic strategies for developing process models: analyzing existing process data
and proactively experimenting with the process. In this chapter we discuss the first approach.
This strategy uses the statistical techniques of correlation and regression analysis. The second
strategy involves experimenting with the process itself in a systematic, planned manner. This
strategy uses the technique of statistical design of experiments. Studying correlations between
variables using scatter plots was discussed in Chapter 5. Data collected without the aid of a
statistical design have many limitations (see the section titled “Some Limitations of Using
Existing Data”). It is easier to appreciate these limitations after one has an understanding of
regression analysis (discussed in this chapter) and design of experiments (discussed in Chapter
7).
In this chapter, we provide some examples of business models and discuss types and uses of
models. This is followed by a discussion of the use of regression analysis of process data to
construct process models. The chapter concludes with a summary of key points, a project update,
and exercises.
EXAMPLES OF BUSINESS MODELS
Discussing a variety of models is helpful in understanding the value and benefits of models and
seeing the variety of situations in which models are used. Box, Hunter, and Hunter call our
attention to a useful model.2 In Yellowstone National Park, the park rangers need to predict when
the Old Faithful geyser will erupt because this is what tourists come to see. Through data
analysis the rangers have determined that if the previous eruption lasted for 2 minutes or less, the
next eruption will be in 45 to 55 minutes. If the previous eruption lasted 3 to 5 minutes, the next
eruption will be in 70 to 85 minutes. The accuracy of this model and its predictions directly
affect the satisfaction of the tourists (i.e., customer satisfaction).
One situation we all can relate to is weight gain of humans and the variables that affect weight
gain. Two key variables affect weight gain: the number of calories we consume and the amount
of exercise we do. In most situations increasing caloric intake increases our weight, whereas
increasing exercise decreases our weight. From a modeling viewpoint, we would say “weight
gain is a function of caloric intake and exercise.” This is represented mathematically as
where f is some unknown mathematical function. This is a crude model at this point. We do not
know the form of the function f, the magnitude and nature of the effects of caloric intake and
exercise on weight gain, or whether these two variables affect weight gain independently of each
other. The model is useful, however, because it tells us two key variables to pay attention to:
caloric intake and exercise.
We call the small subset of variables that have a major effect on the process “key drivers.”
Caloric intake and exercise are key drivers of (have a major influence on) weight gain of
humans. There are no doubt other key drivers of weight gain (e.g., heredity). Our experience is
that processes typically have five or six key drivers and perhaps a few other variables that have
smaller effects.
Another example of multiple drivers was the Chapter 2 case study for Anheuser-Busch
investigating the effects of advertising on sales.3 Three variables were studied: percent change in
advertising level, amount spent on sales effort, and amount spent on sales material. The purpose
of the study was to determine a functional relationship between sales and percent change in
advertising level, which turned out to have a complicated mathematical form.
In another advertising study, Montgomery, Peck, and Vining report data on advertising and sales
of 30 restaurants that could be described by this linear relationship:4
This model was developed for advertising expenses ranging from approximately $3,000 to
$20,000 and suggests that a restaurant would have sales of $50,976 with no advertising and that
sales would increase $7.92 for every dollar spent in advertising. A straight-line relationship is the
simplest model possible to describe the relationship between two variables.
Davis describes a model used by a major credit card company to predict the probability that a
delinquent account will pay its bill.5 Among other things, Davis found that the outcome of the
collection strategy depended on the type of account, the balance owed, and the number of
months it was overdue. Collection strategies developed using the model produced an annual
savings of $37 million and improved customer satisfaction. An unanticipated result found was
that in some instances the best strategy is “to do nothing.”
Oil companies typically develop blending models that describe the quality of the gasoline
(octane number) as a function of the proportion of the types of components that make up the
gasoline. A four-component linear blending model might look like this:
where x1, x2, x3, and x4 are the fractions of the different components in the gasoline and b is the
coefficient derived from the data. The coefficients (b’s) in the model describe the blending
behavior of the components and enable us to predict (calculate) the octane of a blend given the
volume fraction of the different components (x’s) in the blend.
For example, one blending model developed by Snee looked like this:6
where
x1 = light FCC
x2 = alkylate
x3 = butane
x4 = reformate
This model predicts that a gasoline consisting of 20% light FCC, 30% alkylate, 6% butane, and
44% reformate would have an octane of 101.3:
Blending models generally involve 10 to 15 gasoline components and often involve nonlinear
blending terms (e.g., x1x2, x5x6).
Blending models, together with cost and manufacturing data, are typically used as inputs to
linear and nonlinear programming algorithms to develop minimum cost blending strategies. They
help management make decisions such as ways to operate a process or run a business to
maximize profit or minimize cost. When the objective function to be maximized (profit) or
minimized (cost) and the constraints are all linear, we use a linear programming algorithm.
Nonlinear programming algorithms are used when nonlinear equations are involved in either
constraints or the objective function.7
Another area where models can be helpful is in predicting the workforce needed to handle the
workload in call centers and telephone repair centers. In both instances it is advantageous to be
able to accurately predict the workload so that the right size workforce is scheduled: Too few
workers produces unsatisfied customers; too many workers unnecessarily increases costs.
Similar analyses are needed to “size” web sites (determine the appropriate number of servers or
routers).
Studies have shown that telephone repair needs vary with the day of the week (Monday has the
highest load), the season of the year, rainfall, and temperature. Models that predict repair loads
from these variables are very effective in managing workloads and maintaining customer
satisfaction. In addition to helping understand how the variables affect the repair load, these
models can be used for daily resource management, seasonal planning, and risk assessment of
the effects of large rainstorms.
The models discussed in the previous paragraphs were developed based on empirical relations
derived from data. Models can also be based on theoretical relationships, such as engineering
laws, econometric theories, and chemical reaction kinetics. For example, in the case of chemical
reactions in which material A reacts to produce product B, and it is known that first-order
kinetics apply, the rate of formation of product B at any point in time (x) is given by
where the natural number e is about 2.72. The coefficients b1 and b2 are unique to materials A
and B and are derived from data. This model is called a theoretical or mechanistic
model because it is based on theoretical considerations regarding the nature of chemical
reactions. The coefficient b1 is the concentration of material B at the end of the reaction and b2 is
the rate of the reaction.
Models have different amounts of complexity and different uses. In the next section, we discuss
both in greater detail.
TYPES AND USES OF MODELS
A key distinction to grasp is between empirical models and theoretical models. Empirical models
are based on data and are developed by studying plots of the relationship between the process
outputs (y) and one or more predictor variables (x’s). The observed relationships are
subsequently described and quantified by a mathematical equation. The resulting equation is a
model for the relationship. A linear relationship would be described by the equation
where y is the process output of interest, x is the predictor variable, and b0 and b1 describe the
intercept and slope of the line. Linear relationships are typically identified by plotting y versus x,
hence the form of the model is identified empirically from the data. The linear model for the
relationship between restaurant sales and advertising discussed earlier was developed in this
manner.
Empirical models can take a number of forms and often include curvilinear terms
(squares x2x2 and cross products x1x2), and exponential terms (ex). Empirical models and their
formulation will be discussed further in the following section titled “Regression Modeling
Process.” Theoretical models are based on known theoretical relationships and mechanistic
understanding. The chemical reaction rate equation discussed in the section titled “Examples of
Business Models” is an example of a theoretical model.
Theoretical models are difficult to develop, and can be very complex. In a business setting,
economic theory is often used to decide which variables should be considered and what form of
mathematical relationship should be used in the model. Theoretical knowledge should be
incorporated in the empirical modeling process wherever it is available, but the need for a
theoretical model often depends on the use of the model. As Box, Hunter, and Hunter point out,
the empirical model used by the park rangers to predict the eruptions of Old Faithful is good
enough for their purposes.8 Geologists, however, might be interested in the hydrological
mechanisms that govern the geyser’s eruptions and study rate of pressure buildup, water
temperature, and so on. Such studies might improve predictions of the time of eruption. In most
instances, an empirical model will be sufficient. If deep understanding is needed, then the
consideration of fundamental theoretical laws is in order. Theoretical models generally do a
better job of predicting outside the region of the data (extrapolation) than do empirical models.
Another type of model often used in the analysis of time-related data, such as econometric data,
is the time series model. Time series models typically predict the value for a given period (a day,
a month, or a year) as a function of the values of the variable observed on previous days, months,
or years. The basic assumption is that there is a predictable pattern in the past behavior of the
time series, and that the process that produced the data will not change going forward. We can
thus use the time series model to predict the behavior of the process in the future. Time series
models use autocorrelation functions and are beyond the scope of this book.9
Recently, there has been much research in the field of data mining, or analysis of massive data
sets. With the growth of Internet commerce, companies such as Google and Amazon have been
able to accumulate millions or even billions of data records, which often include thousands of
individual variables. A number of advanced modeling methods for identifying structure in these
massive data sets have been developed, such as classification and regression trees (CART),
random forests, neural networks, and nearest neighbors, to name just a few. In addition, a whole
new discipline, often referred to as machine learning, has developed that attempts to automate
much of the modeling process. In other words, this discipline is developing methods that would
allow computers (the machine) to learn about the data and identify the most appropriate model
among an almost limitless set of potential models.
Uses of Models
There are four main uses of models:
1. Predict future process performance.
2. Measure the effects of process changes.
3. Control, manage, and improve the process.
4. Deepen understanding of the process.
Models that forecast future sales, revenues, or earnings are examples of using models to make
predictions. A model can also help us determine the effects of potential process changes without
disturbing the existing process. Using a model we can analyze the predicted process performance
for various conditions of interest. It is also often important to predict how the process
performance will change given specified changes in the levels of the predictor variables. For
example, “If I double my advertising expense, how will my sales volume change?” The third use
of models helps us control and manage the process to produce consistent products. When the
process is off target, the model can tell us our options: which variables can be changed by how
much to get the process back on target.
Finally, as we construct and use the model, we deepen our understanding of the process and the
variables that drive its performance. The modeling process, if done properly, forces us to look
systematically at all parts of the process. At some point even the best models fail. The detection
of the failure and subsequent enhancement of the model adds to our knowledge of the process.
One strategy for developing models is to observe the process “as is” by collecting process data.
When we have the data in hand, we need a technique to analyze the data and build the model,
and regression analysis is the technique we often use. Regression analysis is sometimes referred
to as a mathematical French curve because it helps us smooth out variation and identify
predictive relationships in many dimensions (i.e., many predictor variables). For many years
engineers and draftspeople used a flexible straightedge, called a French curve, to draw lines and
curves through clouds of points in one (y versus x) and sometimes two dimensions
(y versus x1 and x2). Today, of course, this is done with computer software. There are a number
of ways to build models, both formal and informal, but we focus in the following sections on
regression analysis because it is so widely used. Chapter 7 explains how to use regression
analysis to analyze the results of designed experiments.
REGRESSION MODELING PROCESS
When building process models using regression analysis we must consider (1) the number of
predictor variables (x’s), (2) the nature of the relationship between the response (y) and the
predictor variables (x’s), (3) a procedure for calculating the coefficients in the model, and (4) an
overall method for building the model. These considerations are the subject of this section.
Schematically, the process can be represented as shown in Figure 6.1. Process inputs (x’s) move
through a series of process steps influenced by process variables (additional x’s) to produce
process outputs (y’s), which are then sent to an internal or external customer. Variation in the
process inputs and process variables cause variation in the process output measures (y’s). In the
case of our weight-gain example, caloric intake is an input variable (x), amount of exercise is a
process variable (x), and weight gain is a process output variable (y). Amount of caloric intake
and exercise have effects on (cause) weight gain.
FIGURE 6.1 Process Diagram
The overall objective is to use process data to build a model that helps us understand how the
process works and to predict process performance (y) from various predictor variables (x’s),
which include process inputs and process operating variables. In the case of one predictor
variable (x), in which the relationship between the response (y) and xcan be described by a
straight line, the linear model would have the following form:
where b0 and b1 are coefficients to be estimated from the data by a technique known as least
squares. In this model, b1 is the slope of the line relating y and x and b0 is the intercept—the
value of y where x = 0 (Figure 6.2). This is the same form of an equation for a straight line used
in algebra: y = mx + b, where m is the slope (m is the same as b1) and b is the intercept (b is the
same as b0). If the relationship between y and x is not straight but curved, a model that contains a
quadratic term might be used. For example,
FIGURE 6.2 Straight-Line Model
We will restrict our discussion to linear and quadratic models because these models are widely
used in practice and illustrate the key aspects of building and using models. We acknowledge
that many other forms of curvilinear models are also encountered in practice.
Multiple Predictor Variables
Few processes involve a single predictor variable. When we include multiple predictor variables,
the linear model for p process variables is
and the quadratic model becomes
In the case of three predictor variables, the linear model is
and the quadratic model is
As before, the coefficients (b’s) in the models are estimated from process data using the method
of least squares. As in the case of one predictor variable, the quadratic model is formed by
adding curvilinear terms—for example, x1x2, x1x1—to the linear model. The cross-product terms
in the quadratic models (x1x2, x2x3) describe a form of curvilinearity known as interaction. When
two variables interact, the effect of one variable (x1) is dependent on the level of another variable
(x2). This characteristic of process variables is discussed in greater detail in Chapter 7.
In the following sections we discuss a methodology for building a regression model and illustrate
the methodology in the situation of one and three predictor variables.
A Method for Building Regression Models
A key aspect of statistical thinking is that all work is done in a system of interconnected
processes. Building models using regression analysis is a process with five key steps:
1. Step 1. Get to know your data. Create and examine graphical and analytical summaries of
the response (y) and predictor variables (x’s).
2. Step 2. Formulate the model: linear, curvilinear, and so on.
3. Step 3. Fit the model to the data.
4. Step 4. Check the fit of the model.
5. Step 5. Report and use the model as appropriate.
Modeling begins after the problem has been properly formulated, the use of the model has been
determined, the process variables are understood, and the data have been collected. This five-step
process is summarized in detail in Table 6.1 and shown graphically in Figure 6.3. Model building
is not a linear, single-pass process; many iterations and recycle loops may be needed to develop a
useful model. The method for building regression models summarized in Table 6.1 was created
using the concepts and methods of statistical engineering discussed in Chapter 2.
FIGURE 6.3 Regression Analysis Method
TABLE 6.1 Method for Building Regression Models
1. Get to know your data
•
Examine the summary statistics: mean, standard deviation, minimum, maximum, for
all x’s and y’s.
•
•
•
•
Examine the correlation matrix.
When data are collected sequentially in time, do a time plot for each x and y.
Plot y versus each of the x’s.
Construct scatter plots for all pairs of x’s or at least those x’s that have high correlation
coefficients in the correlation matrix.
•
Examine all plots for outlier or atypical points.
2. Formulate the model
•
•
Use subject matter knowledge whenever possible to guide the selection of the model
form.
Study the form of the relationship between y and each of the x’s in the plots to determine
whether linear relationships exist or whether curvilinear terms need to be added to the
model.
3. Fit the model to the data
•
Calculate the regression coefficients in the model and the associated regression
statistics.
•
Examine the regression results for significant variables (key drivers).
•
Assess the fit of the model using the adjusted R-squared statistic.
•
Study the regression coefficient variance inflation factors (VIFs) to determine whether
correlations among predictor variables (multicollinearity) is a problem.
4. Check the fit of the model
• Construct plots of the residuals (observed minus predicted) to identify any abnormal
patterns or atypical data points:
Residuals versus predicted values
Residuals versus predictor variables (x’s)
Residuals versus time or observation sequence
Normal probability plot of residuals
•
Nonrandom patterns in the residuals indicate that the model is not adequate.
5. Report and use the model
•
Use the model as appropriate.
•
Create and circulate a document that summarizes the model, the data, and the
assumptions used to create it.
•
Establish a procedure to continually check the performance of the model over time to
detect any deviations from the assumptions used to construct the model.
Least Squares
In step three of the methodology, “fit the model to the data,” we use a technique known
as least squares. Least squares calculations can be done in JMP or other statistical
software, so we will focus on the interpretation of the results of these calculations and
how these models can be used to better understand and manage the process being
studied. To understand least squares, let’s study an example involving one predictor
variable (x) and one response (y). For simplicity we will assume that we have five
observations:
Observation x y
1
8 1
2
10 6
3
12 4
4
14 6
5
16 10
Purchase answer to see full
attachment