PREDICTIVE MODELING
Advances in Predictive
Modeling: How
In-Database Analytics
Will Evolve to Change
the Game
Sule Balkan and Michael Goul
Abstract
Sule Balkan is clinical assistant
professor at Arizona State University,
department of information systems.
sule.balkan@asu.edu
Organizations using predictive modeling will benefit from
recent efforts in in-database analytics—especially when they
become mainstream, and after the advantages evolve over
time as adoption of these analytics grows. This article posits
that most benefits will remain under-realized until campaigns
apply and adapt these enhancements for improved productivity. Campaign managers and analysts will fashion in-database
analytics (in conjunction with their database experts) to support their most important and arduous day-to-day activities. In
this article, we review issues related to building and deploying
analytics with an eye toward how in-database solutions
advance the technology. We conclude with a discussion of how
analysts will benefit when they take advantage of the tighter
coupling of databases and predictive analytics tool suites,
particularly in end-to-end campaign management.
Introduction
Michael Goul is professor and chair at
Arizona State University, department
of information systems.
michael.goul@asu.edu
Decoupling data management from applications has
provided significant advantages, mostly related to data
independence. It is therefore surprising that many vendors
are more tightly coupling databases and data warehouses
with tool suites that support business intelligence (BI)
analysts who construct and manage predictive models.
These analysts and their teams construct and deploy models
for guiding campaigns in areas such as marketing, fraud
detection, and credit scoring, where unknown business
patterns and/or inefficiencies can be discovered.
“In-database analytics” includes the embedding of
predictive modeling functionalities into databases or data
warehouses. It differs from “in-memory analytics,” which is
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
17
PREDICTIVE MODELING
designed to minimizing disk access. In-database analytics
focuses on the movement of data between the database
or data warehouse and analysts’ workbenches. In the
simplest form of in-database analytics, the computation
of aggregates such as average, variance, and other statistical summaries can be performed by parallel database
engines quickly and efficiently—especially in contrast to
performing computations inside an analytics tool suite
with comparatively slow file management systems. In
tightly coupled environments, those aggregates can be
passed from the data engine to the predictive modeling
tool suite when building analytical models such as statistical regression models, decision trees, and even neural
networks. In-database analytics also enable streamlining
of modeling processes.
The typical modeling processes referred to as CRISP-DM,
SEMMA, and KDD contain common BI steps or phases.
Knowledge Discovery in Databases (KDD) refers to the
broad process of finding knowledge using data mining
(DM) methods (Fayyad, Piatetski-Shapiro, Smyth, and
Uthurusamy, 1996). KDD relies on using a database
along with any required preprocessing, sub-sampling, and
transformation of values in that database. Another version
of a DM process approach was developed by SAS Institute:
Sample, Explore, Modify, Model, Assess (SEMMA) refers
to the lifecycle of conducting a DM project.
Another approach, CRISP-DM, was developed by a
consortium of Daimler Chrysler, SPSS, and NCR. It stands
for CRoss-Industry Standard Process for Data Mining,
and its cycle has six stages: business understanding, data
understanding, data preparation, modeling, evaluation,
and deployment (Azavedo and Santos, 2008). All three
methodologies address data mining processes. Even though
the three methodologies are different, their common
objective is to produce BI by guiding the construction of
predictive models based on historical data.
A traditional way of discussing methodologies for predictive analytics involves a “sense, assess, and respond” cycle
that organizations and managers should apply in making
effective decisions (Houghton, El Sawy, Gray, Donegan,
and Joshi, 2004). Using historical data to enable managers
to sense what is happening in the environment has been the
18
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
foundation of the recent thrust to vitalize evidence-based
management (Pfeffer and Sutton, 2006). Predictive models
help managers assess and respond to the environment in
ways that are informed by historical data and the patterns
within that data. Predictive models help to scale responses
because, for example, scoring models can be constructed
to enable the embedding of decision rules into business
processes. In-database analytics can streamline elements of
the “sense, assess, and respond” cycle beyond those steps or
phases in KDD, SEMMA, and CRISP-DM.
This article explains how basic in-database analytics
will advance predictive modeling processes. However,
we argue that the most important advancements will
be discovered when actual campaigns are orchestrated
and campaign managers access the new, more tightly
coupled predictive modeling tool suites and database/data
warehouse engines. We assert that the most important
practical contribution of in-database analytics will occur
when analysts are under pressure to produce models
within time-constrained campaigns, and performances
from earlier campaign steps need to be incorporated to
inform follow-up campaign steps.
The next section discusses current impediments to predictive analytics and how in-database analytics will attempt
to address them. We also discuss the benefits to be realized
after more tightly coupled predictive analytics tool suites
and databases/data warehouses become widely available.
These benefits will be game-changers and will occur in such
areas as end-to-end campaign management.
What is Wrong with Current
Predictive Analytics Tool Suites?
Current analytics solutions require many steps and take
a great deal of time. For analysts who build, maintain,
deploy, and track predictive models, the process consists
of many distributed processes (distributed among
analysts, tool suites, and so on). This section discusses
challenges that analysts face when building and deploying
predictive models.
Time-Consuming Processes
To build a predictive model, an analyst may have to tap
into many different data sources. Data sources must con-
PREDICTIVE MODELING
SAMPLE
Input data,
sampling,
data partition
EXPLORE
Ranks-plots
variable selection
MODIFY
MODEL
Transform variable,
filter outliers,
missing imputation
ASSESS
Regression,
tree,
neural network
Assessment,
score,
report
Figure 1. SEMMA methodology supported by SAS Enterprise Mining environment
tain known values for target variables in order to be used
when constructing a predictive model. All the attributes
that might be independent variables in a model may reside
in different tables or even different databases. It takes time
and effort to collect and synthesize this data.
Once all of the needed data is merged, each of the independent variables is evaluated to ascertain the relations,
correlations, patterns, and transformations that will be
required. However, most of the data is not ready to be
analyzed unless it has been appropriately customized. For
example, character variables such as gender need to be
manipulated, as do numeric variables such as ZIP code.
Some continuous variables may need to be converted into
scales. After all of this preparation, the modeling process
continues through one of the many methodologies such as
KDD, CRISP-DM, or SEMMA. For our purposes in this
article, we will use SEMMA (see Figure 1).
The first step of SEMMA is data sampling and data
partitioning. A random sample is drawn from a population to prevent bias in the model that will be developed.
Then, a modeling data set is partitioned into training and
validation data sets. Next is the Explore phase, where each
explanatory variable is evaluated and its associations with
other variables are analyzed. This is a time-consuming step,
especially if the problem at hand requires evaluating many
independent variables.
In the Modify phase, variables are transformed; outliers
are identified and filtered; and for those variables that are
not fully populated, missing value imputation strategies
are determined. Rectifying and consolidating different
analysts’ perspectives with respect to the Modify phase
can be arduous and confusing. In addition, when applying
transformations and inserting missing values in large data
sets, a tool suite must apply operations to all observations
and then store the resulting transformations within the tool
suite’s file management system.
Many techniques can be used in the Model phase of
SEMMA, such as regression analysis, decision trees, and
neural networks. In constructing models, many tool suites
suffer from slow file management systems, which can
constrain the number and quality of models that an analyst
can realistically construct.
The last phase of SEMMA is the Assess phase, where all
models built in the modeling phase are assessed based
on validation results. This process is handled within
tool suites, and it takes considerable time and many
steps to complete.
Multiple Versions and Sources of the Truth
Another difficulty in building and maintaining predictive
models, especially in terms of campaign management,
is the risk that modelers may be basing their analysis on
multiple versions and sources of data. That base data is
often referred to as the “truth,” and the problem is often
referred to as having “multiple versions of the truth.”
To complete the time-consuming tasks of building
predictive models as just described, each modeler extracts
data from a data warehouse into an analytics workstation.
This may create a situation where different modelers are
working from different sources of truth, as modelers
might extract data snapshots at different times (Gray and
Watson, 2007). Also, having multiple modelers working on
different predictive models can mean that each modeler is
analyzing the data and creating different transformations
from the same raw data without adopting a standardized
method or a naming convention. This makes deploying
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
19
PREDICTIVE MODELING
multiple models very difficult, as the same raw data may
be transformed in different ways using different naming
conventions. It also makes transferring or sharing models
across different business areas challenging.
Another difficulty relates to the computing resources on
each modeler’s workbench when multiple modelers are
going through similar, redundant steps of data preparation, transformation, segmentation, scoring, and all the
other functions that can take a great deal of disk space
and CPU time.
The Challenges of Leveraging Unstructured Data and Web
Data Mining in Modeling Environments
Modelers often tap into readily available raw data in the
database or data warehouse. However, unstructured data
is rarely used during these phases because handling data
in the form of text, e-mail documents, and images is
computationally difficult and time consuming. Converting unstructured data into information is costly in a
campaign management environment, so it isn’t often
done. The challenges of creating reusable and repeatable
variables for deployment make using unstructured data
even more difficult.
Web data mining spiders and crawlers are often used
to gather unstructured data. Current analyst tool suite
processes for unstructured data require that modelers
understand archaic processing commands expressed in
specialized, non-standard syntax. There are impediments
to both gathering and manipulating unstructured data,
and there are difficulties in capturing and applying
predictive models that deal with unstructured data. For
example, clustering models may facilitate identifying rules
for detecting what cluster a new document is most closely
aligned with. However, exporting that clustering rule from
the predictive modeling workbench into a production
environment is very difficult.
Managing BI Knowledge Worker Training and
Standardization of Processes
In most organizations, there is a centralized BI group that
builds, maintains, and deploys multiple predictive models
for different business units. This creates economies of scale,
because having a centralized BI group is definitely more
20
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
cost effective than the alternative. However, the economies
of scale do not cascade into standardization of processes
among analyst teams. Each individual contributor usually
ends up with customized versions of codes. Analysts may
not be aware of the latest constructs others have advanced.
What Basic Changes Will In-Database
Analytics Foster?
In-database analytics’ major advantage is the efficiencies
it brings to predictive model construction processes due
to processing speeds made possible by harnessing parallel
database/warehouse engine capabilities. Time savings are
generated in the completion of computationally intensive
modeling tasks. Faster transformations, missing-value
imputations, model building, and assessment operations
create opportunities by leaving more time available
for fine-tuning model portfolios. Thanks to increasing
cooperation between database/warehouse experts and
predictive modeling practitioners, issues associated with
non-standardized metadata may also be addressed. In
addition, there is enhanced support for analyses of very
large data sets. This couldn’t come at a better time, because
data volumes are always growing.
In-database analytics make it easier to process and use
unstructured data by converting complicated statistical processes into manageable queries. Tapping into
unstructured data and creating repeatable and reusable
information—and combining this into the model-building
process—may aid in constructing much better predictive
models. For example, moving clustering rules into the
database eliminates the difficulty of exporting these rules to
and from tool suites. It also eliminates most temporary data
storage difficulties for analyst workbenches.
Shared environments created by in-database analytics may
bring business units together under common goals. As
different business units tap into the same raw data, including all possible versions of transformations and metadata,
productivity can be enhanced. When new ways of building
models are available, updates can be made in-database.
All individual contributors have access to the latest
developments, and no single business unit or individual
is left behind. Saving time in the labor-intensive steps of
model building, working from a single source of truth,
PREDICTIVE MODELING
Process
Benefits
Data set creation and preparation
Reduce cycle time by parallel-processing multiple functions; accurate andtimely completion of
tasks by functional embedding
Data processing and model buildingby multiple analysts
Eliminate multiple versions of truth and large data set movements to andfrom analytical tool
suites
Unstructured data management
Broaden analytics capability by streamlining repeatability and reusability
Training and standardization
Create operational and analytical efficiencies; access to latest developments; automatically
update metadata
Table 1. Preliminary benefits of in-database analytics
RET
AR
N
SIG
DE
MO
D
ORE
PL
X
E
MPLE
SA
SEMMA
EVALUA
TE
DEPLOYMENT
EM
PO
The DEEPER phases delineate, in
sequential fashion, the types of activities
involved in model deployment with a special
emphasis on campaign management. The
T
GE
ANCE
ORM
RF SURE
PE MEA
To drive measurable business results from predictive
models, SEMMA (or a similar methodology) is followed by
a deployment cycle. That cycle may involve the continued
application of models in a (recurring) campaign, refinement when model performance results are used to revise
other models, making decisions on whether completely
new models are required given model
performance, and so on. We distinguish
deployment from the SEMMA-supported
phase (intelligence) because deployment
MODE
L
often engages the broader organization and
Y
IF
AS
requires a predictive model (or models)
SE
SS
to be put into actual business use. This
INTELLIGENCE
section introduces a new methodology we
created to describe deployment: “DEEPER”
(Design, Embed, Empower, Performancemeasurement, Evaluate, and Re-target).
Figure 2 depicts the iterative relationship
between SEMMA and DEEPER.
EMBED
Context for In-Database Analytics Innovation
design phase involves making plans for how to transition a
scoring model (or models) from the tool suite (where it was
developed) to actual application in a business context. It
also involves thinking about how to capture the results of
applying the model and storing those results for subsequent
analysis. There may also be other data that a campaign
manager wishes to capture, such as the time taken before
seeing a response from a target. A proper design can eliminate missteps in a campaign. For example, if a targeted
catalog mailing is enabled by a scoring model developed
using SEMMA, then users must choose which deciles to
target first, how to capture the results of the campaign (e.g.,
actual purchases or requests for new service), and what new
data might be appropriate to capture during the campaign.
ER
W
having access to repeatable and reusable structured and
unstructured data, and making sure all the business units
are working with the same standards and updates—all
this makes it easier to transfer knowledge as new analysts
join or move across business units. Table 1 summarizes the
preliminary benefits of in-database analytics for modelers.
DEEPER
Figure 2. DEEPER phases guide the deployment, adoption, evaluation, and recalibration of
predictive models.
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
21
PREDICTIVE MODELING
Once designed, the model must be accurately embedded
into business processes. Model score views must be secured;
developers must ensure scores appear in user interfaces at
the right time; and process managers must be able to insert
scores into automated business process logic. Embedding
a predictive model may require safeguards to exceptions.
If there are exceptions to applications of a model, other
safeguards need to be considered.
Making the results of a predictive model (e.g., a score)
available to people and systems is just the first step in
ensuring it is used. In the empower phase, employees
may need to be trained to interpret model results; they
may have to learn to look at data in a certain way using
new interfaces; or they may need to learn the benefits of
evidence-based management approaches as supported by
predictive modeling. Similarly, if people are involved, testing may be required to ensure that training approaches are
working as intended. The empower step ensures appropriate
behaviors by both systems and people as they pertain to the
embedding of the predictive model into business processes.
A campaign begins in earnest after the empower phase.
Targets receive their model-prescribed treatments, and
reactions are collected as planned for in the design phase
of DEEPER. This reactions-directed phase, performance
measurement, involves ensuring the reactions and events
subsequent to a predictive model’s application are captured
and stored for later analysis. The results may also be
captured and made available in real-time support for
campaign managers. Dashboards may be appropriate for
monitoring campaign progress, and alerts may support
managers in making corrections should a campaign stray
from an intended path. If there is an anomaly, or when a
campaign has reached a checkpoint, campaign managers
take time to evaluate the effectiveness or current progress of
the campaign. The objective is to address questions such as:
■■
■■
■■
22
Are error levels acceptable?
Were campaign results worth the investment in the
predictive analytics solution?
How is actual behavior different from predicted
behavior for a model or a model decile?
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
This is the phase when the campaign’s effectiveness and
current progress are assessed.
The results of the evaluate phase of DEEPER may lead to a
completely new modeling effort. This is depicted in Figure
3 by the gray background arrow leading from evaluate to
the sample phase of SEMMA. This implies a transition
from deployment back to what we have referred to as
intelligence. However, there is not always time to return
to the intelligence cycle, and minor alterations to a model
might be deemed more appropriate than starting over. The
latter decision is most prevalent in time-pressured, recurring campaigns. We refer to this phase as re-target, which
requires analysts to take into account new information
gathered as part of the performance management deployment phase. It also takes advantage of the plans for how
this response information was encoded per the design phase
of deployment.
The most important consideration involves interpreting
results from the campaign and managing non-performing
targets. A non-performing target is one that scored high in
a predictive model, for example, but that did not respond
as predicted. In a recurring campaign, there may be an
effort to re-target that subset. There could also be an effort
to re-target the campaign to another set of targets, e.g.,
those initially scored into other deciles. Re-targeting can
be a time-consuming process; new data sets with response
results need to be made available to predictive modeling
tool suites, and findings from tracking need to be incorporated into decisions.
DEEPER provides the context for considering how
improvements to in-database analytics can be game-changers. In-database analytics can make significant inroads to
DEEPER processes that take time and are under-supported
by predictive modeling tool suites. However, these improvements will be driven by analysts who work closely with
their organizations’ database experts. This combination
of analyst and data management skills, experience, and
knowledge will spur innovation significantly beyond
current expectations.
PREDICTIVE MODELING
How Might In-Database Analytics
for DEEPER Evolve?
Extending in-database analytics to DEEPER processes
requires considering how each DEEPER phase might be
streamlined given tighter coupling between predictive
modeling tool suites and databases/data warehouses.
Although many of the advantages of this tighter coupling
may be realized differently by different organizations, there
are generic value streams to guide efforts. Here the phrase
“value stream” refers to process flows central to DEEPER.
This section discusses these generic value streams: (1)
intelligence-to-plan, (2) plan-to-implementation, (3)
implementation-to-use, (4) use-to-results, (5) results-toevaluation, and (6) evaluation-to-decision.
In the design phase of DEEPER, planning can be facilitated by examining possible end-user database views that
could be augmented with predictive intelligence. Instead
of creating new interfaces, it is possible that Web pages
equipped with embedded database queries can quickly
retrieve and display predictive model scores to decision
makers or front-line employees. Many of these displays are
already incorporated into business processes, so opportunities to use the tables and queries to supply model results can
streamline implementation. When additional data items
need to be captured, that data may be captured at the point
of sale or other customer touch points. A review of current
metadata may speed up the design of a suitable deployment
strategy. In addition to “pushing” model intelligence to
interfaces, there may also be ways of “pulling” data from
the database/warehouse to facilitate re-targeting or for
initiating new SEMMA cycles.
For example, it may be possible to design queries to
automate the retrieval of data items such as target response
times from operational data stores. Similarly, it may be
possible to use SQL to aggregate the information needed
for this type of next-step analysis. For example, total
sales to a customer within a specified time period can be
aggregated using a query and then used in the re-targeting
phase to reflect whether a target performed as predicted.
In-database analytics can support the design phase because
it eliminates many of the traditional bottlenecks such as
complex requirements gathering and the creation of formal
specification documents (including use cases). Instead,
existing use cases can be reviewed and augmented, and
database/warehouse–supported metadata facilities can
support the design of schema for capturing new target
response data. We refer to this as an intelligence-to-plan
value stream for the in-database analytics supported design
deployment phase.
In the embed phase, transferring scored model results
to tables is a first step in considering ways to make use
of database/warehouse capabilities to support DEEPER.
Once the scores are appropriately stored in tables, there are
many opportunities to use queries to embed the scores into
people-supported and automated business processes. For
example, coding to retrieve scores for inclusion in front-line
employee interfaces can be done in a manner consistent
with other embedded SQL applications. This saves time
in training interface developers because it implies that
the same personnel who implemented the interfaces can
effectively alter them to include new intelligence.
There is also no need for additional project governance
functions or specialized software. In fact, database/
warehouse triggers and alerts can be used to ensure that
predictive analytics are used only when model deployment
assumptions are relevant. As the database/warehouse is the
same place where analytic model results reside, there are
numerous implementation advantages. We refer to this as
a plan-to-implementation value stream for the in-database
analytics supported embed deployment phase.
After implementation, testing will ensure that model
results/scores are understandable to decision makers (the
empower phase) and that their performance can scale
when production systems are at high capacity. Such stress
tests can be conducted in a manner similar to database
view tests. Because of the inherent speed of database/
warehouse systems, their performance will likely exceed
separate, isolated workbench performance. Global roll-out
can be eased by tried-and-true database/warehouse roll-out
processes. We refer to this as an implementation-to-use value
stream for the in-database analytics supported empower
deployment phase.
Similarly, the use-to-results value stream is that part of a
campaign when actions are taken and targets respond.
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
23
PREDICTIVE MODELING
In this performance management phase of deployment,
dashboards can be used to track performance, database
tables can automatically collect and store ongoing
campaign results, queries can aggregate responses over
time as part of automating responses, and many other
in-database solutions can help to streamline related
processes. This information is central to the evaluate phase,
where the results-to-evaluation value stream can enable
careful scrutiny of the predictive analytics model portfolio.
Queries can be written to compare actual results to those
predicted during SEMMA phases. When more than one
model has been constructed in the SEMMA processes, all
can be re-examined in light of the new information about
responses. If-then statements can be embedded in queries
to identify target segments that have responded according
to business goals, and remaining non-responders can be
quickly identified.
performance of the models in the portfolio. If models exist
that were not used but appear to perform better, those
models may be used in the next DEEPER cycle. Alternatively, a combination or pooling of models might be most
appropriate. Again, automated queries might be able to
provide decision support for such pooling options, and they
can aid in scheduling the appropriate model for the data
sets as the DEEPER cycle progresses. In addition, it may
be possible to use queries to apply business rules to manage
data sets, and prior results could inform the scheduling
of resting periods for targets such that each target isn’t
inundated with catalog mailings, for example.
Such analysis can be done for each analytical model in
the portfolio and for each decile of predicted respondents
associated with those models. This has been an enormously
time-consuming process in the past, but the database/warehouse query engine can conduct this type of post-analysis
efficiently. Queries can also identify subsets of respondents
that outperformed the predicted model performance—and
those that significantly under-performed. This type of
analysis can be quickly supported through queries, and it
can provide significant insight for the re-target phase.
Conclusion
Following the results-to-evaluation value stream of the
deployment cycle, the evaluation-to-decision value stream
focuses on whether a new intelligence cycle (a repeat of
SEMMA processes) is required. If performance results
indicate major model failures, then a repeat is likely
necessary to resurrect and continue a campaign. Even if
there weren’t major failures, environmental changes such as
economic conditions may have rendered models outdated.
Data collected in the performance evaluation phase may
help to streamline the decision process. If costs aren’t being
recovered, then it is likely that either the campaign will
cease or a new intelligence cycle is necessary.
Often a portfolio of models is created in the initial intelligence cycle. It may be possible to use queries to automate
the process of recalculating the prior and anticipated
24
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
Table 2 summarizes key generic value streams that can be
supported by in-database analytics and briefly describes
the possibilities discussed in this section. Opportunities to
evolve in-database analytics are likely to be numerous.
In-database analytics create an environment where
functions are embedded and processed in parallel, thereby
streamlining the steps of both intelligence (e.g., SEMMA)
and deployment (e.g., DEEPER) cycles. As data sources
are updated, attribute names and formats may change, yet
they are sharable. In-database analytics can support quality
checks and create warning messages if the range, format,
and/or type of data differ from a previous version or model
assumptions. If external data has attributes that were not in
the data dictionary, metadata can be updated automatically.
Data conversions can be handled in-database and only once
instead of being repeated by multiple modelers. In-database
analytics fosters stability, enhances efficiency, and improves
productivity across business units.
In-database analytics will be critical to a company’s
bottom line when models are deployed and there is
time pressure for multiple, successive campaigns where
ongoing results can be used to build updated, improved
predictive models. Enhancements can be realized in a
host of value streams. For example, in-database analytics
can significantly reduce cycle times for rebuilding and
redeploying updated models to meet campaign deadlines. As multiple models are constructed, in-database
analytics will enable managing them as a portfolio.
Timely responses, tracking, and fast interpretation of
PREDICTIVE MODELING
Intelligence-to-plan
Planning is streamlined; push and pull strategies are feasible; schema design can support planning
Plan-to-implementation
Scores maintained in-database; embedded SQL in HTML can facilitate view deployment; triggers and alerts can be used
to guard for exceptions
Implementation-to-use
Stress testing and global rollout follow database/warehouse methodologies and rely on common human and physical
resources
Use-to-results
Dashboards can be readily adapted; database/warehouse tables can be used as response aggregators
Results-to-evaluation
Re-examine all created models efficiently in light of response information; embed if-then logic to re-target nonresponders
Evaluation-to-decision
Consider applying different models; allow targeted respondents to “rest”; use database to provide decision support for
deciding to re-target or re-enter the intelligence cycle
Table 2. Generic value streams and areas for innovation with in-database analytics
early responders to campaigns will enable companies to
fine-tune business rules and react in record time.
As the fine line between intelligence and deployment cycles
fades because of the fast-paced environment supported
by in-database analytics, businesses may move away
from the concept of campaign management into triggerbased, “lights-out” processing, where all data feeds are
automatically updated and processed, and there is no need
to compile data into periodic campaigns. There will be realtime decision making with instant scoring each time there
is an update in one of the important independent variables.
Analysts will spend their time fine-tuning model performance, building business rules, analyzing early results,
monitoring data movements, and optimizing the use of
multiple models—instead of dealing with the manual tasks
of data preparation, data cleansing, and managing file
movements and basic statistical processes that have been
moved into the database/warehouse.
Although lights-out processing is not on the near-term
horizon, the evolution of in-database analytics promises to
move organizations in that direction. Once in the hands of
analysts and their database/warehouse teams, in-database
analytics will be a game-changer. n
References
Azevedo, Ana, and Manuel Felipe Santos [2008]. “KDD,
SEMMA AND CRISP-DM: A Parallel Overview.”
IADIS European Conference Data Mining,
pp. 182–185.
Fayyad, U. M., Gregory Piatetski-Shapiro, Padhraic
Smyth, and Ramasamy Uthurusamy [1996]. Advances in
Knowledge Discovery and Data Mining, AAAI Press/The
MIT Press.
Gray, Paul, and Hugh J. Watson, Hugh [2007]. “What Is
New in BI,” Business Intelligence Journal, Vol. 12, No. 1.
Houghton, Bob, Omar A. El Sawy, Paul Gray, Craig
Donegan, and Ashish Joshi [2004]. “Vigilant
Information Systems for Managing Enterprises in
Dynamic Supply Chains: Real-Time Dashboards at
Western Digital,” MIS Quarterly Executive,
Vol. 3, No. 1.
Pfeffer, Jeffrey, and Robert I. Sutton [2006]. “Evidence
Based Management,” Harvard Business Review, January.
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
25
Copyright of Business Intelligence Journal is the property of Data Warehousing Institute and its content may
not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.
Services to Support Knowledge Sharing in Complex Business
Networks, Big Data as the Source
Abdussalam Ali and Igor Hawryszkiewycz
University of Technology Sydney, Sydney, Australia
Abdussalam.M.Ali@student.uts.edu.au
Igor.Hawryszkiewycz@uts.edu.au
Abstract: Big Data has become a buzzword that refers to the complex and massive data that is either structured or
unstructured and not easy to be captured and processed by traditional tools and software applications. The term refers to
the large size of data that is created through the activities using of Information and communication technology (ICT). Big
Data in our research is the big container to be explored to discover knowledge and information based on the searching
context. Big Data may include both explicit and tacit sources of information and knowledge. Businesses should consider all
sources in the business environment when discovering and capturing knowledge by accessing this data. Business networks,
as a type of social networks, are the mechanism for performing knowledge sharing and transfer for innovation and gaining
competitive advantages. The goal of our research is to design and implement a generic model of services that support and
coordinate business networks to discover, capture, create and share knowledge. Implementing these services involves
considering Big Data as the source of information and knowledge. This model is to be implemented as a platform of
services in the cloud. Although the model and services are generic, the platform is to be customisable for business’s special
needs. These services will support businesses in terms of creating collaborative environments and adapting to any changes
that are happening in the business environment while collaboration is in operation. A prototype of services is to be
implemented in the cloud environment. This prototype is to be tested by experts through a case study to measure the
success and performance of the model.
Keywords: big data, knowledge sharing, business networks, cloud computing
1. Introduction
Big Data has emerged through the development of Information and communication technology (ICT) and
people using it. The term refers to the large size of data and information around the world because of its fast
growth through the utilisation of advanced technology and huge storage systems.
) bytes of data are produced every day, containing unstructured, semi‐structured
Nearly 20 quintillion (
and structured content. This content includes text and multimedia such as voice, audio and images (Barnaghi,
Sheth & Henson 2013). Organisations have started to explore the huge volume of data. This data is not
organised adequately in a database manner (Davenport, Barth & Bean 2012). Knowledge Management (KM) is
one of disciplines affected by this emergence. Businesses explore Big Data to capture the information and
create knowledge for innovation and competitive advantages.
In the last few decades businesses have started to introduce technology to support their knowledge
management strategies, and KM systems have been developed accordingly. Technology is considered a
success factor for KM by many researches and studies, such as Wong (2005), Moffett et al. (2003) and
Davenport et al. (1998). The study of Moffett et al. (2003) is based on a survey of 1000 British companies. They
found that most of these companies introduced the technology as a main component to support KM. Wong
(2005) study, based on small‐medium enterprises (SMEs), states that pretty‐implemented technology is one of
main factors that support the success of KM. Alazmi and Zairi (2003) is a paper based on 15 literatures to study
the critical factors affecting KM. Their findings show that components related to the technology factor
represent the highest percentage (17%) compared to other components mentioned in the literatures.
Although these researches state that technology is a critical factor to make KM successful in the firm, many
present the issues of technology related to KM systems. These issues can be summarised as followed:
Technology does not serve more than being a storage and repository of information. That is because the
KM systems are designed in the same way and tradition of designing and implementing the information
systems ((Currie & Maire 2004), (Nunes et al. 2006) and (Birkinshaw 2001)). In addition, IT systems
operate as storages for explicit knowledge more than the tacit type, as the tacit type is difficult to gather
((Nunes et al. 2006) and (Birkinshaw 2001)).
476
Abdussalam Ali and Igor Hawryszkiewycz
This leads to an argument of overlooking the “social interaction” phenomena when introducing IT to
support KM. Although social interaction is important for people to exchange information and knowledge,
IT is considered to be a replacement for “social interaction” (Birkinshaw 2001). McDermont (1999) reports
that knowledge is different from information in many aspects. As a result, the KM systems cannot be
designed and implemented based on information systems concepts.
The other limitation that can be mentioned here is that KM systems are not up to date and do not cater
for emerging needs ((Fischer & Ostwald 2001) and (Van Zolingen, Streumer & Stooker 2001)). One of these
emergences is Big Data as it is the main source of knowledge and knowledge creation, and consists of both
sources of knowledge, tacit and explicit.
The aim of our research is to implement a generic model of services that support the social interaction in
knowledge sharing and transfer. From a Big Data perspective, these services will support knowledge discovery
and capturing. In our model we consider Big Data as the big container that contains all information and
knowledge sources. This information and knowledge is in soft or hard form, online or offline and explicit or
tacit. Services to be implemented that support businesses to explore and discover knowledge resides within
Big Data. Section 2 “Big Data as the Knowledge Source” presents more information about this aspect.
The other component to be mentioned here is business networks. Business networks, as a type of social
networks, are the mechanism for performing knowledge sharing and transfer for innovation and gaining
competitive advantages. The model is supposed to coordinate and manage these networks for creating and
sharing the knowledge. In addition, the model (Ali, Hawryszkiewycz & Chen 2014) flexibly provides businesses
and business units with the ability to quickly share and analyse knowledge to address emerging business needs
in their environment. This is based on the fact that businesses, these days, operate in complex environments.
The services need to be generic and reconfigurable as knowledge needs cannot be anticipated in today’s
dynamic environment. Hasgall (2012) performed a study based on a questionnaire to understand how social
networks are effective in supporting organisations to adapt and respond to changes in their complex
environment. The findings lead to a conclusion that social networks support employees by providing them with
knowledge. This knowledge can be integrated into the firm and can increase the sensitivity of the workers to
the environmental changes.
Consequently a flexible approach is needed where knowledge flows and responsibilities can be easily changed
without the need to reprogram systems. A typical scenario may be a new partner entering a network,
decisions to develop new products and services that require new expertise, or simply improving workflows.
Each of these not only brings in new knowledge but often also requires the rearrangement of responsibility for
processing the knowledge. Networks also exist within businesses where different business units network to
create new products and services for business clients.
2. Big data as the knowledge source
Kabir and Carayannis (2013) characterise Big Data as the data which is too large and is not easy to capture and
analyse by traditional technology and tools. Authors consider Big Data as a main resource to create knowledge
with continuous growth. This is because of many factors, including continuous innovations in IT hardware and
software. It is a massive and large lake that can be used as resource to create knowledge (Kabir & Carayannis
2013).
Agrawa et al. (2011) present the challenges of Big Data as a source of knowledge to support decision making
and innovation. These challenges are:
Heterogeneity which refers to the difference in the data format, and even if the data captured is in the
same format, differences will exist in terms of data structure and organisation.
Scale as the main characteristic of Big Data is its big size and volume. Managing these large volumes and
its continuous growth is one of the challenges of Big Data.
Timeliness, which means that the speed of data and growing rate increases by introducing the new
technology and its continuous development. The challenge here is how to co‐op this increase and growing
rates in terms of discovering and capturing the data.
477
Abdussalam Ali and Igor Hawryszkiewycz
Privacy is one of big concerns in the context of Big Data. Dealing with privacy is both a technical and social
issue. Acquiring personal data, for example, will raise many questions regarding privacy and at which level
this data can be used and published.
Sorting out how to solve these issues is not in our research scope. However, it may be included in our
framework for future development. Big Data in our model is the big container to be explored to discover
knowledge and information based on the searching context. Big Data may include both explicit and tacit
sources of information and knowledge. Internet, online databases, electronic sheets, electronic documents,
hard disk drives and offline printed documents and files all compose the Big Data container. On the other
hand, experts, skilled people, consultants, managers, workers and communities of practise members are all
examples of tacit sources in the Big Data container. Information in the Big Data container is presented in
different formats, including database records, word and pdf documents, video, audio, images, etc. Capturing
and analysing knowledge from these format types is done by specialised applications. In our research we may
support these types of format in terms of discovering them, indexing them and knowing how to link these
sources with the knowledge created based on them. Recommender service may be implemented to support
knowledge discovery in the future for other explorers and users.
Kabir and Carayannis (2013) present their “Big Data Strategy Framework” as in Figure 1.
Figure 1: Big data strategy framework (Kabir & Carayannis 2013)
The authors show that infrastructure, team building and knowledge base as being the main components and
aspects of their framework. Infrastructure includes technology as one of its subcomponents. Teams should be
created based on the business’s objectives. The knowledge that is created by businesses should support
innovation and competitive advantage and be considered as new knowledge for future use and share (Kabir &
Carayannis 2013).
Our model is not based on this framework, but it supports our previous arguments that the other factors such
as social factors and the environment should be considered as well.
In conclusion, our aim is to design and implement a generic model of services that support and coordinate
business networks to discover, capture, create and share knowledge. Implementing these services involves
considering Big Data as the source of information and knowledge.
3. The proposed model
This paper proposes a model to manage discovering, capturing, organising and sharing knowledge between
business networks within a complex environment. The paper sees knowledge sharing as predominantly a
socio‐technical issue and Big Data as the source of knowledge.
The model provides the flexibility needed in today’s environment. In our model, any business creates its own
groups and organisations to gather knowledge and information. The organisation in our model is defined
according to Living Systems Theory (LST) as a group of groups that deal with one or more gathering projects
((Ali & Hawryszkiewwycz 2012) and (Miller 1965)). Each group within the organisation processes the
knowledge and information. Thus if new groups are created to respond to some event, knowledge must
quickly flow to these groups.
478
Abdussalam Ali and Igor Hawryszkiewycz
The strategies toward the model implementation can be described as follows:
3.1 Generic knowledge management functions, elements and activities
Boundary roles must often define the knowledge elements to be managed and the assignment of these
responsibilities to roles in their business unit. Referring to Ali et al. (2014), knowledge management functions
(KMF) have been defined throughout many literatures.
These literatures include and Fernandez and Sabherwal (2010) , Awad and Ghaziri (Awad & Ghaziri 2004),
Daklir (2011) and the functions can be described as follows:
Discovering: The process of finding where the knowledge resides.
Gathering: Fernandez and Sabherwal (2010) define gathering as the process of obtaining knowledge from
the tacit (individuals) and explicit (such as manuals) sources.
Filtering: It is the process of minimising the knowledge and/or information gathered by rejecting the
redundancy (Dakilir 2011).
Organising: The process of composing the knowledge so that it can be easily retrieved and used to make
decisions (Awad & Ghaziri 2004).
Sharing: It is a way of transferring knowledge between individuals and groups ((Awad & Ghaziri 2004) and
(Fernandez & Sabherwal 2010)).
In this paper we have illustrated these functions by joining them to Big Data as shown in Figure 2.
Formally, there may be any number of knowledge elements, such as sales, purchases, proposals and so on. So
we might define a knowledge element K(sale) or K(purchase). It may be a latest sale or some new idea. Each of
these knowledge elements will go through the functions in Figure 2. We use the notation Discover(K(sale)),
Gather(K(sale)) and Discover(K(purchase)). We call these knowledge processing activities. Thus any knowledge
element goes through all the KMFs.
A knowledge processing activity is a knowledge processing function applied to a knowledge element.
From Figure 2 the following points can be highlighted:
These functions are not sequential. For example, while organising the user asks for more captured or
discovered knowledge.
While a specific function is running the user can acquire Big Data for more knowledge.
Knowledge creation can happen at any stage.
Figure 2: Knowledge management functions
479
Abdussalam Ali and Igor Hawryszkiewycz
3.2 Allocations:
Our goal is to develop a framework that provides choices for changing allocations as systems evolve. The goal
is to provide the flexibility to reconfigure the requirements as needs change. That provides the ability for
networks to share knowledge by assigning responsibilities. The following choices are possible:
Type 1 Allocation (knowledge management function specialists) ‐ Allocate all activities of the same
knowledge management function to one group.
Type 2 Allocation (knowledge element specialists) ‐ Allocate all knowledge processing activities on the
same knowledge element to one organisation. The organisation then distributes the different knowledge
processing activities to different groups.
Type 3 – Each functional unit has its own knowledge processing organisation or group
Type 4 – Totally open (Hybrid )
Figure 3: Type 1 allocation
Allocations are at two levels – allocation of the knowledge activity to the group, followed by the allocation of
action tasks to roles in the group.
As an example, the model for type 1 allocation is shown in Figure 3. Figure 3 illustrates an organisation of three
groups which gathers knowledge by assigning roles to them.
There are actors participating in more than one network. For example, the coordinators participate in the
organisational network and also in the group network.
The model does not at this stage include the agencies used in the exchange of information. The goal is to
create these agencies through a cloud platform.
The output of the research is to implement a platform based on cloud technology to support these roles within
the groups. The major functions of this platform are as following:
Creating and resigning the groups.
Supporting group members to access the platform’s services to finish the role’s tasks.
Enabling knowledge sharing between the groups and organisations.
Supporting collaboration between businesses and enterprises.
4. Model implementation
Our goal is to implement the model on the cloud environment to support knowledge sharing across a road
community.
The implementation will create the services for implementing the relationships between roles within business
networks for effective knowledge discovery, gathering and sharing. These services can be categorised as
administrative services and processing services. Administrative services are those modules that support the
480
Abdussalam Ali and Igor Hawryszkiewycz
administrative activities. Examples of this are creating groups and organisations, creating roles,
assigning/resigning the roles to/from the organisations and groups, and creating users and assigning them to
roles.
Figure 4: High level illustration of the model
The processing services are the modules which are accessed by the users within the groups and organisations.
These services support the knowledge management functions shown in Figure 2. Our model should provide
services for businesses to create their collaborative environments either within the business itself or in
collaboration with other businesses. Services are supposed to provide the business with the capability to adapt
to the changes that occur in the business itself or in the environment. This adaptation includes managing
groups, organisations, roles and users as well as managing the relationships between these sets. Knowledge
management functions and activities should be considered in this operation.
Our goal is to make these services configurable and that new required services can be implemented upon
business request and added to the system at any time. In addition, roles and their responsibilities can be
modified as well.
Figure 4 illustrates a scenario of gathering knowledge for a “new course project” in a faculty.
Big Data is the container of sources, as defined previously that can be explored by the users for knowledge
based on the defined knowledge elements. The cloud environment will contain the services that support the
knowledge management functions. The collaborative organisation “new course” discovers, captures and
organises the knowledge. This knowledge is based on the knowledge elements defined as “subject material”
and “market and prices”. Services in the cloud will support businesses to:
Creating organisations and groups (eg.: discovering, capturing and organising).
Creating roles (eg.: coordinator, material‐discoverer and organiser).
Assigning these roles to the groups/organisations.
Assigning users (eg.: John, David, Mac, etc.) to these roles.
Supporting the knowledge management functions and processes.
The services to be implemented are not considered as interfaces between the users and Big Data, as shown in
Figure 4. Rather, they support users to discover through Big Data, and create and share knowledge among
themselves.
481
Abdussalam Ali and Igor Hawryszkiewycz
4.1 Technology
Cloud technology is the infrastructure to be used for implementing our model to gain the advantages of it.
That can be achieved by implemented the model as a platform of services delivered as a Software as a Service
(SaaS) to the beneficiaries. The following are some advantages that communicate our research (Marston et al.
2011):
1. The low cost of using cloud services. That allows small and medium businesses to benefit from these
services. Knowledge management is not focused in small and medium businesses (Pillania 2008) and one of
the reasons behind that is the high costs of dedicated knowledge management applications and systems
(Nunes et al. 2006).
2. Large capacity. One of the cloud features is to provide big sizes of data storage. In our model, the businesses
will not concern about the continuous scalability. This scalability is either in terms of the number of
organisations and groups created within the system or the amount of knowledge produced.
3. Mobility. Cloud computing is an online based technology. That allows people to share knowledge and
participate in organisations and groups even if they are in different and distant geographic areas.
Services to be implemented can be categorised as following:
Management services: these services to support the creation of the objects and components of the model
and maintaining the relationships between them. That include; managing and maintaining the groups and
organisations, roles, users and knowledge elements.
Processing services: these services support the knowledge management functions performed. These
include knowledge discovery, capturing, filtering and organising.
Sharing services: The knowledge created through the knowledge functions is subject to be shared among
the users. These services support this sort of process. That will include services which manage requests,
responses, broadcastings.
Notification service(s): These services support the communications between the users.
The platform is to be a browser based application. That allows access to the services used by users to create
their organisations and groups, and creating and sharing the knowledge. These services allow the users to
access and manage SQL tables at the back end.
Testing the model will be done by creating different scenarios and to be evaluated by experts. Our test should
satisfy that our generic model caters the different collaborative scenarios in terms of knowledge creation and
sharing.
5. Summary and future research
The paper presents a model for facilitating knowledge management in complex business systems. This paper
illustrates the idea of how the model operates at the high level. Also, it gives an idea about the choice of
technology to be used for implementation and the reason behind this choice.
Semantics of all activities are to be defined. Accordingly, the services are to be defined and designed based on
those semantics. Semantics are high level descriptions of how the model operates and how the relations
between the different components sets are maintained. These semantics will define the operations that take
place in the collaborative environment, including the semantics of coordination, management and KM
activities.
Figure 5 illustrates how our model is to be developed by time. The figure shows that we start working through
business scenarios. We then define the business model and semantics accordingly. These semantics again are
applied on the scenarios; evaluating these semantics and making any changes needed to the business model.
The implemented services are to be tested through different scenarios and evaluated as well. Changes and
modifications take place to the business model, semantics and services until the model the testing criteria is
satisfied.
482
Abdussalam Ali and Igor Hawryszkiewycz
There will be continuous evaluations among these three levels until the system reaches the stability and
satisfactory. In other words, business model, semantics and technical model are subject to change every time
until the system reaches stability.
These services are to be implemented in the cloud to take advantage of the cloud computing environment.
They will be implemented as a prototype for testing and evaluation.
Figure 5: Model development
References
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S. & Dayal, U. 2011, Challenges and Opportunities with Big Data, Purdue
University.
Alazmi, M. & Zairi, M. 2003, 'Knowledge management critical success factors', Total Quality Management & Business
Excellence, vol. 14, no. 2, pp. 199‐204.
Ali, A. & Hawryszkiewwycz, I. 2012, 'A Modelling Approach for Knowledge Management in Complex Business Systems',
IADIS International Conference WWW/Internet, Madrid.
Ali, A., Hawryszkiewycz, I. & Chen, J. 2014, 'Services for Knowledge Sharing in Dynamic Business Networks', paper
presented to the Australasian Software Engineering Conference, Sydney.
Awad, E.M. & Ghaziri, H.M. 2004, 'Working Smarter, Not Harder', in, Knowledge Management, Pearson Education, Inc,
New Jersey, pp. 24‐5.
Barnaghi, P., Sheth, A. & Henson, C. 2013, 'From Data to Actionable Knowledge: Big Data Challenges in the Web of Things',
IEEE Intelligent Systems, vol. 28, no. 6, pp. 6‐11.
Birkinshaw, J. 2001, 'Why is Knowledge Management So Difficult?', Business Strategy Review, vol. 12, no. 1,
pp. 11‐8.
Currie, G. & Maire, K. 2004, 'The Limits of a Technological Fix to Knowledge Management: Epistemological, Political and
Cultural Issues in the Case of Intranet Implementation', Management Learning, vol. 35, no. 1, pp. 9‐29.
Dakilir, K. 2011, 'The Knowledge Management Cycle', in, Knowledge Management in Theory and Practice, 2nd edn,
Massachusetts Institute of Technology, London, pp. 31‐58.
Davenport, T.H., Barth, P. & Bean, R. 2012, 'How 'Big Data' Is Different', Mit Sloan Management Review, vol. 54, no. 1, pp.
42‐47.
Davenport, T.H., De Long, D.W. & Beers, M.C. 1998, 'Successful knowledge management projects', Sloan Management
Review, vol. 39, no. 2, pp. 43‐57.
Fernandez, I. & Sabherwal, R. 2010, 'Knowledge Management Solutions: Processes and Systems', in, Kowledge
Management, Systems and Processes, M.E. Sharpe, Inc, New York, pp. 56‐70.
Fischer, G. & Ostwald, J. 2001, 'Knowledge Management: Problems, Promises, Realities, and Challenges', IEEE Intelligent
Systems, vol. 16, no. 1, pp. 60‐72.
Hasgall, A.E. 2012, 'The effectiveness of social networks in complex adaptive working environments', Journal of Systems
and Information Technology, vol. 14, no. 3, pp. 220‐35.
Kabir, N. & Carayannis, E. 2013, 'Big Data, Tacit Knowledge and Organizational', Journal of Intelligence Studies in Business,
vol. 3, no. 3, pp. 54‐62.
Marston, S., Li, Z., Bandyopadhyay, S., Zhang, J. & Ghalsasi, A. 2011, 'Cloud computing ‐‐ The business perspective', Decision
Support Systems, vol. 51, no. 1, pp. 176‐89.
483
Abdussalam Ali and Igor Hawryszkiewycz
McDermott, R. 1999, 'Why information technology inspired but cannot deliver knowledge management', California
Management Review, vol. 41, no. 4, pp. 103‐17.
Miller, J.G. 1965, 'Living systems: Structure and process', Behavioral Science, vol. 10, no. 4, pp. 337‐79.
Moffett, S., McAdam, R. & Parkinson, S. 2003, 'Technology and people factors in knowledge management: an empirical
analysis', Total Quality Management & Business Excellence, vol. 14, no. 2, pp. 215‐24.
Nunes, M.B., Annansingh, F., Eaglestone, B. & Wakefield, R. 2006, 'Knowledge management issues in knowledge‐intensive
SMEs', Journal of Documentation, vol. 62, no. 1, pp. 101‐19.
Pillania, R.K. 2008, 'Strategic issues in knowledge management in small and medium enterprises', Knowledge Management
Research & Practice, vol. 6, no. 4, pp. 334‐8.
Van Zolingen, S.J., Streumer, J.N. & Stooker, M. 2001, 'Problems in Knowledge Management: A Case Study of a Knowledge‐
Intensive Company', International Journal of Training and Development, vol. 5, no. 3, pp. 168‐84.
Wong, K.Y. 2005, 'Critical success factors for implementing knowledge management in small and medium enterprises',
Industrial Management & Data Systems, vol. 105, no. 3‐4, pp. 261‐79.
484
Copyright of Proceedings of the International Conference on Intellectual Capital, Knowledge
Management & Organizational Learning is the property of Academic Conferences &
Publishing International Ltd. and its content may not be copied or emailed to multiple sites or
posted to a listserv without the copyright holder's express written permission. However, users
may print, download, or email articles for individual use.
PREDICTIVE MODELING
Advances in Predictive
Modeling: How
In-Database Analytics
Will Evolve to Change
the Game
Sule Balkan and Michael Goul
Abstract
Sule Balkan is clinical assistant
professor at Arizona State University,
department of information systems.
sule.balkan@asu.edu
Organizations using predictive modeling will benefit from
recent efforts in in-database analytics—especially when they
become mainstream, and after the advantages evolve over
time as adoption of these analytics grows. This article posits
that most benefits will remain under-realized until campaigns
apply and adapt these enhancements for improved productivity. Campaign managers and analysts will fashion in-database
analytics (in conjunction with their database experts) to support their most important and arduous day-to-day activities. In
this article, we review issues related to building and deploying
analytics with an eye toward how in-database solutions
advance the technology. We conclude with a discussion of how
analysts will benefit when they take advantage of the tighter
coupling of databases and predictive analytics tool suites,
particularly in end-to-end campaign management.
Introduction
Michael Goul is professor and chair at
Arizona State University, department
of information systems.
michael.goul@asu.edu
Decoupling data management from applications has
provided significant advantages, mostly related to data
independence. It is therefore surprising that many vendors
are more tightly coupling databases and data warehouses
with tool suites that support business intelligence (BI)
analysts who construct and manage predictive models.
These analysts and their teams construct and deploy models
for guiding campaigns in areas such as marketing, fraud
detection, and credit scoring, where unknown business
patterns and/or inefficiencies can be discovered.
“In-database analytics” includes the embedding of
predictive modeling functionalities into databases or data
warehouses. It differs from “in-memory analytics,” which is
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
17
PREDICTIVE MODELING
designed to minimizing disk access. In-database analytics
focuses on the movement of data between the database
or data warehouse and analysts’ workbenches. In the
simplest form of in-database analytics, the computation
of aggregates such as average, variance, and other statistical summaries can be performed by parallel database
engines quickly and efficiently—especially in contrast to
performing computations inside an analytics tool suite
with comparatively slow file management systems. In
tightly coupled environments, those aggregates can be
passed from the data engine to the predictive modeling
tool suite when building analytical models such as statistical regression models, decision trees, and even neural
networks. In-database analytics also enable streamlining
of modeling processes.
The typical modeling processes referred to as CRISP-DM,
SEMMA, and KDD contain common BI steps or phases.
Knowledge Discovery in Databases (KDD) refers to the
broad process of finding knowledge using data mining
(DM) methods (Fayyad, Piatetski-Shapiro, Smyth, and
Uthurusamy, 1996). KDD relies on using a database
along with any required preprocessing, sub-sampling, and
transformation of values in that database. Another version
of a DM process approach was developed by SAS Institute:
Sample, Explore, Modify, Model, Assess (SEMMA) refers
to the lifecycle of conducting a DM project.
Another approach, CRISP-DM, was developed by a
consortium of Daimler Chrysler, SPSS, and NCR. It stands
for CRoss-Industry Standard Process for Data Mining,
and its cycle has six stages: business understanding, data
understanding, data preparation, modeling, evaluation,
and deployment (Azavedo and Santos, 2008). All three
methodologies address data mining processes. Even though
the three methodologies are different, their common
objective is to produce BI by guiding the construction of
predictive models based on historical data.
A traditional way of discussing methodologies for predictive analytics involves a “sense, assess, and respond” cycle
that organizations and managers should apply in making
effective decisions (Houghton, El Sawy, Gray, Donegan,
and Joshi, 2004). Using historical data to enable managers
to sense what is happening in the environment has been the
18
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
foundation of the recent thrust to vitalize evidence-based
management (Pfeffer and Sutton, 2006). Predictive models
help managers assess and respond to the environment in
ways that are informed by historical data and the patterns
within that data. Predictive models help to scale responses
because, for example, scoring models can be constructed
to enable the embedding of decision rules into business
processes. In-database analytics can streamline elements of
the “sense, assess, and respond” cycle beyond those steps or
phases in KDD, SEMMA, and CRISP-DM.
This article explains how basic in-database analytics
will advance predictive modeling processes. However,
we argue that the most important advancements will
be discovered when actual campaigns are orchestrated
and campaign managers access the new, more tightly
coupled predictive modeling tool suites and database/data
warehouse engines. We assert that the most important
practical contribution of in-database analytics will occur
when analysts are under pressure to produce models
within time-constrained campaigns, and performances
from earlier campaign steps need to be incorporated to
inform follow-up campaign steps.
The next section discusses current impediments to predictive analytics and how in-database analytics will attempt
to address them. We also discuss the benefits to be realized
after more tightly coupled predictive analytics tool suites
and databases/data warehouses become widely available.
These benefits will be game-changers and will occur in such
areas as end-to-end campaign management.
What is Wrong with Current
Predictive Analytics Tool Suites?
Current analytics solutions require many steps and take
a great deal of time. For analysts who build, maintain,
deploy, and track predictive models, the process consists
of many distributed processes (distributed among
analysts, tool suites, and so on). This section discusses
challenges that analysts face when building and deploying
predictive models.
Time-Consuming Processes
To build a predictive model, an analyst may have to tap
into many different data sources. Data sources must con-
PREDICTIVE MODELING
SAMPLE
Input data,
sampling,
data partition
EXPLORE
Ranks-plots
variable selection
MODIFY
MODEL
Transform variable,
filter outliers,
missing imputation
ASSESS
Regression,
tree,
neural network
Assessment,
score,
report
Figure 1. SEMMA methodology supported by SAS Enterprise Mining environment
tain known values for target variables in order to be used
when constructing a predictive model. All the attributes
that might be independent variables in a model may reside
in different tables or even different databases. It takes time
and effort to collect and synthesize this data.
Once all of the needed data is merged, each of the independent variables is evaluated to ascertain the relations,
correlations, patterns, and transformations that will be
required. However, most of the data is not ready to be
analyzed unless it has been appropriately customized. For
example, character variables such as gender need to be
manipulated, as do numeric variables such as ZIP code.
Some continuous variables may need to be converted into
scales. After all of this preparation, the modeling process
continues through one of the many methodologies such as
KDD, CRISP-DM, or SEMMA. For our purposes in this
article, we will use SEMMA (see Figure 1).
The first step of SEMMA is data sampling and data
partitioning. A random sample is drawn from a population to prevent bias in the model that will be developed.
Then, a modeling data set is partitioned into training and
validation data sets. Next is the Explore phase, where each
explanatory variable is evaluated and its associations with
other variables are analyzed. This is a time-consuming step,
especially if the problem at hand requires evaluating many
independent variables.
In the Modify phase, variables are transformed; outliers
are identified and filtered; and for those variables that are
not fully populated, missing value imputation strategies
are determined. Rectifying and consolidating different
analysts’ perspectives with respect to the Modify phase
can be arduous and confusing. In addition, when applying
transformations and inserting missing values in large data
sets, a tool suite must apply operations to all observations
and then store the resulting transformations within the tool
suite’s file management system.
Many techniques can be used in the Model phase of
SEMMA, such as regression analysis, decision trees, and
neural networks. In constructing models, many tool suites
suffer from slow file management systems, which can
constrain the number and quality of models that an analyst
can realistically construct.
The last phase of SEMMA is the Assess phase, where all
models built in the modeling phase are assessed based
on validation results. This process is handled within
tool suites, and it takes considerable time and many
steps to complete.
Multiple Versions and Sources of the Truth
Another difficulty in building and maintaining predictive
models, especially in terms of campaign management,
is the risk that modelers may be basing their analysis on
multiple versions and sources of data. That base data is
often referred to as the “truth,” and the problem is often
referred to as having “multiple versions of the truth.”
To complete the time-consuming tasks of building
predictive models as just described, each modeler extracts
data from a data warehouse into an analytics workstation.
This may create a situation where different modelers are
working from different sources of truth, as modelers
might extract data snapshots at different times (Gray and
Watson, 2007). Also, having multiple modelers working on
different predictive models can mean that each modeler is
analyzing the data and creating different transformations
from the same raw data without adopting a standardized
method or a naming convention. This makes deploying
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
19
PREDICTIVE MODELING
multiple models very difficult, as the same raw data may
be transformed in different ways using different naming
conventions. It also makes transferring or sharing models
across different business areas challenging.
Another difficulty relates to the computing resources on
each modeler’s workbench when multiple modelers are
going through similar, redundant steps of data preparation, transformation, segmentation, scoring, and all the
other functions that can take a great deal of disk space
and CPU time.
The Challenges of Leveraging Unstructured Data and Web
Data Mining in Modeling Environments
Modelers often tap into readily available raw data in the
database or data warehouse. However, unstructured data
is rarely used during these phases because handling data
in the form of text, e-mail documents, and images is
computationally difficult and time consuming. Converting unstructured data into information is costly in a
campaign management environment, so it isn’t often
done. The challenges of creating reusable and repeatable
variables for deployment make using unstructured data
even more difficult.
Web data mining spiders and crawlers are often used
to gather unstructured data. Current analyst tool suite
processes for unstructured data require that modelers
understand archaic processing commands expressed in
specialized, non-standard syntax. There are impediments
to both gathering and manipulating unstructured data,
and there are difficulties in capturing and applying
predictive models that deal with unstructured data. For
example, clustering models may facilitate identifying rules
for detecting what cluster a new document is most closely
aligned with. However, exporting that clustering rule from
the predictive modeling workbench into a production
environment is very difficult.
Managing BI Knowledge Worker Training and
Standardization of Processes
In most organizations, there is a centralized BI group that
builds, maintains, and deploys multiple predictive models
for different business units. This creates economies of scale,
because having a centralized BI group is definitely more
20
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
cost effective than the alternative. However, the economies
of scale do not cascade into standardization of processes
among analyst teams. Each individual contributor usually
ends up with customized versions of codes. Analysts may
not be aware of the latest constructs others have advanced.
What Basic Changes Will In-Database
Analytics Foster?
In-database analytics’ major advantage is the efficiencies
it brings to predictive model construction processes due
to processing speeds made possible by harnessing parallel
database/warehouse engine capabilities. Time savings are
generated in the completion of computationally intensive
modeling tasks. Faster transformations, missing-value
imputations, model building, and assessment operations
create opportunities by leaving more time available
for fine-tuning model portfolios. Thanks to increasing
cooperation between database/warehouse experts and
predictive modeling practitioners, issues associated with
non-standardized metadata may also be addressed. In
addition, there is enhanced support for analyses of very
large data sets. This couldn’t come at a better time, because
data volumes are always growing.
In-database analytics make it easier to process and use
unstructured data by converting complicated statistical processes into manageable queries. Tapping into
unstructured data and creating repeatable and reusable
information—and combining this into the model-building
process—may aid in constructing much better predictive
models. For example, moving clustering rules into the
database eliminates the difficulty of exporting these rules to
and from tool suites. It also eliminates most temporary data
storage difficulties for analyst workbenches.
Shared environments created by in-database analytics may
bring business units together under common goals. As
different business units tap into the same raw data, including all possible versions of transformations and metadata,
productivity can be enhanced. When new ways of building
models are available, updates can be made in-database.
All individual contributors have access to the latest
developments, and no single business unit or individual
is left behind. Saving time in the labor-intensive steps of
model building, working from a single source of truth,
PREDICTIVE MODELING
Process
Benefits
Data set creation and preparation
Reduce cycle time by parallel-processing multiple functions; accurate andtimely completion of
tasks by functional embedding
Data processing and model buildingby multiple analysts
Eliminate multiple versions of truth and large data set movements to andfrom analytical tool
suites
Unstructured data management
Broaden analytics capability by streamlining repeatability and reusability
Training and standardization
Create operational and analytical efficiencies; access to latest developments; automatically
update metadata
Table 1. Preliminary benefits of in-database analytics
RET
AR
N
SIG
DE
MO
D
ORE
PL
X
E
MPLE
SA
SEMMA
EVALUA
TE
DEPLOYMENT
EM
PO
The DEEPER phases delineate, in
sequential fashion, the types of activities
involved in model deployment with a special
emphasis on campaign management. The
T
GE
ANCE
ORM
RF SURE
PE MEA
To drive measurable business results from predictive
models, SEMMA (or a similar methodology) is followed by
a deployment cycle. That cycle may involve the continued
application of models in a (recurring) campaign, refinement when model performance results are used to revise
other models, making decisions on whether completely
new models are required given model
performance, and so on. We distinguish
deployment from the SEMMA-supported
phase (intelligence) because deployment
MODE
L
often engages the broader organization and
Y
IF
AS
requires a predictive model (or models)
SE
SS
to be put into actual business use. This
INTELLIGENCE
section introduces a new methodology we
created to describe deployment: “DEEPER”
(Design, Embed, Empower, Performancemeasurement, Evaluate, and Re-target).
Figure 2 depicts the iterative relationship
between SEMMA and DEEPER.
EMBED
Context for In-Database Analytics Innovation
design phase involves making plans for how to transition a
scoring model (or models) from the tool suite (where it was
developed) to actual application in a business context. It
also involves thinking about how to capture the results of
applying the model and storing those results for subsequent
analysis. There may also be other data that a campaign
manager wishes to capture, such as the time taken before
seeing a response from a target. A proper design can eliminate missteps in a campaign. For example, if a targeted
catalog mailing is enabled by a scoring model developed
using SEMMA, then users must choose which deciles to
target first, how to capture the results of the campaign (e.g.,
actual purchases or requests for new service), and what new
data might be appropriate to capture during the campaign.
ER
W
having access to repeatable and reusable structured and
unstructured data, and making sure all the business units
are working with the same standards and updates—all
this makes it easier to transfer knowledge as new analysts
join or move across business units. Table 1 summarizes the
preliminary benefits of in-database analytics for modelers.
DEEPER
Figure 2. DEEPER phases guide the deployment, adoption, evaluation, and recalibration of
predictive models.
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
21
PREDICTIVE MODELING
Once designed, the model must be accurately embedded
into business processes. Model score views must be secured;
developers must ensure scores appear in user interfaces at
the right time; and process managers must be able to insert
scores into automated business process logic. Embedding
a predictive model may require safeguards to exceptions.
If there are exceptions to applications of a model, other
safeguards need to be considered.
Making the results of a predictive model (e.g., a score)
available to people and systems is just the first step in
ensuring it is used. In the empower phase, employees
may need to be trained to interpret model results; they
may have to learn to look at data in a certain way using
new interfaces; or they may need to learn the benefits of
evidence-based management approaches as supported by
predictive modeling. Similarly, if people are involved, testing may be required to ensure that training approaches are
working as intended. The empower step ensures appropriate
behaviors by both systems and people as they pertain to the
embedding of the predictive model into business processes.
A campaign begins in earnest after the empower phase.
Targets receive their model-prescribed treatments, and
reactions are collected as planned for in the design phase
of DEEPER. This reactions-directed phase, performance
measurement, involves ensuring the reactions and events
subsequent to a predictive model’s application are captured
and stored for later analysis. The results may also be
captured and made available in real-time support for
campaign managers. Dashboards may be appropriate for
monitoring campaign progress, and alerts may support
managers in making corrections should a campaign stray
from an intended path. If there is an anomaly, or when a
campaign has reached a checkpoint, campaign managers
take time to evaluate the effectiveness or current progress of
the campaign. The objective is to address questions such as:
■■
■■
■■
22
Are error levels acceptable?
Were campaign results worth the investment in the
predictive analytics solution?
How is actual behavior different from predicted
behavior for a model or a model decile?
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
This is the phase when the campaign’s effectiveness and
current progress are assessed.
The results of the evaluate phase of DEEPER may lead to a
completely new modeling effort. This is depicted in Figure
3 by the gray background arrow leading from evaluate to
the sample phase of SEMMA. This implies a transition
from deployment back to what we have referred to as
intelligence. However, there is not always time to return
to the intelligence cycle, and minor alterations to a model
might be deemed more appropriate than starting over. The
latter decision is most prevalent in time-pressured, recurring campaigns. We refer to this phase as re-target, which
requires analysts to take into account new information
gathered as part of the performance management deployment phase. It also takes advantage of the plans for how
this response information was encoded per the design phase
of deployment.
The most important consideration involves interpreting
results from the campaign and managing non-performing
targets. A non-performing target is one that scored high in
a predictive model, for example, but that did not respond
as predicted. In a recurring campaign, there may be an
effort to re-target that subset. There could also be an effort
to re-target the campaign to another set of targets, e.g.,
those initially scored into other deciles. Re-targeting can
be a time-consuming process; new data sets with response
results need to be made available to predictive modeling
tool suites, and findings from tracking need to be incorporated into decisions.
DEEPER provides the context for considering how
improvements to in-database analytics can be game-changers. In-database analytics can make significant inroads to
DEEPER processes that take time and are under-supported
by predictive modeling tool suites. However, these improvements will be driven by analysts who work closely with
their organizations’ database experts. This combination
of analyst and data management skills, experience, and
knowledge will spur innovation significantly beyond
current expectations.
PREDICTIVE MODELING
How Might In-Database Analytics
for DEEPER Evolve?
Extending in-database analytics to DEEPER processes
requires considering how each DEEPER phase might be
streamlined given tighter coupling between predictive
modeling tool suites and databases/data warehouses.
Although many of the advantages of this tighter coupling
may be realized differently by different organizations, there
are generic value streams to guide efforts. Here the phrase
“value stream” refers to process flows central to DEEPER.
This section discusses these generic value streams: (1)
intelligence-to-plan, (2) plan-to-implementation, (3)
implementation-to-use, (4) use-to-results, (5) results-toevaluation, and (6) evaluation-to-decision.
In the design phase of DEEPER, planning can be facilitated by examining possible end-user database views that
could be augmented with predictive intelligence. Instead
of creating new interfaces, it is possible that Web pages
equipped with embedded database queries can quickly
retrieve and display predictive model scores to decision
makers or front-line employees. Many of these displays are
already incorporated into business processes, so opportunities to use the tables and queries to supply model results can
streamline implementation. When additional data items
need to be captured, that data may be captured at the point
of sale or other customer touch points. A review of current
metadata may speed up the design of a suitable deployment
strategy. In addition to “pushing” model intelligence to
interfaces, there may also be ways of “pulling” data from
the database/warehouse to facilitate re-targeting or for
initiating new SEMMA cycles.
For example, it may be possible to design queries to
automate the retrieval of data items such as target response
times from operational data stores. Similarly, it may be
possible to use SQL to aggregate the information needed
for this type of next-step analysis. For example, total
sales to a customer within a specified time period can be
aggregated using a query and then used in the re-targeting
phase to reflect whether a target performed as predicted.
In-database analytics can support the design phase because
it eliminates many of the traditional bottlenecks such as
complex requirements gathering and the creation of formal
specification documents (including use cases). Instead,
existing use cases can be reviewed and augmented, and
database/warehouse–supported metadata facilities can
support the design of schema for capturing new target
response data. We refer to this as an intelligence-to-plan
value stream for the in-database analytics supported design
deployment phase.
In the embed phase, transferring scored model results
to tables is a first step in considering ways to make use
of database/warehouse capabilities to support DEEPER.
Once the scores are appropriately stored in tables, there are
many opportunities to use queries to embed the scores into
people-supported and automated business processes. For
example, coding to retrieve scores for inclusion in front-line
employee interfaces can be done in a manner consistent
with other embedded SQL applications. This saves time
in training interface developers because it implies that
the same personnel who implemented the interfaces can
effectively alter them to include new intelligence.
There is also no need for additional project governance
functions or specialized software. In fact, database/
warehouse triggers and alerts can be used to ensure that
predictive analytics are used only when model deployment
assumptions are relevant. As the database/warehouse is the
same place where analytic model results reside, there are
numerous implementation advantages. We refer to this as
a plan-to-implementation value stream for the in-database
analytics supported embed deployment phase.
After implementation, testing will ensure that model
results/scores are understandable to decision makers (the
empower phase) and that their performance can scale
when production systems are at high capacity. Such stress
tests can be conducted in a manner similar to database
view tests. Because of the inherent speed of database/
warehouse systems, their performance will likely exceed
separate, isolated workbench performance. Global roll-out
can be eased by tried-and-true database/warehouse roll-out
processes. We refer to this as an implementation-to-use value
stream for the in-database analytics supported empower
deployment phase.
Similarly, the use-to-results value stream is that part of a
campaign when actions are taken and targets respond.
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
23
PREDICTIVE MODELING
In this performance management phase of deployment,
dashboards can be used to track performance, database
tables can automatically collect and store ongoing
campaign results, queries can aggregate responses over
time as part of automating responses, and many other
in-database solutions can help to streamline related
processes. This information is central to the evaluate phase,
where the results-to-evaluation value stream can enable
careful scrutiny of the predictive analytics model portfolio.
Queries can be written to compare actual results to those
predicted during SEMMA phases. When more than one
model has been constructed in the SEMMA processes, all
can be re-examined in light of the new information about
responses. If-then statements can be embedded in queries
to identify target segments that have responded according
to business goals, and remaining non-responders can be
quickly identified.
performance of the models in the portfolio. If models exist
that were not used but appear to perform better, those
models may be used in the next DEEPER cycle. Alternatively, a combination or pooling of models might be most
appropriate. Again, automated queries might be able to
provide decision support for such pooling options, and they
can aid in scheduling the appropriate model for the data
sets as the DEEPER cycle progresses. In addition, it may
be possible to use queries to apply business rules to manage
data sets, and prior results could inform the scheduling
of resting periods for targets such that each target isn’t
inundated with catalog mailings, for example.
Such analysis can be done for each analytical model in
the portfolio and for each decile of predicted respondents
associated with those models. This has been an enormously
time-consuming process in the past, but the database/warehouse query engine can conduct this type of post-analysis
efficiently. Queries can also identify subsets of respondents
that outperformed the predicted model performance—and
those that significantly under-performed. This type of
analysis can be quickly supported through queries, and it
can provide significant insight for the re-target phase.
Conclusion
Following the results-to-evaluation value stream of the
deployment cycle, the evaluation-to-decision value stream
focuses on whether a new intelligence cycle (a repeat of
SEMMA processes) is required. If performance results
indicate major model failures, then a repeat is likely
necessary to resurrect and continue a campaign. Even if
there weren’t major failures, environmental changes such as
economic conditions may have rendered models outdated.
Data collected in the performance evaluation phase may
help to streamline the decision process. If costs aren’t being
recovered, then it is likely that either the campaign will
cease or a new intelligence cycle is necessary.
Often a portfolio of models is created in the initial intelligence cycle. It may be possible to use queries to automate
the process of recalculating the prior and anticipated
24
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
Table 2 summarizes key generic value streams that can be
supported by in-database analytics and briefly describes
the possibilities discussed in this section. Opportunities to
evolve in-database analytics are likely to be numerous.
In-database analytics create an environment where
functions are embedded and processed in parallel, thereby
streamlining the steps of both intelligence (e.g., SEMMA)
and deployment (e.g., DEEPER) cycles. As data sources
are updated, attribute names and formats may change, yet
they are sharable. In-database analytics can support quality
checks and create warning messages if the range, format,
and/or type of data differ from a previous version or model
assumptions. If external data has attributes that were not in
the data dictionary, metadata can be updated automatically.
Data conversions can be handled in-database and only once
instead of being repeated by multiple modelers. In-database
analytics fosters stability, enhances efficiency, and improves
productivity across business units.
In-database analytics will be critical to a company’s
bottom line when models are deployed and there is
time pressure for multiple, successive campaigns where
ongoing results can be used to build updated, improved
predictive models. Enhancements can be realized in a
host of value streams. For example, in-database analytics
can significantly reduce cycle times for rebuilding and
redeploying updated models to meet campaign deadlines. As multiple models are constructed, in-database
analytics will enable managing them as a portfolio.
Timely responses, tracking, and fast interpretation of
PREDICTIVE MODELING
Intelligence-to-plan
Planning is streamlined; push and pull strategies are feasible; schema design can support planning
Plan-to-implementation
Scores maintained in-database; embedded SQL in HTML can facilitate view deployment; triggers and alerts can be used
to guard for exceptions
Implementation-to-use
Stress testing and global rollout follow database/warehouse methodologies and rely on common human and physical
resources
Use-to-results
Dashboards can be readily adapted; database/warehouse tables can be used as response aggregators
Results-to-evaluation
Re-examine all created models efficiently in light of response information; embed if-then logic to re-target nonresponders
Evaluation-to-decision
Consider applying different models; allow targeted respondents to “rest”; use database to provide decision support for
deciding to re-target or re-enter the intelligence cycle
Table 2. Generic value streams and areas for innovation with in-database analytics
early responders to campaigns will enable companies to
fine-tune business rules and react in record time.
As the fine line between intelligence and deployment cycles
fades because of the fast-paced environment supported
by in-database analytics, businesses may move away
from the concept of campaign management into triggerbased, “lights-out” processing, where all data feeds are
automatically updated and processed, and there is no need
to compile data into periodic campaigns. There will be realtime decision making with instant scoring each time there
is an update in one of the important independent variables.
Analysts will spend their time fine-tuning model performance, building business rules, analyzing early results,
monitoring data movements, and optimizing the use of
multiple models—instead of dealing with the manual tasks
of data preparation, data cleansing, and managing file
movements and basic statistical processes that have been
moved into the database/warehouse.
Although lights-out processing is not on the near-term
horizon, the evolution of in-database analytics promises to
move organizations in that direction. Once in the hands of
analysts and their database/warehouse teams, in-database
analytics will be a game-changer. n
References
Azevedo, Ana, and Manuel Felipe Santos [2008]. “KDD,
SEMMA AND CRISP-DM: A Parallel Overview.”
IADIS European Conference Data Mining,
pp. 182–185.
Fayyad, U. M., Gregory Piatetski-Shapiro, Padhraic
Smyth, and Ramasamy Uthurusamy [1996]. Advances in
Knowledge Discovery and Data Mining, AAAI Press/The
MIT Press.
Gray, Paul, and Hugh J. Watson, Hugh [2007]. “What Is
New in BI,” Business Intelligence Journal, Vol. 12, No. 1.
Houghton, Bob, Omar A. El Sawy, Paul Gray, Craig
Donegan, and Ashish Joshi [2004]. “Vigilant
Information Systems for Managing Enterprises in
Dynamic Supply Chains: Real-Time Dashboards at
Western Digital,” MIS Quarterly Executive,
Vol. 3, No. 1.
Pfeffer, Jeffrey, and Robert I. Sutton [2006]. “Evidence
Based Management,” Harvard Business Review, January.
BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2
25
Copyright of Business Intelligence Journal is the property of Data Warehousing Institute and its content may
not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.
Services to Support Knowledge Sharing in Complex Business
Networks, Big Data as the Source
Abdussalam Ali and Igor Hawryszkiewycz
University of Technology Sydney, Sydney, Australia
Abdussalam.M.Ali@student.uts.edu.au
Igor.Hawryszkiewycz@uts.edu.au
Abstract: Big Data has become a buzzword that refers to the complex and massive data that is either structured or
unstructured and not easy to be captured and processed by traditional tools and software applications. The term refers to
the large size of data that is created through the activities using of Information and communication technology (ICT). Big
Data in our research is the big container to be explored to discover knowledge and information based on the searching
context. Big Data may include both explicit and tacit sources of information and knowledge. Businesses should consider all
sources in the business environment when discoverin...
Purchase answer to see full
attachment