ASSESSING THE
MARKET VALUE RATIOS
OF RECENT HOME SALES
Fairfax County Virginia and the
Surrounding Communities
Parcel Real Estate Analytics
7054 Haycock Rd
Falls Church, VA 22043
(703) 538-8310
parcelrealestateanalytics@vt.ed
u
1
2
Contents
Introduction
Business Understanding
Data Understanding
Data Preparation
Data Modeling
Evaluation
Deployment, Monitoring and Maintenance
Conclusion
3
4
6
8
11
14
16
19
References Consulted
20
Appendix A
21
www.parcelreaestateanalytics.com
3
1. Introduction
The abundance of publicly available data in the real estate industry has been a true blessing, but it
remains a challenge for the incumbents to extract meaningful information to serve their business and
personal needs. Despite having a sea of data, businesses struggle to guide seller and buyer decisions
with targeted data mining results due to reasons such as lack of initiative. In this proposal, we will
describe how data mining can help solve business problems based on Cross Industry Standard Process
for Data Mining (CRISP-DM) methodology. CRISP-DM is an industry-proven way to guide your data
mining and analytic efforts. As a methodology, it includes descriptions of the typical phases of a
project, the tasks involved with each phase, and an explanation of the relationships between these
tasks. As a process model, CRISP-DM provides an overview of the data mining life cycle. The process
model consists of a sequence of six phases starting from ensuring there is a good understanding of
the business all the way to a recommend deployment of the solution that satisfies the specific
business needs.
Figure 1: The CRISP-DM Model. [1]
1
Majid Bahrepour. The Forgotton Steps in CRISP-DM and ASUM-DM Methodologies.
https://amsterdam.luminis.eu/2018/08/17/the-forgotten-step-in-crisp-dm-and-asum-dm-methodologies/
www.parcelreaestateanalytics.com
4
2. Business Understanding
BUSINESS PROBLEM
Coming up with the right price when selling a home is a top of mind question for sellers. While there
are many businesses that provide information on the housing market, including available properties
and sales price, they are mostly geared towards providing value to the buyer and not the seller.
Homeowners who are ready to sell their home are often constrained by the lack of accuracy in
websites or applications like Trulia, Zillow, and Realtor.com. Not only is it difficult to list the right
price, but it is even more challenging to settle on a price that maximizes profit and sells within the
timing needs of a seller (Days on Market). For example, someone moving to a new location for a job
that starts in two weeks often desires to sell their home at a price that enables quick reaction. This is
very different from someone selling their home with hopes of eventually moving to Mexico to retire.
On one end the first seller is looking at shorter days on the market to price their home, while the
other is looking more at maximizing profits, with the ability to manage an extended period of Days on
Market (DOM) when pricing their home. This need, of projecting sales prices based on the sellers
desired DOM is therefore a very valuable element for homebuyers.
BUSINESS OBJECTIVE
In order to realize this business need, our team of business, data and data mining experts will
perform a theoretical analysis to develop a pricing strategy tool, using available real estate data, that
predicts an estimated sales price of a home given a desired sell-date range (days on market), and
desired bottom line.
The revenue model for this project would be a subscription model that allows sellers and realtors to
access the service and create DOM estimates with associated probabilities by asking price.
BUSINESS SUCCESS CRITERI A
In order to understand whether our model will help improve the business problem, we will have key
performance indicators in place (discussed in more detail in the deployment section). Overall, we will
look at customer feedback, in addition to satisfaction, precision, and acquisition measures.
SITUATION ASSESSMENT
To help achieve the business objectives we will examine regional data, available through government
websites and online marketplaces, particularly Fairfax County, Virginia. The data will help us establish
a proof of concept and determine the feasibility of the business proposal.
DATA MINING GOAL
Our goal is to develop models that can provide sellers and realtors with useful insights and help
answer the following questions for them:
www.parcelreaestateanalytics.com
5
1.
2.
3.
How soon can I sell the house? (Sellers)
How much should I sell the house for? (Sellers and Realtors)
Do I need to cut price to sell the house? (Sellers and Realtors)
Because there are multiple factors that impact the sale of home, understanding the housing market
within the Fairfax County region is key. We will do this by looking at common characteristics of home
sales that have sold within 0-30 days, 30-60, and 60-90 days in the market; and home features that
affect the listing price of a home correlated to its sales. Both unsupervised and supervised machine
learning algorithms will be used.
Unsupervised:
•
Hierarchical and K-mean clustering: We will explore underlying groups in the data and
identify attributes that characterize the clustering.
Supervised:
•
•
Regression: We will build a predictive model with optimal asking price as the dependent
variable. We will include variables such as assessed value, square footage,, buyer-seller index,
days on market, etc.
Classification, decision tree and logistic regression: We will use classification algorithms to
advice if a property can be sold within a desired amount of time.
With these models in place we can understand the market better and provide both sellers and
realtors with advice so they achieve optimal economic gain.
REQUIREMENTS, ASSUMPTI ONS, CONSTRAINTS, RI SKS AND CONTINGENCIE S
Our team has discussed and identified the following requirements, assumptions, constraints, risks
and contingencies:
1. We will only look only at the data for the years 2006-2016
2. We assume 50 days as a cut-off point for quick-selling and slow-selling homes
3. There is a risk we will encounter some limitations with data such as incomplete data, lack of
data on buyers, and lack of data where some factors that influence the value of home are
subjective.
4. There are external factors that could impact the accuracy of our prediction outcomes such as:
interest rates, economy, government policies, stock market, natural disasters)
www.parcelreaestateanalytics.com
6
3. Data Understanding
The data we will be exploring is readily available through government websites and online
marketplaces. The data is largely reliable, and the cost is minimal. We do not anticipate the need for
any further investment to obtain additional data sources or sets based on the scope of the project.
DATA SOURCES
We have identified the following potential sources of data freely available from the data.gov data
catalog. All of the datasets refer specifically to home data from Fairfax County, Virginia and provide a
varied representation of property, owner, and sales data.
In response to the business problem stated above, we will collect the following data for sold homes
via the Zillow API:
• Median Sales Price (Seasonally adjusted)
• Median List Price
• Sale-to-List Ratio
• Median Price Cut (%)
• Days on Zillow
• Market Health Index
• Buyer-Seller Index
Additionally, we will leverage the following freely available data sets from data.gov:
•
Tax Administration's Real Estate - Land Data - Provides data points related to size and
location of land parcels in Fairfax County.
•
Tax Administration's Real Estate - Sales Data - Data related to sales date and sale price.
•
Tax Administration's Real Estate - Dwelling Data - This set presents specific data about the
dwelling, i.e. bedroom count, bathroom count, etc...
•
Tax Administration's Real Estate - Parcels Data - Data related to land parcels in Fairfax
County. This set may require additional evaluation to determine if it does indeed provide
unique data that cannot be gleaned from the other sets.
•
Tax Administration's Real Estate - Assessed Values - Assessed values could help show how
significantly seller timeline (available DOM) impacts price adjustments; a regression of sales
price and assessed value will assist in this analysis.
Each data.gov set can be cross-referenced and joined using the PARID (presumably, Parcel ID) foreign
key, which exists in all of the data sets. For the specific business problem outlined above, we will
look only at the data for the years 2006-2016. Only sales listed as “valid and verified” in the data will
be considered. We intend to combine the Fairfax County sources with data from Zillow to obtain the
days on market (DOM) value for properties which fit these parameters.
www.parcelreaestateanalytics.com
7
The sources of the data are independent, though the underlying raw data will significantly overlap.
Data from both sources contains complementary information that will help sellers develop a
comprehensive view of the market and determine an optimal asking price.
We will compile data over the course of several months. Parcel size and square footage of the
dwelling will also be considered as we seek to provide sellers with an optimal starting price point
given the speed with which they need to close their sale.
DATA IDENTIFICATION AND DESCRIPTION
Internal deliberations have identified some of the most important attributes to consider for our
analysis. Please note this list may be inconclusive as we may add or remove attributes as we progress
to data analysis.
Attribute
Description
Days on Market
Days the home has been posted on Zillow until it’s sold
Zillow Sale Price
The final sale price of the home from Zillow
Square Footage
The square footage of the home
Zillow Asking
Price
The list price of the home posted on Zillow
Assessed Value
The appraised value of the home
Parcel Size
The total acreage of the home
Zip Code
The zip code to which the home belongs to
Buyer-Seller
Index
An index developed by Zillow on a scale of 0 to 10 (0 being a strong seller’s
market and 10 a strong buyer’s market)
Market Health
Index
An index developed by Zillow on a scale of 0 to 10 (0 being the least healthy and
10 the healthiest)
To optimize a seller’s decision-making process in regard to setting, dropping, and raising list price. We
will attempt both unsupervised and supervised machine learning algorithms using the
aforementioned attributes and target variables.
Test calls to the API were performed successfully by setting up a Web Data Connector for
Tableau. While calling the freely available API costs nothing, hosting a Web Data Connector for
processing the calls would incur nominal fees and, if Tableau were to be used for additional data
processing this, too, would have a software licensing fee associated.
www.parcelreaestateanalytics.com
8
4. Data Preparation
Data Preparation will consist of two phases: Phase 1: Data Consolidation and Phase 2: Data Cleaning
and Transformation. In Phase 1 relevant data is collected and consolidated from the identified data
sets, which is then made ready for transformation. In Phase 2 the data that are incomplete, duplicate,
or incorrectly formatted are corrected and/or eliminated. The data is then built into the appropriate
form or structure needed for data modeling.
Phase 1: Data Consolidation
A search for datasets which are freely available and expose the real estate data needed for the given
business problem is daunting. To narrow the search, we looked specifically for datasets concerning
the US state of Virginia. Fairfax County, shown in Figure 1, which is considered part of the greater
metropolitan area of the District of Columbia, makes numerous datasets freely available via their
website, http://data-fairfaxcountygis.opendata.arcgis.com (henceforth referred to as the Fairfax Data
Site, or FDS).
When exploring the site for sets specific to Real Estate, several potential sets were uncovered. Many
sets had similar features and most included Parcel ID as a potential match point. The Parcel ID,
however, was later found to be somewhat unreliable as the ID would often be split and/or modified
as properties were broken apart and sold piecemeal. Ongoing exploration of the data led to a data
set which included the feature, Market Sale Ratio, which is the ratio of the market value of the
property to assessed value of the property. In the end, the Market Sale Ratio dataset was selected as
a primary dataset for the problem (http://datafairfaxcountygis.opendata.arcgis.com/datasets/market-sale-ratio).
Using the datasets at FDS, we would be able to harvest our preferred features into Tableau or
another tool. The flexibility of the ArcGIS API allows users to refine the data chosen to harvest and
focus analysis early on. An API query can be modified to eliminate null values or even isolate a range
of values during harvest which can rapidly accelerate the data cleaning process.
A basic API call is presented below. Certain features are noticeable, and the query appears very
generic. It calls for all fields to be returned in JSON format:
https://services1.arcgis.com/ioennV6PpG5Xodq0/ArcGIS/rest/services/OpenData_S4/FeatureServer/
1/query?where=1%3D1&outFields=*&outSR=4326&f=json
Alternatively, a customized call to the API appears as such:
https://services1.arcgis.com/ioennV6PpG5Xodq0/ArcGIS/rest/services/OpenData_S4/FeatureServer/
1/query?where=SALES_VALUE%20%3E%3D%201%20AND%20SALES_VALUE%20%3C%3D%201000000
000&outFields=PIN,HOUSI_UNIT_TYPE,MARKE_SALE_RATIO,MARKE_VALUE,ASSES_VALUE,SALES_VA
LUE,VALID_FROM,VALID_TO,PARCE_ID&outSR=4326&f=json
www.parcelreaestateanalytics.com
9
This call specifies a range for the SALES_VALUE field (highlighted green, used to eliminate null values)
and specifies the fields for the query to return (highlighted yellow). Indicating the fields could be
considered redundant since the script used for calling the API outlines the data schema and table to
be populated by the API call. Only fields presented in the script will be populated.
One other calculation performed on the raw data related to the conversion of the harvested Unix
Date to a Standard Date format. The Unix Date first needed to be truncated to 10 digits as it was
using 13 digits, which was providing seconds and milliseconds as part of the date stamp. There may
be instances when this information may be beneficial, but the stated problem does not require this
level of precision. So, as the data was pulled into Tableau, the following two conversions were
applied:
1.
To truncate to 10 digits: LEFT(STR([Unix Date]),10)
2.
To convert Unix Date: DATEADD('second',INT([TruncatedDate]), #1970-01-01#)
For purposes of plotting parcels on a map for visualization, the geometry attributes of the dataset can
be isolated with a unique Web Data Connector to populate a table with the Parcel ID Number, along
with the latitude and longitude for each object. To generate the appropriate map in Tableau for this
data, generic State and County columns for each parcel will be added during the data harvest,
appropriately populated with Virginia and/or Fairfax.
Test calls to the API were performed successfully by setting up a Web Data Connector for
Tableau. While calling the freely available API costs nothing, hosting a Web Data Connector for
processing the calls would incur nominal fees and, if Tableau were to be used for additional data
processing this, too, would have a software licensing fee associated.
APIs, particularly one with well-documented standards like the ArcGIS API, make data harvest a
practical and efficient means for acquiring object attributes needed to analyze a business
problem. Attributes can easily be harvest into a one or more tables to create useable feature vectors.
Additionally, if multiple data sets are associated with a single API, an attribute may exist across tables,
making it convenient to join object and create associations for analysis.
Phase 2: Data Cleaning and Transformation
We will examine the consolidated data and identify missing values or patterns. Missing values will be
investigated and imputed appropriately to preserve the power of the data. Due to the vast range of
variable in the data, we will also normalize the variables for meaningful modeling. As many of the
variables in the data are highly correlated, to avoid complications on our predictive models, we will
selectively use these variables in the regression and classification models.
Data pertaining Fairfax County is extracted from Zillow.com using Zillow API and Tableau web data
connector. Data pertaining neighboring counties with similar demographics and market condition is
www.parcelreaestateanalytics.com
10
also extracted for the purpose of model training and testing. Table 1 illustrates new variables needed
from data mining tasks.
Variable
Type
Description
Quick Sell
Category 0=house not sold within 50 days. 1=house sold within 50 days.
Price Reduction Category 0=asking price reduced. 1=asking price not reduced.
www.parcelreaestateanalytics.com
11
5. Data Modeling
OVERVIEW
We propose several modeling techniques in line with our business objective: k - means and
hierarchical and k - means clustering, classification, linger regression, and logistic regression. These
models will help us understand the market better and advise sellers to achieve optimal economic
gain. The proposed models can provide sellers with useful insights in the following areas:
1.
How soon can I sell the house?
2.
How much should I sell the house for?
3.
Do I need to cut price to sell the house?
UNSUPERVISED LEARNIN G
Days on market is an important attribute in the business objective, but not much information is
known as to how it relates to other variables in the housing market. Thus, we want to use clustering
analysis to explore the data set before performing other data mining tasks.
HIERARCHICAL CLUSTER ING
Clustering analysis forms groups or classes based on the distance between data points. Instances
belong to the same groups are similar, whereas instance belong to different groups are less similar.
Not knowing how many groups extract, we want to perform a hierarchical clustering to identify
meaningful clusters. This exploratory effort will help us unveil natural groupings in the data. With the
target variable unknown, we will start the analysis with variables based on some of the important
attributes in the housing market: days on market, average sales price, median listing price per sq ft,
and inventory age. Price and square footage are prominent factors for sellers and buyers to consider
in a deal. Days on market is of business interest and inventory age reflect the supply and demand
condition of the market. Variable data is normalized. We will use Euclidean distance and centroid
linkage to perform the analysis. We will partition the available data into ⅓ for testing and ⅔ for
training. Data for the aforementioned variables contains no missing value. However, due to limited
amount of data available for Fairfax County, we may use data from neighboring counties such as
Arlington county with similar demographics and market condition to train the model.
K - MEANS CLUSTERING
K nearest neighbors is another way to find groupings in the data. Having a number of clusters in mind,
we can prescribe a value to k and train the model find k number of centroids and group the points
closest the same centroid into one cluster. Distortion is the sum of squared differences between each
data point and the associated centroid. We use the value of distortion to evaluate the quality of
clustering model. The lower the distortion value, the better the clustering. We use normalized data
for this analysis and the model is trained and tested using the same protocol and variables as those
used in hierarchical clustering. If the model is unable to produce meaningful clustering, we will
consider adjusting the variables selected in the model and repeat the clustering analysis to obtain
insightful results.
www.parcelreaestateanalytics.com
12
SUPERVISED LEARNING
Regression
Regression analysis allows us to statistically estimate the relationship between variables. To correctly
make use of a regression model, we need to first identify the dependent variable, one that will vary
dependent on another variable, the dependent variable. In our case, the price for selling a home is
our independent variable and the size of the home is our dependent variable.
Multilinear Linear Regression
We will build a predictive model with estimated sales price as the target variable. We will include
variables such as assessed value, square footage and zip code, etc. and avoid using variables that are
highly correlated.
Before applying multilinear regression, we will first establish the existence of a relationship between
our variables of interest. To establish this relationship, we can use a scatter plot. Scatter graphs will
help us to more correctly estimate the price considering all data points and identify multicollinearity.
To determine the fitting relationship, we can use methods such as the Residual Sum of Squares (RSS)
or the MSE (Mean Squared Error) or the RMSE (Root Mean Squared Error)
RMSE(Square root of MSE) = √ (MSE) [2]
We will then need to train the data to build regression models with the expected variables. Using
tools like matplotlib, we can plot the modeled data side by side with the actual data. We will then use
the methods stated above to find the correct linear regression equation to estimate the price of the
homes.
When using regression to estimate the sales price, care must be taken to correctly interpret the
results. This is because price and demand are always related in economics. Price goes up or down
with demand, with other factors cared for, and therefore the two are dependent to one another.
Therefore when modeling the price as a regression based on select attributes, we will need to factor
the demand element to have a correct forecast. As we build the regression model, without adding
too much dimensionality, we will attempt to add modulators to reflect the effect of market condition
on the predicted sales price.
Classification (Decision Tree)
As many sellers want to sell their homes sooner rather than later and the average of days on market
is around 50 days in the last few months, considering the current market condition, we assume 50
days as a cut-off point for quick-selling and slow-selling homes. We want to use it as a baseline to
advise sellers if they can sell their properties within 50 days, the market average. We want to build a
2
Bhavesh P. (2017, April 17). Predicting house value using regression analysis – Towards Data Science. Retrieved October
12, 2018, from https://towardsdatascience.com/regression-analysis-model-used-in-machine-learning-318f7656108a
www.parcelreaestateanalytics.com
13
classification model to predict the probability of selling a home within 50 days based on important
variables of the home, including some of the variables that can generate meaningful clustering.
Example of variables include median listing price per sq ft, square footage, and number of bedrooms.
To avoid overfitting, the model is cross-validated with the minimum leaf size set to 5. If necessary, we
may use data from neighboring countries to train and test the model.
Logistic Regression
Once a seller has an estimated sales price and days on market date in mind, another model is
required to help make the decision of price reduction. We will build a logistic regression model with
the target variable being the probability of making a price reduction (0-1).
To determine if we need to reduce the price or not, we will employ strategies such as machine
learning to determine the best matching relationship between the pricing variables of the homes.
We will build a trained regression algorithm based on the property variables such as square footage
and market variables such as market health index. To check for the goodness of fit of the model, we
will use r score. Cross validation is used to examine the dataset and the model. Alternatively, we can
also build a classification tree to compare to the logistic regression model.
2
www.parcelreaestateanalytics.com
14
6. Evaluation
The purpose of the evaluation stage is to select models the best generalizes unseen data, to meet our
business objectives within the constraints we have. For all models, we will use cross-validation (10
folds) to estimate out-of-sample performance, in addition to validation criteria applicable to any
specific model. We will also examine whether the models will help improve our decision making in
advising both sellers and realtors.
MODEL EVALUATION
Before applying any predictive models to the data, we use clustering as an exploratory effort to
uncover underlying groups in the data regarding the housing market. To identify relevant attributes
to use in our predictive models, we want to use clustering to gain insights and profile the groups
based on their characteristics.
To effectively evaluate a clustering model without a target variable or labeled class, we will use
Internal and external quality criteria, each of which encompasses different measures:
● Internal quality criteria: focus on the data itself, inter-cluster and intra-cluster, without using
external information.
● External quality criteria: ear used to match some predefined clusters.
Meaningful clusters will help us select attribute and tune the parameters in the predictive models
and provide our customers with market insights in the form of dashboards.
We use classification models, including decision tree and logistic regression, to accomplish two main
business objectives: predicting if a seller can sell his/her house within 50 days and if a price reduction
is needed. We will build a confusion matrix to test the overall performance of the classification
models. From the confusion matrix, several metrics are calculated to evaluate the model’s
performance:
● Classification Accuracy: Overall, how often is the classifier correct?
● Sensitivity: When the actual value is positive, how often is the prediction correct?
● False Positive Rate: When the actual value is negative, how often is the prediction
incorrect?
● Precision: When a positive value is predicted, how often is the prediction correct?
We also want to plot the Receiver Operating Characteristic (ROC) curve for our model to identify the
optimal threshold to balance sensitivity and false positive rate. At a result, the optimal threshold will
help us select a model that’s best suited to advise if a property can be sold within 50 day and if the
seller needs to reduce the asking price.
In addition, as we may use both decision tree and logistic regression to answer the same business
question, lift curves will help us compare the models in terms of effectiveness over random guessing.
www.parcelreaestateanalytics.com
15
The main purpose of the regression model is to give customers an opportunity to predict a value of
interest, such as the optimal asking price. We will test the principal assumptions regarding the
regression, for example, linearity of the relationship between attributes and the target variable. As
many other models, regression models are subject to overfitting, we will attempt to limit the
complexity of the model, while retaining its functionality.
BUSINESS EVALUATION
The goal of this analysis is to provide realtors and sellers with knowledge of the factors that influence
Market Value Ratio. The return on the investment for analyzing and assessing this data is potentially
substantial, since realtors can prepare sellers for the market prior to listing a home. Sellers should
have clear knowledge of what to expect from a given sale regardless of time-of-year or zip code.
Specific benefits of the analysis and how it will impact a home sale include:
●
●
●
●
The ability to price homes more accurately based on the all factors surrounding a sale.
The ability to make sellers aware of a likely timeline for their listing.
Targeted marketing campaigns for recruitment of sellers from soon-to-be “hot” markets.
Increased seller and realtor satisfaction
To evaluate the business effectiveness of our use of this analysis, we will seek feedback from the
customers impacted using our data. We are actively seeking partners in the world of real estate
interested in our analysis. Once a partner is secured, we would recommend they use our data
analytics service for not less than one year. We will then monitor the partner entity’s overall sales
totals and how their cumulative Market Value Ratio performs for the year. These observations will
have minimal costs associated but will greatly inform the Return on Investment for our analytic
services. Moreover, we will be monitoring the key performance indicators mentioned in the
deployment section to measure the progress towards our business goals, as well as our partners’
operational well-being in the real estate market. In the grand scheme of things, our business results
from machine learning should indicate an increase in the overall market efficiency; both sellers and
buyers should benefit.
NEXT STEPS
Additional observations would be made via customer surveys of both our business partner and their
direct customers. Assuming our partner uses the data as we outline herein, surveying their
customers will provide qualitative data on the effectiveness of our analysis. The survey will present
the biggest cost and time commitment but is necessary to assess the efficacy of our system. See
Appendix A for an example of the survey.
www.parcelreaestateanalytics.com
16
7. Deployment, Monitoring and Maintenance
DEPLOYMENT PLAN
The predictive model is designed for homeowners and realtors as a price strategy tool to allow them
to make sales price estimates given a desired sell-date range (days on market), and desired bottom
line.
Because of the varying differences of these two audiences, the model will be deployed in two ways:
1.
For Realtors: Partner with realtor.com and embed our model into their realtor.com hub.
What this will do is allow our price strategy tool to be highlighted on each realtor’s
professional online dashboard. Realtors will have the select either a subscription model or
single purchase to use the model. We would then provide a percentage of profits to
realtor.com as part of the partnership.
2.
For Homeowners: Develop a web-based application, where homeowners can make a
single purchase to use the model.
Both deployments will have a user interface designed so that both homeowners and realtors can
input specifications about the home (features, days on market, profit, etc…) into a web-based
application which then runs the predictive model algorithms on the back end and ultimately provides
the seller or realtor with an estimated sales price for them to use. Because we are only focusing on
Fairfax Virginia, our partnerships with realtor.com and outreach to sellers will only pertain to that
county.
KEY PERFORMANCE INDI CATORS
In order to understand if our price strategy tool addresses the business problem, we will have the
following key performance indicators in place.
1. Customer Acquisition and Retention
a) Customer Acquisition: Number of new customers that have been added to customer base
b) Revenue Growth rate: Percentage increase in sales between two time periods
c) Customer Churn rate: The rate at which customers discontinue or opted out of renewing a
price strategy tool subscription.
d) Repeat Purchase rate: Percentage of your current customers that have returned to buy
another single purchase or renewed subscription.
2. Measures for Model Performance
a) Percentage homes sold using the predictive sales price and days on market
b) Percentage homes sold within 15% of the predictive sales price and days on market
3. Customer Satisfaction
a) Customer Satisfaction Score (CSAT): Average score of all customer responses, where
customers rated their satisfaction outcome from using the price strategy tool.
b) Net Promoter Score: How likely customers are to recommend the price strategy tool on a
scale from 1 to 10 (Promoters: 9-10, passives: 7-8, or detractors 0-6).
www.parcelreaestateanalytics.com
17
RISKS
Data is the fuel needed for any successful predictive modeling exercise. As such we have identified
potential possibilities where our model exhibits data limitations and could be at variance with actual
results:
• No information about potential buyers.
• The value of a house can be heavily influenced by features that are extremely subjective.
Artistic quality of the residence, architecture, and a specific style might appeal strongly to one
buyer but not another.
MONITORING AND MAINT ENANCE
Monitoring and maintenance are critical aspects especially when the price strategy tool becomes a
daily activity to customers. Proper maintenance will help in reducing periods of unworthy utilization
of the data mining models. To monitor data deployment effectively, we will adopt a comprehensive
monitoring and maintenance plan. We will also develop guidelines from basic principles.
The significance of the step is to determine or approximate the extent to which our models have
achieved an expected degree of confidence in predictability. Additionally, we will seek to evaluate the
model periodically to ensure its effectiveness and to make continuous improvements. The monitoring
and maintenance plan will include:
1. Tracking and updating the data that influence the models
2. Measuring and monitoring the validity and accuracy of each model
3. Identifying accuracy thresholds or expected changes in data.
4. Data Dependencies between multiple model techniques (k - means and hierarchical and k means clustering, classification, linger regression, and logistic regression).
COMPONENTS OF MONITO RING AND MAINTENANCE PLAN
The monitoring and maintenance of our project shall have five main parts. The elements pose checks
during monitoring and maintenance. Elements comprise of various aspects. We will harmonize each
of components during the processes.
TASKS PRIORITIES SYSTEM
The tasks priorities recognized by the model appreciate the ideology of provision of maintenance
utilities. The minimization of losses and deficiencies is the main aim of the system regardless of
situations. The priorities of monitoring and maintenance include the emergencies and scheduled
operations and services in any case and response.
DEVELOPED PROCESSES SYSTEM
Each stage from data acquisition through validation to modeling has got well-defined procedures and
assumptions. The system will seek to justify whether we follow procedures and challenges faced in
each step as we obtain data. We will define the purpose of normal activities. We will consider the
frequency of operation in each stage.
www.parcelreaestateanalytics.com
18
TASKS ORDER SYSTEM
The project has got detailed work order systems as we have discussed in previous sections. The
provision includes all information pertained, for instance, source, description, cost and time take to
achieve the output. This data is crucial in the evaluation of system performance. For effectiveness of
the component, all requests and performed activities will be recorded in detail. We will also consider
a deviation of each outcome from the expected.
TRAINING
The project will train users of the model on how to use the model effectively. The plan recognizes the
significance of giving the users opportunities to gain technical expertise, increase knowledge and
skills and learn new methodologies. The project will train users regularly for effectiveness. We will
involve realtors during this component of the monitoring and maintenance.
We will determine the accuracy of the system by comparing the model outputs to real situations in
instances possible. The process is also known as validation. We will select one or more data points.
Then check their results in the model. Likewise, we will use other models for the same prediction for
example determination of price in a given area. Finally, we will calculate the mean square error. If the
error exceeds 10%, the model will not be significant and thus should not be used. In that situations,
we will update the model and utilize other conventional models in the short run. In the long term, the
new model shall be developed to take consideration of more factors to make the model more
relevant and sounder.
www.parcelreaestateanalytics.com
19
8. Conclusion
With successful implementation of these supervised and unsupervised machine learning models, we
will gain competitive advantage and offer state-of-the-art services to our customers as an industry
leader. It is imperative to monitor the performance of existing clustering, classification, and
regression models and explore alternative options. As data mining is an iterative process, we will
continuously gather new data and improve our algorithm to fit our business scope and serve client
requirements.
www.parcelreaestateanalytics.com
20
References Consulted
Bhavesh, P. (2017, April 17). Predicting house value using regression analysis – Towards Data Science.
Retrieved October 12, 2018, from https://towardsdatascience.com/regression-analysis-model-used-inmachine-learning-318f7656108a
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2016). CRISP-DM 1.0
Step-by-step data mining guide. Crisp DM Consortium (Updated 2010) (1999).
Fairfax County, Virginia. Retrieved 11 October 2018, from
https://upload.wikimedia.org/wikipedia/commons/e/e2/Dc22counties.jpg
Fawcett, F. P. (2013). Data Science for Business.O'Reilly Media, Inc.
Kevin K Dobbin, R. M. (2011). Optimally splitting cases for training and testing high dimensional classifiers.
BMC Med Genomics, 31.
López-Campos, M. A., Márquez, A. C., & Fernández, J. F. G. (2018). The Integration of Open Reliability,
Maintenance, and Condition Monitoring Management Systems. In Advanced Maintenance Modelling for Asset
Management (pp. 43-78). Springer, Cham.
Majid Bahrepour. The Forgotton Steps in CRISP-DM and ASUM-DM Methodologies.
https://amsterdam.luminis.eu/2018/08/17/the-forgotten-step-in-crisp-dm-and-asum-dm-methodologies
Malika, C., Ghazzali, N., Boiteau, V., and Niknafs, A. 2014. “NbClust: An R Package for Determining the Relevant
Number of Clusters in a Data Set.” Journal of Statistical Software61: 1–36.
http://www.jstatsoft.org/v61/i06/paper
Ng, R. (2018, November 4). Evaluating a Classification Model. Retrieved from richieng.com:
https://www.ritchieng.com/machine-learning-evaluate-classification-model/Rokach L., Maimon O. (2005)
Clustering Methods. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook.
Springer, Boston, MA
Roope Astala, G. E. (2018, January 8). Machine Learning Studio Module Reference.Retrieved from Microsoft
Azure: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/cross-validatemodel
Team Lynch (2018, November 4). Customer Satisfaction Survey. Retrieved from: Team Lynch:
http://www.teamlynchrealestate.com/Customer-Satisfaction-Survey
www.parcelreaestateanalytics.com
21
Appendix A
Reproduced from: http://teamlynchrealestate.com
www.parcelreaestateanalytics.com
22
www.parcelreaestateanalytics.com
Purchase answer to see full
attachment