business intelligence

Content Type

User Generated

User

nzrvz

Subject

Business Finance

Description

Upload your individual project log, which should contain a reflection on your own learning experience with respect to the team project. The entries should be dated. Entries need not be extensive, but should span the entire project duration, and should reflect the team’s process, experience with the tools and methods used, problems encountered, etc. Note that these logs will be private and not shared with team members.

We did this project step by step, every two weeks the prof asked us to add some sections to our project,

1- Project Deliverable #1 - Project 1- 9 Sep

2- Project Deliverable #2 - Business Understanding & Data Understanding Draft Sections 16-23 Sep

3- Project Deliverable #3 - Data Preparation and Modeling Draft Sections- 1-14 Oct

4- Project Deliverable #4 - Evaluation and Deployment Draft Sections -23Oct-4 Nov

project team members

Zeina Meng,

Courtney Eyer

Cindy Orellana

Othman

Our project is attached

Unformatted Attachment Preview

ASSESSING THE MARKET VALUE RATIOS OF RECENT HOME SALES Fairfax County Virginia and the Surrounding Communities Parcel Real Estate Analytics 7054 Haycock Rd Falls Church, VA 22043 (703) 538-8310 parcelrealestateanalytics@vt.ed u 1 2 Contents Introduction Business Understanding Data Understanding Data Preparation Data Modeling Evaluation Deployment, Monitoring and Maintenance Conclusion 3 4 6 8 11 14 16 19 References Consulted 20 Appendix A 21 www.parcelreaestateanalytics.com 3 1. Introduction The abundance of publicly available data in the real estate industry has been a true blessing, but it remains a challenge for the incumbents to extract meaningful information to serve their business and personal needs. Despite having a sea of data, businesses struggle to guide seller and buyer decisions with targeted data mining results due to reasons such as lack of initiative. In this proposal, we will describe how data mining can help solve business problems based on Cross Industry Standard Process for Data Mining (CRISP-DM) methodology. CRISP-DM is an industry-proven way to guide your data mining and analytic efforts. As a methodology, it includes descriptions of the typical phases of a project, the tasks involved with each phase, and an explanation of the relationships between these tasks. As a process model, CRISP-DM provides an overview of the data mining life cycle. The process model consists of a sequence of six phases starting from ensuring there is a good understanding of the business all the way to a recommend deployment of the solution that satisfies the specific business needs. Figure 1: The CRISP-DM Model. [1] 1 Majid Bahrepour. The Forgotton Steps in CRISP-DM and ASUM-DM Methodologies. https://amsterdam.luminis.eu/2018/08/17/the-forgotten-step-in-crisp-dm-and-asum-dm-methodologies/ www.parcelreaestateanalytics.com 4 2. Business Understanding BUSINESS PROBLEM Coming up with the right price when selling a home is a top of mind question for sellers. While there are many businesses that provide information on the housing market, including available properties and sales price, they are mostly geared towards providing value to the buyer and not the seller. Homeowners who are ready to sell their home are often constrained by the lack of accuracy in websites or applications like Trulia, Zillow, and Realtor.com. Not only is it difficult to list the right price, but it is even more challenging to settle on a price that maximizes profit and sells within the timing needs of a seller (Days on Market). For example, someone moving to a new location for a job that starts in two weeks often desires to sell their home at a price that enables quick reaction. This is very different from someone selling their home with hopes of eventually moving to Mexico to retire. On one end the first seller is looking at shorter days on the market to price their home, while the other is looking more at maximizing profits, with the ability to manage an extended period of Days on Market (DOM) when pricing their home. This need, of projecting sales prices based on the sellers desired DOM is therefore a very valuable element for homebuyers. BUSINESS OBJECTIVE In order to realize this business need, our team of business, data and data mining experts will perform a theoretical analysis to develop a pricing strategy tool, using available real estate data, that predicts an estimated sales price of a home given a desired sell-date range (days on market), and desired bottom line. The revenue model for this project would be a subscription model that allows sellers and realtors to access the service and create DOM estimates with associated probabilities by asking price. BUSINESS SUCCESS CRITERI A In order to understand whether our model will help improve the business problem, we will have key performance indicators in place (discussed in more detail in the deployment section). Overall, we will look at customer feedback, in addition to satisfaction, precision, and acquisition measures. SITUATION ASSESSMENT To help achieve the business objectives we will examine regional data, available through government websites and online marketplaces, particularly Fairfax County, Virginia. The data will help us establish a proof of concept and determine the feasibility of the business proposal. DATA MINING GOAL Our goal is to develop models that can provide sellers and realtors with useful insights and help answer the following questions for them: www.parcelreaestateanalytics.com 5 1. 2. 3. How soon can I sell the house? (Sellers) How much should I sell the house for? (Sellers and Realtors) Do I need to cut price to sell the house? (Sellers and Realtors) Because there are multiple factors that impact the sale of home, understanding the housing market within the Fairfax County region is key. We will do this by looking at common characteristics of home sales that have sold within 0-30 days, 30-60, and 60-90 days in the market; and home features that affect the listing price of a home correlated to its sales. Both unsupervised and supervised machine learning algorithms will be used. Unsupervised: • Hierarchical and K-mean clustering: We will explore underlying groups in the data and identify attributes that characterize the clustering. Supervised: • • Regression: We will build a predictive model with optimal asking price as the dependent variable. We will include variables such as assessed value, square footage,, buyer-seller index, days on market, etc. Classification, decision tree and logistic regression: We will use classification algorithms to advice if a property can be sold within a desired amount of time. With these models in place we can understand the market better and provide both sellers and realtors with advice so they achieve optimal economic gain. REQUIREMENTS, ASSUMPTI ONS, CONSTRAINTS, RI SKS AND CONTINGENCIE S Our team has discussed and identified the following requirements, assumptions, constraints, risks and contingencies: 1. We will only look only at the data for the years 2006-2016 2. We assume 50 days as a cut-off point for quick-selling and slow-selling homes 3. There is a risk we will encounter some limitations with data such as incomplete data, lack of data on buyers, and lack of data where some factors that influence the value of home are subjective. 4. There are external factors that could impact the accuracy of our prediction outcomes such as: interest rates, economy, government policies, stock market, natural disasters) www.parcelreaestateanalytics.com 6 3. Data Understanding The data we will be exploring is readily available through government websites and online marketplaces. The data is largely reliable, and the cost is minimal. We do not anticipate the need for any further investment to obtain additional data sources or sets based on the scope of the project. DATA SOURCES We have identified the following potential sources of data freely available from the data.gov data catalog. All of the datasets refer specifically to home data from Fairfax County, Virginia and provide a varied representation of property, owner, and sales data. In response to the business problem stated above, we will collect the following data for sold homes via the Zillow API: • Median Sales Price (Seasonally adjusted) • Median List Price • Sale-to-List Ratio • Median Price Cut (%) • Days on Zillow • Market Health Index • Buyer-Seller Index Additionally, we will leverage the following freely available data sets from data.gov: • Tax Administration's Real Estate - Land Data - Provides data points related to size and location of land parcels in Fairfax County. • Tax Administration's Real Estate - Sales Data - Data related to sales date and sale price. • Tax Administration's Real Estate - Dwelling Data - This set presents specific data about the dwelling, i.e. bedroom count, bathroom count, etc... • Tax Administration's Real Estate - Parcels Data - Data related to land parcels in Fairfax County. This set may require additional evaluation to determine if it does indeed provide unique data that cannot be gleaned from the other sets. • Tax Administration's Real Estate - Assessed Values - Assessed values could help show how significantly seller timeline (available DOM) impacts price adjustments; a regression of sales price and assessed value will assist in this analysis. Each data.gov set can be cross-referenced and joined using the PARID (presumably, Parcel ID) foreign key, which exists in all of the data sets. For the specific business problem outlined above, we will look only at the data for the years 2006-2016. Only sales listed as “valid and verified” in the data will be considered. We intend to combine the Fairfax County sources with data from Zillow to obtain the days on market (DOM) value for properties which fit these parameters. www.parcelreaestateanalytics.com 7 The sources of the data are independent, though the underlying raw data will significantly overlap. Data from both sources contains complementary information that will help sellers develop a comprehensive view of the market and determine an optimal asking price. We will compile data over the course of several months. Parcel size and square footage of the dwelling will also be considered as we seek to provide sellers with an optimal starting price point given the speed with which they need to close their sale. DATA IDENTIFICATION AND DESCRIPTION Internal deliberations have identified some of the most important attributes to consider for our analysis. Please note this list may be inconclusive as we may add or remove attributes as we progress to data analysis. Attribute Description Days on Market Days the home has been posted on Zillow until it’s sold Zillow Sale Price The final sale price of the home from Zillow Square Footage The square footage of the home Zillow Asking Price The list price of the home posted on Zillow Assessed Value The appraised value of the home Parcel Size The total acreage of the home Zip Code The zip code to which the home belongs to Buyer-Seller Index An index developed by Zillow on a scale of 0 to 10 (0 being a strong seller’s market and 10 a strong buyer’s market) Market Health Index An index developed by Zillow on a scale of 0 to 10 (0 being the least healthy and 10 the healthiest) To optimize a seller’s decision-making process in regard to setting, dropping, and raising list price. We will attempt both unsupervised and supervised machine learning algorithms using the aforementioned attributes and target variables. Test calls to the API were performed successfully by setting up a Web Data Connector for Tableau. While calling the freely available API costs nothing, hosting a Web Data Connector for processing the calls would incur nominal fees and, if Tableau were to be used for additional data processing this, too, would have a software licensing fee associated. www.parcelreaestateanalytics.com 8 4. Data Preparation Data Preparation will consist of two phases: Phase 1: Data Consolidation and Phase 2: Data Cleaning and Transformation. In Phase 1 relevant data is collected and consolidated from the identified data sets, which is then made ready for transformation. In Phase 2 the data that are incomplete, duplicate, or incorrectly formatted are corrected and/or eliminated. The data is then built into the appropriate form or structure needed for data modeling. Phase 1: Data Consolidation A search for datasets which are freely available and expose the real estate data needed for the given business problem is daunting. To narrow the search, we looked specifically for datasets concerning the US state of Virginia. Fairfax County, shown in Figure 1, which is considered part of the greater metropolitan area of the District of Columbia, makes numerous datasets freely available via their website, http://data-fairfaxcountygis.opendata.arcgis.com (henceforth referred to as the Fairfax Data Site, or FDS). When exploring the site for sets specific to Real Estate, several potential sets were uncovered. Many sets had similar features and most included Parcel ID as a potential match point. The Parcel ID, however, was later found to be somewhat unreliable as the ID would often be split and/or modified as properties were broken apart and sold piecemeal. Ongoing exploration of the data led to a data set which included the feature, Market Sale Ratio, which is the ratio of the market value of the property to assessed value of the property. In the end, the Market Sale Ratio dataset was selected as a primary dataset for the problem (http://datafairfaxcountygis.opendata.arcgis.com/datasets/market-sale-ratio). Using the datasets at FDS, we would be able to harvest our preferred features into Tableau or another tool. The flexibility of the ArcGIS API allows users to refine the data chosen to harvest and focus analysis early on. An API query can be modified to eliminate null values or even isolate a range of values during harvest which can rapidly accelerate the data cleaning process. A basic API call is presented below. Certain features are noticeable, and the query appears very generic. It calls for all fields to be returned in JSON format: https://services1.arcgis.com/ioennV6PpG5Xodq0/ArcGIS/rest/services/OpenData_S4/FeatureServer/ 1/query?where=1%3D1&outFields=*&outSR=4326&f=json Alternatively, a customized call to the API appears as such: https://services1.arcgis.com/ioennV6PpG5Xodq0/ArcGIS/rest/services/OpenData_S4/FeatureServer/ 1/query?where=SALES_VALUE%20%3E%3D%201%20AND%20SALES_VALUE%20%3C%3D%201000000 000&outFields=PIN,HOUSI_UNIT_TYPE,MARKE_SALE_RATIO,MARKE_VALUE,ASSES_VALUE,SALES_VA LUE,VALID_FROM,VALID_TO,PARCE_ID&outSR=4326&f=json www.parcelreaestateanalytics.com 9 This call specifies a range for the SALES_VALUE field (highlighted green, used to eliminate null values) and specifies the fields for the query to return (highlighted yellow). Indicating the fields could be considered redundant since the script used for calling the API outlines the data schema and table to be populated by the API call. Only fields presented in the script will be populated. One other calculation performed on the raw data related to the conversion of the harvested Unix Date to a Standard Date format. The Unix Date first needed to be truncated to 10 digits as it was using 13 digits, which was providing seconds and milliseconds as part of the date stamp. There may be instances when this information may be beneficial, but the stated problem does not require this level of precision. So, as the data was pulled into Tableau, the following two conversions were applied: 1. To truncate to 10 digits: LEFT(STR([Unix Date]),10) 2. To convert Unix Date: DATEADD('second',INT([TruncatedDate]), #1970-01-01#) For purposes of plotting parcels on a map for visualization, the geometry attributes of the dataset can be isolated with a unique Web Data Connector to populate a table with the Parcel ID Number, along with the latitude and longitude for each object. To generate the appropriate map in Tableau for this data, generic State and County columns for each parcel will be added during the data harvest, appropriately populated with Virginia and/or Fairfax. Test calls to the API were performed successfully by setting up a Web Data Connector for Tableau. While calling the freely available API costs nothing, hosting a Web Data Connector for processing the calls would incur nominal fees and, if Tableau were to be used for additional data processing this, too, would have a software licensing fee associated. APIs, particularly one with well-documented standards like the ArcGIS API, make data harvest a practical and efficient means for acquiring object attributes needed to analyze a business problem. Attributes can easily be harvest into a one or more tables to create useable feature vectors. Additionally, if multiple data sets are associated with a single API, an attribute may exist across tables, making it convenient to join object and create associations for analysis. Phase 2: Data Cleaning and Transformation We will examine the consolidated data and identify missing values or patterns. Missing values will be investigated and imputed appropriately to preserve the power of the data. Due to the vast range of variable in the data, we will also normalize the variables for meaningful modeling. As many of the variables in the data are highly correlated, to avoid complications on our predictive models, we will selectively use these variables in the regression and classification models. Data pertaining Fairfax County is extracted from Zillow.com using Zillow API and Tableau web data connector. Data pertaining neighboring counties with similar demographics and market condition is www.parcelreaestateanalytics.com 10 also extracted for the purpose of model training and testing. Table 1 illustrates new variables needed from data mining tasks. Variable Type Description Quick Sell Category 0=house not sold within 50 days. 1=house sold within 50 days. Price Reduction Category 0=asking price reduced. 1=asking price not reduced. www.parcelreaestateanalytics.com 11 5. Data Modeling OVERVIEW We propose several modeling techniques in line with our business objective: k - means and hierarchical and k - means clustering, classification, linger regression, and logistic regression. These models will help us understand the market better and advise sellers to achieve optimal economic gain. The proposed models can provide sellers with useful insights in the following areas: 1. How soon can I sell the house? 2. How much should I sell the house for? 3. Do I need to cut price to sell the house? UNSUPERVISED LEARNIN G Days on market is an important attribute in the business objective, but not much information is known as to how it relates to other variables in the housing market. Thus, we want to use clustering analysis to explore the data set before performing other data mining tasks. HIERARCHICAL CLUSTER ING Clustering analysis forms groups or classes based on the distance between data points. Instances belong to the same groups are similar, whereas instance belong to different groups are less similar. Not knowing how many groups extract, we want to perform a hierarchical clustering to identify meaningful clusters. This exploratory effort will help us unveil natural groupings in the data. With the target variable unknown, we will start the analysis with variables based on some of the important attributes in the housing market: days on market, average sales price, median listing price per sq ft, and inventory age. Price and square footage are prominent factors for sellers and buyers to consider in a deal. Days on market is of business interest and inventory age reflect the supply and demand condition of the market. Variable data is normalized. We will use Euclidean distance and centroid linkage to perform the analysis. We will partition the available data into ⅓ for testing and ⅔ for training. Data for the aforementioned variables contains no missing value. However, due to limited amount of data available for Fairfax County, we may use data from neighboring counties such as Arlington county with similar demographics and market condition to train the model. K - MEANS CLUSTERING K nearest neighbors is another way to find groupings in the data. Having a number of clusters in mind, we can prescribe a value to k and train the model find k number of centroids and group the points closest the same centroid into one cluster. Distortion is the sum of squared differences between each data point and the associated centroid. We use the value of distortion to evaluate the quality of clustering model. The lower the distortion value, the better the clustering. We use normalized data for this analysis and the model is trained and tested using the same protocol and variables as those used in hierarchical clustering. If the model is unable to produce meaningful clustering, we will consider adjusting the variables selected in the model and repeat the clustering analysis to obtain insightful results. www.parcelreaestateanalytics.com 12 SUPERVISED LEARNING Regression Regression analysis allows us to statistically estimate the relationship between variables. To correctly make use of a regression model, we need to first identify the dependent variable, one that will vary dependent on another variable, the dependent variable. In our case, the price for selling a home is our independent variable and the size of the home is our dependent variable. Multilinear Linear Regression We will build a predictive model with estimated sales price as the target variable. We will include variables such as assessed value, square footage and zip code, etc. and avoid using variables that are highly correlated. Before applying multilinear regression, we will first establish the existence of a relationship between our variables of interest. To establish this relationship, we can use a scatter plot. Scatter graphs will help us to more correctly estimate the price considering all data points and identify multicollinearity. To determine the fitting relationship, we can use methods such as the Residual Sum of Squares (RSS) or the MSE (Mean Squared Error) or the RMSE (Root Mean Squared Error) RMSE(Square root of MSE) = √ (MSE) [2] We will then need to train the data to build regression models with the expected variables. Using tools like matplotlib, we can plot the modeled data side by side with the actual data. We will then use the methods stated above to find the correct linear regression equation to estimate the price of the homes. When using regression to estimate the sales price, care must be taken to correctly interpret the results. This is because price and demand are always related in economics. Price goes up or down with demand, with other factors cared for, and therefore the two are dependent to one another. Therefore when modeling the price as a regression based on select attributes, we will need to factor the demand element to have a correct forecast. As we build the regression model, without adding too much dimensionality, we will attempt to add modulators to reflect the effect of market condition on the predicted sales price. Classification (Decision Tree) As many sellers want to sell their homes sooner rather than later and the average of days on market is around 50 days in the last few months, considering the current market condition, we assume 50 days as a cut-off point for quick-selling and slow-selling homes. We want to use it as a baseline to advise sellers if they can sell their properties within 50 days, the market average. We want to build a 2 Bhavesh P. (2017, April 17). Predicting house value using regression analysis – Towards Data Science. Retrieved October 12, 2018, from https://towardsdatascience.com/regression-analysis-model-used-in-machine-learning-318f7656108a www.parcelreaestateanalytics.com 13 classification model to predict the probability of selling a home within 50 days based on important variables of the home, including some of the variables that can generate meaningful clustering. Example of variables include median listing price per sq ft, square footage, and number of bedrooms. To avoid overfitting, the model is cross-validated with the minimum leaf size set to 5. If necessary, we may use data from neighboring countries to train and test the model. Logistic Regression Once a seller has an estimated sales price and days on market date in mind, another model is required to help make the decision of price reduction. We will build a logistic regression model with the target variable being the probability of making a price reduction (0-1). To determine if we need to reduce the price or not, we will employ strategies such as machine learning to determine the best matching relationship between the pricing variables of the homes. We will build a trained regression algorithm based on the property variables such as square footage and market variables such as market health index. To check for the goodness of fit of the model, we will use r score. Cross validation is used to examine the dataset and the model. Alternatively, we can also build a classification tree to compare to the logistic regression model. 2 www.parcelreaestateanalytics.com 14 6. Evaluation The purpose of the evaluation stage is to select models the best generalizes unseen data, to meet our business objectives within the constraints we have. For all models, we will use cross-validation (10 folds) to estimate out-of-sample performance, in addition to validation criteria applicable to any specific model. We will also examine whether the models will help improve our decision making in advising both sellers and realtors. MODEL EVALUATION Before applying any predictive models to the data, we use clustering as an exploratory effort to uncover underlying groups in the data regarding the housing market. To identify relevant attributes to use in our predictive models, we want to use clustering to gain insights and profile the groups based on their characteristics. To effectively evaluate a clustering model without a target variable or labeled class, we will use Internal and external quality criteria, each of which encompasses different measures: ● Internal quality criteria: focus on the data itself, inter-cluster and intra-cluster, without using external information. ● External quality criteria: ear used to match some predefined clusters. Meaningful clusters will help us select attribute and tune the parameters in the predictive models and provide our customers with market insights in the form of dashboards. We use classification models, including decision tree and logistic regression, to accomplish two main business objectives: predicting if a seller can sell his/her house within 50 days and if a price reduction is needed. We will build a confusion matrix to test the overall performance of the classification models. From the confusion matrix, several metrics are calculated to evaluate the model’s performance: ● Classification Accuracy: Overall, how often is the classifier correct? ● Sensitivity: When the actual value is positive, how often is the prediction correct? ● False Positive Rate: When the actual value is negative, how often is the prediction incorrect? ● Precision: When a positive value is predicted, how often is the prediction correct? We also want to plot the Receiver Operating Characteristic (ROC) curve for our model to identify the optimal threshold to balance sensitivity and false positive rate. At a result, the optimal threshold will help us select a model that’s best suited to advise if a property can be sold within 50 day and if the seller needs to reduce the asking price. In addition, as we may use both decision tree and logistic regression to answer the same business question, lift curves will help us compare the models in terms of effectiveness over random guessing. www.parcelreaestateanalytics.com 15 The main purpose of the regression model is to give customers an opportunity to predict a value of interest, such as the optimal asking price. We will test the principal assumptions regarding the regression, for example, linearity of the relationship between attributes and the target variable. As many other models, regression models are subject to overfitting, we will attempt to limit the complexity of the model, while retaining its functionality. BUSINESS EVALUATION The goal of this analysis is to provide realtors and sellers with knowledge of the factors that influence Market Value Ratio. The return on the investment for analyzing and assessing this data is potentially substantial, since realtors can prepare sellers for the market prior to listing a home. Sellers should have clear knowledge of what to expect from a given sale regardless of time-of-year or zip code. Specific benefits of the analysis and how it will impact a home sale include: ● ● ● ● The ability to price homes more accurately based on the all factors surrounding a sale. The ability to make sellers aware of a likely timeline for their listing. Targeted marketing campaigns for recruitment of sellers from soon-to-be “hot” markets. Increased seller and realtor satisfaction To evaluate the business effectiveness of our use of this analysis, we will seek feedback from the customers impacted using our data. We are actively seeking partners in the world of real estate interested in our analysis. Once a partner is secured, we would recommend they use our data analytics service for not less than one year. We will then monitor the partner entity’s overall sales totals and how their cumulative Market Value Ratio performs for the year. These observations will have minimal costs associated but will greatly inform the Return on Investment for our analytic services. Moreover, we will be monitoring the key performance indicators mentioned in the deployment section to measure the progress towards our business goals, as well as our partners’ operational well-being in the real estate market. In the grand scheme of things, our business results from machine learning should indicate an increase in the overall market efficiency; both sellers and buyers should benefit. NEXT STEPS Additional observations would be made via customer surveys of both our business partner and their direct customers. Assuming our partner uses the data as we outline herein, surveying their customers will provide qualitative data on the effectiveness of our analysis. The survey will present the biggest cost and time commitment but is necessary to assess the efficacy of our system. See Appendix A for an example of the survey. www.parcelreaestateanalytics.com 16 7. Deployment, Monitoring and Maintenance DEPLOYMENT PLAN The predictive model is designed for homeowners and realtors as a price strategy tool to allow them to make sales price estimates given a desired sell-date range (days on market), and desired bottom line. Because of the varying differences of these two audiences, the model will be deployed in two ways: 1. For Realtors: Partner with realtor.com and embed our model into their realtor.com hub. What this will do is allow our price strategy tool to be highlighted on each realtor’s professional online dashboard. Realtors will have the select either a subscription model or single purchase to use the model. We would then provide a percentage of profits to realtor.com as part of the partnership. 2. For Homeowners: Develop a web-based application, where homeowners can make a single purchase to use the model. Both deployments will have a user interface designed so that both homeowners and realtors can input specifications about the home (features, days on market, profit, etc…) into a web-based application which then runs the predictive model algorithms on the back end and ultimately provides the seller or realtor with an estimated sales price for them to use. Because we are only focusing on Fairfax Virginia, our partnerships with realtor.com and outreach to sellers will only pertain to that county. KEY PERFORMANCE INDI CATORS In order to understand if our price strategy tool addresses the business problem, we will have the following key performance indicators in place. 1. Customer Acquisition and Retention a) Customer Acquisition: Number of new customers that have been added to customer base b) Revenue Growth rate: Percentage increase in sales between two time periods c) Customer Churn rate: The rate at which customers discontinue or opted out of renewing a price strategy tool subscription. d) Repeat Purchase rate: Percentage of your current customers that have returned to buy another single purchase or renewed subscription. 2. Measures for Model Performance a) Percentage homes sold using the predictive sales price and days on market b) Percentage homes sold within 15% of the predictive sales price and days on market 3. Customer Satisfaction a) Customer Satisfaction Score (CSAT): Average score of all customer responses, where customers rated their satisfaction outcome from using the price strategy tool. b) Net Promoter Score: How likely customers are to recommend the price strategy tool on a scale from 1 to 10 (Promoters: 9-10, passives: 7-8, or detractors 0-6). www.parcelreaestateanalytics.com 17 RISKS Data is the fuel needed for any successful predictive modeling exercise. As such we have identified potential possibilities where our model exhibits data limitations and could be at variance with actual results: • No information about potential buyers. • The value of a house can be heavily influenced by features that are extremely subjective. Artistic quality of the residence, architecture, and a specific style might appeal strongly to one buyer but not another. MONITORING AND MAINT ENANCE Monitoring and maintenance are critical aspects especially when the price strategy tool becomes a daily activity to customers. Proper maintenance will help in reducing periods of unworthy utilization of the data mining models. To monitor data deployment effectively, we will adopt a comprehensive monitoring and maintenance plan. We will also develop guidelines from basic principles. The significance of the step is to determine or approximate the extent to which our models have achieved an expected degree of confidence in predictability. Additionally, we will seek to evaluate the model periodically to ensure its effectiveness and to make continuous improvements. The monitoring and maintenance plan will include: 1. Tracking and updating the data that influence the models 2. Measuring and monitoring the validity and accuracy of each model 3. Identifying accuracy thresholds or expected changes in data. 4. Data Dependencies between multiple model techniques (k - means and hierarchical and k means clustering, classification, linger regression, and logistic regression). COMPONENTS OF MONITO RING AND MAINTENANCE PLAN The monitoring and maintenance of our project shall have five main parts. The elements pose checks during monitoring and maintenance. Elements comprise of various aspects. We will harmonize each of components during the processes. TASKS PRIORITIES SYSTEM The tasks priorities recognized by the model appreciate the ideology of provision of maintenance utilities. The minimization of losses and deficiencies is the main aim of the system regardless of situations. The priorities of monitoring and maintenance include the emergencies and scheduled operations and services in any case and response. DEVELOPED PROCESSES SYSTEM Each stage from data acquisition through validation to modeling has got well-defined procedures and assumptions. The system will seek to justify whether we follow procedures and challenges faced in each step as we obtain data. We will define the purpose of normal activities. We will consider the frequency of operation in each stage. www.parcelreaestateanalytics.com 18 TASKS ORDER SYSTEM The project has got detailed work order systems as we have discussed in previous sections. The provision includes all information pertained, for instance, source, description, cost and time take to achieve the output. This data is crucial in the evaluation of system performance. For effectiveness of the component, all requests and performed activities will be recorded in detail. We will also consider a deviation of each outcome from the expected. TRAINING The project will train users of the model on how to use the model effectively. The plan recognizes the significance of giving the users opportunities to gain technical expertise, increase knowledge and skills and learn new methodologies. The project will train users regularly for effectiveness. We will involve realtors during this component of the monitoring and maintenance. We will determine the accuracy of the system by comparing the model outputs to real situations in instances possible. The process is also known as validation. We will select one or more data points. Then check their results in the model. Likewise, we will use other models for the same prediction for example determination of price in a given area. Finally, we will calculate the mean square error. If the error exceeds 10%, the model will not be significant and thus should not be used. In that situations, we will update the model and utilize other conventional models in the short run. In the long term, the new model shall be developed to take consideration of more factors to make the model more relevant and sounder. www.parcelreaestateanalytics.com 19 8. Conclusion With successful implementation of these supervised and unsupervised machine learning models, we will gain competitive advantage and offer state-of-the-art services to our customers as an industry leader. It is imperative to monitor the performance of existing clustering, classification, and regression models and explore alternative options. As data mining is an iterative process, we will continuously gather new data and improve our algorithm to fit our business scope and serve client requirements. www.parcelreaestateanalytics.com 20 References Consulted Bhavesh, P. (2017, April 17). Predicting house value using regression analysis – Towards Data Science. Retrieved October 12, 2018, from https://towardsdatascience.com/regression-analysis-model-used-inmachine-learning-318f7656108a Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2016). CRISP-DM 1.0 Step-by-step data mining guide. Crisp DM Consortium (Updated 2010) (1999). Fairfax County, Virginia. Retrieved 11 October 2018, from https://upload.wikimedia.org/wikipedia/commons/e/e2/Dc22counties.jpg Fawcett, F. P. (2013). Data Science for Business.O'Reilly Media, Inc. Kevin K Dobbin, R. M. (2011). Optimally splitting cases for training and testing high dimensional classifiers. BMC Med Genomics, 31. López-Campos, M. A., Márquez, A. C., & Fernández, J. F. G. (2018). The Integration of Open Reliability, Maintenance, and Condition Monitoring Management Systems. In Advanced Maintenance Modelling for Asset Management (pp. 43-78). Springer, Cham. Majid Bahrepour. The Forgotton Steps in CRISP-DM and ASUM-DM Methodologies. https://amsterdam.luminis.eu/2018/08/17/the-forgotten-step-in-crisp-dm-and-asum-dm-methodologies Malika, C., Ghazzali, N., Boiteau, V., and Niknafs, A. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software61: 1–36. http://www.jstatsoft.org/v61/i06/paper Ng, R. (2018, November 4). Evaluating a Classification Model. Retrieved from richieng.com: https://www.ritchieng.com/machine-learning-evaluate-classification-model/Rokach L., Maimon O. (2005) Clustering Methods. In: Maimon O., Rokach L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA Roope Astala, G. E. (2018, January 8). Machine Learning Studio Module Reference.Retrieved from Microsoft Azure: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/cross-validatemodel Team Lynch (2018, November 4). Customer Satisfaction Survey. Retrieved from: Team Lynch: http://www.teamlynchrealestate.com/Customer-Satisfaction-Survey www.parcelreaestateanalytics.com 21 Appendix A Reproduced from: http://teamlynchrealestate.com www.parcelreaestateanalytics.com 22 www.parcelreaestateanalytics.com
Purchase answer to see full attachment

Tags: business intelligence

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

Name
Professor
Subject
Date

ASSESSING THE MARKET VALUE RATIOS OF RECENT HOME SALES.
Parcel Real Estate Analytics
PROJECT LOG

Project Deliverable 1 (1-9 Sep 2018)
Project introduction
In the first section, we had detailed discussions on problem identification and providing a general
overview of possible remind. Initially, I realized that there is sea- considerable resources of available
data. However, the business struggled to guide the seller and buyer. There existed challenge of for the
incumbents to extract meaningful information to solve business and personal needs in the real estate
industry. Zeina Meng, Courtney Eyer, Cindy Orellana, Othman and I opted to adopt Cross Industry
Standard Process for Data Mining (CRISP-DM) methodology which is approved method. The guideline
provides an overview of the data mining cycle. Thus we planned to design the project.

Project Deliverable 2 (16-23 Sep 2018)
Business Understanding & Data Understanding Draft Sections
The section was had two parts:
1. Business Understanding
2. Data Understanding

Business Understanding
During the section, I learned from my colleagues Zeina Meng, Courtney Eyer, Cindy Orellana, Othman
that complication arises from the fact that many businesses are oriented towards the provision of value
to the buyer to the seller. Existing websites or applications information lack accuracy and sufficiency. In
this regard, w...