INFO 370 FINAL PROJECT

Jonathan Lin, Yu Che Lin, Eva Yin, Ashley Zhou

Project Overview

For research project, our team decided not to use the Zillow dataset, because we realized the variables within the Zillow dataset are mostly about price (median price of home with one bedroom, two bedrooms, etc.) and it would meaningless to use price variables predict another median price variable. Instead, we chose to use another dataset that contains housing information in Beijing because in addition to the dataset containing home sale values, it contains other metrics such as size of the house, district, etc. Correspondingly, our research focus became to identify potential factors that have an effect on the total transaction price of an house (which is the totalPrice listed in the dataset in a certain time period, representing the transaction price of the entire house through Lianjia website) in Beijing, and then to try to use those factors to predict the sale prices of future houses that appear on the market.

In terms of the source of data, we worked with the Kaggle Research Dataset (https://www.kaggle.com/ruiqurm/lianjia?fbclid=IwAR3wd9hpmt4sA4z5TueQVsQzMuL5eMGkqrQU8WdnQEpFp66Hu0_f0VjUDuI), which is fetched from Lianjia.com, a website where people posted housing information in Beijing. The dataset includes various different housing relevant metrics, like the time of transaction, the square of house, the number of bathroom, etc.

The target audience of our recourse would be groups of people who are planning to buy or sell houses in a few months or years in Beijing, either for their own use or for financial investment. Therefore, we hope our audience would be able to gain some insights about the trend of listed housing prices in city Beijing, which may ultimately help them decide what some which months are the best time to buy or sell the properties and how to pick houses that have greater growth potential, etc.

Background Information

To begin with, we did some background research to help contextualize your research. We believe our research would be really useful for our potential audience as one of the paper clearly states that, “property values have become an increasingly common topic of conversation in recent years”. In addition, it also mentions the most influential factors on housing price, including supply and demand, interest rates, economic growth, demographics, location, the potential of growth, a second bedroom, parking, home improvements. Each of them has a different level and direction of effects with the price of houses (House Prices). Similarly, according to another research paper that focuses on finding the factors that influence the real estate price in London, several of the most significant variables that may affect house price including population density, income, and GVA (which is the measure of the value of goods and services in that area) (Gu).

Approaches Overview

As to our approaches, we first did some data clean up and preparation. We dropped off all irrelevant variables, including url (url used to fetch the data), id (transaction id), and Cid (community id) because they do not contain information about the house itself. We also removed the columns Lng (longtitude) and Lat (Latitude), because even though they represent the geographic locations of certain houses and might have effect on the housing price, longitude and latitude are too abstract to be used for location identification, which may cause unnecessary confusion to our audience and we currently are not focusing the relationship between location and housing price.

In terms of the missing values handling, we dropped all the rows that contains null values, because each house is independent to each other, for example, one house's renovation condition will not be affected by another one's so it would be improper to use a mean value or use a value above to fill in another house's empty feature.

As to the new variables, we firstly added a column of month and a column of year to be used as extra features for each row, representing the month and the year when a transaction occurs, and they are generated based on the month and year information from the tradeTime columns. In addition, we also made a totalRoom column that representing the total number of rooms within a house, calculated by adding the number of livingRoom, drawingRoom, kitchen and bathroom.

For the feature selection, we first tried to use univariate feature selection to select the variables that are most correlated to our outcome of interest, and the top 6 variables we got were DOM (a house's active days on the market), followers (the number of people follow the transaction), square (the total square of house), totalRoom (the total number of rooms), ladderRatio (the proportion between number of residents on the same floor and number of elevator and ladder on that floor, which describes how many floors a resident has on average) and floorfixed (the height of the house). We then try to verify the univaritate selection result by creating a heatmap (attached below), which visualizes the correlation between all the variables and our outcome of interest (total transaction price). When looking at the heatmap, we ignored all the categorical variables (renovationCondition, buildingStructure, etc.) and concentrated on the continuous variables. Based on the lightness of the block color (the lighter the block is, the more related it is to certain variable), we could see that square (the total square of house), totalRoom (the total number of rooms), floorfixed (the height of the house), DOM (a house's active days on the market) and year (year that this transaction occurs) are the most relevant variables, most of the variables conforms with what we got from feature selection (square, totalRoom, DOM, floorfixed) but it still provides us new variables we should consider (year). In the end, we contrasted and combined the variables we got from both univariate feature selection and the heatmap, and decided the final set of variables to be DOM (a house's active days on the market), followers (the number of people follow the transaction), square (the total square of house), totalRoom (the total number of rooms), floorfixed (the height of the house) and year (year that this transaction occurs). We got rid of the ladderRatio variable because its correlation number is among the lowest ones according to the heat map.

In [13]:
import analysis as ana
import importlib
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[13]:
In [3]:
ana.correlation(ana.data)

Visualization

The bar graph below shows the distribution of the average housing transaction price (unit is 10k rmb) in each month, and we could see that each month's average transaction prices are really close, with September being the month with the highest average transaction price and May being the month has the lowest average transaction price. Compared with the overall trend, the housing transaction price becomes relatively lower from April to July.

In [4]:
ana.month(ana.data)

The bar graph below shows the distribution of the average housing transaction price (unit is 10k rmb) in each year, and we could see that there is an overall increasing trend in housing price from 2010 to 2018, with 2017 being the year has the highest average transaction price. One thing to notice is that the dataset only contains the January transaction data for 2018, so that might be the reason that the average price of 2018 is lower than that of 2017.

In [5]:
ana.year(ana.data)

The histogram graph below shows the distribution of the average housing transaction price, and we could see that the majority of transaction occurs when the price is between 0k and 5000k rmb, and there are relatively fewer transactions occurs then the price is above 1000k rmb.

In [6]:
ana.price(ana.data)

The scatter plot graph below shows the relationship between the square of a house and its housing transaction price, and we could see that the overall trend is as the square of a house increases, the transaction price will also increase, which is quite intuitive. However, this disperse of the plots still shows that houses with the same square (for instance, 300) may still be transacted at different prices, meaning that other factors also have effect on housing transaction price.

In [7]:
ana.square_corr(ana.data)

Modeling

In our case, we would use a Regressor rather than Classifier to do the data modeling, because Classifier is generally used for predicting categorical values (labels, etc.) while Regressors are generally use for predicting quantity, since our outcome of interest is the total transaction price, which is a continuous variable. We have labelled features so we will be using supervised learning.

We first tried to use the SGD Regressor, since our dataset contains over 100k values and KNN (k nearest neighbors) Regressor, which is a common algorithm that is used to predict the certain quantifiable outcome of interest, that is to say, to identify the number of any observation (row), we simply need to look at the class of K similar points. We performed polynomial transformations, cross validation and grid research for those two models, and add a scaler for the KNN. It turns out that the negative MAE for KNN to be 131.315, and the negative MAE for SGD to be 127.842. In order to improve our accuracy, we tried Ridge Regressor, which is a technique for analyzing multiple regression data. Similarly, performed polynomial transformations, cross validation and grid research to it and get a negative MAE of 120.09, which proves that it is slightly more accurate compared to the first two modeling strategies we used. We thought the potential reason might be that Rigde Regressor takes account for some of the multicollinearity we saw in the heatmap.

Due to the sample size of our dataset (150k+ samples), we didn't have the computational power for more hyperparameters in our models' grid search. We attempted to try sampling our data into smaller groups to train (20k samples), but that often excluded certain categories that have less presence in the dataset (e.g. building structure of brick and wood was only present in ~1% of the full dataset).

To further validate our prediction, we made the residuals vs fitted plot graphs for Ridge Regressor (graphs below). Based on the plot graphs (with x axis being the predicted price and y axis being the real price), we could see that the overall trend of our prediction align with the real values, but there are still some outliners (such as a negative predicted values). Based on the residuals graph, we could see that our prediction is relatively balanced, as there are either too many over-predictions nor too many under-predictions.

In [10]:
ana.predictions_df.plot(kind='scatter', x='Ridge', y='Ridge_res', alpha = .3);
ana.plt.axhline(0, color='r')
ana.plt.ylabel('Residuals of Ridge (unit 10k)')
ana.plt.xlabel('Ridge Predicted Transaction Price (unit 10k)')
ana.plt.title('Residuals of Ridge Models and the Real Transaction Prices vs. Ridge Predicted Transaction Price')
ana.plt.show()
In [12]:
ana.predictions_df.plot(kind='scatter', x='Ridge', y='realPrices', alpha = .3);
ana.plt.plot(ana.predictions_df.realPrices, ana.predictions_df.realPrices, c='r')
ana.plt.xlabel('Ridge Predicted Transaction Price (unit 10k)')
ana.plt.ylabel('Real House Transaction Price (unit 10k)')
ana.plt.title('Relationship between the predictions from models and the real transaction prices')
ana.plt.show()

Considering the running time, currently we have to use a relatively small range of hyperparameters, and we thought we could use a larger range in the future to help us improve our prediction if there is no running time limitation. To conclude, we hope our audience to take a close look at the house's active days on the this website, the number of people follow this house, the total square of house, the total number of rooms and the height of the house before actually conducting the transaction, which may help them would gain a more insightful opinion about the final transcation price of a house in Beijing.

Resource

Citation “Why West Coast home prices are surging” -Kathryn Vasel https://money.cnn.com/2018/06/13/real_estate/west-coast-housing-markets/index.html

“Why are House Prices so High?” -Positive Money

https://positivemoney.org/issues/house-prices/ “House Prices.” Information About Factors That Determine Property Prices - HomeGuru, www.homeguru.com.au/house-prices.

Gu Yiyang, “What are the most important factors that influence the changes in London Real Estate Prices? How to quantify them?”, https://arxiv.org/pdf/1802.08238.pdf