项目背景

数据是从kaggle下载的，Airbnb是一个让大众出租住宿民宿的网站，提供短期出租房屋或房间的服务，并且以其独特性的居住体验发展迅速，这里我们拿到的数据是2019年纽约的民宿数据。

这里需要注意一点，有的压缩包解压之后直接使用csv文件会有问题，正确的方式是，1）右击csv文件，打开方式选择txt，然后保存，Encloding选ANSI保存，2）再使用Excel打开csv文件，然后另存为，保存类型选择CSV UTF-8（逗号分隔），这样文件可以解决基本的中文乱码/列错位等等问题；

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('AB_NYC_2019v.csv',encoding='utf_8_sig')

df.head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	34177117	Sonder \| The Biltmore \| 1BR	219517861	Sonder (NYC)	Manhattan	Theater District	40.76118	-73.98635	Entire home/apt	699	29	NaN	NaN	327	365
1	34202946	Sonder \| 21 Chelsea \| Stunning 1BR + Rooftop	219517861	Sonder (NYC)	Manhattan	Chelsea	40.74285	-73.99595	Entire home/apt	277	29	NaN	NaN	327	365
2	35005862	Sonder \| 21 Chelsea \| Classic 1BR + Rooftop	219517861	Sonder (NYC)	Manhattan	Chelsea	40.74288	-73.99438	Entire home/apt	275	29	NaN	NaN	327	365
3	33528090	Sonder \| 21 Chelsea \| Vibrant 1BR + Rooftop	219517861	Sonder (NYC)	Manhattan	Chelsea	40.74250	-73.99443	Entire home/apt	260	29	NaN	NaN	327	365
4	34570839	Sonder \| 21 Chelsea \| Quaint 1BR + Rooftop	219517861	Sonder (NYC)	Manhattan	Chelsea	40.74305	-73.99415	Entire home/apt	254	29	NaN	NaN	327	365

df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

df.describe()

	id	host_id	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365
count	4.889500e+04	4.889500e+04	48895.000000	48895.000000	48895.000000	48895.000000	48895.000000	38843.000000	48895.000000	48895.000000
mean	1.901714e+07	6.762001e+07	40.728949	-73.952170	152.720687	7.029962	23.274466	1.373221	7.143982	112.781327
std	1.098311e+07	7.861097e+07	0.054530	0.046157	240.154170	20.510550	44.550582	1.680442	32.952519	131.622289
min	2.539000e+03	2.438000e+03	40.499790	-74.244420	0.000000	1.000000	0.000000	0.010000	1.000000	0.000000
25%	9.471945e+06	7.822033e+06	40.690100	-73.983070	69.000000	1.000000	1.000000	0.190000	1.000000	0.000000
50%	1.967728e+07	3.079382e+07	40.723070	-73.955680	106.000000	3.000000	5.000000	0.720000	1.000000	45.000000
75%	2.915218e+07	1.074344e+08	40.763115	-73.936275	175.000000	5.000000	24.000000	2.020000	2.000000	227.000000
max	3.648724e+07	2.743213e+08	40.913060	-73.712990	10000.000000	1250.000000	629.000000	58.500000	327.000000	365.000000

描述性统计分析：

一共有约48000+条数据
价格[price]的均值为152美元，中位数106美元，数据向右/高偏移；存在极高/低异常值；
最少预订天/夜晚[minimum_nights]存在极高异常值；（如果有特殊情况，结合实际处理，这里以常规情况为例）
每位房东的房源数量[calculated_host_listings_count]均值是7，中位数是1，说明大部分房东都只有1个房源，存在极高异常值；
最近一次浏览日期[last_review]和每月浏览数量均值[reviews_per_month]都有10000+个缺失值，0填充；

1.数据处理

df['last_review'].fillna(0,inplace=True)
df['reviews_per_month'].fillna(0,inplace=True)

df['unit_price'] = (df.price / df.minimum_nights)  #增加一列字段 均价
df.unit_price.plot()  #查看单价的大概分布

<matplotlib.axes._subplots.AxesSubplot at 0x19f1beffa20>

df.unit_price.describe()

count    48895.000000
mean        70.174247
std        157.620388
min          0.000000
25%         20.000000
50%         44.500000
75%         81.500000
max       8000.000000
Name: unit_price, dtype: float64

df[df['unit_price'] > 500].count()

id                                316
name                              316
host_id                           316
host_name                         316
neighbourhood_group               316
neighbourhood                     316
latitude                          316
longitude                         316
room_type                         316
price                             316
minimum_nights                    316
number_of_reviews                 316
last_review                       316
reviews_per_month                 316
calculated_host_listings_count    316
availability_365                  316
unit_price                        316
dtype: int64

有316条房间单价unit_price大于500，占总样本数的0.64%，根据统计学，这部分数据我们做删除处理；

df = df[df.unit_price < 500]
df = df[df.unit_price > 1]

df.describe()

	id	host_id	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365	unit_price
count	4.822100e+04	4.822100e+04	48221.000000	48221.000000	48221.000000	48221.000000	48221.000000	48221.000000	48221.000000	48221.000000	48221.000000
mean	1.903368e+07	6.764915e+07	40.728922	-73.952029	143.473466	6.173472	23.470521	1.100040	7.190871	111.849899	61.690299
std	1.097072e+07	7.860904e+07	0.054550	0.046198	176.823568	10.942889	44.688066	1.602071	33.161159	131.099858	63.639782
min	2.539000e+03	2.438000e+03	40.499790	-74.244420	10.000000	1.000000	0.000000	0.000000	1.000000	0.000000	1.016667
25%	9.488762e+06	7.833913e+06	40.690040	-73.982910	69.000000	1.000000	1.000000	0.040000	1.000000	0.000000	20.000000
50%	1.969877e+07	3.086886e+07	40.722960	-73.955570	105.000000	3.000000	5.000000	0.380000	1.000000	44.000000	44.000000
75%	2.914645e+07	1.074344e+08	40.763170	-73.936120	175.000000	5.000000	24.000000	1.610000	2.000000	224.000000	80.000000
max	3.648724e+07	2.743213e+08	40.913060	-73.712990	10000.000000	365.000000	629.000000	58.500000	327.000000	365.000000	499.500000

df.unit_price.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x19f178b8208>

2.民宿区域分析

从neighbourhood_group/neighbourhood两个维度查看不同区域民宿数量分布

df_pie = df.neighbourhood_group.value_counts().reset_index()
df_pie

	index	neighbourhood_group
0	Manhattan	21287
1	Brooklyn	19880
2	Queens	5614
3	Bronx	1070
4	Staten Island	370

plt.figure(figsize=(15,9.3))
explode = (0.1,0,0,0,0) #为了使最大占比区域分离
plt.pie(x='neighbourhood_group',labels='index',explode=explode,autopct='%1.1f%%',data=df_pie)
plt.title("The Rate of different Landmark Point")
plt.show()

pivot = df.pivot_table(index='neighbourhood',values='id',aggfunc='count')
df_pie_2 = pivot.sort_values(by='id',ascending=False).reset_index().head(20) #这里只取值前20，否则显示太多
df_pie_2.head()

	neighbourhood	id
0	Williamsburg	3888
1	Bedford-Stuyvesant	3666
2	Harlem	2638
3	Bushwick	2445
4	Upper West Side	1936

plt.figure(figsize=(15,9.3))
plt.pie(x='id',labels='neighbourhood',autopct='%1.1f%%',data=df_pie_2)
plt.show()

plt.figure(figsize=(10,8))
pivot[pivot.id > 2500].sort_values(by='id',ascending=False).plot.bar()
plt.xticks(rotation=30)

(array([0, 1, 2]), <a list of 3 Text xticklabel objects>)<Figure size 720x576 with 0 Axes>

房间最多的区域前3个区域是Williamsburg 3888,Bedford-Stuyvesant 3666,Harlem 2638

3.对比Top3区域的均价

df1 = df[['neighbourhood','unit_price']].groupby(by=['neighbourhood','unit_price']).count().reset_index()
df1.head()

	neighbourhood	unit_price
0	Allerton	5.000000
1	Allerton	12.142857
2	Allerton	15.750000
3	Allerton	16.333333
4	Allerton	17.000000

top3 = ['Williamsburg','Bedford-Stuyvesant','Harlem']
for i in top3:plt.hist(df1[df1.neighbourhood == i].unit_price,bins=20)plt.xlabel(i)plt.show()

前两个价格区间类似，都集中在20美元以下，最后一个集中在20美元左右，之后都呈递减趋势；
值得注意的是，前3个区域的民宿价格根据数量递减，最高价格也呈递减趋势；
第一多的Williamsburg最高房价高达500，而第二多的Bedford-Stuyvesant价格最高达400多（小于500），而第三的Harlem则仅需350；

4.最受欢迎的民宿

df.sort_values(by=['number_of_reviews','reviews_per_month'],ascending=False).iloc[:10,:] #注意by后面参数的顺序影响结果

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	unit_price
5321	9145202	Room near JFK Queen Bed	47621202	Dona	Queens	Jamaica	40.66730	-73.76831	Private room	47	1	629	2019/7/5	14.58	2	333	47.0
8515	903972	Great Bedroom in Manhattan	4734398	Jj	Manhattan	Harlem	40.82085	-73.94025	Private room	49	1	607	2019/6/21	7.75	3	293	49.0
4259	903947	Beautiful Bedroom in Manhattan	4734398	Jj	Manhattan	Harlem	40.82124	-73.93838	Private room	49	1	597	2019/6/23	7.72	3	342	49.0
4709	891117	Private Bedroom in Manhattan	4734398	Jj	Manhattan	Harlem	40.82264	-73.94041	Private room	49	1	594	2019/6/15	7.57	3	339	49.0
15330	10101135	Room Near JFK Twin Beds	47621202	Dona	Queens	Jamaica	40.66939	-73.76975	Private room	47	1	576	2019/6/27	13.40	2	173	47.0
15923	8168619	Steps away from Laguardia airport	37312959	Maya	Queens	East Elmhurst	40.77006	-73.87683	Private room	46	1	543	2019/7/1	11.59	5	163	46.0
14803	834190	Manhattan Lux Loft.Like.Love.Lots.Look !	2369681	Carol	Manhattan	Lower East Side	40.71921	-73.99116	Private room	99	2	540	2019/7/6	6.95	1	179	49.5
4427	16276632	Cozy Room Family Home LGA Airport NO CLEANING FEE	26432133	Danielle	Queens	East Elmhurst	40.76335	-73.87007	Private room	48	1	510	2019/7/6	16.22	5	341	48.0
10001	3474320	Private brownstone studio Brooklyn	12949460	Asa	Brooklyn	Park Slope	40.67926	-73.97711	Entire home/apt	160	1	488	2019/7/1	8.14	1	269	160.0
44566	166172	LG Private Room/Family Friendly	792159	Wanda	Brooklyn	Bushwick	40.70283	-73.92131	Private room	60	3	480	2019/7/7	6.70	1	0	20.0

前10个最受欢迎的民宿，其中4家在Manhattan，4家在Queens，2家在Brooklyn
其中前8个民宿均价都在50上下浮动，最末有1个极高值160，有1个极低值20，再看房型，160的房型是整屋出租，其他都是单间
前10家民宿的浏览量在480~630之间稳定浮动
前10家民宿中，排前5的房子由2个人经营，即一个人开多家民宿，猜测是有专业的运营团队，后5个目前不确定（因为这里只显示高人气房子）

下面我们看下排行后面5个的房东是否只经营1家民宿，从后面的全部数据中筛选

df[df['host_name'] == 'Maya'].count()

id                                46
name                              46
host_id                           46
host_name                         46
neighbourhood_group               46
neighbourhood                     46
latitude                          46
longitude                         46
room_type                         46
price                             46
minimum_nights                    46
number_of_reviews                 46
last_review                       46
reviews_per_month                 46
calculated_host_listings_count    46
availability_365                  46
unit_price                        46
dtype: int64

df['host_name'].isin(['Carol']).value_counts()

False    48176
True        45
Name: host_name, dtype: int64

df['host_name'].isin(['Danielle']).value_counts()

False    48145
True        76
Name: host_name, dtype: int64

df['host_name'].isin(['Asa']).value_counts()

False    48220
True         1
Name: host_name, dtype: int64

df['host_name'].isin(['Wanda']).value_counts()

False    48217
True         4
Name: host_name, dtype: int64

这里通过两种方法可以求得后5个人气民宿的房东所拥有的民宿数量，排行前10的民宿中，有9家都是团队运营，只有一家房东Asa是个人运营；
可能Airbin更趋于一种商业模式，而非最初的出发点，把自己闲置的房间share给旅行者；
有点类似中国的二手房东，只是目标客户不再是本地人，而是旅行者，租赁方式也由长租改为短租；
不过Airbin经过这么久的发展，已经是成熟的市场，团队应该比个人更专业些，排行靠前也就不足为奇了；

下面我们算下个人和团体的比例。计算方式：房东名下只有1套房的的视为个体，1次以上的视为团体

df2 = df['host_name'].value_counts()
df2.head()

Michael         400
David           390
Sonder (NYC)    327
John            291
Alex            272
Name: host_name, dtype: int64

df2.describe()

count    11355.000000
mean         4.244826
std         14.076734
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        400.000000
Name: host_name, dtype: float64

(df2 > 1).sum() #团体经营

(df2 == 1).sum() #个体经营

个体经营者约6837人，团体经营者约4518人，个人运营者更多，不过注意前者只有1套房子，后者有多套房子
一共有48895条数据，减去6837条个人运营的1个房子，剩下的都是团体运营的房子，团体拥有总房子数量为个人的6倍左右；

5. 受欢迎的房子特点

keyword = df['name'].to_frame().to_csv('hotword.txt', index=False, sep=',', header=None) #方法to_frame()将Series转DataFrame

from wordcloud import WordCloud
text = open('hotword.txt',encoding = 'utf_8_sig').read()
wordcloud = WordCloud(background_color="white",max_font_size=50).generate(text)
plt.figure(figsize=(10,6.18))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

通过词云呈现出高热点搜索词：

搜索词最多的是Private/Apartment/Bedroom等房型，所以一定要标注出房价类型
其次出现heart of/Central Park/In Brooklyn/East Village等描述地理位置的词汇
同样出现的有形容词类如Beautiful/Cozy/Bright/Charming/Moden/Lovely/Quiet等
房东在添加标题的时候可以考虑加入高频搜索词汇，以吸引更多的人浏览

后记：这里只是粗略的进行分析，如果有面积，实际上更深入的还可以利用统计学/机器学习领域的最小二乘法去分析下定价，或者天气等不同因素的影响，挖掘更深度的东西；

已经开始看机器学习了，不确定是否用得上，不过一口气听了10多集，觉得还行，另外，在学习machine learning之前，建议先打好基础，包括统计学，很多人推荐很多经典书，这里推荐一本深入浅出统计学，我刷了2遍，竟然发现machine learning很多用到统计学的。

最近看的一本书里提到一句话，大概意思是：数据分析只能在现有的格局里找到最优解，但是人类的思想却可以突破当下的困局，从而带来全面的革新。所以数据分析有优势，但也确实有其弊端。

继续学习吧~

Airbin_0720