python猫咪藏在哪个房间_Python分析Airbnb北京房源，去北京旅游到底应该住哪里？...

说来惭愧，阿花花从小到大，除了在妈妈肚子里的时候去过一次北京，就再也没有和这片土地发生过关系了。好想吃北京的爆肚、涮肉、豆汁儿、烤鸭、炸酱面、卤煮火烧、豌豆黄、麻团、驴打滚、焦圈balabala.....

那么问题来了，如果有突如其来的假期可以让我去北京，那么我应该住哪儿呢？

正好，airbnb开源了部分数据>>Airbnb北京数据集<<，ummmm，让我好好研究一下。

（哦对，全文代码在此>>Airbnb数据分析及可视化<<，一键fork运行）

读数据

import numpy as np

import pandas as pd

import re

import seaborn as sns

import matplotlib.pyplot as plt

from matplotlib import font_manager

import geopandas as gpd #geopandas是专门用于描述地理位置的开发包，额外需要了解一下geohash

from shapely.geometry import Point # 经纬度转换为单独描述的点

import mapclassify

%pylab inline

%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

listing_path = 'listings.csv'

neighbourhoods_path = 'neighbourhoods.csv'

reviews_path = 'reviews.csv'

listing = pd.read_csv(listing_path)

neighbourhoods = pd.read_csv(neighbourhoods_path)

reviews = pd.read_csv(reviews_path)

listing

哦。。。看的不太真切，再让我info一下

listing.info()

通过对数据的简单描述发现：总共有16个特征列，每个特征列有28，452个数据。

有4列有一定程度的缺失（name/neighbour/last_review/reviews_per_month）

了解数据

进行一些特征的简单处理和描述

将neighbour处理成只有中文名

a = listing['neighbourhood']

def colum_to_str(data):

neighbourhood = []

a = data.str.findall('\w+').tolist()

for i in a:

neighbourhood.append(i[0])

return neighbourhood

listing['neighbourhood'] = colum_to_str(a)

listing['neighbourhood'].unique()

房间类型

房间类型总共有三种全套、单间和多人房间

listing['room_type'].unique()

大部分为整套出租，很少有公共房间出租

plt.subplot(311)

data = listing['room_type'].value_counts().tolist()

a = listing['room_type'].unique()

plt.bar(x=a, height=data)

数值类特征

new_columns = ['price','minimum_nights','number_of_reviews','calculated_host_listings_count','availability_365']

data = listing[new_columns]

def is_number(data):

int_columns = []

str_columns = []

columns_name = data.columns.tolist()

for i in range(data.shape[1]):

if data[columns_name[i]].dtype == 'int64' or data[columns_name[i]].dtype == 'float64':

int_columns.append(columns_name[i])

else:

str_columns.append(columns_name[i])

return int_columns, str_columns

int_columns, str_columns = is_number(listing)

import pylab as mpl

mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体

mpl.rcParams['axes.unicode_minus'] = False

#画出数字类型的分布

for i in range(len(data.columns)):

plt.hist(data[data.columns[i]].get_values())

plt.xlabel(data.columns[i])

plt.show()

通过连续型特征观察各个字段的分布，我们可以得出：price价格分布较广，平均为600元左右，中位数为400元左右，但存在极大值的干扰，需要对极大值进行处理后续；

minimum_nights存在异常值，这里超过365天就认为是异常值，需要对异常值进行处理后续；

availability_365正常,观察得出有接近一半的民宿全年可住的天数为350天；

calculated_host_listings存在一些集值，表明一些房主可能是专门做名宿的运营着多套房子；

number_of_reviews存在极大值，可能要注意是否存在恶意刷评论的现象；

中介vs个人

第一名的中介房源挺多哈

listing[['host_name','name']].groupby('host_name').count().sort_values(by='name', ascending=False)

区域分布

listing['neighbourhood'].value_counts()

开始分析

区域（分布、价格）

neighbourhood_gpd = gpd.GeoDataFrame.from_file('neighbourhoods.geojson')

neighbourhood_gpd.head()

def listing_to_gbd(data):

data['geometry'] = list(zip(data['longitude'],data['latitude']))#zip的作用为打包成为元组的列表，例如zip(a,b)，将a、b中第一个元素放到列表中

data['geometry'] = data['geometry'].apply(Point)

data = gpd.GeoDataFrame(data)#转换成geopandas可识别

return data

listing_gbd = listing_to_gbd(listing)

listing_gbd.head()

计算每个区的房间数量,并从大大小进行排列

neighbourhood_gpd_groupby = listing_gbd[['neighbourhood','id']].groupby('neighbourhood').count().sort_values(by='id',ascending=False).reset_index()

new_neighbourhood = neighbourhood_gpd.merge(neighbourhood_gpd_groupby,on='neighbourhood', how='left')

new_neighbourhood.head(5)

画图

base = new_neighbourhood.plot(column='id',cmap='Oranges',scheme='fisher_jenks', legend=True,edgecolor='white',figsize=(20,20))#画第一层底

listing_gbd.plot(ax=base,color='lightgreen',marker='o',markersize=65,alpha=0.05)#在底上堆叠民宿的地理位置

plt.gca().xaxis.set_major_locator(plt.NullLocator())#去掉X轴

plt.gca().yaxis.set_major_locator(plt.NullLocator())#去掉y轴

def explode_situtation(data):

explode = {}

for i in range(len(data)):

if data[i]>data.mean():

explode[data.index[i]] = 0.1

else:

explode[data.index[i]] = 0

return explode

explode = list(explode_situtation(listing.neighbourhood.value_counts()).values())

data2 = listing.neighbourhood.value_counts()

label2 = listing.neighbourhood.unique().tolist()

plt.figure(figsize=(20,20))

plt.title('民宿区域分布比例图',fontdict={'fontsize':18})

plt.pie(data2,labels=label2,autopct='%.2f%%',explode=explode,startangle=90,

counterclock=False,textprops={'fontsize':12,'color':'black'},colors=sns.color_palette('hls',n_colors=18))

plt.legend(loc='best',shadow=True,fontsize=11)

通过探索可以发现：超过60%的名宿集中在朝阳区、密云县以及东城区这三块北京核心区域

对于朝阳区、密云县和东城区周边的区域，其民宿分布也靠近这三块区域的边界，特别是西城区和海淀区

剩余区域名宿分布相对均匀，没有明显的中心出现

#接下来我们继续观察一下不同区域的房间的价格是否会有区别

#为防止一些异常值的干扰，在这之前我们先查看一下价格分布，做一些合理的处理

sns.distplot(listing['price'],color='b')

a = listing[['neighbourhood','price']].groupby(['neighbourhood','price']).count().reset_index()

for i in label2:

plt.hist(a[a['neighbourhood']==i].price)

plt.xlabel(i)

plt.show()

price_is_0 = listing[listing['price']==0]

price_is_0

#发现有一些房源为测试房源，对价格为0的房源和测试房源进行删除

test_house = listing[listing.name.str.startswith('测试')==True]

test_house

drop_index_list = price_is_0.index.tolist()+test_house.index.tolist()

#清除完毕

b = pd.DataFrame(listing_dealt['neighbourhood'].unique(),columns=['区域'])

b['最高价格'] = listing_dealt[['price','neighbourhood']].groupby('neighbourhood').max().price.tolist()

b['最低价格'] = listing_dealt[['price','neighbourhood']].groupby('neighbourhood').min().price.tolist()

b['中位数价格'] = listing_dealt[['price','neighbourhood']].groupby('neighbourhood').median().price.tolist()

b['25%价格位置'] = listing_dealt[['price','neighbourhood']].groupby('neighbourhood').quantile(0.25).price.tolist()

b['75%价格位置'] = listing_dealt[['price','neighbourhood']].groupby('neighbourhood').quantile(0.75).price.tolist()

b['四分卫数'] = np.array(listing_dealt[['price','neighbourhood']].groupby('neighbourhood').quantile(0.75).price.tolist()) - np.array(listing_dealt[['price','neighbourhood']].groupby('neighbourhood').quantile(0.25).price.tolist())

通过上述分析可以得出房山区的民宿价格总的来说最高（899元，比第二名大200元以上）、顺义次之（678元），第三是海淀区（597元），且这三个区域价格分布较广

拥有最多民宿的朝阳区、密云县和东城区大部分民宿价格在300到600元之间，这一定程度上解释了该区域民宿数量比较多的原因，区域中心且性价比比较高

每个区域的最高价和最低价相差较大，且一些区域的最高价出现超过万元以上，不确定是异常值还是民宿价本身这么高，可能要到大数据集区进一步分析

民宿本身

最受欢迎的民宿top10

在这里暂且认为一个评价代表中一位住客，评价/月平均评价超过90%位数的为比较受欢迎的民宿

avg_review = listing_dealt['number_of_reviews'].quantile(0.9)

avg_month_review = listing_dealt['reviews_per_month'].quantile(0.9)

popular_house = listing_dealt[(listing_dealt['number_of_reviews']>avg_review) & (listing_dealt['reviews_per_month']>avg_month_review)]

popular_house.sort_values(by=['number_of_reviews','reviews_per_month'],ascending=False).head(10)

最受欢迎的民宿大部分在朝阳区（5家），一部分在东城区（3家），还有一部分在西城区（2家）价格跨度比较大，在87元到899之间；87元是分享房间，899元的是整套

前10的受欢迎的民宿中90%都是专业的运营团队，其名下不止一套房子；只有一家是个人运营的，间接说明个人运营的民宿相对来说入住率会比专业团队差一些，可能是推广和运营相关的环节，需要进一步观察一下各个步骤的转换率

前10的受欢迎的民宿中有4家整套、4家个单人房间，还有2家多人房，这与之前房子类型的结论基本一致；但同时我们也可以看到多人房也比较受欢迎，这里暂时判断如果要做民宿，可能多人房是一个不错的选择，竞争没有整套出租和单人房来的竞争激烈

最受欢迎的民宿特点

import jieba

from wordcloud import WordCloud

reviews_top90 = popular_house.sort_values(by=['number_of_reviews','reviews_per_month'],ascending=False)

reviews_top90.head()

def jieba_cut(data):

a = []

worddict = {}

with open('chineseStopWords.txt', encoding='gbk') as f:

result=f.read().split()

for i in data:

words = jieba.lcut(i)

word=[x for x in words if x not in result]

a.extend(word)

for word in a:

worddict.setdefault(word,0)

worddict[word]+=1

return worddict

popular_words = jieba_cut(reviews_top90.name.astype('str'))

#对popular_words_df中的一些字段进行处理，例如空值和标点符号

def deal_with_meanless_word(data):

mean_words = {}

for i in data.keys():

if len(i) > 1:#认为小于一个长度的字没有意义，最好的情况下是自己定义一个没有意义的列表

mean_words[i] = data[i]

return mean_words

mean_words = deal_with_meanless_word(popular_words)

mean_words_df = pd.Series(mean_words).sort_values(ascending=False)

mean_words_df_top15 = mean_words_df.head(15)

mean_words_df_top15

plt.figure(figsize=(15,8))

plt.title('最受欢迎的房间中描述关键词')

mean_words_df_top15.plot(kind='bar',ylim=[0,300])

import re

wordcloud_use = ' '.join(mean_words.keys())

resultword=re.sub("[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\。\@\#\\\&\*\%]", "",wordcloud_use)

w = WordCloud(scale=4,background_color='white', font_path='msyhbd.ttf',

max_words = 100,max_font_size = 60,random_state=20).generate(resultword)

plt.imshow(w)

通过对关键字的观察可以发现大家其实都比较实在，前十几乎全部都是在说地段以及交通，毕竟北京它堵啊，住地铁边总没错的

关于房屋风格，在大家眼中，庭院、禅意、隐庐是北京的代名词，但是那边几位“日式”的朋友，这我就不太懂懂了啊

在满屏的关键词中，我竟一眼就看到了“猫咪”，天哪是什么神仙房子自带猫咪！我柠檬了！！！

房东的角度

listing_dealt.calculated_host_listings_count.describe()

listing_dealt.loc[listing_dealt['calculated_host_listings_count']<=5,'renter_type'] = '个体'

listing_dealt.loc[(listing_dealt['calculated_host_listings_count']<=20)&(listing_dealt['calculated_host_listings_count']>5),'renter_type'] = '小型公司'

listing_dealt.loc[(listing_dealt['calculated_host_listings_count']<=100)&(listing_dealt['calculated_host_listings_count']>20),'renter_type'] = '中型公司'

listing_dealt.loc[listing_dealt['calculated_host_listings_count']>100,'renter_type'] = '大型公司'

renter_type = listing_dealt.groupby('renter_type').count().id.sort_values(ascending=False)

renter_type

plt.figure(figsize=(15,8))

plt.bar(x = renter_type.index, height=renter_type.values)

由上图我们可以得出：大部分民宿还是由个体经营，也就是只经营一套民宿，其次是5～20的小型公司，接着是20～100的中型公司，只有极小部分是大型公司100以上

房东收入

#平均价格乘上每个月的评论数

avg_price = listing_dealt[['renter_type','price']].groupby('renter_type').mean().reset_index()

avg_rent = listing_dealt[['renter_type','reviews_per_month']].groupby('renter_type').mean().reset_index()

revenue = pd.merge(avg_price,avg_rent, on='renter_type',how='left')

revenue['avg_revenue'] = revenue.price*revenue.reviews_per_month

revenue

#可以看到个体和小型公司开民宿的收入比较高，大型公司收入最低，当然因为存在极值的影响，大型公司的手下的房子收入不一，被平均了。

总结

ummm，自带猫咪的房子固然好，但是荷包空空，可能，最后我还是会选一个便宜厕所干净的吧嘤QAQ

python猫咪藏在哪个房间_Python分析Airbnb北京房源，去北京旅游到底应该住哪里？...相关推荐

python猫咪藏在哪个房间_Python OS 文件/目录方法
摘要: 下文讲述Python中os.walk函数的功能说明,如下所示: os.walk()函数的功能: 用于在目录树中向上或向下移动,输出在目录中的文件名常用于遍历目录操作,类似于其它语言的Nex ...
python猫咪藏在哪个房间_Python 画猫咪
#!user/bin/env python3 # -*- coding:utf-8 -*- # Email 23198899766@QQ.com # Time : 2020/11/23 16:30 f ...
python项目之猫咪藏在哪个房间
猫咪藏在哪个房间程序说明本项目我们将用Python语言设计一个计算机程序来模拟"猫咪藏在哪个房间"游戏,该程序中,计算机代替你朋友的角色,而你的角色不变.你将通过一个输入框和几 ...
【计算机专业毕设之基于python猫咪网爬虫大数据可视化分析系统-哔哩哔哩】 https://b23.tv/jRN6MVh
[计算机专业毕设之基于python猫咪网爬虫大数据可视化分析系统-哔哩哔哩] https://b23.tv/jRN6MVh https://b23.tv/jRN6MVh
猫咪藏在哪个房间python作业_猫咪生气躲进房间，众人找到后，猫咪一脸疑问：听说你们在找我...
以前,养猫的人不多,而且养猫的方式也不一样,人们每天忙得都顾不上和猫玩耍,观察它们.所以就说猫咪养不熟,高冷什么的,这就是因为不了解所以产生的误解. 如今,养猫的人越来越多,而且人们的空闲时间也多了, ...
猫咪藏在哪个房间python_猫咪总喜欢把自己藏在不可思议的地方，这是为什么呢？...
相信有很多猫咪的主人都会有这样有趣的发现,有很多猫咪就像是活在水中那样,就像是液体般一样的存在.有很多猫咪都会藏在一些非常不可思议的地方,就好像一些小盒子里面,还有一些用来装东西的小容器里面.这样就会 ...
猫咪藏在哪个房间python项目_铲屎官必读：猫咪为什么都喜欢藏在盒子里？
原标题:铲屎官必读:猫咪为什么都喜欢藏在盒子里? Защитные механизмы 保护机制 Каждому животному присущи разные защитные механизмы ...
python游戏猫咪藏在哪个房间_tes体系风格已经定型，EZ加猫咪似乎成唯一解，在藏还是真没了？...
#瓦罗兰特电竞赛# 对于tes这支队伍来说,在打完了MSC之后,似乎所有的人都认为,这就是LPL的头号种子,甚至说是世界上最强的那支队伍.然而实际上呢?经过了大半个夏季赛的考验之后,似乎tes的问题暴 ...
猫咪藏在哪个房间python_盘点：猫咪玩“躲猫猫”喜欢藏的几个地方，这下再也不愁找不到了...
一觉醒来,家里的猫不在身边,将被子掀起来自己找,还是没有,床底下?依旧没有,然后找遍整个房间--不知道诸位铲屎官有没有这样的经历,猫咪"人间蒸发",然后又神不知鬼不觉地出现在身边, ...

python猫咪藏在哪个房间_Python分析Airbnb北京房源，去北京旅游到底应该住哪里？...

python猫咪藏在哪个房间_Python分析Airbnb北京房源，去北京旅游到底应该住哪里？...相关推荐

最新文章

热门文章