Python数据分析-北京房价分析
#!/usr/bin/env python
# coding: utf-8
# 明确分析⽬的:了解北京近年房价情况,为买房作出指导
# 各区房源数目、平均面积、均价
# 各区房屋总价均值-有/无地铁
# 各区-有地铁-是否配有电梯 均价
# 2017年 2室1厅1厨1卫户型房屋-有电梯/无电梯-有地铁/无地铁 各区均价
# 均价日趋势-统计每⽇所有房源的平均单价
# 2017年 总价200~400万、单价4~7万房源占比
# 引⼊使⽤的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 加载数据⽂件
# df = pd.read_csv('./beijing_houst_price.csv')
# 警告 DtypeWarning: Columns (0,6,7,9) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('./beijing_houst_price.csv', dtype={'id':'str','tradeTime':'str', 'livingRoom':'str', 'drawingRoom':'str', 'bathRoom':'str'})
# 简单查看数据有哪些列
df.head()
id | tradeTime | followers | totalPrice | price | square | livingRoom | drawingRoom | kitchen | bathRoom | floor | buildingType | buildingStructure | ladderRatio | elevator | fiveYearsProperty | subway | district | communityAverage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 101084782030 | 2016-08-09 | 106 | 415.0 | 31680 | 131.00 | 2 | 1 | 1 | 1 | 高 26 | 1.0 | 6 | 0.217 | 1.0 | 0.0 | 1.0 | 7 | 56021.0 |
1 | 101086012217 | 2016-07-28 | 126 | 575.0 | 43436 | 132.38 | 2 | 2 | 1 | 2 | 高 22 | 1.0 | 6 | 0.667 | 1.0 | 1.0 | 0.0 | 7 | 71539.0 |
2 | 101086041636 | 2016-12-11 | 48 | 1030.0 | 52021 | 198.00 | 3 | 2 | 1 | 3 | 中 4 | 4.0 | 6 | 0.500 | 1.0 | 0.0 | 0.0 | 7 | 48160.0 |
3 | 101086406841 | 2016-09-30 | 138 | 297.5 | 22202 | 134.00 | 3 | 1 | 1 | 1 | 底 21 | 1.0 | 6 | 0.273 | 1.0 | 0.0 | 0.0 | 6 | 51238.0 |
4 | 101086920653 | 2016-08-28 | 286 | 392.0 | 48396 | 81.00 | 2 | 1 | 1 | 1 | 中 6 | 4.0 | 2 | 0.333 | 0.0 | 1.0 | 1.0 | 1 | 62588.0 |
# 查看列数目、类型
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318851 entries, 0 to 318850
Data columns (total 19 columns):
id 318851 non-null object
tradeTime 318851 non-null object
followers 318851 non-null int64
totalPrice 318851 non-null float64
price 318851 non-null int64
square 318851 non-null float64
livingRoom 318851 non-null object
drawingRoom 318851 non-null object
kitchen 318851 non-null int64
bathRoom 318851 non-null object
floor 318851 non-null object
buildingType 316830 non-null float64
buildingStructure 318851 non-null int64
ladderRatio 318851 non-null float64
elevator 318819 non-null float64
fiveYearsProperty 318819 non-null float64
subway 318819 non-null float64
district 318851 non-null int64
communityAverage 318388 non-null float64
dtypes: float64(8), int64(5), object(6)
memory usage: 46.2+ MB
# 查看数值类型数据的整体信息 常用统计值
df.describe()
followers | totalPrice | price | square | kitchen | buildingType | buildingStructure | ladderRatio | elevator | fiveYearsProperty | subway | district | communityAverage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 318851.000000 | 318851.000000 | 318851.000000 | 318851.000000 | 318851.000000 | 316830.000000 | 318851.000000 | 3.188510e+05 | 318819.000000 | 318819.000000 | 318819.000000 | 318851.000000 | 318388.000000 |
mean | 16.731508 | 349.030201 | 43530.436379 | 83.240597 | 0.994599 | 3.009790 | 4.451026 | 6.316486e+01 | 0.577055 | 0.645601 | 0.601112 | 6.763564 | 63682.446305 |
std | 34.209185 | 230.780778 | 21709.024204 | 37.234661 | 0.109609 | 1.269857 | 1.901753 | 2.506851e+04 | 0.494028 | 0.478331 | 0.489670 | 2.812616 | 22329.215447 |
min | 0.000000 | 0.100000 | 1.000000 | 6.900000 | 0.000000 | 0.048000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 10847.000000 |
25% | 0.000000 | 205.000000 | 28050.000000 | 57.900000 | 1.000000 | 1.000000 | 2.000000 | 2.500000e-01 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 46339.000000 |
50% | 5.000000 | 294.000000 | 38737.000000 | 74.260000 | 1.000000 | 4.000000 | 6.000000 | 3.330000e-01 | 1.000000 | 1.000000 | 1.000000 | 7.000000 | 59015.000000 |
75% | 18.000000 | 425.500000 | 53819.500000 | 98.710000 | 1.000000 | 4.000000 | 6.000000 | 5.000000e-01 | 1.000000 | 1.000000 | 1.000000 | 8.000000 | 75950.000000 |
max | 1143.000000 | 18130.000000 | 156250.000000 | 1745.500000 | 4.000000 | 4.000000 | 6.000000 | 1.000940e+07 | 1.000000 | 1.000000 | 1.000000 | 13.000000 | 183109.000000 |
# 查看各列⾮空值数量
df.count()
id 318851
tradeTime 318851
followers 318851
totalPrice 318851
price 318851
square 318851
livingRoom 318851
drawingRoom 318851
kitchen 318851
bathRoom 318851
floor 318851
buildingType 316830
buildingStructure 318851
ladderRatio 318851
elevator 318819
fiveYearsProperty 318819
subway 318819
district 318851
communityAverage 318388
dtype: int64
# 开始数据清理
# 查看是否有重复数据
df[df.duplicated()]
# -->无完全重复的条目
id | tradeTime | followers | totalPrice | price | square | livingRoom | drawingRoom | kitchen | bathRoom | floor | buildingType | buildingStructure | ladderRatio | elevator | fiveYearsProperty | subway | district | communityAverage |
---|
# 查看id字段是否有重复值
df[df['id'].duplicated()]
# -->无id重复的条目
id | tradeTime | followers | totalPrice | price | square | livingRoom | drawingRoom | kitchen | bathRoom | floor | buildingType | buildingStructure | ladderRatio | elevator | fiveYearsProperty | subway | district | communityAverage |
---|
# 根据分析目标,我们取出需要的列即可
# 'id', 'tradeTime', 'totalPrice', 'price', 'square', 'livingRoom', 'drawingRoom', 'kitchen', 'bathRoom', 'floor', 'elevator', 'subway','district', 'communityAverage'
df = df[['id', 'tradeTime', 'totalPrice', 'price', 'square', 'livingRoom', 'drawingRoom', 'kitchen', 'bathRoom', 'floor', 'elevator', 'subway','district', 'communityAverage']]
# 查看tradeTime列数据情况
df['tradeTime'].value_counts()
2016-02-28 1096
2016-03-06 948
2016-07-31 940
2016-08-31 910
2016-03-05 824...
2011-02-18 1
2010-08-13 1
2010-11-27 1
2010-01-15 1
2010-03-09 1
Name: tradeTime, Length: 2560, dtype: int64
# 可见tradeTime列数据时间跨度大,且年代久远的数据没有太多参考价值,有些时间段数据量太少不具有参考性
# 需要对tradeTime列进行清理
df['tradeTime'] = pd.to_datetime(df['tradeTime'])
# 查看数据类型
df.dtypes
id object
tradeTime datetime64[ns]
totalPrice float64
price int64
square float64
livingRoom object
drawingRoom object
kitchen int64
bathRoom object
floor object
elevator float64
subway float64
district int64
communityAverage float64
dtype: object
# 统计各年数据量
df['year'] = df['tradeTime'].dt.year
df['year'].value_counts()
# 02 03 08 09 10 18数据量较少
2016 90829
2015 69805
2017 43217
2013 38751
2012 37221
2014 32602
2011 6010
2018 221
2010 189
2002 3
2009 1
2008 1
2003 1
Name: year, dtype: int64
# 删除数据量较少和年代久远的数据,统计2013~2017年数据
df.drop(df[df['year'] < 2013].index, inplace = True)
df.drop(df[df['year'] > 2017].index, inplace = True)
# 清理totalPrice小于100万的数据-->偏远或者面积太小
df.drop(df[df['totalPrice'] < 100].index, inplace = True)
# 再次查看数据情况
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 272923 entries, 0 to 318850
Data columns (total 15 columns):
id 272923 non-null object
tradeTime 272923 non-null datetime64[ns]
totalPrice 272923 non-null float64
price 272923 non-null int64
square 272923 non-null float64
livingRoom 272923 non-null object
drawingRoom 272923 non-null object
kitchen 272923 non-null int64
bathRoom 272923 non-null object
floor 272923 non-null object
elevator 272917 non-null float64
subway 272917 non-null float64
district 272923 non-null int64
communityAverage 272558 non-null float64
year 272923 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(4), object(5)
memory usage: 33.3+ MB
# 对于elevator和subway列,是否存在空值
print(df['elevator'].isnull(), df['subway'].isnull())
0 False
1 False
2 False
3 False
4 False...
318846 False
318847 False
318848 False
318849 False
318850 False
Name: elevator, Length: 272923, dtype: bool 0 False
1 False
2 False
3 False
4 False...
318846 False
318847 False
318848 False
318849 False
318850 False
Name: subway, Length: 272923, dtype: bool
# 查看elevator和subway列是否有nan值
print(df['elevator'].value_counts(dropna = False))
print(df['subway'].value_counts(dropna = False))
1.0 157827
0.0 115090
NaN 6
Name: elevator, dtype: int64
1.0 164183
0.0 108734
NaN 6
Name: subway, dtype: int64
df.elevator.fillna('ABCNAN', inplace = True)
df.subway.fillna('ABCNAN', inplace = True)
# 查看数据情况
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 272923 entries, 0 to 318850
Data columns (total 15 columns):
id 272923 non-null object
tradeTime 272923 non-null datetime64[ns]
totalPrice 272923 non-null float64
price 272923 non-null int64
square 272923 non-null float64
livingRoom 272923 non-null object
drawingRoom 272923 non-null object
kitchen 272923 non-null int64
bathRoom 272923 non-null object
floor 272923 non-null object
elevator 272923 non-null object
subway 272923 non-null object
district 272923 non-null int64
communityAverage 272558 non-null float64
year 272923 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(7)
memory usage: 33.3+ MB
# 删除elevator和subway异常值数据行
df.drop(df[df['elevator'] == 'ABCNAN'].index, inplace = True)
df.drop(df[df['subway'] == 'ABCNAN'].index, inplace = True)
# 查看数据情况
df.info()
# 可见communityAverage有部分数据缺失
<class 'pandas.core.frame.DataFrame'>
Int64Index: 272917 entries, 0 to 318850
Data columns (total 15 columns):
id 272917 non-null object
tradeTime 272917 non-null datetime64[ns]
totalPrice 272917 non-null float64
price 272917 non-null int64
square 272917 non-null float64
livingRoom 272917 non-null object
drawingRoom 272917 non-null object
kitchen 272917 non-null int64
bathRoom 272917 non-null object
floor 272917 non-null object
elevator 272917 non-null object
subway 272917 non-null object
district 272917 non-null int64
communityAverage 272552 non-null float64
year 272917 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(7)
memory usage: 33.3+ MB
# communityAverage
df[df['communityAverage'].isnull()] #查看缺失值所在数据行
id | tradeTime | totalPrice | price | square | livingRoom | drawingRoom | kitchen | bathRoom | floor | elevator | subway | district | communityAverage | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2027 | 101091727692 | 2016-12-05 | 1255.0 | 139290 | 90.10 | 4 | 0 | 0 | 0 | 底 1 | 0 | 1 | 10 | NaN | 2016 |
3902 | 101091913830 | 2016-06-20 | 238.0 | 51830 | 45.92 | 1 | 1 | 1 | 1 | 高 6 | 0 | 1 | 7 | NaN | 2016 |
4982 | 101092003852 | 2016-06-28 | 291.0 | 41195 | 70.64 | 1 | 1 | 1 | 1 | 高 11 | 1 | 1 | 7 | NaN | 2016 |
5809 | 101092065365 | 2016-09-30 | 176.0 | 110000 | 16.00 | 1 | 0 | 0 | 0 | 底 1 | 0 | 1 | 1 | NaN | 2016 |
6088 | 101092088297 | 2016-07-11 | 382.0 | 39024 | 97.89 | 2 | 2 | 1 | 1 | 中 28 | 1 | 1 | 7 | NaN | 2016 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
316175 | BJXC91739524 | 2016-03-12 | 155.0 | 115586 | 13.41 | 1 | 0 | 0 | 0 | 底 1 | 0 | 1 | 10 | NaN | 2016 |
317054 | BJXC92150717 | 2016-05-23 | 214.0 | 149442 | 14.32 | 1 | 0 | 0 | 0 | 底 1 | 0 | 0 | 10 | NaN | 2016 |
317133 | BJXC92215207 | 2016-05-22 | 227.0 | 145981 | 15.55 | 1 | 0 | 0 | 0 | 底 1 | 0 | 1 | 10 | NaN | 2016 |
317186 | BJXC92255534 | 2016-05-25 | 191.8 | 49987 | 38.36 | 1 | 1 | 1 | 1 | 底 1 | 0 | 1 | 10 | NaN | 2016 |
317217 | BJXC92289286 | 2016-06-05 | 180.0 | 102390 | 17.58 | 1 | 0 | 0 | 0 | 底 1 | 0 | 1 | 10 | NaN | 2016 |
365 rows × 15 columns
# 使用平均值填充communityAverage缺失值
df['communityAverage'].fillna(df['communityAverage'].mean(), inplace=True)
# 查看数据情况
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 272917 entries, 0 to 318850
Data columns (total 15 columns):
id 272917 non-null object
tradeTime 272917 non-null datetime64[ns]
totalPrice 272917 non-null float64
price 272917 non-null int64
square 272917 non-null float64
livingRoom 272917 non-null object
drawingRoom 272917 non-null object
kitchen 272917 non-null int64
bathRoom 272917 non-null object
floor 272917 non-null object
elevator 272917 non-null object
subway 272917 non-null object
district 272917 non-null int64
communityAverage 272917 non-null float64
year 272917 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(7)
memory usage: 33.3+ MB
# 重新排序索引值
# 删除数据行后,行索引仍然不变,若想使用连续索引数值,则需重新生成
df = df.reset_index()
# 数据清洗完毕,开始分析# 常⽤统计值
df['year'] = df['year'].astype('str') #以免使用describe时对年份进行各种计算
df.describe()
index | totalPrice | price | square | kitchen | district | communityAverage | |
---|---|---|---|---|---|---|---|
count | 272917.000000 | 272917.00000 | 272917.000000 | 272917.000000 | 272917.000000 | 272917.000000 | 272917.000000 |
mean | 154336.376825 | 374.59492 | 46617.560471 | 83.670028 | 0.996325 | 6.738968 | 63832.871276 |
std | 94013.407724 | 235.14487 | 21598.793862 | 37.584340 | 0.095351 | 2.798208 | 22298.588824 |
min | 0.000000 | 100.00000 | 4335.000000 | 6.900000 | 0.000000 | 1.000000 | 10847.000000 |
25% | 68616.000000 | 227.00000 | 31075.000000 | 58.060000 | 1.000000 | 6.000000 | 46505.000000 |
50% | 157116.000000 | 316.00000 | 41700.000000 | 74.580000 | 1.000000 | 7.000000 | 59179.000000 |
75% | 234415.000000 | 450.00000 | 57072.000000 | 98.850000 | 1.000000 | 8.000000 | 76223.000000 |
max | 318850.000000 | 18130.00000 | 156250.000000 | 1745.500000 | 4.000000 | 13.000000 | 183109.000000 |
# 各区房源数目、平均面积、均价
df_dis = df.groupby('district', as_index = False)
df_dis_count = df_dis.count()[['district','id']]
df_dis_count.rename(columns={'id':'num'},inplace = True) # 各区房源数目
df_dis_mean_square = df_dis.mean()[['district','square']] # 各区房源平均面积
df_dis_mean_comm = df_dis.mean()[['district','communityAverage']] # 各区均价df_dis_info = pd.merge(df_dis_count, pd.merge(df_dis_mean_square, df_dis_mean_comm, on = 'district'), on = 'district')
df_dis_info.sort_values('num', ascending = False, inplace = True) # 总表按照各区房源数目降序排列
df_dis_info
district | num | square | communityAverage | |
---|---|---|---|---|
6 | 7 | 92720 | 84.822103 | 63003.715434 |
5 | 6 | 33140 | 101.336912 | 43109.573240 |
7 | 8 | 32376 | 79.900437 | 79591.773777 |
9 | 10 | 26899 | 67.598403 | 101684.515659 |
1 | 2 | 24864 | 78.545938 | 54975.301568 |
0 | 1 | 14998 | 72.598140 | 89713.498843 |
3 | 4 | 13062 | 87.911008 | 44792.826757 |
10 | 11 | 11715 | 86.374318 | 44031.363518 |
8 | 9 | 9487 | 75.088564 | 50449.493201 |
12 | 13 | 7369 | 96.919533 | 39418.455028 |
4 | 5 | 2847 | 90.847980 | 36074.011195 |
2 | 3 | 2137 | 97.458615 | 48023.558727 |
11 | 12 | 1303 | 85.445825 | 39053.454139 |
df_dis.head()
index | id | tradeTime | totalPrice | price | square | livingRoom | drawingRoom | kitchen | bathRoom | floor | elevator | subway | district | communityAverage | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 101084782030 | 2016-08-09 | 415.0 | 31680 | 131.00 | 2 | 1 | 1 | 1 | 高 26 | 1 | 1 | 7 | 56021.0 | 2016 |
1 | 1 | 101086012217 | 2016-07-28 | 575.0 | 43436 | 132.38 | 2 | 2 | 1 | 2 | 高 22 | 1 | 0 | 7 | 71539.0 | 2016 |
2 | 2 | 101086041636 | 2016-12-11 | 1030.0 | 52021 | 198.00 | 3 | 2 | 1 | 3 | 中 4 | 1 | 0 | 7 | 48160.0 | 2016 |
3 | 3 | 101086406841 | 2016-09-30 | 297.5 | 22202 | 134.00 | 3 | 1 | 1 | 1 | 底 21 | 1 | 0 | 6 | 51238.0 | 2016 |
4 | 4 | 101086920653 | 2016-08-28 | 392.0 | 48396 | 81.00 | 2 | 1 | 1 | 1 | 中 6 | 0 | 1 | 1 | 62588.0 | 2016 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
291 | 295 | 101090284214 | 2016-07-29 | 355.0 | 32569 | 109.00 | 3 | 2 | 1 | 1 | 中 22 | 1 | 0 | 9 | 47095.0 | 2016 |
397 | 403 | 101090681279 | 2016-07-27 | 395.0 | 29351 | 134.58 | 3 | 2 | 1 | 2 | 底 6 | 0 | 0 | 11 | 40026.0 | 2016 |
472 | 480 | 101090865126 | 2016-09-12 | 290.0 | 43524 | 66.63 | 1 | 2 | 1 | 1 | 底 9 | 1 | 1 | 11 | 39787.0 | 2016 |
578 | 587 | 101091076817 | 2016-08-31 | 410.0 | 33607 | 122.00 | 3 | 2 | 1 | 2 | 高 7 | 0 | 1 | 11 | 42790.0 | 2016 |
580 | 589 | 101091085930 | 2016-06-23 | 395.0 | 36336 | 108.71 | 2 | 2 | 1 | 2 | 低 12 | 1 | 0 | 11 | 43196.0 | 2016 |
65 rows × 16 columns
# 各区房屋总价均值-有/无地铁(假设subway值为1时为有地铁)
df_dis_sub = df[['id', 'district', 'subway','totalPrice']]
df_dis_sub = df_dis_sub.groupby(['district', 'subway']).mean()print(df_dis_sub)
# df_dis_sub_1 = df_dis[df_dis['subway'] == 1]# df_dis_sub_0 = df_dis[df_dis['subway'] == 0]
# df_dis_sub_0
totalPrice
district subway
1 0.0 469.8782651.0 465.033473
2 0.0 322.2504881.0 315.975104
3 0.0 372.5883841.0 257.979536
4 0.0 277.0029991.0 281.420831
5 0.0 238.8219411.0 281.531037
6 0.0 296.8973131.0 312.152069
7 0.0 375.2044741.0 401.876686
8 0.0 456.3726471.0 463.602992
9 0.0 286.7956641.0 291.473189
10 0.0 469.2435721.0 486.890207
11 0.0 242.5576781.0 264.659448
12 0.0 250.7085321.0 426.000000
13 0.0 257.4199351.0 231.075452
# 各区-有地铁的-是否配有电梯 均价
df_dis_sub_01 = df[['id', 'district', 'subway', 'elevator', 'totalPrice']]
df_dis_sub_1 = df_dis_sub_01[df_dis_sub_01['subway'] == 1]
df_dis_sub_1 = df_dis_sub_1.groupby(['district', 'elevator'], as_index = False).mean()
df_dis_sub_1.rename(columns = {'totalPrice':'totalPrice_mean'}, inplace = True)
print(df_dis_sub_1)
district elevator totalPrice_mean
0 1 0.0 415.038504
1 1 1.0 500.240515
2 2 0.0 267.338108
3 2 1.0 334.515961
4 3 0.0 518.142857
5 3 1.0 236.086011
6 4 0.0 239.914759
7 4 1.0 335.179934
8 5 0.0 258.421368
9 5 1.0 287.453998
10 6 0.0 308.621994
11 6 1.0 316.539542
12 7 0.0 302.067409
13 7 1.0 441.639936
14 8 0.0 409.848486
15 8 1.0 512.433305
16 9 0.0 222.063187
17 9 1.0 345.343639
18 10 0.0 461.317588
19 10 1.0 510.121250
20 11 0.0 253.602230
21 11 1.0 268.960810
22 12 1.0 426.000000
23 13 0.0 202.840370
24 13 1.0 281.763431
# 2017年 2室1厅1厨1卫户型房屋-有电梯/无电梯-有地铁/无地铁 各区均价
df_dis_want = df[['id', 'district','livingRoom', 'drawingRoom', 'kitchen', 'bathRoom', 'subway', 'elevator', 'totalPrice','year']]
print(df_dis_want.info())
df_dis_w = df_dis_want[(df['year'] == '2017') & (df['livingRoom'] == '2') & (df['drawingRoom'] == '1') & (df['kitchen'] == 1) & (df['bathRoom'] == '1')]
# 注意到判别条件这里,数据类型不同判别条件中需要考虑是否加引号'',这也可认为是本次数据清洗环节的疏漏
df_dis_w = df_dis_w.groupby(['district', 'elevator', 'subway'], as_index = False).mean()
df_dis_w.rename(columns = {'totalPrice':'totalPrice_mean'}, inplace = True)
print(df_dis_w)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272917 entries, 0 to 272916
Data columns (total 10 columns):
id 272917 non-null object
district 272917 non-null int64
livingRoom 272917 non-null object
drawingRoom 272917 non-null object
kitchen 272917 non-null int64
bathRoom 272917 non-null object
subway 272917 non-null object
elevator 272917 non-null object
totalPrice 272917 non-null float64
year 272917 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 20.8+ MB
None
district | elevator | subway | kitchen | totalPrice_mean | |
---|---|---|---|---|---|
0 | 1 | 0.0 | 0.0 | 1 | 493.103448 |
1 | 1 | 0.0 | 1.0 | 1 | 597.487764 |
2 | 1 | 1.0 | 0.0 | 1 | 656.100000 |
3 | 1 | 1.0 | 1.0 | 1 | 700.313229 |
4 | 2 | 0.0 | 0.0 | 1 | 355.323571 |
5 | 2 | 0.0 | 1.0 | 1 | 381.905593 |
6 | 2 | 1.0 | 0.0 | 1 | 495.709938 |
7 | 2 | 1.0 | 1.0 | 1 | 466.610455 |
8 | 3 | 0.0 | 0.0 | 1 | 358.381250 |
9 | 3 | 0.0 | 1.0 | 1 | 541.500000 |
10 | 3 | 1.0 | 0.0 | 1 | 457.250000 |
11 | 3 | 1.0 | 1.0 | 1 | 448.420000 |
12 | 4 | 0.0 | 0.0 | 1 | 311.772622 |
13 | 4 | 0.0 | 1.0 | 1 | 305.311983 |
14 | 4 | 1.0 | 0.0 | 1 | 439.410204 |
15 | 4 | 1.0 | 1.0 | 1 | 412.669841 |
16 | 5 | 0.0 | 0.0 | 1 | 256.658491 |
17 | 5 | 0.0 | 1.0 | 1 | 316.978571 |
18 | 5 | 1.0 | 0.0 | 1 | 317.132948 |
19 | 5 | 1.0 | 1.0 | 1 | 359.614839 |
20 | 6 | 0.0 | 0.0 | 1 | 352.025395 |
21 | 6 | 0.0 | 1.0 | 1 | 395.474759 |
22 | 6 | 1.0 | 0.0 | 1 | 420.451366 |
23 | 6 | 1.0 | 1.0 | 1 | 439.371875 |
24 | 7 | 0.0 | 0.0 | 1 | 364.164934 |
25 | 7 | 0.0 | 1.0 | 1 | 409.597200 |
26 | 7 | 1.0 | 0.0 | 1 | 554.104437 |
27 | 7 | 1.0 | 1.0 | 1 | 536.223223 |
28 | 8 | 0.0 | 0.0 | 1 | 503.982394 |
29 | 8 | 0.0 | 1.0 | 1 | 532.799109 |
30 | 8 | 1.0 | 0.0 | 1 | 621.156806 |
31 | 8 | 1.0 | 1.0 | 1 | 653.117304 |
32 | 9 | 0.0 | 0.0 | 1 | 315.139793 |
33 | 9 | 0.0 | 1.0 | 1 | 322.747917 |
34 | 9 | 1.0 | 0.0 | 1 | 480.467907 |
35 | 9 | 1.0 | 1.0 | 1 | 445.882243 |
36 | 10 | 0.0 | 0.0 | 1 | 644.037000 |
37 | 10 | 0.0 | 1.0 | 1 | 638.352427 |
38 | 10 | 1.0 | 0.0 | 1 | 741.245455 |
39 | 10 | 1.0 | 1.0 | 1 | 744.362667 |
40 | 11 | 0.0 | 0.0 | 1 | 356.081275 |
41 | 11 | 0.0 | 1.0 | 1 | 389.598276 |
42 | 11 | 1.0 | 0.0 | 1 | 374.167647 |
43 | 11 | 1.0 | 1.0 | 1 | 425.826744 |
44 | 12 | 0.0 | 0.0 | 1 | 298.251190 |
45 | 12 | 1.0 | 0.0 | 1 | 401.925397 |
46 | 12 | 1.0 | 1.0 | 1 | 390.000000 |
47 | 13 | 0.0 | 0.0 | 1 | 303.945556 |
48 | 13 | 0.0 | 1.0 | 1 | 290.141912 |
49 | 13 | 1.0 | 0.0 | 1 | 388.379070 |
50 | 13 | 1.0 | 1.0 | 1 | 409.612766 |
# 均价⽇趋势
# 统计每⽇所有房源的平均单价
df_day_price = df.groupby('tradeTime').mean()['price']
df_day_price.sort_index(inplace=True) # 按照索引排序
df_day_price.plot() # 画出趋势图
每年初期出现了明显异常值,是因为什么导致的/还是说本身就是错误值?
# 2017年 总价200~400万、单价5~8万、配电梯(假设elevator值为1时为有电梯) 的房源占比
df_2017 = df[df['year'] == '2017']
num1 = len(df[(df['totalPrice'] > 200) & (df['totalPrice'] < 400) & (df['price'] > 40000) &( df['price'] < 70000) & (df['elevator'] == 1)] )
num2 = len(df_2017) # 2017年数据条数
want_ratio = num1/num2
print(want_ratio) #占比
0.6929146649957065
Python数据分析-北京房价分析相关推荐
- Python数据分析初学之分析表格
文章目录 Python数据分析初学之分析表格 任务要求 代码实现 Python数据分析初学之分析表格 任务要求 1)使用 pandas 读取文件 data.csv 中的数据 ,创建 DataFrame ...
- 【详解】Python数据分析第三方库分析
Python数据分析第三方库分析 目录 Python数据分析第三方库分析 @常用库下载地址 1 Numpy 2 Matplotlib 3 Pandas 4 SciPy 5 Scikit-Learn 6 ...
- python数据分析的交叉分析和分组分析 -第三次笔记
python数据分析 -第三次笔记 –1.交叉分析 –2.分组分析 1.交叉分析 交叉分析的含义是在纵向分析法和横向分析法的基础上,从交叉.立体的角度出发,由浅入深.由低级到高级的一种分析方法.这种方 ...
- python波士顿房价是什么数据,Python数据分析 | 波士顿房价回归分析
分析目标: 将波士顿房价的数据集进行描述性数据分析.预测性数据分析(主要用了回归分析),可用于预测房价. 数据集介绍: 卡内基梅隆大学收集,StatLib库,1978年,涵盖了麻省波士顿的506个不同 ...
- python数据分析北京_Python实现的北京积分落户数据分析示例
本文实例讲述了Python实现的北京积分落户数据分析.分享给大家供大家参考,具体如下: 北京积分落户状况 获取数据(爬虫/文件下载)-> 分析 (维度-指标) 从公司维度分析不同公司对落户人数指 ...
- Python数据分析——基金定投收益率分析,以及支付宝“慧定投”智能定投实现
文章目录 一.关于基金定投 数据来源 接口规范 常见指数基金/股票代码 二.分析目标 三.代码实现 1.定义获取数据.清洗数据的函数 2.定义定投策略函数 3.计算2019年对沪深300指数基金进行定 ...
- python数据比例_#python# #数据分析# 性别比例分析
手头有一份性别比例的样本数据,清洗后只保留了性别信息,做了一个数据分析. 数据清洗和数据统计的代码就不贴了,贴性别比例pie图和性别比例趋势图的代码. 性别比例pie图: def _plot_gend ...
- Python数据分析之探索性分析(多因子复合分析)
目录 一.假设检验: 二.交叉分析 1.分析属性与属性之间关系的方法 2.透视表 三.分组与钻取: 四.相关分析 1.相关系数分析 2.熵:条件熵:互信息(熵增益):增益率:基尼系数: 3.衡量离散数 ...
- 【Python数据分析】房价数据分析实战(包含源码和数据)
今天我们利用波士顿房价进行简单分析,快速熟悉数据挖掘和分析的一般流程. 1.导入数据. 2.查看数据维度,从结果可以出,该数据一共有506条记录,14个特征,然后再输出特征的名字和数据类型. 3.然后 ...
- python数据分析之对比分析
对比分析 概念:两个互相联系的指标进行比较 类型:绝对数比较(相减) .相对数比较(相除) 其中相对数比较分析也包括:结构分析.比例分析.动态对比分析 1.绝对数比较 a.对比的指标在量级上不能差别过 ...
最新文章
- wordpress on Zencart (WOZ) Ultimate SEO URLs 静态化
- css3抽奖转盘,从零制作CSS3抽奖大转盘
- ASP.NETCore微服务(七)——【docker部署linux上线】(ECS+linux+docker+API上线部分)
- ZAB协议选主过程详解
- 单元测试debug过程中,显示variables are not available
- 中年高校教师、行政人员的21个特征!
- cortex M0 典型os模型
- python3怎么使用mnist_loader_Python读取mnist
- 项目管理指标_企业工程项目管理部门绩效考核KPI关键指标,共4个维度113项指标...
- 【Java】恶搞程序实现桌面无限弹窗
- 实习日记——Day38
- java毕业设计——基于java+Eclipse的扫雷游戏设计与实现(毕业论文+程序源码)——扫雷游戏
- 【Jupyter Notebook】添加目录--Table of Contents
- IOS APP 公司主体变更的转让流程
- 08年中报大幅预增股
- 题目1(15分)对spark1.txt文件进行筛选,将A或者包含A的字母筛选出来并统计个数,然后输出到dome1文件中。
- 卸载虚拟机出现用户已存在的错误_用虚拟机安装360全家桶是什么体验
- Linux下的AudoCAD替代软件
- 开企业邮箱需要服务器么,企业邮箱一定要虚拟主机吗
- 数仓工具—Hive实战之日活跃周活跃月活(12)
热门文章
- 3D打印机DIY之六------G代码命令
- 400 : perceived to be a client error 错误
- 计算机系固态硬盘机械硬盘,直观:如何在固态硬盘+机械硬盘上安装系统_IT /计算机_资料...
- 六年如逆旅,我亦是行人 ——一个顾问的六年安全从业经历
- 腾讯精选50题—Day6题目43,46,53
- 华为往事(十八)--CC08 STP:华为抢占制高点
- 这些个适合oier的网站丫太有趣了吧(不定期更新中)
- 百度笔记聚合是什么?
- 未能联接game center服务器,game center连接不成功怎么办 有哪些修复步骤 - 驱动管家...
- 产品经理不再纸上谈兵——关于用户默认头像的思考