这两天刚好有些时间,于是跑了一些空间计量模型作为实战练习,使用的包是pysal,原教程点击该链接,主要是阐述了空间异质性、空间依赖的含义以及SLX,SAR,SEM这三个空间计量基本模型,其他的许多变体其实也就是这三个模型的两两结合或三个结合在一起。在本博客中不再阐述空间异质性和空间依赖了,只讲如何用pysal实现SLX,SAR,SEM这三个基本模型,希望了解全部内容的可以看原教程。此外,pysal这个包最好是要更新到最新版,要不然本博客代码跑起来会有bug。

1.数据情况及OLS回归

The Data: San Diego AirBnB,可以点击自己下载。数据是San Diego的airbnb的每晚房价数据,包括房价,这个也就是我们后续分析的因变量,还有一些连续的自变量,比如房间数量、卫生间数量、卧室数量、床数量等等,此外也有一些分类变量,比如neighborhood,此外,还有一些哑变量,比如pg_Apartment等。

import warnings
warnings.filterwarnings("ignore")%matplotlib inlineimport pysal
import spreg
from libpysal import weights
import esda
from scipy import stats
import statsmodels.formula.api as sm
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
db = gpd.read_file('regression_db.geojson')
db.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 6110 entries, 0 to 6109
Data columns (total 20 columns):#   Column              Non-Null Count  Dtype
---  ------              --------------  -----   0   accommodates        6110 non-null   int64   1   bathrooms           6110 non-null   float64 2   bedrooms            6110 non-null   float64 3   beds                6110 non-null   float64 4   neighborhood        6110 non-null   object  5   pool                6110 non-null   int64   6   d2balboa            6110 non-null   float64 7   coastal             6110 non-null   int64   8   price               6110 non-null   float64 9   log_price           6110 non-null   float64 10  id                  6110 non-null   int64   11  pg_Apartment        6110 non-null   int64   12  pg_Condominium      6110 non-null   int64   13  pg_House            6110 non-null   int64   14  pg_Other            6110 non-null   int64   15  pg_Townhouse        6110 non-null   int64   16  rt_Entire_home/apt  6110 non-null   int64   17  rt_Private_room     6110 non-null   int64   18  rt_Shared_room      6110 non-null   int64   19  geometry            6110 non-null   geometry
dtypes: float64(6), geometry(1), int64(12), object(1)
memory usage: 954.8+ KB
db.head(2)
accommodates bathrooms bedrooms beds neighborhood pool d2balboa coastal price log_price id pg_Apartment pg_Condominium pg_House pg_Other pg_Townhouse rt_Entire_home/apt rt_Private_room rt_Shared_room geometry
0 5 2.0 2.0 2.0 North Hills 0 2.972077 0 425.0 6.052089 6 0 0 1 0 0 1 0 0 POINT (-117.12971 32.75399)
1 6 1.0 2.0 4.0 Mission Bay 0 11.501385 1 205.0 5.323010 5570 0 1 0 0 0 1 0 0 POINT (-117.25253 32.78421)

These are the explanatory variables we will use throughout the chapter.

variable_names = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'rt_Private_room', 'rt_Shared_room','pg_Condominium', 'pg_House', 'pg_Other', 'pg_Townhouse'
]

下面的这个代码是直接做一个普通OLS回归,不考虑空间效应,就是把因变量和自变量输进去进行了,用stats也可以,大同小异。

m1 = spreg.OLS(y = db[['log_price']].values, x = db[variable_names].values,name_y='log_price', name_x=variable_names
)

Number of Variables是包括CONSTANT变量在内的,因此一共是11个。

print(m1.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          11
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6099
R-squared           :      0.6683
Adjusted R-squared  :      0.6678
Sum squared residual:    1320.148                F-statistic           :   1229.0564
Sigma-square        :       0.216                Prob(F-statistic)     :           0
S.E. of regression  :       0.465                Log likelihood        :   -3988.895
Sigma-square ML     :       0.216                Akaike info criterion :    7999.790
S.E of regression ML:      0.4648                Schwarz criterion     :    8073.685------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT       4.3883830       0.0161147     272.3217773       0.0000000accommodates       0.0834523       0.0050781      16.4336318       0.0000000bathrooms       0.1923790       0.0109668      17.5419773       0.0000000bedrooms       0.1525221       0.0111323      13.7009195       0.0000000beds      -0.0417231       0.0069383      -6.0134430       0.0000000rt_Private_room      -0.5506868       0.0159046     -34.6244758       0.0000000rt_Shared_room      -1.2383055       0.0384329     -32.2198992       0.0000000pg_Condominium       0.1436347       0.0221499       6.4846529       0.0000000pg_House      -0.0104894       0.0145315      -0.7218393       0.4704209pg_Other       0.1411546       0.0228016       6.1905633       0.0000000pg_Townhouse      -0.0416702       0.0342758      -1.2157316       0.2241342
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           11.964TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        2671.611           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               10         322.532           0.0000
Koenker-Bassett test             10         135.581           0.0000
================================ END OF REPORT =====================================

下面看一下有沙滩和无沙滩的房子是不是存在异质性,因为我们之前没有考虑任何的空间效应,因此可能回归残差不是白噪声,下面就来验证下是不是说确实数据中是存在空间效应的,空间效应当然有很多种,这里只是举了有没有沙滩这个空间效应。

is_coastal = db.coastal.astype(bool)
coastal = m1.u[is_coastal] # 取出残差中存在coastal的样本的残差
not_coastal = m1.u[~is_coastal]  # 取出残差中不存在coastal的样本的残差
plt.hist(coastal, density=True, label='Coastal')
plt.hist(not_coastal, histtype='step',density=True, linewidth=4, label='Not Coastal'
)
plt.vlines(0,0,1, linestyle=":", color='k', linewidth=4)
plt.legend()
plt.show()

While it appears that the neighborhoods on the coast have only slightly higher average errors (and have lower variance in their prediction errors), the two distributions are significantly distinct from one another when compared using a classic ttt test:

stats.ttest_ind(coastal, not_coastal)
Ttest_indResult(statistic=array([13.98193858]), pvalue=array([9.442438e-44]))

下面建立以下这个样本的空间权重矩阵,关于空间权重矩阵的相关介绍和看我的空间计量01,02,这边使用的距离权重矩阵。

knn = weights.KNN.from_dataframe(db, k=1) # 得到k近邻权重矩阵
knn
<libpysal.weights.distance.KNN at 0xaf77427fc8>
knn.weights
{0: [1.0],1: [1.0],2: [1.0],3: [1.0],4: [1.0],...}

This means that, when we compute the spatial lag of that KNNKNNKNN weight and the residual, we get the residual of the AirBnB listing closest to each observation.

lag_residual = weights.spatial_lag.lag_spatial(knn, m1.u) # 生成空间滞后变量
ax = sns.regplot(m1.u.flatten(),lag_residual.flatten(),line_kws=dict(color='orangered'),ci=None
)
ax.set_xlabel('Model Residuals - $u$')
ax.set_ylabel('Spatial Lag of Model Residuals - $W u$');

lag_residual
array([[-0.59201055],[ 0.65057166],[ 0.00970927],...,[-0.40444691],[ 0.07424779],[-0.30458876]])

从上面这个图实际上我们可以很明显的看到空间效应确实存在于残差中,这个图横坐标是每个样本的残差,纵坐标是每个样本的最近领的残差,如果是随机分布的,那么就是没有任何关系,但是我们发现两者间存在一定的聚集关系,并且回归线系数也显著非0.因此用普通OLS回归实际上是存在一些问题的,下面使用Spatial Regimes来做回归,还是使用spreg来做。

2.Spatial Regimes

公式如下所示,其实理解起来也是相当简单,就是每个地方单独给一个截距项。希望用这个截距项来代表空间异质性。

log⁡Pi=αr+∑kXkiβk−r+ϵi\log{P_i} = \alpha_r + \sum_k \mathbf{X}_{ki}\beta_{k-r} + \epsilon_i logPi​=αr​+k∑​Xki​βk−r​+ϵi​

where we are not only allowing the constant term to vary by region (αr\alpha_rαr​), but also every other parameter (βk−r\beta_{k-r}βk−r​).

To illustrate this approach, we will use the “spatial differentiator” of whether a house is in a coastal neighborhood or not (coastal_neig) to define the regimes. The rationale behind this choice is that renting a house close to the ocean might be a strong enough pull that people might be willing to pay at different rates for each of the house’s characteristics.

m4 = spreg.OLS_Regimes(y = db[['log_price']].values, x = db[variable_names].values,regimes = db['coastal'].tolist(),constant_regi='many',regime_err_sep=False,name_y='log_price', name_x=variable_names
)
print(m4.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES - REGIMES
---------------------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          22
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6088
R-squared           :      0.6853
Adjusted R-squared  :      0.6843
Sum squared residual:    1252.489                F-statistic           :    631.4283
Sigma-square        :       0.206                Prob(F-statistic)     :           0
S.E. of regression  :       0.454                Log likelihood        :   -3828.169
Sigma-square ML     :       0.205                Akaike info criterion :    7700.339
S.E of regression ML:      0.4528                Schwarz criterion     :    7848.128------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------0_CONSTANT       4.4072424       0.0215156     204.8392695       0.00000000_accommodates       0.0901860       0.0064737      13.9311338       0.00000000_bathrooms       0.1433760       0.0142680      10.0487871       0.00000000_bedrooms       0.1129626       0.0138273       8.1695568       0.00000000_beds      -0.0262719       0.0088380      -2.9726102       0.00296440_rt_Private_room      -0.5293343       0.0189179     -27.9805699       0.00000000_rt_Shared_room      -1.2244586       0.0425969     -28.7452834       0.00000000_pg_Condominium       0.1053065       0.0281309       3.7434523       0.00018320_pg_House      -0.0454471       0.0179571      -2.5308637       0.01140320_pg_Other       0.0607526       0.0276365       2.1982715       0.02796730_pg_Townhouse      -0.0103973       0.0456730      -0.2276456       0.81992941_CONSTANT       4.4799043       0.0250938     178.5260014       0.00000001_accommodates       0.0484639       0.0078806       6.1497397       0.00000001_bathrooms       0.2474779       0.0165661      14.9388057       0.00000001_bedrooms       0.1897404       0.0179229      10.5864676       0.00000001_beds      -0.0506077       0.0107429      -4.7107925       0.00000251_rt_Private_room      -0.5586281       0.0283122     -19.7309699       0.00000001_rt_Shared_room      -1.0528541       0.0841745     -12.5079997       0.00000001_pg_Condominium       0.2044470       0.0339434       6.0231780       0.00000001_pg_House       0.0753534       0.0233783       3.2232188       0.00127431_pg_Other       0.2954848       0.0386455       7.6460385       0.00000001_pg_Townhouse      -0.0735077       0.0493672      -1.4889984       0.1365396
------------------------------------------------------------------------------------
Regimes variable: unknownREGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           14.033TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        3977.425           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               21         443.593           0.0000
Koenker-Bassett test             21         164.276           0.0000REGIMES DIAGNOSTICS - CHOW TESTVARIABLE        DF        VALUE           PROBCONSTANT         1           4.832           0.0279accommodates         1          16.736           0.0000bathrooms         1          22.671           0.0000bedrooms         1          11.504           0.0007beds         1           3.060           0.0802pg_Condominium         1           5.057           0.0245pg_House         1          16.793           0.0000pg_Other         1          24.410           0.0000pg_Townhouse         1           0.881           0.3480rt_Private_room         1           0.740           0.3896rt_Shared_room         1           3.309           0.0689Global test        11         328.869           0.0000
================================ END OF REPORT =====================================

3.Exogenous effects: The SLX Model

公式如下,也就是对于自变量再加上一个空间滞后项,这个模型是最简单的空间计量模型,因为仅仅自变量加了一个空间滞后项,回归方式其实还是OLS,而且解读起来其实也和普通的OLS是一样的,比较简单,具体的这边不阐述了,可以看一下沈体雁老师《空间计量经济学》一书。
log⁡(Pi)=α+∑k=1pXijβj+∑k=1p(∑j=1Nwijxjk)γk+ϵi\log(P_i) = \alpha + \sum^{p}_{k=1}X_{ij}\beta_j + \sum^{p}_{k=1}\left(\sum^{N}_{j=1}w_{ij}x_{jk}\right)\gamma_k + \epsilon_i log(Pi​)=α+k=1∑p​Xij​βj​+k=1∑p​(j=1∑N​wij​xjk​)γk​+ϵi​

where ∑j=1Nwijxjk\sum_{j=1}^N w_{ij}x_{jk}∑j=1N​wij​xjk​ represents the spatial lag of the kkkth explanatory variable.
This can be stated in matrix form using the spatial weights matrix, W\mathbf{W}W, as:
log⁡(Pi)=α+Xβ+WXγ+ϵ\log(P_i) = \alpha + \mathbf{X}\beta + \mathbf{WX}\gamma + \epsilon log(Pi​)=α+Xβ+WXγ+ϵ

This splits the model to focus on two main effects: β\betaβ and γ\gammaγ. The β\betaβ effect describes the change in yiy_iyi​ when XikX_{ik}Xik​ changes by one. 1. The subscript for site iii is important here: since we’re dealing with a W\mathbf{W}W matrix, it’s useful to be clear about where the change occurs. Indeed, this matters for the γ\gammaγ effect, which represents an indirect effect of a change in XiX_iXi​. This can be conceptualized in two ways. First, one could think of γ\gammaγ as simply the effect of a unit change in your average surroundings. This is useful and simple. But, this interpretation ignores where this change might occur. In truth, a change in a variable at site iii will result in a spillover to its surroundings:
when xix_ixi​ changes, so too does the spatial lag of any site near iii. The precise size of this will depend on the structure of W\mathbf{W}W, and can be different for every site. For example, think of a very highly-connected “focal” site in a row-standardized weight matrix. This focal site will not be strongly affected if a neighbor changes by a single unit, since each site only contributes a small amount to the lag at the focal site. Alternatively, consider a site with only one neighbor: its lag will change by exactly the amount its sole neighbor changes. Thus, to discover the exact indirect effect of a change yyy caused by the changeat a specific site xix_ixi​ you would need to compute the change in the spatial lag,and then use that as your change in XXX.

In Python, we can calculate the spatial lag of each variable whose name starts by pg_ by first creating a list of all of those names, and then applying PySAL's lag_spatial to each of them:
下面就是创建了自变量的空间滞后项,在后面的回归中,把自变量的空间滞后项也放进去就可以了。

wx = db.filter(like='pg'
).apply(lambda y: weights.spatial_lag.lag_spatial(knn, y)
).rename(columns=lambda c: 'w_'+c
).drop('w_pg_Apartment', axis=1
)
wx
w_pg_Condominium w_pg_House w_pg_Other w_pg_Townhouse
0 0.00 0.45 0.20 0.00
1 0.30 0.35 0.00 0.05
2 0.05 0.30 0.00 0.05
3 0.00 0.95 0.00 0.05
4 0.05 0.85 0.00 0.00
... ... ... ... ...
6105 0.30 0.05 0.00 0.00
6106 0.05 0.00 0.60 0.00
6107 0.00 0.55 0.25 0.00
6108 0.00 0.05 0.05 0.00
6109 0.05 0.70 0.05 0.00

6110 rows × 4 columns

Once computed, we can run the model using OLS estimation because, in this context, the spatial lags included do not violate any of the assumptions OLS relies on (they are essentially additional exogenous variables):

slx_exog = db[variable_names].join(wx)
m5 = spreg.OLS(db[['log_price']].values, slx_exog.values,name_y='l_price', name_x=slx_exog.columns.tolist()
)
print(m5.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :     l_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          15
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6095
R-squared           :      0.6800
Adjusted R-squared  :      0.6792
Sum squared residual:    1273.933                F-statistic           :    924.9423
Sigma-square        :       0.209                Prob(F-statistic)     :           0
S.E. of regression  :       0.457                Log likelihood        :   -3880.030
Sigma-square ML     :       0.208                Akaike info criterion :    7790.061
S.E of regression ML:      0.4566                Schwarz criterion     :    7890.826------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT       4.3205814       0.0234977     183.8727044       0.0000000accommodates       0.0809972       0.0050046      16.1843874       0.0000000bathrooms       0.1893447       0.0108059      17.5224026       0.0000000bedrooms       0.1635998       0.0109764      14.9047058       0.0000000beds      -0.0451529       0.0068249      -6.6159365       0.0000000rt_Private_room      -0.5293783       0.0157308     -33.6524367       0.0000000rt_Shared_room      -1.2892590       0.0381443     -33.7995105       0.0000000pg_Condominium       0.1063490       0.0221782       4.7952003       0.0000017pg_House       0.0327806       0.0156954       2.0885538       0.0367893pg_Other       0.0861857       0.0239774       3.5944620       0.0003276pg_Townhouse      -0.0277116       0.0338485      -0.8186965       0.4129916w_pg_Condominium       0.5928369       0.0689612       8.5966706       0.0000000w_pg_House      -0.0774462       0.0318830      -2.4290766       0.0151661w_pg_Other       0.4851047       0.0551461       8.7967121       0.0000000w_pg_Townhouse      -0.2724493       0.1223388      -2.2270058       0.0259833
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           14.277TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        2458.006           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test               14         393.052           0.0000
Koenker-Bassett test             14         169.585           0.0000
================================ END OF REPORT =====================================

As an illustration, let’s look at some of the direct/indirect effects. The direct effect of the pg_Condominium variable means that condominiums are typically 11% more expensive (βpg_Condominium=0.1063\beta_{pg\_Condominium}=0.1063βpg_Condominium​=0.1063) than the benchmark
property type, apartments. More relevant to this section, any given house surrounded by condominiums also receives a price premium. But, since pgCondominiumpg_CondominiumpgC​ondominium is a dummy variable,
the spatial lag at site iii represents the percentage of properties near iii that are condominiums, which is between 000 and 111.2
So, a unit change in this variable means that you would increase the condominium percentage by 100%. Thus, a .1.1.1 increase in w_pg_Condominium (a change of ten percentage points) would result in a 5.92% increase in the property house price (βwpg_Condominium=0.6\beta_{w_pg\_Condominium} = 0.6βwp​g_Condominium​=0.6). Similar interpretations can be derived for all other spatially lagged variables to derive the indirect effect of a change in the spatial lag.

However, to compute the indirect change for a given site iii, you may need to examine the predicted values for yiy_iyi​. In this example, since we are using a row-standardized weights matrix with twenty nearest neighbors, the impact of changing xix_ixi​ is the same for all of its neighbors and for any site iii. Thus, the effect is always γ20\frac{\gamma}{20}20γ​, or about 0.02960.02960.0296. However, this would not be the same for many other kinds of weights (like Kernel, Queen, Rook, DistanceBand, or Voronoi), so we will demonstrate how to construct the indirect effect for a specific iii:

First, predicted values for yiy_iyi​ are stored in the predy attribute of any model:

m5.predy
array([[5.43610121],[5.38596868],[4.25377454],...,[4.29145318],[4.89174746],[4.85867698]])

Showing this process below, let’s first change a property to be a condominium. Consider the third observation, which is the first apartment in the data:

# 观察第三行数据
db.loc[2]
accommodates                                                     2
bathrooms                                                        1
bedrooms                                                         1
beds                                                             1
neighborhood                                           North Hills
pool                                                             0
d2balboa                                                   2.49389
coastal                                                          0
price                                                           99
log_price                                                  4.59512
id                                                            9553
pg_Apartment                                                     1
pg_Condominium                                                   0
pg_House                                                         0
pg_Other                                                         0
pg_Townhouse                                                     0
rt_Entire_home/apt                                               0
rt_Private_room                                                  1
rt_Shared_room                                                   0
geometry              POINT (-117.1412083878189 32.75326632438691)
residual                                                  0.287341
Name: 2, dtype: object

This is an apartment. Let’s make a copy of our data and change this apartment into a condominium:

db_scenario = db.copy()
# make Apartment 0 and condo 1
db_scenario.loc[2, ['pg_Apartment', 'pg_Condominium']] = [0,1]

We’ve successfully made the change:

db_scenario.loc[2]
accommodates                                                     2
bathrooms                                                        1
bedrooms                                                         1
beds                                                             1
neighborhood                                           North Hills
pool                                                             0
d2balboa                                                   2.49389
coastal                                                          0
price                                                           99
log_price                                                  4.59512
id                                                            9553
pg_Apartment                                                     0
pg_Condominium                                                   1
pg_House                                                         0
pg_Other                                                         0
pg_Townhouse                                                     0
rt_Entire_home/apt                                               0
rt_Private_room                                                  1
rt_Shared_room                                                   0
geometry              POINT (-117.1412083878189 32.75326632438691)
residual                                                  0.287341
Name: 2, dtype: object

Now, we need to also update the spatial lag variates:

db['pg_Condominium']
0       0
1       1
2       0
3       0
4       0..
6105    0
6106    1
6107    0
6108    0
6109    0
Name: pg_Condominium, Length: 6110, dtype: int64
weights.spatial_lag.lag_spatial(knn, db['pg_Condominium'])  # 空间滞后操作后得到的是每个site i周边是Condominium的占比
array([0.  , 0.3 , 0.05, ..., 0.  , 0.  , 0.05])
wx_scenario = db_scenario.filter(like='pg'
).apply(lambda y: weights.spatial_lag.lag_spatial(knn, y)
).rename(columns=lambda c: 'w_'+c
).drop('w_pg_Apartment', axis=1
)
wx_scenario
w_pg_Condominium w_pg_House w_pg_Other w_pg_Townhouse
0 0.00 0.45 0.20 0.00
1 0.30 0.35 0.00 0.05
2 0.05 0.30 0.00 0.05
3 0.00 0.95 0.00 0.05
4 0.05 0.85 0.00 0.00
... ... ... ... ...
6105 0.30 0.05 0.00 0.00
6106 0.05 0.00 0.60 0.00
6107 0.00 0.55 0.25 0.00
6108 0.00 0.05 0.05 0.00
6109 0.05 0.70 0.05 0.00

6110 rows × 4 columns

And build a new exogenous X\mathbf{X}X matrix, including the a constant 1 as the leading column

slx_exog_scenario = db_scenario[variable_names].join(wx_scenario)
slx_exog_scenario
accommodates bathrooms bedrooms beds rt_Private_room rt_Shared_room pg_Condominium pg_House pg_Other pg_Townhouse w_pg_Condominium w_pg_House w_pg_Other w_pg_Townhouse
0 5 2.0 2.0 2.0 0 0 0 1 0 0 0.00 0.45 0.20 0.00
1 6 1.0 2.0 4.0 0 0 1 0 0 0 0.30 0.35 0.00 0.05
2 2 1.0 1.0 1.0 1 0 1 0 0 0 0.05 0.30 0.00 0.05
3 2 1.0 1.0 1.0 1 0 0 1 0 0 0.00 0.95 0.00 0.05
4 2 1.0 1.0 1.0 1 0 0 1 0 0 0.05 0.85 0.00 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6105 2 1.0 1.0 1.0 1 0 0 0 0 0 0.30 0.05 0.00 0.00
6106 6 2.0 2.0 2.0 0 0 1 0 0 0 0.05 0.00 0.60 0.00
6107 1 1.0 1.0 1.0 1 0 0 1 0 0 0.00 0.55 0.25 0.00
6108 3 1.0 1.0 1.0 0 0 0 0 0 0 0.00 0.05 0.05 0.00
6109 3 1.0 1.0 2.0 0 0 0 1 0 0 0.05 0.70 0.05 0.00

6110 rows × 14 columns

Now, our new prediction (in the scenario where we have changed site 2 from an apartment into a condominium), is:

y_pred_scenario = m5.betas[0] + slx_exog_scenario @ m5.betas[1:]  # @操作是矩阵乘法的含义

This prediction will be exactly the same for all sites, except site 2 and its neighbors. So, the neighbors to site 2 are:

knn.neighbors[2]
[772,2212,139,4653,2786,1218,138,808,1480,4241,1631,3617,2612,1162,135,23,5528,3591,407,6088]

And the effect of changing site 2 into a condominium is associated with the following changes to yiy_iyi​:

# 将和第2行相关的行都取出来,这些行和原来的预测值会有所不同
(y_pred_scenario - m5.predy).loc[[2] + knn.neighbors[2]]
0
2 0.106349
772 0.029642
2212 0.029642
139 0.029642
4653 0.029642
2786 0.029642
1218 0.029642
138 0.029642
808 0.029642
1480 0.029642
4241 0.029642
1631 0.029642
3617 0.029642
2612 0.029642
1162 0.029642
135 0.029642
23 0.029642
5528 0.029642
3591 0.029642
407 0.029642
6088 0.029642

We see the first row, representing the direct effect, is equal exactly to the estimate for pg_Condominium. For the other effects, though, we have only changed w_pg_Condominium by .05.05.05

scenario_near_2 = slx_exog_scenario.loc[knn.neighbors[2], ['w_pg_Condominium']]
orig_near_2 = slx_exog.loc[knn.neighbors[2], ['w_pg_Condominium']]
scenario_near_2.join(orig_near_2, lsuffix='_scenario', rsuffix= '_original')
# 以前对于2的邻居,他们周边的Condominium的概率是‘w_pg_Condominium_original’,现在都加上了0.05
w_pg_Condominium_scenario w_pg_Condominium_original
772 0.10 0.05
2212 0.10 0.05
139 0.10 0.05
4653 0.10 0.05
2786 0.10 0.05
1218 0.10 0.05
138 0.10 0.05
808 0.05 0.00
1480 0.10 0.05
4241 0.10 0.05
1631 0.10 0.05
3617 0.10 0.05
2612 0.10 0.05
1162 0.05 0.00
135 0.05 0.00
23 0.10 0.05
5528 0.05 0.00
3591 0.05 0.00
407 0.05 0.00
6088 0.10 0.05

上面的例子就就很好的解释了一下SLX模型,其实很简单就是先得到自变量的空间滞后项,再放进去跑OLS就可以了,直接效应解读和OLS没有任何区别,但是简介效应解读需要注意一下,对于本例子所选取的KNN权重矩阵,间接效应指的就是某个site周围领域的平均效应,也就是空间依赖性了。

4.SEM

The spatial error model includes a spatial lag in the error term of the equation:

log⁡Pi=α+∑kβkXki+ui\log{P_i} = \alpha + \sum_k \beta_k X_{ki} + u_i logPi​=α+k∑​βk​Xki​+ui​

ui=λulag−i+ϵiu_i = \lambda u_{lag-i} + \epsilon_i ui​=λulag−i​+ϵi​

where ulag−i=∑jwi,juju_{lag-i} = \sum_j w_{i,j} u_julag−i​=∑j​wi,j​uj​.
Although it appears similar, this specification violates the assumptions about the
error term in a classical OLS model. Hence, alternative estimation methods are
required. PySAL incorporates functionality to estimate several of the most
advanced techniques developed by the literature on spatial econometrics. For
example, we can use a general method of moments that account for
heterogeneity (Arraiz et al., 2010):

db[variable_names]
accommodates bathrooms bedrooms beds rt_Private_room rt_Shared_room pg_Condominium pg_House pg_Other pg_Townhouse
0 5 2.0 2.0 2.0 0 0 0 1 0 0
1 6 1.0 2.0 4.0 0 0 1 0 0 0
2 2 1.0 1.0 1.0 1 0 0 0 0 0
3 2 1.0 1.0 1.0 1 0 0 1 0 0
4 2 1.0 1.0 1.0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ...
6105 2 1.0 1.0 1.0 1 0 0 0 0 0
6106 6 2.0 2.0 2.0 0 0 1 0 0 0
6107 1 1.0 1.0 1.0 1 0 0 1 0 0
6108 3 1.0 1.0 1.0 0 0 0 0 0 0
6109 3 1.0 1.0 2.0 0 0 0 1 0 0

6110 rows × 10 columns

knn = weights.KNN.from_dataframe(db, k=20) # 得到k近邻权重矩阵
knn.transform='r'
# knn.weights
m6 = spreg.GM_Error_Het(y = db[['log_price']].values, x = db[variable_names].values,w=knn,name_y='log_price', name_x=variable_names
)
print(m6.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES (HET)
---------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          11
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6099
Pseudo R-squared    :      0.6655
N. of iterations    :           1                Step1c computed       :          No------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT       4.4262033       0.0217088     203.8898741       0.0000000accommodates       0.0695536       0.0063268      10.9934495       0.0000000bathrooms       0.1614101       0.0151312      10.6673765       0.0000000bedrooms       0.1739251       0.0146697      11.8560847       0.0000000beds      -0.0377710       0.0088293      -4.2779023       0.0000189rt_Private_room      -0.4947947       0.0163843     -30.1993140       0.0000000rt_Shared_room      -1.1613985       0.0515304     -22.5381175       0.0000000pg_Condominium       0.1003761       0.0213148       4.7092198       0.0000025pg_House       0.0308334       0.0147100       2.0960849       0.0360747pg_Other       0.0861768       0.0254942       3.3802547       0.0007242pg_Townhouse      -0.0074515       0.0292873      -0.2544285       0.7991646lambda       0.6448728       0.0186626      34.5543740       0.0000000lambda       0.6448728       0.0186626      34.5543740       0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

5.SAR

SAR的公式如下:
log⁡Pi=α+ρlog⁡Plag−i+∑kβkXki+ϵi\log{P_i} = \alpha + \rho \log{P_{lag-i}} + \sum_k \beta_k X_{ki} + \epsilon_i logPi​=α+ρlogPlag−i​+k∑​βk​Xki​+ϵi​

Although it might not seem very different from the previous equation, this model violates the exogeneity assumption, crucial for OLS to work. Put simply, this occurs when PiP_iPi​ exists on both “sides” of the equals sign.In theory, since PiP_iPi​ is included in computing Plag−iP_{lag-i}Plag−i​, exogeneity is violated. Similarly to the case ofthe spatial error, several techniques have been proposed to overcome this limitation, and PySAL implements several of them. In the example below, we use a two-stage least squares estimation {cite}Anselin_1988, where the spatial lag of all the explanatory variables is used as instrument for the endogenous
lag:

m7 = spreg.GM_Lag(db[['log_price']].values, db[variable_names].values,w=knn, name_y='log_price', name_x=variable_names
)
print(m7.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   log_price                Number of Observations:        6110
Mean dependent var  :      4.9958                Number of Variables   :          12
S.D. dependent var  :      0.8072                Degrees of Freedom    :        6098
Pseudo R-squared    :      0.7057
Spatial Pseudo R-squared:  0.6883------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT       2.7440254       0.0727290      37.7294400       0.0000000accommodates       0.0697596       0.0048157      14.4859187       0.0000000bathrooms       0.1626725       0.0104007      15.6405467       0.0000000bedrooms       0.1604137       0.0104823      15.3033012       0.0000000beds      -0.0365065       0.0065336      -5.5874750       0.0000000rt_Private_room      -0.4981415       0.0151396     -32.9031780       0.0000000rt_Shared_room      -1.1157392       0.0365563     -30.5210777       0.0000000pg_Condominium       0.1072995       0.0209048       5.1327614       0.0000003pg_House      -0.0004017       0.0136828      -0.0293598       0.9765777pg_Other       0.1207503       0.0214771       5.6222929       0.0000000pg_Townhouse      -0.0185543       0.0322730      -0.5749190       0.5653461W_log_price       0.3416482       0.0147787      23.1175620       0.0000000
------------------------------------------------------------------------------------
Instrumented: W_log_price
Instruments: W_accommodates, W_bathrooms, W_bedrooms, W_beds,W_pg_Condominium, W_pg_House, W_pg_Other, W_pg_Townhouse,W_rt_Private_room, W_rt_Shared_room
================================ END OF REPORT =====================================

Similarly to the effects in the SLX regression, changes in the spatial lag regression need to be interpreted with care. Here, w_log_price applies consistently over all observations, and actually changes the effective strength of each of the β\betaβ coefficients. Thus, it is useful to use predictions and scenario-building to predict yyy when changing XXX, which allows you to analyze the direct and indirect components.

参考

https://geographicdata.science/book/notebooks/11_regression.html
沈体雁 et al.空间计量经济学
沈体雁 et al. 空间计量分析软件
本博客代码等资料


  1. Since we use the log price for a yyy variable, our β\betaβ coefficients are still all interpreted as elasticities, meaning that a unit change in the xix_ixi​ variate results in a β\betaβ percent change in the price y_i ↩︎

  2. Discover this for yourself: what is the average of numpy.array([True, True, True, False, False, True)]? ↩︎

03空间计量基础模型之SLX,SAR,SEM相关推荐

  1. 空间计量模型_Stata中的空间计量回归模型应用

    在Stata 15中,推出了最新的空间计量官方命令,均以sp开头,表示 spatial data),可以处理横截面与面板形式的空间数据.本文主要为大家介绍空间计量横截面及面板模型的应用,全文分为两部分 ...

  2. 空间计量模型_Stata空间面板数据模型专题直播丨Stata空间计量3月远程直播

    2月28日19:00-21:00Stata空间计量直播专题课(空间面板数据模型)提供全套资料及课后Q&A 空间面板数据模型的前生今世:静态.动态和具有共同因子约束的空间面板数据模型. 模型选择 ...

  3. 面板空间计量模型(Stata)

    面板空间计量模型(Stata) 文章目录 面板空间计量模型(Stata) @[toc] 1 面板空间自回归模型 2 面板空间误差模型 3 面板空间自相关模型 4 面板空间杜宾模型 5 动态面板空间计量 ...

  4. MATLAB1阶零模型,MATLAB 空间计量模型的实现

    前面也在各大论坛混迹,学习MATLAB的空间计量模型的实现,并且经过痛苦的学习过程,好容易有点收获.最近看论坛里还有很多同行在痛苦的学习,就想分享下我的学习经验.让大家共同进步. (1)首先你需要到h ...

  5. 相干层析模型计算matlab,MATLAB 空间计量模型的实现

    前面也在各大论坛混迹,学习MATLAB的空间计量模型的实现,并且经过痛苦的学习过程,好容易有点收获.最近看论坛里还有很多同行在痛苦的学习,就想分享下我的学习经验.让大家共同进步. (1)首先你需要到h ...

  6. 空间计量模型之stata篇

    本文详细介绍了Stata软件解决包括空间权重矩阵的构建.空间计量模型的选择以及空间计量模型的建立过程等问题. 1.空间权重矩阵的构建 通常情况下,空间权重矩阵有邻接矩阵.地理距离空间权重矩阵.经济距离 ...

  7. 截面空间计量模型(Stata)

    截面空间计量模型(Stata) 文章目录 截面空间计量模型(Stata) @[toc] 1 广义空间自回归模型(SAC) 2 空间误差模型(SEM) 3 空间杜宾模型(SDM) 4 广义空间嵌套模型( ...

  8. 请写出空间计量模型stata代码

    对不起,我不能直接给出完整的代码.但是,我可以提供一些指导,帮助您编写空间计量模型的Stata代码. 首先,空间计量模型的代码依赖于您所使用的空间计量模型的类型,例如空间自回归模型(SAR),空间错误 ...

  9. 空间杜宾模型-多种权重矩阵制作、空间相关性检验、SDM、SEM、SAR模型的命令、相关检验及其结果分析

    一.数据介绍 数据名称:[stata代码]空间杜宾模型相关检验及结果分析 数据说明:包含全面的空间计量步骤--多种权重矩阵制作.空间相关性检验.SDM.SEM.SAR模型的命令.相关检验及其结果分析. ...

最新文章

  1. 「回顾」网易数据基础平台建设
  2. c++求矩阵的秩_一篇文章搞定矩阵相关概念及意义通俗解释汇总
  3. Unix/Linux下文件基本操作[zt]
  4. 60道Python面试题答案精选!找工作前必看
  5. Redis:10---List对象
  6. html文字转语音代码,【JavaScript】实现文本转语音功能
  7. sqlyog怎么查找表_VBA代码解决方案第58讲:在VBA中查找指定工作表的实用方法
  8. Ubuntu12.04使用vi编辑器进入编辑模式按上下键出现乱码
  9. PHP 与 JSP 比较(PHP、ASP、JSP是什么)
  10. 使用wx原生方法扫描获取SN码
  11. log4j2.xsml配置文件详细
  12. Word论文引用和目录生成方法
  13. 视频监控系统上云解决方案EasyCVR集成海康EHome私有协议系列——文件查找操作流程
  14. composer 安装 thinkphp
  15. 献给母亲节的技术大礼包
  16. 腾讯代码安全指南开源,涉及 C/C++、Go 等六门编程语言
  17. 阿里AI实验室新添两员大将,易鑫集团计划融资8亿美元即将完成IPO | 大数据24小时
  18. 8位颜色数据,最大值为255为什么不是256?
  19. [C#] winform中的DataGridView的列宽设置(自动调整列宽)
  20. AHP层次分析法java实现

热门文章

  1. 已解决requests.exceptions.ConnectTimeout: HTTPConnectionPool(host=‘123.96.1.95‘, port=30090): Max retri
  2. oracle数据库之统计分析(方差、标准差、协方差)
  3. Excel 2010 VBA 入门 093 数据处理之建立数组
  4. Hadoop3.2.1 【 YARN 】源码分析 : ContainerManager浅析
  5. 主机识别SDIO接口卡过程
  6. 零基础学前端开发培训
  7. linux 内存溢出排查_linux下valgrind内存问题排查
  8. 为什么有了IP地址还要有MAC地址??
  9. 苏州大学2021年全日制博士学位研究生招生简章
  10. CToolBar的使用总结(转)