以伍德里奇《计量经济学导论：现代方法》的”第7章含有定性信息的多元回归分析：二值（或虚拟）变量“的案例7.8为例，讲解Python如何对虚拟变量进行回归分析。

例7.8 文件LAWSCH85包含了法学院毕业生起薪中位数的数据。一个关键的解释变量是法学院的排名。由于每个法学院都有一个排名，所以我们显然不能对每个排名都包括进来一个虚拟变量。因此我们可以将排名转换为排名范围，这需要用到pandas.cut函数。

一、导入数据

二、将连续变量转变为分类变量

三、对包含虚拟变量的自变量进行回归

一、导入数据

import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.formula.api as smflawsch85 = woo.dataWoo('lawsch85')

二、将连续变量转变为分类变量

pandas.cut将一组连续值分成离散间隔。适用于将连续变量转变为分类变量。

pandas.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False)参数意义：
x: 进行划分的一维数组
bins: 分组依据
-int：整数，代表将x平分成bins份。x的范围在每侧扩展0.1%，以包括x的最大值和最小值。
-sequence of scalars：标量序列，标量序列定义了被分割后每一个bin的区间边缘，此时x没有扩展。
-IntervalIndex：定义要使用的精确区间。
right : 默认为True，是否包含区间右部。如果bins是[1, 2, 3, 4]，区间就是(1,2], (2,3], (3,4]。如果为False，不包含区间右部，区间就是(1,2), (2,3), (3,4)
labels :每组的标签。
retbins : return bins的缩写默认为False，是否返回bins，默认不返回。
precision: 小数精度，默认为3
include_lowest : 表示第一个bin的初始值是否包含在内，默认为false，np.arange(0, 101, 10) 默认不包含0，第一个bin为(0, 10]。如果设置为True，则包含0，第一个bin就是(-0.001, 10.0]返回值：
out：一个pandas.Categorical, Series或者ndarray类型的值，代表分区后x中的每个值在哪个bin（区间）中，如果指定了labels，则返回对应的label。
bins：分隔后的区间，当指定retbins为True时返回。

因此，数据框lawsch85的“rank"列是法学院的排名，数据如下：

print(lawsch85['rank'])
'''
0      128
1      104
2       34
3       49
4       95151     17
152     21
153    143
154      3
155    120
Name: rank, Length: 156, dtype: int64
'''

采用pandas.cut将排名转变为排名范围，定义为数据框lawsch85的’rc'列。

#指定多个区间
cutpts = [0, 10, 25, 40, 60, 100, 175]
#定义分类变量，并设置分类变量的标签
lawsch85['rc'] = pd.cut(lawsch85['rank'], bins=cutpts,labels=['(0,10]', '(10,25]', '(25,40]','(40,60]', '(60,100]', '(100,175]'])
print(lawsch85['rc'])
'''
0      (100,175]
1      (100,175]
2        (25,40]
3        (40,60]
4       (60,100]151      (10,25]
152      (10,25]
153    (100,175]
154       (0,10]
155    (100,175]
Name: rc, Length: 156, dtype: category
Categories (6, object): ['(0,10]' < '(10,25]' < '(25,40]' < '(40,60]' < '(60,100]' < '(100,175]']
'''

三、对包含虚拟变量的自变量进行回归

将工资对法学院排名等数据进行回归分析，代码如下：

reg = smf.ols(formula='np.log(salary) ~ C(rc)+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)

其中，C()为分类变量函数，C指Categorical variables；若变量为字符串或二值变量，则可以用分类变量函数，也也可以不用分类变量函数，回归模型自动将变量分类；若变量为数字，且超过2个数字，则必须通过分类变量函数将连续变量转变为分类变量，否则回归模型将变量视为连续变量进行回归。

在本案例，由于变量rc为字符串，因此formula='np.log(salary) ~ C(rc)+ LSAT + GPA + np.log(libvol) + np.log(cost)'和formula='np.log(salary) ~ rc+ LSAT + GPA + np.log(libvol) + np.log(cost)'是等价的。

但如果变量rc为数字序列，包含超过2个数字，则一定需要分类变量函数。

回归结果如下：

results = reg.fit()
print(results.summary())
'''OLS Regression Results
==============================================================================
Dep. Variable:         np.log(salary)   R-squared:                       0.911
Model:                            OLS   Adj. R-squared:                  0.905
Method:                 Least Squares   F-statistic:                     143.2
Date:                Sun, 17 Apr 2022   Prob (F-statistic):           9.45e-62
Time:                        15:12:27   Log-Likelihood:                 146.45
No. Observations:                 136   AIC:                            -272.9
Df Residuals:                     126   BIC:                            -243.8
Df Model:                           9
Covariance Type:            nonrobust
======================================================================================coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              9.8649      0.450     21.930      0.000       8.975      10.755
C(rc)[T.(10,25]]      -0.1060      0.039     -2.739      0.007      -0.183      -0.029
C(rc)[T.(25,40]]      -0.3245      0.044     -7.318      0.000      -0.412      -0.237
C(rc)[T.(40,60]]      -0.4367      0.046     -9.512      0.000      -0.528      -0.346
C(rc)[T.(60,100]]     -0.5680      0.047    -12.038      0.000      -0.661      -0.475
C(rc)[T.(100,175]]    -0.6996      0.053    -13.078      0.000      -0.805      -0.594
LSAT                   0.0057      0.003      1.858      0.066      -0.000       0.012
GPA                    0.0137      0.074      0.185      0.854      -0.133       0.161
np.log(libvol)         0.0364      0.026      1.398      0.165      -0.015       0.088
np.log(cost)           0.0008      0.025      0.033      0.973      -0.049       0.051
==============================================================================
Omnibus:                        9.419   Durbin-Watson:                   1.926
Prob(Omnibus):                  0.009   Jarque-Bera (JB):               20.478
Skew:                           0.100   Prob(JB):                     3.57e-05
Kurtosis:                       4.890   Cond. No.                     9.85e+03
==============================================================================Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.85e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
'''

分类变量rc一共有6个分类：'(0,10]'、 '(10,25]' 、 '(25,40]' 、 '(40,60]' 、 '(60,100]' 、 '(100,175]'，6个类别应该有5个虚拟变量，其中1个类别作为基组，若不设定基组，C()自动设定一个基组；我们也可以制定基组，形式为C(varible,Treatment(base group))。

例如，我们制定法学院排名(10,25]为基准组，代码如下：

reg = smf.ols(fomula='np.log(salary) ~ C(rc,Treatment("(10,25]"))+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
results = reg.fit()
print(results.summary())
'''OLS Regression Results
==============================================================================
Dep. Variable:         np.log(salary)   R-squared:                       0.911
Model:                            OLS   Adj. R-squared:                  0.905
Method:                 Least Squares   F-statistic:                     143.2
Date:                Sun, 17 Apr 2022   Prob (F-statistic):           9.45e-62
Time:                        15:28:40   Log-Likelihood:                 146.45
No. Observations:                 136   AIC:                            -272.9
Df Residuals:                     126   BIC:                            -243.8
Df Model:                           9
Covariance Type:            nonrobust
============================================================================================================coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
Intercept                                    9.7588      0.436     22.388      0.000       8.896      10.621
C(rc, Treatment("(10,25]"))[T.(0,10]]        0.1060      0.039      2.739      0.007       0.029       0.183
C(rc, Treatment("(10,25]"))[T.(25,40]]      -0.2185      0.035     -6.164      0.000      -0.289      -0.148
C(rc, Treatment("(10,25]"))[T.(40,60]]      -0.3307      0.035     -9.480      0.000      -0.400      -0.262
C(rc, Treatment("(10,25]"))[T.(60,100]]     -0.4619      0.034    -13.460      0.000      -0.530      -0.394
C(rc, Treatment("(10,25]"))[T.(100,175]]    -0.5935      0.039    -15.049      0.000      -0.672      -0.515
LSAT                                         0.0057      0.003      1.858      0.066      -0.000       0.012
GPA                                          0.0137      0.074      0.185      0.854      -0.133       0.161
np.log(libvol)                               0.0364      0.026      1.398      0.165      -0.015       0.088
np.log(cost)                                 0.0008      0.025      0.033      0.973      -0.049       0.051
==============================================================================
Omnibus:                        9.419   Durbin-Watson:                   1.926
Prob(Omnibus):                  0.009   Jarque-Bera (JB):               20.478
Skew:                           0.100   Prob(JB):                     3.57e-05
Kurtosis:                       4.890   Cond. No.                     9.48e+03
==============================================================================Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.48e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
'''

四、全套代码

import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.formula.api as smflawsch85 = woo.dataWoo('lawsch85')#指定多个区间
cutpts = [0, 10, 25, 40, 60, 100, 175]
#定义分类变量，并设置分类变量的标签
lawsch85['rc'] = pd.cut(lawsch85['rank'], bins=cutpts,labels=['(0,10]', '(10,25]', '(25,40]','(40,60]', '(60,100]', '(100,175]'])#回归分析，由程序自动指定基组
reg = smf.ols(formula='np.log(salary) ~ C(rc)+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
results = reg.fit()
print(results.summary())#回归分析，设定法学院排名(10,25]为基组
reg = smf.ols(formula='np.log(salary) ~ C(rc,treatment("(10,25]"))+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
results = reg.fit()
print(results.summary())

【Python计量】statsmodels对虚拟变量进行回归相关推荐

stata 模型设定专题【计量经济系列（六）】（遗漏变量、无关变量、多重共线性、leverage、虚拟变量、线性插值......）
stata 模型设定专题[计量经济系列(六)] 文章目录 1. 遗漏变量与无关变量 1.1 遗漏变量 1.2无关变量 2. 解释变量的权衡标准 3. 检验函数形式 4. 多重共线性 4.1 检验多 ...
R语言定量方法：回归，虚拟变量和交互项，假设检验:F 检验、AIC 和 BIC分析学生成绩数据带自测题
最近我们被客户要求撰写关于学生成绩的研究报告,包括一些图形和统计输出. 回归假设省略变量偏差如果真实模型包括X 1 和X 2 ,但我们忘记了X 2,那么 - 在某些情况下 - 对X的估计将会有偏差 ...
给属性赋值_赋值方法：虚拟变量 Dummy Coding
点击上方蓝色字体,关注我们选择实验法获得的数据属于离散变量,因而使用离散选择模型进行分析,常见的是Logit模型.在使用中需要对获得数据进行处理,其中一个处理方式就是虚拟变量(Dummy Varia ...
Python Statsmodels 统计包之 OLS 回归
Statsmodels 是 Python 中一个强大的统计分析包,包含了回归分析.时间序列分析.假设检验等等的功能.Statsmodels 在计量的简便性上是远远不及 Stata 等软件的,但它的优 ...
R 回归虚拟变量na_互助问答第85期：虚拟变量和空间面板回归问题
问题一:设置虚拟变量如何做工具变量处理使用tobit模型回归时,被解释变量为连续变量,解释变量为分类变量,因研究重点需要将分类变量具体分析,因此将解释变量虚拟变量处理,回归时命令为tobit y i ...
虚拟变量的方法介绍及python实现方式
虚拟变量的定义作用计量经济学中对虚拟变量给出了定义.作用及使用场景,进一步的深入了解可以系统性学习. 定义:虚拟变量 ( Dummy Variables) ,用以反映无法定量度量的因素,譬如性别对收 ...
python如何运用ols_使用OLS回归（Python，StatsModels，Pandas）预测未来值
我目前正试图在Python中实现一个MLR,我不知道如何去应用我发现的未来值的系数.使用OLS回归(Python,StatsModels,Pandas)预测未来值 import pandas as p ...
笔记︱虚拟变量回归=差异显著（方差分析）+差异量化（系数值）
虚拟变量作为自变量,放在回归方程中在教科书里面讲的都很多,笔者以前在学习的时候觉得虚拟变量较之方差分析,还有更多惊喜.谢宇老师的<回归分析>书中对虚拟变量做了高度的总结与归纳. 之后在文章 ...
『R语言Python』建模前的准备：连续型与离散型变量探索，离散型变量转为虚拟变量
在建立模型之前,我们常要先对数据的类型作出判断,连续型数据可以不做处理,而离散型数据则可能需要转为虚拟变量.下文使用R语言中的经典数据集 mtcarsmtcarsmtcars 进行演示 Python: ...
22、python数据处理虚拟变量的转化
虚拟变量(dummy variables):虚拟变量,也叫哑变量和离散特征编码.可用来表示分类变量.费数量因素可能产生的影响 01 离散特征的取值之间有大小意义例如:尺寸(L.XL.XXL) 02 ...

【Python计量】statsmodels对虚拟变量进行回归

一、导入数据

二、将连续变量转变为分类变量

三、对包含虚拟变量的自变量进行回归

【Python计量】statsmodels对虚拟变量进行回归相关推荐

最新文章

热门文章