电路分析导论

In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a company (stops purchasing, cancels the subscription, etc.). Retention refers to keeping the clients of a business active (the definition of active highly depends on the business model).

在我们竞争异常激烈的时代,所有企业都面临客户流失/保留的问题。 为了快速提供背景信息,当客户停止使用公司的服务(停止购买,取消订阅等)时,就会发生流失。 保留是指使业务的客户保持活动状态(活动的定义在很大程度上取决于业务模型)。

Intuitively, companies want to increase retention by preventing churn. This way, their relationship with the customers is longer and thus potentially more profitable. What is more, in most cases the company’s cost of retaining a customer is much lower than that of acquiring a new customer, for example, via performance marketing. For businesses, the concept of retention is closely connected to customer lifetime value (CLV), which the businesses want to maximize. But that is a topic for another article.

直观上,公司希望通过防止流失来增加保留率。 这样,他们与客户的关系就会更长,因此可能会带来更大的利润。 更重要的是,在大多数情况下,公司保留客户的成本要比例如通过绩效营销获得新客户的成本低得多。 对于企业而言,保留的概念与企业希望最大化的客户生命周期价值 (CLV)紧密相关。 但这是另一篇文章的主题。

With this article, I want to start a short series focusing on survival analysis, which is often an underestimated, yet very interesting branch of statistical learning. In this article, I provide a general introduction to survival analysis and its building blocks. First I explain the required concepts and then describe different approaches to analyzing time-to-event data. Let’s start!

在本文中,我想开始一个简短的系列,着重于生存分析,这通常是统计学学习中被低估但非常有趣的分支。 在本文中,我对生存分析及其组成部分进行了一般性介绍。 首先,我解释了必需的概念,然后描述了分析事件数据的不同方法。 开始吧!

生存分析导论 (Introduction to Survival Analysis)

Survival analysis is a field of statistics that focuses on analyzing the expected time until a certain event happens. Originally, this branch of statistics developed around measuring the effects of medical treatment on patients’ survival in clinical trials. For example, imagine a group of cancer patients who are administered a certain new form of treatment. Survival analysis can be used for analyzing the results of that treatment in terms of the patients’ life expectancy.

生存分析是一个统计领域,专注于分析直到发生某个事件之前的预期时间。 最初,该统计分支的发展是围绕在临床试验中测量药物治疗对患者生存的影响。 例如,想象一组接受某种新形式治疗的癌症患者。 生存分析可用于根据患者的预期寿命来分析该治疗的结果。

However, survival analysis is not restricted to investigating deaths and can be just as well used for determining the time until a machine fails or — what may at first sound a bit counterintuitively— a user of a certain platform converts to a premium service. That is possible because survival analysis focuses on the time until an event happens, without actually defining the event as a negative one. The conditions that apply to the most popular methods of survival analysis are:

但是,生存分析并不仅限于调查死亡情况,它还可以用于确定机器故障或某个平台的用户转换为优质服务之前的时间(起初听起来有些反直觉)。 之所以可以这样做是因为生存分析着眼于事件发生之前的时间,而没有将事件实际定义为否定事件。 适用于最流行的生存分析方法的条件是:

  • the event of interest is clearly defined and well-specified, so there is no ambiguity about whether it happened or not,对感兴趣的事件进行了明确的定义和明确的规定,因此对于它是否发生没有歧义,
  • the event can occur only once for each subject — this is clear in case of death, but if we applied the analysis to churn, this might be a more complicated case, as a churned user might be reactivated and churn again.该事件对于每个主题只能发生一次-在死亡的情况下很明显,但是如果我们将分析应用于客户流失,则情况可能更复杂,因为流失的用户可能会重新激活并再次流失。

We have already established that survival analysis is used for modeling the time-to-event series, in other words, lifetimes (hence also the name of the Python library which is the go-to tool for this kind of analyses). Generally speaking, we can use survival analysis to try to answer questions like:

我们已经建立了生存分析用于建模事件发生时间序列 (即生存期)的方法(因此也称为Python库的名称,Python库是此类分析的必备工具)。 一般而言,我们可以使用生存分析来尝试回答以下问题:

  • what percentage of the population will survive past a certain time?一定时间后将有百分之几的人口生存?
  • of the survivors, what will be their death/failure rate?的幸存者中,他们的死亡/失败率是多少?
  • how do particular characteristics (for example, such features as age, gender, geographical location, etc.) affect the probability of survival?特定特征(例如年龄,性别,地理位置等特征)如何影响生存概率?

Having briefly described the general idea of survival analysis, it is time to introduce a few concepts that are crucial for a thorough understanding of the subject.

简要描述了生存分析的一般概念之后,现在该介绍一些对彻底理解该主题至关重要的概念。

Photo by Scott Graham on Unsplash
Scott Graham在Unsplash上拍摄的照片

审查制度 (Censoring)

Censoring can be described as the missing data problem in the domain of survival analysis. Observations are censored when the information about their survival time is incomplete. There are different kinds of censoring, such as:

审查可以描述为生存分析领域中的数据丢失问题。 当有关生存时间的信息不完整时,将对观测进行审查 。 审查方式有多种,例如:

  • right-censoring,权利审查
  • interval-censoring,间隔检查
  • left-censoring.左审查。

To keep this section short, we just discuss the one that is encountered most frequently — right-censoring. Let’s come back to the example with cancer treatment. Imagine, that the study of the effects of the new medicine lasts 5 years (this is an arbitrary number, not actually based on anything). It can happen that after 5 years, some of the patients survived and thus have not experienced the death event. At the same time, the authors of the study lost contact with some patients — they might have relocated to another country, they might have actually died, but no confirmation was ever received. Those cases are affected by right-censoring, that is, their true survival time is equal to or greater than the observed survival time (in this case, the 5 years of the study). The following image illustrates right-censoring.

为了使本节简短,我们只讨论最常遇到的一个问题- 右删失 。 让我们回到有关癌症治疗的例子。 想象一下,对新药效果的研究持续了5年(这是一个任意数字,实际上并不是基于任何东西)。 可能发生的情况是,在5年后,一些患者幸存了下来,因此没有经历过死亡事件。 同时,该研究的作者与某些患者失去了联系-他们可能已搬迁到另一个国家,他们可能实际上已经死亡,但从未收到任何确认。 这些案例受权利审查的影响,也就是说,它们的真实生存时间等于或大于观察到的生存时间(在本例中为研究的5年)。 下图说明了权限检查。

Source资源

The existence of censoring is also the reason why we cannot use simple OLS for problems in the survival analysis. That is because OLS effectively draws a regression line that minimizes the sum of squared errors. But for censored data, the error terms are unknown and therefore we cannot minimize the MSE. Applying some simple solutions such as using the censorship date as the date of the death event or dropping the censored observations can severely bias the results.

审查的存在也是我们无法在生存分析中使用简单OLS解决问题的原因。 这是因为OLS有效地绘制了一条回归线,该回归线使平方误差的总和最小。 但是对于被检查的数据,错误项是未知的,因此我们无法最小化MSE。 应用一些简单的解决方案,例如使用检查日期作为死亡事件的日期或放弃检查的观察结果,可能会严重影响结果。

For information regarding different kinds of censoring, please go here.

有关各种检查的信息,请转到此处 。

生存功能 (The Survival Function)

The survival function is a function of time (t) and can be represented as

生存函数是时间( t )的函数,可以表示为

where Pr() stands for the probability and T for the time of the event of interest for a random observation from the sample. We can interpret the survival function as the probability of the event of interest (for example, the death event) not occurring by the time t.

其中, Pr()代表概率, T代表关注事件的时间,可以从样本中进行随机观察。 我们可以将生存函数解释为感兴趣的事件(例如,死亡事件)在时间t之前未发生的概率

The survival function takes values in the range between 0 and 1 (inclusive) and is a non-increasing function of t.

生存函数的取值范围是0到1(含)之间,并且是t的非递增函数

危害功能 (The Hazard Function)

We can think of the hazard function (or hazard rate) as the probability of the subject experiencing the event of interest within a small (or to be more precise, infinitesimal) interval of time, assuming that the subject has survived up until the beginning of the said interval. The hazard function can be represented as:

我们可以将危害函数 (或危害率)视为对象在很小(或更确切地说是无穷小)的时间间隔内经历关注事件的概率,前提是对象一直存活到开始。所说的间隔。 危害函数可以表示为:

where the expression in the numerator is the conditional probability of the event of interest occurring in the given time interval, provided it has not happened before. dt in the denominator is the width of the considered interval of time. When we divide the former by the latter, we effectively obtain the rate of the event’s occurrence per unit of time. Lastly, by taking the limit as the width of the interval goes to zero, we end up with the instantaneous rate of occurrence, so the risk of an event happening at a particular point in time.

其中分子中的表达式是感兴趣事件在给定时间间隔内发生的条件概率,前提是该事件以前没有发生过。 分母中的dt是所考虑的时间间隔的宽度。 当我们将前者除以后者时,我们可以有效地获得每单位时间事件发生的比率。 最后,通过在间隔的宽度变为零时取极限,我们得出瞬时发生率,因此事件在特定时间点发生的风险。

You might wonder why the hazard rate is defined using this small interval of time. The reason for that lies in the fact that the probability of a continuous random variable being equal to a particular value is zero. That is why we need to consider the probability of the event happening in a very small interval of time.

您可能想知道为什么使用这么短的时间间隔来定义危险率。 其原因在于,连续随机变量等于特定值的概率为零。 这就是为什么我们需要考虑事件在很小的时间间隔内发生的可能性。

Technical note: to be theoretically correct, it is important to mention that the hazard function is not actually a probability and the name hazard rate is the more fitting one. That is because even though the expression in the numerator is the probability, the dt in the denominator can actually result in a value of the hazard rate greater than 1 (it is still limited to 0 at the lower interval).

技术说明:从理论上讲是正确的,重要的是要提到危害函数实际上并不是概率,而危害率这个名称更合适。 这是因为即使分子中的表达式是概率,分母中的dt实际上也可以导致危险率的值大于1(在较低的时间间隔仍限制为0)。

Lastly, the survival and hazard functions are related to each other as specified by the following formula:

最后,生存和危害功能相互关联,如下式所示:

To give the equation a bit of context, the integral in the brackets is called the cumulative hazard and can be interpreted as the sum of the risks the subject faces going from time-point 0 to t.

为了使方程更准确,将方括号中的积分称为累积危害,可以将其解释为受试者从时间点0到t所面临的风险之和。

Photo by Justin Luebke on Unsplash
贾斯汀·吕贝克 ( Justin Luebke)在Unsplash上摄

生存分析的不同方法 (Different approaches to Survival Analysis)

As survival analysis is an entire domain of different statistical methods for working with time-to-event series, there are naturally many different approaches we could follow. On a high level, we could split them into three main groups:

由于生存分析是处理事件间隔时间序列的不同统计方法的整个领域,因此自然可以采用许多不同的方法。 在较高的层次上,我们可以将它们分为三个主要组:

  • Non-parametric — with these approaches, we make no assumptions about the underlying distribution of data. Perhaps the most popular example from this group is the Kaplan-Meier curve, which — in short — is a method of estimating and plotting the survival probability as a function of time.

    非参数 -使用这些方法,我们不对数据的基本分布进行任何假设。 该组中最受欢迎的示例也许是Kaplan-Meier曲线 ,简而言之,它是一种估计和绘制生存概率随时间变化的方法。

  • Semi-parametric — as you could have guessed, this group is in between the two extremes and makes very few assumptions. Most importantly, there are no assumptions about the shape of the hazard function/rate. The most popular method from this group is the Cox regression, which we can use to identify the relationship between the hazard function and a set of explanatory variables (predictors).

    半参数 -正如您可能已经猜到的,该组介于两个极端之间,并且很少进行假设。 最重要的是,没有关于危害函数/速率的形状的假设。 该组中最流行的方法是Cox回归 ,我们可以使用它来识别危害函数和一组解释变量(预测变量)之间的关系。

  • Parametric — you might have encountered this approach while doing your studies. The idea is to use some statistical distributions (some of the popular ones include exponential, log, Weibull, or Lomax) to estimate how long a subject will survive. Often, we use maximum likelihood estimation (MLE) to fit the distribution (or actually the distribution’s parameters) to the data for the best performance.

    参数化 -学习时可能会遇到这种方法。 想法是使用一些统计分布(一些流行的分布包括指数分布,对数分布,Weibull分布或Lomax分布)来估计对象可以存活多长时间。 通常,我们使用最大似然估计(MLE)使分布(或实际上是分布的参数)适合数据,以获得最佳性能。

The methods mentioned in this short list are by no means exhaustive and there are many more interesting approaches to analyzing time-to-event data using machine- or deep-learning-based techniques. I will try to cover the most interesting ones in the following posts, so stay tuned :)

此简短列表中提到的方法绝不是穷举,并且有很多有趣的方法可以使用基于机器学习或深度学习的技术来分析事件数据。 我将在以下帖子中尝试介绍最有趣的内容,敬请期待:)

结论 (Conclusions)

In this article, I tried to provide a brief yet thorough introduction to the domain of survival analysis. I believe that this area is often overlooked when talking about different data science solutions. However, by using some simple (or not so simple at all!) solutions we can provide valuable insights for the company or stakeholders and generate actual value-added.

在本文中,我试图对生存分析领域进行简要而全面的介绍。 我认为,在谈论不同的数据科学解决方案时,通常会忽略这一领域。 但是,通过使用一些简单(或根本不是那么简单!)解决方案,我们可以为公司或利益相关者提供有价值的见解,并产生实际的增值。

This article is only the beginning of a short series, and I will keep on adding the following parts below. In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.

本文只是一个简短系列的开始,我将继续在下面添加以下部分。 如果您有任何疑问或建议,请在评论中让我知道,或在Twitter上与您联系 。

In the meantime, you might like some of my other articles:

同时,您可能会喜欢我的其他一些文章:

翻译自: https://towardsdatascience.com/introduction-to-survival-analysis-6f7e19c31d96

电路分析导论


http://www.taodudu.cc/news/show-863630.html

相关文章:

  • 强化学习-第3部分
  • 范数在机器学习中的作用_设计在机器学习中的作用
  • 贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络
  • 模型监控psi_PSI和CSI:前2个模型监控指标
  • flask渲染图像_用于图像推荐的Flask应用
  • pytorch贝叶斯网络_贝叶斯神经网络:2个在TensorFlow和Pytorch中完全连接
  • 稀疏组套索_Python中的稀疏组套索
  • deepin中zz_如何解决R中的FizzBu​​zz问题
  • 图像生成对抗生成网络gan_GAN生成汽车图像
  • 生成模型和判别模型_生成模型和判别模型简介
  • 机器学习算法 拟合曲线_制定学习曲线以检测机器学习算法中的错误
  • 重拾强化学习的核心概念_强化学习的核心概念
  • gpt 语言模型_您可以使用语言模型构建的事物的列表-不仅仅是GPT-3
  • 廉价raid_如何查找80行代码中的廉价航班
  • 深度学习数据集制作工作_创建我的第一个深度学习+数据科学工作站
  • pytorch线性回归_PyTorch中的线性回归
  • spotify音乐下载_使用Python和R对音乐进行聚类以在Spotify上创建播放列表。
  • 强化学习之基础入门_强化学习基础
  • 在置信区间下置信值的计算_使用自举计算置信区间
  • 步进电机无细分和20细分_细分网站导航会话
  • python gis库_使用开放的python库自动化GIS和遥感工作流
  • mask rcnn实例分割_使用Mask-RCNN的实例分割
  • 使用FgSegNet进行前景图像分割
  • 完美下巴标准_平行下颚抓
  • api 规则定义_API有规则,而且功能强大
  • r语言模型评估:_情感分析评估:对自然语言处理的过去和未来的反思
  • 机器学习偏差方差_机器学习101 —偏差方差难题
  • 机器学习 多变量回归算法_如何为机器学习监督算法识别正确的自变量?
  • python 验证模型_Python中的模型验证
  • python文本结构化处理_在Python中标记非结构化文本数据

电路分析导论_生存分析导论相关推荐

  1. 绘制pr曲线图_生存分析如何绘制事件发生累计概率曲线图?

    公众号前段时间发了篇推文<ggsurvplot()函数绘制Kaplan-Meier生存曲线>用来介绍生存曲线的绘制,下面的推文内容跟这篇文章结合着看. 在生存分析中我们通常关注个体在时间t ...

  2. 如何对一个变量数据进行正则判定_生存分析数据中的BuckleyJamesMultipleRegression Model...

    一.模型简介 目前,生存分析领域,最常用的是Cox比例风险回归模型,该模型具有良好的特性,不仅可以分析各种自变量对生存时间的影响,而且对基准风险分布不作任何要求(半参数模型).Cox模型使用时要满足一 ...

  3. 变量的作用域和生存期:_生存分析简介:

    变量的作用域和生存期: In the previous article, I have described the Kaplan-Meier estimator. To give a quick re ...

  4. python 生存分析_生存分析之KM法

    KM法即乘积极限法(product-limit method),是现在生存分析最常用的方法,是由Kaplan和Meier于1958年提出,因此称Kaplan-Meier法,通常简称KM法.KM法是这样 ...

  5. php 生存分析,生信分析网站(生存分析)

    生信论文的套路 ONCOMINE从全景.亚型两个维度做表达差异分析: 临床标本从蛋白水平确认(或HPA数据库),很重要: Kaplan-Meier Plotter从临床意义的角度阐明其重要性: cBi ...

  6. rda冗余分析步骤_群落分析的典范对应分析(CCA)概述

    典范对应分析(CCA)与去趋势典范对应分析(DCCA)概述典范对应分析(canonical correspondence analysis,CCA)是单峰约束排序方法,是对应分析(CA)与多元回归的结 ...

  7. eds能谱图分析实例_成分分析的四大神器—XRF、ICP、EDX和WDX

    成分分析技术主要用于对未知物.未知成分等进行分析,通过成分分析技术可以快速确定目标样品中的各种组成成分是什么,帮助实验人员对样品进行定性定量分析,鉴别等.今天,小析姐就给大家介绍四种成分分析的常见设备 ...

  8. python 情感分析实例_情感分析实例

    以下的样本代码用Pyhton写成,主要使用了scrapy, sklearn两个库. 所以,什么是情感分析(Sentiment Analysis)? 情感分析又叫意见挖掘(Opinion Mining) ...

  9. 为什么c相电路在前面_三相电路分析

    1,三相电路.2,线电压与相电压.3,对称三相电路的计算.4,不对称三相电路的概念.5,三相电路的功率. 01 - 三相电路 1.对称三相电压(或电流) 频率相同.幅值相同,相位彼此相差同一个角度的三 ...

最新文章

  1. python读取excel一列-python读取excel(xlrd)
  2. 考公务员的本科学历可以考吗
  3. 【NLP】中文BERT上分新技巧,多粒度信息来帮忙
  4. 【语言处理与Python】10.1自然语言理解\10.2命题逻辑
  5. ifconfig相关
  6. iOS屏幕旋转 浅析
  7. 【2021牛客暑期多校训练营7】xay loves trees(dfs序,维护根出发的链)
  8. apache php提示下载,apache正在下载php文件而不是显示它们。
  9. 远程通讯测试软件,USR-TCP232-304和虚拟串口软件通讯测试
  10. 量化指标公式源码_量化指标公式源码,通达信量化买盘潮指标
  11. 高效记忆/形象记忆(07)110数字编码表 11-20
  12. Excel-制作简单的环形柱状图
  13. tesorflow2.1.0环境下,tf.keras使用Range优化器(RAdam+Lookahead)
  14. 主机甲和主机乙之间使用后退N帧协议(GBN)传输数据,甲的发送窗口为1000,数据帧长为1000字节,信道带宽为100Mb/s,乙每收到一个数据帧......[数据传输率]错题总结
  15. 使用IAR下载烧录调试
  16. 10、(十)外汇交易中专有名词整理
  17. 【信号与系统】指数信号与正弦信号
  18. 微信公众平台js算法逆向
  19. Vue +Vant 静态电商商城app(首页版)
  20. 融资规模似雪球越滚越厚,谁能抢占工业互联网的制高点?

热门文章

  1. POJ 2653 Pick-up sticks 判断线段相交
  2. 通用Excel文件导出工具类
  3. 关于处理小数点位数的几个oracle函数
  4. mysql 日期 时间戳 转换
  5. angular.js前端和后台的数据交换,后台取不到值对应方案
  6. 五个小例子教你搞懂 JavaScript 作用域问题
  7. 一起谈.NET技术,C#序列化与反序列化(Serializable and Deserialize)
  8. 下面哪项属于计算机在教育教学中的应用,东师现代教育技术18秋在线作业2答案...
  9. python列表写入csv文件_将多个列表写入csv。Python中的文件
  10. [HNOI2003]消防局的设立(贪心)