客户细分

语境 (Context)

I have been working in Advertising, specifically Digital Media and Performance, for nearly 3 years and customer behaviour analysis is one of the core concentrations in my day-to-day job. With the help of different analytics platforms (e.g. Google Analytics, Adobe Analytics), my life has been made easier than before since these platforms come with the built-in function of segmentation that analyses user behaviours across dimensions and metrics.

我从事广告业，特别是数字媒体和表演业已近3年，客户行为分析是我日常工作的核心内容之一。在不同的分析平台(例如Google Analytics(分析)，Adobe Analytics)的帮助下，我的生活变得比以前更加轻松，因为这些平台具有内置的细分功能，可以根据维度和指标分析用户行为。

However, despite the convenience provided, I was hoping to leverage Machine Learning to do customer segmentation that can be scalable and applicable to other optimizations in Data Science (e.g. A/B Testing). Then, I came across the dataset provided by Google Analytics for a Kaggle competition and decided to use it for this project.

但是，尽管提供了便利，但我还是希望利用机器学习来进行客户细分 ，该细分可以扩展并适用于数据科学中的其他优化(例如A / B测试)。然后，我遇到了Google Analytics(分析)提供的Kaggle竞赛数据集，并决定将其用于该项目。

Feel free to check out the dataset here if you’re keen! Beware that the dataset has several sub-datasets and each has more than 900k rows!

如果您愿意，可以在这里签出数据集！请注意，数据集具有多个子数据集，每个子数据集具有超过900k的行 ！

A.解释性数据分析(EDA) (A. Explanatory Data Analysis (EDA))

This always remain an essential step in every Data Science project to ensure the dataset is clean and properly pre-processed to be used for modelling.

这始终是每个Data Science项目中必不可少的步骤，以确保数据集干净且经过适当预处理以用于建模。

First of all, let’s import all the necessary libraries and read the csv file:

首先，让我们导入所有必需的库并读取csv文件：

import pandas as pdimport matplotlib.pyplot as pltimport numpy as npimport seaborn as snsdf_raw = pd.read_csv("google-analytics.csv")df_raw.head()

1.展平JSON字段 (1. Flatten JSON Fields)

As you can see, the raw dataset above is a bit “messy” and not digestible at all since some variables are formatted as JSON fields which compress different values of different sub-variables into one field. For example, for geoNetwork variable, we can tell that there are several sub-variables such as continent, subContinent, etc. that are grouped together.

如您所见，上面的原始数据集有点“混乱”，根本无法消化，因为某些变量的格式设置为JSON字段，可将不同子变量的不同值压缩到一个字段中。例如，对于geoNetwork变量，我们可以知道有几个子变量(例如，continent，subContinent等)组合在一起。

Thanks to the help of a Kaggler, I was able to convert these variables to a more digestible ones by flattening those JSON fields:

多亏了Kaggler的帮助，我能够通过展平那些JSON字段将这些变量转换为更易消化的变量：

import osimport jsonfrom pandas import json_normalizedef load_df(csv_path="google-analytics.csv", nrows=None):    json_columns = ['device', 'geoNetwork', 'totals', 'trafficSource']    df = pd.read_csv(csv_path, converters={column: json.loads for column in json_columns},dtype={'fullVisitorID':'str'}, nrows=nrows)    for column in json_columns:        column_converted = json_normalize(df[column])        column_converted.columns = [f"{column}_{subcolumn}" for subcolumn in column_converted.columns]        df = df.drop(column, axis=1).merge(column_converted, right_index=True, left_index=True)    return df

After flattening those JSON fields, we are able to see a much cleaner dataset, especially those JSON variables split into sub-variables (e.g. device split into device_browser, device_browserVersion, etc.).

展平那些JSON字段后，我们可以看到一个更整洁的数据集，尤其是那些JSON变量拆分为子变量(例如，设备拆分为device_browser，device_browserVersion等)。

2.数据重新格式化和分组 (2. Data Re-formatting & Grouping)

For this project, I have chosen the variables that I believe have better impact or correlation to the user behaviours:

在这个项目中，我选择了我认为对用户行为有更好影响或相关性的变量：

df = df.loc[:,['channelGrouping', 'date', 'fullVisitorId', 'sessionId', 'visitId', 'visitNumber', 'device_browser', 'device_operatingSystem', 'device_isMobile', 'geoNetwork_country', 'trafficSource_source', 'totals_visits', 'totals_hits', 'totals_pageviews', 'totals_bounces', 'totals_transactionRevenue']]df = df.fillna(value=0)df.head()

Moving on, as the new dataset has fewer variables which, however, vary in terms of data type, I took some time to analyze each and every variable to ensure the data is “clean enough” prior to modelling. Below are some quick examples of un-clean data to be cleaned:

继续，由于新数据集的变量较少，但是变量的数据类型不同，我花了一些时间分析每个变量，以确保在建模之前数据“足够干净”。以下是一些要清除的不干净数据的快速示例：

#Format the valuesdf.channelGrouping.unique()df.channelGrouping = df.channelGrouping.replace("(Other)", "Others")#Convert boolean type to string df.device_isMobile.unique()df.device_isMobile = df.device_isMobile.astype(str)df.loc[df.device_isMobile == "False", "device"] = "Desktop"df.loc[df.device_isMobile == "True", "device"] = "Mobile"#Categorize similar valuesdf['traffic_source'] = df.trafficSource_sourcemain_traffic_source = ["google","baidu","bing","yahoo",...., "pinterest","yandex"]df.traffic_source[df.traffic_source.str.contains("google")] = "google"df.traffic_source[df.traffic_source.str.contains("baidu")] = "baidu"df.traffic_source[df.traffic_source.str.contains("bing")] = "bing"df.traffic_source[df.traffic_source.str.contains("yahoo")] = "yahoo".....df.traffic_source[~df.traffic_source.isin(main_traffic_source)] = "Others"

After re-formatting, I found that fullVisitorID’s unique values are fewer than the total rows of the dataset, meaning there are multiple fullVisitorIDs that were recorded. Hence, I proceeded to group the variables by fullVisitorID and sort by Revenue:

重新格式化后，我发现fullVisitorID的唯一值少于数据集的总行数，这意味着记录了多个fullVisitorID。因此，我着手按照fullVisitorID对变量进行分组，然后按Revenue进行排序：

df_groupby = df.groupby(['fullVisitorId', 'channelGrouping', 'geoNetwork_country', 'traffic_source', 'device', 'deviceBrowser', 'device_operatingSystem'])               .agg({'totals_hits':'sum', 'totals_pageviews':'sum', 'totals_bounces':'sum','totals_transactionRevenue':'sum'})               .reset_index()df_groupby = df_groupby.sort_values(by='totals_transactionRevenue', ascending=False).reset_index(drop=True)

3.异常值处理 (3. Outlier Handling)

The last step of any EDA process that cannot be overlooked is detecting and handling outliers of the dataset. The reason being is that outliers, especially those marginally extreme ones, impact the performance of a machine learning model, mostly negatively. That said, we need to either remove those outliers from the dataset or convert them (by mean or mode) to fit them to the range that the majority of the data points lie in:

任何EDA流程中不可忽视的最后一步是检测和处理数据集的异常值。原因是离群值，尤其是那些极度极端的值，对机器学习模型的性能产生了很大的负面影响。也就是说，我们需要从数据集中删除那些离群值，或者将它们转换(通过均值或众数)以使其适合大多数数据点所在的范围：

#Seaborn Boxplot to see how far outliers lie compared to the restsns.boxplot(df_groupby.totals_transactionRevenue)

As you can see, most of the data points in Revenue lie below USD200,000 and there’s only one extreme outlier that hits nearly USD600,000. If we don’t remove this outlier, the model also takes it into consideration that produces a less objective reflection.

如您所见，“收入”中的大多数数据点都在200,000美元以下，只有一个极端的异常值达到了600,000美元。如果我们不删除此异常值，则模型也会将其考虑在内，从而产生较少客观的反映。

So let’s go ahead and remove it, and please do so for other variables. Just a quick note, there are several methods of dealing with outliers (such as inter-quantiles). However, in my case, there’s only one so I just went ahead defining the range that I believe fits well:

因此，让我们继续删除它，对于其他变量，请这样做。简要说明一下，有几种处理离群值(例如分位数间)的方法。但是，就我而言，只有一个，所以我继续定义了我认为合适的范围：

df_groupby = df_groupby.loc[df_groupby.totals_transactionRevenue < 200000]

B. K-均值聚类 (B. K-Means Clustering)

What is K-Means Clustering and how does it help with customer segmentation?

什么是K-Means聚类，它对客户细分有何帮助？

Clustering is the most well-known unsupervised learning technique that finds structure in unlabeled data by identifying similar groups/clusters, particularly with the helps of K-Means.

聚类是最著名的无监督学习技术，通过识别相似的组/集群来发现未标记数据中的结构，尤其是在K-Means的帮助下。

K-Means tries to address two questions: (1) K: the number of clusters (groups) we expect to find in the dataset and (2) Means: the average distance of data to each cluster center (centroid) which we try to minimize.

K-Means尝试解决两个问题：(1)K：我们希望在数据集中找到的聚类 (组) 的数量； (2)均值： 数据到我们试图聚类的每个聚类中心 (质心) 的平均距离最小化。

Also, one thing of note is that K-Means comes with several variations, typically :

另外，值得注意的是，K-Means具有多种变体，通常是：

init = ‘random’: that randomly selects the centroids of each cluster

init ='random'：随机选择每个簇的质心
init = ‘k-means++’: that only selects the 1st centroid by randomness while other centroids to be placed as far away from the 1st as possible

init ='k-means ++'：仅随机选择第一个质心，而其他质心则尽可能远离第一个质心

In this project, I’ll use the second option to ensure that each cluster is well-distinguished from one another:

在这个项目中，我将使用第二个选项来确保每个群集之间的区别明显：

from sklearn.cluster import KMeansdata = df_groupby.iloc[:, 7:]kmeans = KMeans(n_clusters=3, init="k-means++")kmeans.fit(data)labels = kmeans.predict(data)labels = pd.DataFrame(data=labels, index = df_groupby.index, columns=["labels"])

Before applying the algorithm, we need to define “n_clusters” which is the number of groups we expect to get out of the modelling. In this case, I randomly put n_clusters = 3. Then, I went ahead visualizing how the dataset is grouped using 2 variables: Revenue and PageViews:

在应用算法之前，我们需要定义“ n_clusters ”，这是我们希望从建模中摆脱出来的组数。在这种情况下，我随机放置n_clusters =3。然后，我继续可视化如何使用2个变量对数据集进行分组：Revenue和PageViews：

plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 0],df_kmeans.totals_pageviews[df_kmeans.labels == 0], c='blue')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 1], df_kmeans.totals_pageviews[df_kmeans.labels == 1], c='green')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 2], df_kmeans.totals_pageviews[df_kmeans.labels == 2], c='orange')plt.show()

As you can see, the x-axis stands for the number of Revenue while y-axis for PageViews . After modelling, we can tell a certain degree of difference in 3 clusters. However, I was not sure whether 3 is the “right” number of clusters or not. That said, we can rely on the estimator of K-Means algorithm, inertia_, which is the distance from each sample to the centroid. In particular, we will compare the inertia of each cluster ranging from 1 to 10, in my case, and see which is the lowest and how far we should go:

如您所见，x轴代表“收入”数，y轴代表“ PageViews”。建模后，我们可以区分3个聚类的一定程度的差异。但是，我不确定3个集群是否正确。就是说，我们可以依靠K-Means算法的估计量initiative_ ，它是每个样本到质心的距离。特别是，在我的例子中，我们将比较每个群集的惯性，范围是1到10，然后看看哪一个是最低的以及应该走多远：

#Find the best number of clustersnum_clusters = [x for x in range(1,10)]inertia = []for i in num_clusters:    model = KMeans(n_clusters = i, init="k-means++")    model.fit(data)    inertia.append(model.inertia_)

plt.plot(num_clusters, inertia)plt.show()

From the chart above, inertia started to fall slowly since the 4th or 5th cluster, meaning that that’s the lowest inertia we can get, so I decided to go with “n_clusters=4”:

从上表中可以看出，自第4簇或第5簇以来，惯性开始缓慢下降，这意味着这是我们可以获得的最低惯性，因此我决定使用“ n_clusters = 4 ”：

plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 0], df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 0], c='blue')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 1],df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 1], c='green')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 2],df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 2], c='orange')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 3],df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 3], c='red')plt.xlabel("Page Views")plt.ylabel("Revenue")plt.show()

Switch PageViews to x-axis and Revenue to y-axis

The clusters now look a lot more distinguishable from one another:

现在，这些群集彼此之间的区别更加明显：

Cluster 0 (Blue): high PageViews yet little-to-none Revenue群集0(蓝色)：网页浏览量高，但收入却几乎没有
Cluster 1 (Red): medium PageViews, low Revenue群集1(红色)：中型网页浏览量，低收入
Cluster 2 (Orange): medium PageViews, medium Revenue群集2(橙色)：中等浏览量，中等收入
Cluster 4 (Green): unclear trend of PageViews, high Revenue群集4(绿色)：PageViews趋势不明确，收入高

Except for cluster 0 and 4 (unclear pattern), which are beyond our control, cluster 1 and 2 can tell a story here as they seem to share some similarities.

除了群集0和4(不清楚的模式)，这超出了我们的控制范围，群集1和2在这里可以讲一个故事，因为它们似乎具有一些相似之处。

To understand which factor that might impact each cluster, I segmented each cluster by Channels, Device and Operating System:

为了了解可能影响每个集群的因素，我按渠道，设备和操作系统对每个集群进行了细分：

As seen from above, in Cluster 1, Referral channel contributed the highest Revenue, followed by Direct and Organic Search. In contrast, it’s Direct that made the highest contribution in Cluster 2. Similarly, while Macintosh is the most dominating device in Cluster 1, it’s Windows in Cluster 2 that achieved higher revenue. The only similarity between 2 clusters is the Device Browser, which Chrome is widely used.

从上方可以看出，在类别1中，引荐渠道贡献了最高的收入，其次是直接搜索和自然搜索。相比之下，Direct在集群2中贡献最大。类似地，尽管Macintosh是集群1中最主要的设备，但集群2中的Windows获得了更高的收入。 2个群集之间的唯一相似之处是设备浏览器，Chrome被广泛使用。

Voila! This further segmentation helps us tell which factor (in this case, Channel, Device Browser, Operating System) works better for each cluster, hence we can better evaluate our investment moving forward!

瞧！进一步的细分可以帮助我们确定哪个因素(在这种情况下，通道，设备浏览器，操作系统)对于每个集群都更有效，因此我们可以更好地评估未来的投资！

C.通过假设检验进行A / B检验 (C. A/B Testing through Hypothesis Testing)

What is A/B Testing and how can Hypothesis Testing come into place to complement the process?

什么是A / B测试，以及如何进行假设测试来补充流程？

A/B Testing is no stranger to those who work in Advertising and Media, since it’s one of the powerful techniques that help improve the performance with more cost efficiency. Particularly, A/B Testing divides the audience into 2 groups: Test vs Control. Then, we expose the ads/show a different design to the Test group only to see if there’s any significant discrepancy between 2 groups: exposed vs un-exposed.

A / B测试对于从事广告和媒体工作的人员并不陌生，因为它是帮助以更高的成本效率提高性能的强大技术之一。特别是，A / B测试将受众分为两组：测试与控制。然后，我们向测试组展示广告/展示不同的设计，只是为了查看两组之间是否存在显着差异：公开与未公开。

Image credit: https://productcoalition.com/are-you-segmenting-your-a-b-test-results-c5512c6def65?gi=7b445e5ef457

In Advertising, there are a number of different automatic tools in the market that can easily help do A/B Testing at one click. However, I still wanted to try a different method in Data Science that can do the same: Hypothesis Testing. The methodology is pretty much the same, as Hypothesis Testing compares the Null Hypothesis (H0) and Alternate Hypothesis (H1) and see if there’s any significant discrepancy between the two!

在广告中，市场上有许多不同的自动工具，可轻松帮助您一键进行A / B测试。但是，我仍然想在数据科学中尝试一种可以做到这一点的不同方法： 假设检验 。方法学几乎是一样的，因为假设检验将零假设(H0)和替代假设(H1)进行比较，看看两者之间是否存在显着差异！

Assume that I run a promotion campaign that exposes an ad to the Test group. Here’s a quick summary of steps that need to be followed to test the result with Hypothesis Testing:

假设我运行了一个促销活动，将广告展示给“测试”组。以下是使用假设检验测试结果所需遵循的步骤的快速摘要：

Sample Size Determination样本量确定
Pre-requisite Requirements: Normality and Correlation Tests先决条件：正常性和相关性测试
Hypothesis Testing假设检验

For the 1st step, we can rely on Power Analysis which helps determine the sample size to draw from a population. Power Analysis requires 3 parameters: (1) effect size, (2) power and (3) alpha. If you are looking for details on how Power Analysis, please refer to an in-depth article here that I wrote some time ago.

对于第一步 ，我们可以依靠功效分析，该分析有助于确定要从总体中提取的样本量。功效分析需要3个参数：(1)效果大小，(2)功效和(3)alpha。如果您正在寻找在功率分析如何，请参阅了深入的文章详细介绍在这里，我写了前一段时间。

Below is a quick note to each parameter for your quick understanding:

以下是对每个参数的快速注释，以帮助您快速理解：

#Effect Size: (expected mean - actual mean) / actual_stdeffect_size = (280000 - df_group1_ab.revenue.mean())/df_group1_ab.revenue.std() #set expected mean to $350,000print(effect_size)#Powerpower = 0.9 #the probability of rejecting the null hypothesis#Alphaalpha = 0.05 #the error rate

After having 3 parameters ready, we use TTestPower() to determine the sample size:

准备好3个参数后，我们使用TTestPower()确定样本大小：

import statsmodels.stats.power as smsn = sms.TTestPower().solve_power(effect_size=effect_size, power=power, alpha=alpha)print(n)

The result is 279, meaning we need to draw 279 data points from each group: Test and Control. As I don’t have real data, I used np.random.normal to generate a list of revenue data, in this case sample size = 279 for each group:

结果是279，这意味着我们需要从每个组中提取279个数据点：测试和控制。由于我没有真实数据，因此我使用np.random.normal生成了收入数据列表，在这种情况下，每个组的样本量= 279：

#Take the samples out of each group: control vs testcontrol_sample = np.random.normal(control_rev.mean(), control_rev.std(), size=279)test_sample = np.random.normal(test_rev.mean(), test_rev.std(), size=279)

Moving to the 2nd step, we need to ensure the samples are (1) normally distributed and (2) independent (not correlated). Again, if you want a refresh on the tests used in this step, refer to my article as above. In short, we are going to use (1) Shapiro as the normality test and (2) Pearson as the correlation test.

移至第二步 ，我们需要确保样本是(1)正态分布和(2)独立(不相关)的。同样，如果您想刷新此步骤中使用的测试，请参考上面的文章。简而言之，我们将使用(1)Shapiro作为正态性检验，(2)Pearson作为相关性检验。

#Step 2. Pre-requisite: Normality, Correlationfrom scipy.stats import shapiro, pearsonrstat1, p1 = shapiro(control_sample)stat2, p2 = shapiro(test_sample)print(p1, p2)stat3, p3 = pearsonr(control_sample, test_sample)print(p3)

The p-value of Shapiro is 0.129 and 0.539 for Control and Test group respectively, which is > 0.05. Hence, we don’t reject the null hypothesis and are able to say that 2 groups are normally distributed.

对照组和测试组的Shapiro p值分别为0.129和0.539，> 0.05。因此，我们不会拒绝原假设，而是可以说2个组是正态分布的。

The p-value of Pearson is 0.98, which is >0.05, meaning that 2 groups are independent from each other.

皮尔森(Pearson)的p值为0.98，即> 0.05，表示2个组彼此独立。

Final step is here! As there are 2 variables to be tested against each other (Test vs Control group), we use T-Test to see if there’s any significant discrepancy in Revenue after running A/B Testing:

最后一步就在这里 ！由于有两个变量需要相互测试(测试组和对照组)，因此我们使用T-Test来查看运行A / B测试后收入是否存在显着差异：

#Step 3. Hypothesis Testingfrom scipy.stats import ttest_indtstat, p4 = ttest_ind(control_sample, test_sample)print(p4)

The result is 0.35, which is > 0.05. Hence, the A/B Test conducted indicates that the Test Group exposed to the ads doesn’t show any superiority over the Control Group with no ad exposure.

结果为0.35，即> 0.05。因此，进行的A / B测试表明，暴露于广告的测试组与没有暴露广告的对照组相比没有任何优势。

Voila! That’s the end of this project — Customer Segmentation & A/B Testing! I hope you find this article useful and easy to follow.

瞧！这就是项目的结尾–客户细分和A / B测试！我希望您觉得这篇文章有用且易于阅读。

Do look out for my upcoming projects in Data Science and Machine Learning in the near future! In the meantime feel free to check out my Github here for the complete repository:

请在不久的将来注意我即将进行的数据科学和机器学习项目！同时，请随时在此处查看我的Github以获取完整的存储库：

Github: https://github.com/andrewnguyen07LinkedIn: www.linkedin.com/in/andrewnguyen07

GitHub： https : //github.com/andrewnguyen07 LinkedIn： www.linkedin.com/in/andrewnguyen07

Thanks!

谢谢！

翻译自: https://towardsdatascience.com/customer-segmentation-k-means-clustering-a-b-testing-bd26a94462dd

客户细分

查看全文

http://www.taodudu.cc/news/show-863781.html

菜品三级分类_分类器的惊人替代品
开关变压器绕制教程_教程：如何将变压器权重和令牌化器从AllenNLP上传到HuggingFace
一般线性模型和混合线性模型_线性混合模型如何工作
为什么基于数字的技术公司进行机器人研究
人类视觉系统_对人类视觉系统的对抗攻击
在神经网络中使用辍学：不是一个神奇的子弹
线程监视器模型_为什么模型验证如此重要，它与模型监视有何不同
dash使用_使用Dash和SHAP构建和部署可解释的AI仪表盘
面向表开发面向服务开发_面向繁忙开发人员的计算机视觉
可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec
fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数
redis生产环境持久化_在SageMaker上安装持久性Julia环境
alexnet vgg_从零开始：建立著名的分类网2（AlexNet / VGG）
垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器
脑电波之父:汉斯·贝格尔_深度学习，认识聪明的汉斯
PyCaret 2.0在这里-新增功能？
特征选择回归_如何执行回归问题的特征选择
建立神经网络来预测贷款风险
redshift教程_分析和可视化Amazon Redshift数据—教程
白雪小町_町
机器学习术语_机器学习术语神秘化。
centos有趣软件包_这5个软件包使学习R变得有趣
求解决方法_解决方法
xml格式是什么示例_什么是对抗示例？
mlflow_在生产中设置MLflow
神秘实体ALIMA
mnist数据集彩色图像_使用MNIST数据集构建多类图像分类模型。
bert使用做文本分类_使用BERT进行深度学习的多类文本分类
垃圾邮件分类器_如何在10个步骤中构建垃圾邮件分类器
ai 图灵测试_适用于现代AI系统的“视觉图灵测试”

客户细分_客户细分：K-Means聚类和A / B测试相关推荐

OpenCV的k - means聚类 -对图片进行颜色量化
OpenCV的k - means聚类目标学习使用cv2.kmeans()数据聚类函数OpenCV 理解参数输入参数样品:它应该的np.float32数据类型,每个特性应该被放在一个单独的列. ...
OpenCV官方文档理解k - means聚类
理解k - means聚类目标在这一章中,我们将了解k - means聚类的概念,它是如何工作等. 理论我们将这个处理是常用的一个例子. t恤尺寸问题考虑一个公司要发布一个新模型的t恤. 显然 ...
k means聚类算法_K-Means 聚类算法 20210108
说到聚类,应先理解聚类和分类的区别聚类和分类最大的不同在于:分类的目标是事先已知的,而聚类则不一样,聚类事先不知道目标变量是什么,类别没有像分类那样被预先定义出来. K-Means 聚类算法有很多种 ...
python 聚类客户细分_客户细分——K-Means聚类
RFM模型多用于已知目标数据集,场景具有一定的局限性,本篇运用一个适用比较广泛的聚类算法--K-Means,它属于无监督机器学习,K-Means算法的思想很简单,对于给定的样本集,按照样本之间的距离大 ...
k means聚类算法_一文读懂K-means聚类算法
1.引言什么是聚类?我们通常说,机器学习任务可以分为两类,一类是监督学习,一类是无监督学习.监督学习:训练集有明确标签,监督学习就是寻找问题(又称输入.特征.自变量)与标签(又称输出.目标.因变量) ...
机器学习（十四）：K均值聚类(kmeans)
k均值聚类方法是一种无监督机器学习技术,用于识别数据集中的数据对象集群.有许多不同类型的聚类方法,但k -means是最古老和最平易近人的方法之一.这些特性使得在 Python 中实现k -means ...
rfm模型分析与客户细分_如何使用基于RFM的细分来确定最佳客户
rfm模型分析与客户细分 With some free time at hand in the midst of COVID-19 pandemic, I decided to do pro bono ...
java 在底图上绘制线条_使用底图和geonamescache绘制k表示聚类
java 在底图上绘制线条 This is the third of four stories that aim to address the issue of identifying disease ...
客户行为模型 r语言建模_客户行为建模：汇总统计的问题
客户行为模型 r语言建模 As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value ...

客户细分_客户细分：K-Means聚类和A / B测试