“It is difficult to make predictions, especially about the future.”

“很难做出预测，尤其是对未来的预测。”

~Niels Bohr

〜尼尔斯·波尔

Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.

通过数据可以更好地解释一切。数据驱动的决策对于任何行业的成功都是至关重要的。

And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.

自从难忘的时光以来，这就是事实。现在的区别在于，我们更好地发展了健康的数据前景，并且我们拥有比以前更多的数据。而且，我们拥有以前无法想象的计算能力。

In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.

在这种情况下，应利用计算能力和数据做出更好的决策来解决业务问题。

In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.

在我的项目中，我选择为在加利福尼亚市开设新餐馆提供建议。在这个项目中，我提供了一份具体的投资建议清单。对餐馆类型(例如日式餐厅，甜点店等)和各个县提出了建议。

In this post, I will go over the full process of a Data Science project.

在本文中，我将介绍数据科学项目的整个过程。

数据源 (Data Sources)

For solving this problem, data from four sources have been leveraged-

为了解决这个问题，我们利用了来自四个来源的数据-

Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.

由加利福尼亚政府提供的加利福尼亚开放数据门户中提供的地理位置数据称为“加利福尼亚县” 。
The Foursquare API for information about established restaurants and other relevant detailed information about the same.

Foursquare API，用于提供有关已建立餐厅的信息以及有关该餐厅的其他相关详细信息。
County-wise population data from the US Government Census site.

来自美国政府人口普查站点的县级人口数据。
County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.

美国商务部经济分析局提供的县级实际GDP数据。

探索性数据分析 (Exploratory Data Analysis)

After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.

清理数据后( 绝对超过数据科学家工作的90％ )，从数据中获得了有意义的见解。

City Centers of California’s Counties, source: Author

It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.

还发现县的GDP与县的人口密切相关。因此，具有高GDP和高人口的县成为吸引投资的目的地。

Strong Correlation Between GDP and Population of Californian Counties, source: Author

Number of Eateries in Each County (capped at 50 by Foursquare), source: Author

With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.

借助Foursquare API提供的信息，获得了每个县的十个最常见的场所列表。这将在决策中加以利用。

应用机器学习模型 (Applying Machine Learning Model)

选择算法 (Choosing Algorithm)

The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.

业务问题是寻找餐馆类型和投资地点。数据未标记。这使得要解决的问题成为无监督学习的经典应用。

The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.

目的不是寻找价值或寻找阶级。目的不是建议某人仅提出一项投资建议。向利益相关者建议可能的场所清单是目标。

And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.

这可以通过基于GDP和人口对县进行聚类来实现。而KMeans聚类是实现这一目标的最佳统计学习算法。

Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.

使用了Scikit-learn库的KMeans聚类算法实现。

选择k (Choosing k)

For choosing the best k for clustering, the elbow method was employed.

为了选择最佳的k进行聚类，采用了弯头法。

Inertia vs. Values of k Plot, source: Author

As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.

从图中可以看出，最佳k为4。因此，在k = 4时应用了聚类算法。因此，根据县的人口和GDP形成了4个县集群。

结果 (Results)

4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).

形成了包含县的4个集群。经检查，发现洛杉矶县因其GDP和人口相对较高而与其自身形成了一个集群(集群2)。另一个集群中的县的GDP较高且人口众多，但洛杉矶县附近没有。奥兰治，圣克拉拉和圣地亚哥是该集群中的三个县(集群3)。然后是一个集群(集群1)中的Plumas，内华达州，塞拉利昂等GDP较低且人口较少的县(另一个集群)(萨克拉曼多，河滨等)中部GDP和人口较低的县(集群) -4)。

Resulting Clusters on a Map, source: Author

In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.

在第2、3组中，我们的县人口众多，GDP很高。在这些县中，投资于任何一家餐馆都是有利可图的，而建议投资于不在前三名场所中的餐馆则是有利的。

In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.

在集群4中，县的人口和GDP高于集群1中的县，但低于集群2或3中的县。在这些县中投资优先于集群2和集群3中的县。。应该在不常见的餐馆里进行投资，以使他们面临的竞争更少。

Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.

集群1由人口较少的县主导。在对第2组或第3组或第4组的县进行投资之后，应该优先选择对这些县进行投资。建议不要在这些县进行投资。

After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.

在提出投资选择建议之后，每个集群的表格都是用餐馆类型构成的，而不是三种最常见的类型。

Table for Counties and Investment Recommendations in Cluster 3

Full Report Link: PDF in GitHub RepositoryNotebook with Full Code: NB Viewer

完整报告链接： GitHub存储库笔记本中的PDF ，完整代码： NB Viewer

Feel free to comment, provide feedback, or criticize.

随时发表评论，提供反馈或批评。

Connect with me on LinkedIn or Twitter.

在LinkedIn或Twitter上与我联系。

This blog post is related to Applied Data Science Capstone Project offered by IBM through Coursera.

这篇博客文章与IBM通过Coursera提供的Applied Data Science Capstone Project有关。

翻译自: https://medium.com/beginning-data-science/investing-in-a-new-eastery-in-california-a-data-driven-approach-e91229e0289e

查看全文

http://www.taodudu.cc/news/show-997402.html

近似算法的近似率_选择最佳近似最近算法的数据科学家指南
在Python中使用Seaborn和WordCloud可视化YouTube视频
数据结构入门最佳书籍_最佳数据科学书籍
多重插补均值插补_Feature Engineering Part-1均值/中位数插补。
客户行为模型 r语言建模_客户行为建模：汇总统计的问题
多维空间可视化_使用GeoPandas进行空间可视化
机器学习来源框架_机器学习的秘密来源：策展
呼吁开放外网_服装数据集：呼吁采取行动
数据可视化分析票房数据报告_票房收入分析和可视化
先知模型 facebook_Facebook先知
项目案例:qq数据库管理_2小时元项目：项目管理您的数据科学学习
查询数据库中有多少个数据表_您的数据中有多少汁？
数据科学与大数据技术的案例_作为数据科学家解决问题的案例研究
商业数据科学
数据科学家数据分析师_站出来！分析人员，数据科学家和其他所有人的领导和沟通技巧...
分析工作试用期收获_免费使用零编码技能探索数据分析
残疾科学家_数据科学与残疾：通过创新加强护理
spss23出现数据消失_改善23亿人口健康数据的可视化
COVID-19研究助理
缺失值和异常值的识别与处理_识别异常值-第一部分
梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例
yolo人脸检测数据集_自定义数据集上的Yolo-V5对象检测
图深度学习-第2部分
量子信息与量子计算_量子计算为23美分。
失物招领php_新奥尔良圣徒队是否增加了失物招领？
客户细分模型_Avarto金融解决方案的客户细分和监督学习模型
梯度反传_反事实政策梯度解释
facebook.com_如何降低电子商务的Facebook CPM
西格尔零点猜想_我从埃里克·西格尔学到的东西
深度学习算法和机器学习算法_啊哈！ 4种流行的机器学习算法的片刻

在加利福尼亚州投资于新餐馆：一种数据驱动的方法相关推荐

ICDE-2020 论文简析:空间众包中的预测任务分配 : 一种数据驱动的方法 Predictive Task Assignment in Spatial Crowdsourcing
ICDE-2020 论文简析:空间众包中的预测任务分配:一种数据驱动的方法 Predictive Task Assignment in Spatial Crowdsourcing: A Data-dr ...
Internet：A/B Testing即对照实验(一种数据驱动决策方法)的简介、原理、案例应用之详细攻略
Internet:A/B Testing即对照实验(一种数据驱动决策方法)的简介.原理.案例应用之详细攻略目录 A/B Testing即对照实验(一种数据驱动决策方法)的简介 1.A/B Testi ...
一种新的高级抖动分离解析方法
一种新的高级抖动分离解析方法 A new analytic approach for advanced jitter separation 抖动分量的分析是现代通信系统调试中一项越来越重要的任务.一方 ...
phpddos应对最近新起一种udp flood的攻击形式
phpddos应对最近新起一种udp flood的攻击形式,是利用php中的fsockopen函数往特定机器发送大量UDP包,耗费大量流量,直到网络瘫痪. php当前只支持用allow_url_fo ...
《Nature》发布药物筛选新突破：一种有效的方法来发现新的抗膜蛋白的配体和抑制剂
香港大学Xiaoyu LI博士.重庆大学Yizhou LI教授和上海第二军医大学Yan CAO教授共同组成的联合研究团队开发了一种针对活细胞膜蛋白的药物发现新方法,并在著名化学期刊<Nature ...
电商新零售四种形态承泽集团思购臻选
对于消费者来说:在很多平台上都是买东西,这个平台就相当于创业平台一样,有副业,被动收益,比其他平台的模式更先进,永远都有人帮忙做市场思购臻选企业背景新零售的出现,让线上(电商品牌)与线下(传统品牌) ...
换新NAS不用愁，3种数据迁移方法教你轻松学会
NAS 虽然一台能用很久但总免不了升级换代如何优雅地将数据从之前的群晖 NAS 中迁移到全新的NAS里呢? 小编给大家准备了三个方案大家可以根据自己的实际情况进行选择~ 01 Hyper ...
论文阅读笔记——利用枪口模式识别作为一种生物特征识别方法
利用枪口模式识别作为一种生物特征识别方法论文简介标题期刊情况论文内容摘要介绍材料与方法从提取的墨迹识别枪口模式枪口模式识别算法提升油墨印刷的程序灰度数字图像的枪口模式识别枪口模 ...
[转]设计高效SQL: 一种视觉的方法
原文地址:http://www.itpub.net/thread-1357925-1-1.html 英文原文:http://www.simple-talk.com/sql/performance/de ...

在加利福尼亚州投资于新餐馆：一种数据驱动的方法