分步式数据库

As an inspiring data scientist, building interesting portfolio projects is key to showcase your skills. When I learned coding and data science as a business student through online courses, I disliked that datasets were made up of fake data or were solved before like Boston House Prices or the Titanic dataset on Kaggle.

作为一个鼓舞人心的数据科学家,构建有趣的项目组合是展示您的技能的关键。 当我通过在线课程学习商科专业的编码和数据科学时,我不喜欢数据集是由伪造的数据组成的,或者不像波士顿房屋价格或Kaggle上的泰坦尼克号数据集那样被求解过

In this blogpost, I want to show you how I develop interesting data science project ideas and implement them step by step, such as exploring Germany’s biggest frequent flyer forum Vielfliegertreff. If you are short on time feel free to skip to the conclusion TLDR.

在本博文中,我想向您展示我如何开发有趣的数据科学项目构想并逐步实施它们,例如探索德国最大的飞行常客论坛Vielfliegertreff。 如果您时间有限,请随时跳过TLDR的结论

步骤1:选择与您相关的激情话题 (Step 1: Choose your passion topic that is relevant)

As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:

首先,我考虑一个可能满足以下三个要求的项目,使其成为最有趣和最有趣的项目:

  1. Solving my own problem or burning question

    解决我自己的问题或棘手的问题

  2. Connected to some recent event to be relevant or especially interesting

    最近的活动相关或特别有趣

  3. Has not been solved or covered before

    之前尚未解决或覆盖

As these ideas are still quite abstract, let me give you a rundown how my three projects fulfilled the requirements:

由于这些想法还很抽象,请允许我简要介绍一下我的三个项目如何满足要求:

演示地址

Overview of my own data science portfolio projects fufilling the three outlined requirements.
我自己的数据科学组合项目概述满足了三个概述的要求。

As a beginner do not strive for perfection, but choose something you are genuinely curious about and write down all the questions you want to explore in your topic.

作为一个初学者,不要追求完美,而要选择您真正好奇的东西并写下您想在主题中探索的所有问题。

步骤2:开始将您自己的数据集收集在一起 (Step 2: Start Scraping together your own dataset)

Given that you followed my third requirement, there will be no dataset publicly available and you will have to scrape data together yourself. Having scraped a couple of websites, there are 3 major frameworks I use for different scenarios:

鉴于您遵循了我的第三个要求,因此不会公开提供任何数据集,并且您必须自己将数据收集在一起。 抓取了两个网站后,我针对不同的情况使用了3个主要框架

演示地址

Overview of the 3 major frameworks I use for scraping.
我用于抓取的3个主要框架的概述。

For Vielfliegertreff, I used scrapy as framework for the following reasons:

对于Vielfliegertreff,出于以下原因,我使用scrapy作为框架:

  1. There was no Java Script enabled elements that were hiding data.

    没有启用Java Script的元素可以隐藏数据。

  2. The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.

    网站结构非常复杂,必须从每个论坛主题,所有主题,所有环节到所有发布网站页面。 使用scrapy,您可以轻松实现复杂的逻辑,从而产生有组织的方式导致新回调函数的请求。

  3. There were quite a lot of posts so crawling the entire forum will defnitley take some time. Scrapy allows you to asynchronously scrape websites at an incredible speed.

    有很多帖子,因此在整个论坛中进行爬网将需要一些时间。 Scrapy允许您以惊人的速度异步抓取网站

To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to scrape around 3000 pages / minute:

为了让您了解刮擦的强大程度,我Swift对MacBook Pro(13英寸,2018年,四个Thunderbolt 3端口)进行了基准测试,采用了2,3 GHz四核Intel Core i5,能够刮擦约3000页/分钟

Scrapy scraping benchmark. (Image by Author)
刮擦基准。 (图片由作者提供)

To be nice and not to get blocked, it is important that you scrape gently, by for example enabling scrapy’s auto-throttle feature. Furthermore, I also saved all data to a SQL lite database via an items pipeline to avoid duplicates and turned on to log each url request to make sure I do not put more load on the server if I stop and restart the scraping process.

为了变得友善而不被阻塞,轻柔地刮擦非常重要,例如通过启用scrapy的自动油门功能。 此外,我还通过项管道将所有数据保存到SQL lite数据库中,以避免重复,并打开日志记录每个url请求,以确保如果停止并重新启动抓取过程,则不会对服务器造成更多负载。

Knowing how to scrape gives you the freedom to collect datasets by yourself and teaches you also important concepts about how the internet works, what a request is and the structure of html/xpath.

知道如何抓取使您可以自由地自己收集数据集,并且还教您有关互联网如何工作,请求是什么以及html / xpath的结构的重要概念

For my project I ended up with 1.47 gb of data which was close to 1 million posts in the forum.

对于我的项目,我最终获得了1.47 gb的数据,该数据在论坛中接近100万个帖子

步骤3:清理资料集 (Step 3: Cleaning your dataset)

With your own scraped messy dataset the most challenging part of the portfolio project comes, where data scientists spend on average 60% of their time:

使用您自己的混乱数据集,投资组合项目中最具挑战性的部分来了,数据科学家平均花费60%的时间

CrowdFlower 2016CrowdFlower 2016

Unlike clean Kaggle datasets, your own dataset allows you to build skills in data cleaning and show a future employer that you are ready to deal with real life messy datasets. Additionally, you can explore and take advantage of the python ecosystem by leveraging libraries that solve some common data cleaning tasks that others solved before.

与干净的Kaggle数据集不同,您自己的数据集可让您建立数据清理技能,并向未来的雇主表明您已准备好处理现实生活中的混乱数据集。 此外,您可以利用库来解决利用python生态系统,这些库可以解决其他人之前解决的一些常见数据清理任务

For my dataset from Vielfliegertreff, there were a couple of common tasks like turning the dates into a pandas timestamps, converting numbers from strings into actual numeric data types and cleaning a very messy html post text to something readable and usable for NLP tasks. While some tasks are a bit more complicated, I would like to share my top 3 favourite libraries that solved some of my common data cleaning problems:

对于我来自Vielfliegertreff的数据集,有一些常见的任务,例如将日期转换为熊猫时间戳,将数字从字符串转换为实际的数字数据类型以及将非常混乱的html帖子文本清理为对NLP任务可读且可用的东西。 尽管有些任务比较复杂,但我想分享我最喜欢3个librarie ,它们解决了一些常见的数据清理问题:

  1. dateparser: Easily parse localized dates in almost any string formats commonly found on web pages.

    dateparser :可以轻松解析网页上常见的几乎任何字符串格式的本地化日期。

  2. clean-text: Preprocess your scraped data with clean-text to create a normalized text representation. This one is also amazing to remove personally identifiable information, such as emails or phone numbers etc.

    clean-text 使用clean-text预处理您抓取的数据,以创建规范化的文本表示形式。 删除个人身份信息(例如电子邮件或电话号码等)也非常出色。

  3. fuzzywuzzy: Fuzzy string matching like a boss.

    Fuzzywuzzy 模糊字符串匹配,像一个老板。

步骤4:资料探索与分析 (Step 4: Data Exploration and Analysis)

When completing the Data Science Nanodegree on Udacity, I came across the Cross-Industry Standard Process for Data Mining (CRISP-DM), which I thought was quite an interesting framework to structure your work in a systematic way.

在完成有关Udacity的数据科学纳米学位时,我遇到了跨行业的数据挖掘标准流程(CRISP-DM) ,我认为这是一个非常有趣的框架,可以系统地组织您的工作。

With our current flow, we implicitly followed the CRISP-DM for our project:

通过当前的流程,我们隐含地遵循了CRISP-DM的项目:

Expressing business understanding by coming up with the following questions in step 1:

通过在步骤1中提出以下问题来表达业务理解

  1. How is COVID-19 impacting online frequent flyer forums like Vielfliegertreff?COVID-19对Vielfliegertreff等在线飞行常客论坛有何影响?
  2. What are some of the best posts in the forums?论坛中最好的帖子是什么?
  3. Who are the experts that I should follow as a new joiner?作为新加入者,我应该追随哪些专家?
  4. What are some of the worst or best things people say about airlines or airports?人们对航空公司或机场所说的最坏或最好的话是什么?

And with the scraped data we are now able to translate our initial business questions from above into specific data explanatory questions:

通过抓取的数据,我们现在能够将我们最初的业务问题从上面转换为具体的数据解释性问题

  1. How many posts are posted on a monthly basis? Did the posts decrease in the beginning of 2020 after COVID-19? Is there also some sort of indication that less people joined the plattform not being able to travel?每月发布多少个帖子? 在COVID-19之后,职位在2020年初是否减少了? 是否还有某种迹象表明,加入平台的人越来越少而无法旅行?
  2. What are the top 10 number of posts by the number of likes?按赞次数排名前10位的帖子数是多少?
  3. Who is posting the most and also receives on average the most likes for the post? These are the users I should follow regularly to see the best content.谁在该帖子上发布的次数最多,平均也收到最多的赞? 这些是我应定期关注以查看最佳内容的用户。
  4. Could a sentiment analysis on every post in combination with named entity recognition to identify cities/airports/airlines lead to interesting positive or negative comments?对每个帖子进行情感分析并结合命名实体识别以识别城市/机场/航空公司,是否会引起有趣的正面或负面评论?

For the Vielfliegertreff project one can definitely say that there has been a trend of declining posts over the years. With COVID-19 we can clearly see a rapid decrease in posts from January 2020 onwards when Europe was shutting down and closing borders which also heavily affected travelling:

对于Vielfliegertreff项目,可以肯定地说,这些年来职位呈下降趋势。 有了COVID-19,我们可以清楚地看到,自20201月起欧洲关闭和关闭边境,这也严重影响了出行,职位Swift减少

演示地址

Posts created by month. (Chart by Author)
按月创建的帖子。 (作者图表)

Also user sign ups have gone down over the years and the forum seems to see less and less of its rapid growth since start in January 2009:

多年来用户注册量也有所下降,自2009年1月开始以来,该论坛的快速增长似乎越来越少:

演示地址

Sign up numbers of users over the months. (Chart by author)
注册过去几个月的用户数。 (按作者图表)

Last but not least, I wanted to check what the most liked post was about. Unfortunately, it is in Germany, but it was indeed a very interesting post, where a German guy was allowed to spend some time on a US aircraft carrier and experienced a catapult take off in a C2 airplane. The post has some very nice pictures and interesting details. Feel free to check it out here if you can understand some German:

最后但并非最不重要的一点是,我想检查一下最喜欢的帖子。 不幸的是,这是在德国,但这确实是一个非常有趣的职位,在那里,一个德国人被允许在美国的航母上呆了一段时间,并经历了C2飞机上的弹射器起飞。 该帖子有一些非常漂亮的图片和有趣的细节。 如果您能理解一些德语,请随时在此处查看:

fleckenmann)fleckenmann )

步骤5:通过Blogpost或Web App共享您的工作 (Step 5: Share your work via a Blogpost or Web App)

Once you are done with those steps you can go one step further and create a model that classifies or predicts certain data points. For this project I did not attempt further to use machine learning in a specific way, although I had some interesting ideas about classifying sentiment of posts in connection with certain airlines.

完成这些步骤后,您可以再进一步一步,创建一个模型来分类或预测某些数据点。 对于这个项目,尽管我对分类某些航空公司的职位情绪有一些有趣的想法,但我没有尝试进一步以特定的方式使用机器学习。

In another project however, I modeled a price prediction algorithm that allows a user to get a price estimate for any type of tractor. The model was then deployed with the awesome streamlit framework, which can be found here (be patient with loading, it might load a bit slower).

但是,在另一个项目中,我对价格预测算法进行了建模,该算法允许用户获得任何类型的拖拉机的价格估计。 然后,使用令人敬畏的精简框架来部署该模型,该框架可以在此处找到(耐心等待加载,加载速度可能会慢一些)。

Another way to share your work is like me through blog posts on Medium, Hackernoon, KDNuggets or other popular websites. When writing blog posts, about portfolio projects or other topics, such as awesome interactive AI applications, I always try to make them as fun, visual and interactive as possible. Here are some of my top tips:

分享您作品的另一种方式是像我一样,通过Medium, Hackernoon , KDNuggets或其他流行网站上的博客文章。 在撰写有关投资组合项目或其他主题(如超棒的交互式AI应用程序)的博客文章时,我总是尽力使它们尽可能有趣,直观和互动。 以下是一些我的重要提示:

  • Include nice pictures for easy understanding and to break up some of the long text

    包括精美的图片,以便于理解并分解一些较长的文字

  • Include interactive elements, like tweets or videos that let the user interact

    包括互动元素,例如允许用户互动的推文或视频

  • Change boring tables or charts for interactive ones through tools and frameworks like airtable or plotly

    通过诸如airtable或plotly之类的工具和框架,交互式表格更改无聊的表格或图表

结论与TLDR(Conclusion & TLDR)

Come up with a blog post idea that answers a burning question you had or solves your own problem. Ideally the timing of the topic is relevant and has not been analysed by anyone else before. Based on your experience, website structure and complexity, choose a framework that matches the scraping job best. During data cleaning leverage existing libraries to solve painful data cleaning tasks like parsing timestamps or cleaning text. Finally, choose how you can best share your work. Both an interactive deployed model/dashboard or a well written medium blog post can differentiate you from other applicants on the journey to become a data scientist.

提出一个博客想法,回答您遇到的紧迫问题解决自己的问题。 理想情况下,主题的时间安排是相关的,并且以前没有其他人进行过分析。 根据您的经验,网站结构和复杂性,选择最适合抓取工作的框架。 在数据清理期间,利用现有库来解决繁琐的数据清理任务,例如解析时间戳或清理文本。 最后,选择如何最好地共享您的工作。 交互式部署的模型/仪表板或写得很好的博客文章都可以使您与成为数据科学家的其他申请人区分开

As always feel free to share with me some great data science resources or some of your best portfolio projects!

一如既往,随时与我分享一些很棒的数据科学资源或一些最佳的项目组合!

翻译自: https://towardsdatascience.com/a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project-aa641c2f2403

分步式数据库


http://www.taodudu.cc/news/show-3305945.html

相关文章:

  • 小程序历程_行动应用程式使用者历程简介
  • 纯函数式编程语言_函数式编程正在接管具有纯视图的UI。
  • 分步式数据库_创建真实数据科学项目的分步指南
  • 酒店应用爆发式增长,“API即服务”已成趋势!
  • Luogu P3224 [HNOI2012]永无乡
  • 论文阅读| 《An Integrated Pipeline Architecture for Modeling Urban Land Use, Travel Demand, and Traffic 》
  • 【杭电多校round3】G Interstellar Travel
  • 何谓启发式算法?
  • 调研技巧(上):以『时间旅行调试』为例
  • Travel to ThoughtWorks
  • 常见英语单词的过去式和过去分词
  • 人脸识别的时候,一定要穿上衣服啊!
  • 起底人脸信息倒卖产业链:一次丢失,终身危险
  • 人脸识别产品相关知识整理
  • 不可思议!一个3D面具就能破解刷脸支付,还能进火车站?微信:盗刷,我赔!...
  • 你的脸正在被偷走,你却对此无能为力
  • 阿里人脸识别安全技术获专利 可防范3D人脸面具攻击
  • GeeksForGeeks 翻译计划
  • 蚁群算法求解路径优化问题
  • Hive执行计划详解
  • 详解智能优化算法:遗传算法和蚁群算法
  • 微信群裂变操作流程方案
  • 初识蚁群算法
  • 上线计划
  • 从飞信群再谈时间管理
  • 浅析人行二代征信
  • 数据源——人行征信
  • 【推荐与广告】积累与发现
  • Python全栈(五)Web安全攻防之4.sqlmap性能优化和注入技术参数
  • 客户需求分析8个维度_常规数据分析的3个维度以及5个优化点,信息流投放参考...

分步式数据库_创建真实数据科学档案项目的分步指南相关推荐

  1. 分步式数据库_创建真实数据科学项目的分步指南

    分步式数据库 As an inspiring data scientist, building interesting portfolio projects is key to showcase yo ...

  2. 数据科学项目_完整的数据科学组合项目

    数据科学项目 In this article, I would like to showcase what might be my simplest data science project ever ...

  3. python计算各类型电影的评分_【Python数据科学实战项目】之 基于MovieLens的影评趋势分析|详解...

    原标题:[Python数据科学实战项目]之 基于MovieLens的影评趋势分析|详解 注:图片源于https://movielens.org/ 1. 项目任务 1.1 数据来源 本项目使用Group ...

  4. 熊猫数据集_熊猫迈向数据科学的第一步

    熊猫数据集 I started learning Data Science like everyone else by creating my first model using some machi ...

  5. 电子表格转换成数据库_创建数据库,将电子表格转换为关系数据库,第1部分...

    电子表格转换成数据库 Part 1: Creating an Entity Relational Diagram (ERD) 第1部分:创建实体关系图(ERD) A Relational Databa ...

  6. 微观计量经济学_微观经济学与数据科学

    微观计量经济学 什么是经济学和微观经济学? (What are Economics and Microeconomics?) Economics is a social science concern ...

  7. 如何创建一个数据科学项目?

    摘要: 在一个新的数据科学项目,你应该如何组织你的项目流程?数据和代码要放在那里?应该使用什么工具?在对数据处理之前,需要考虑哪些方面?读完本文,会让你拥有一个更加科学的工作流程. 假如你想要开始一个 ...

  8. 如何创建一个数据科学项目? 1

    假如你想要开始一个新的数据科学项目,比如对数据集进行简单的分析,或者是一个复杂的项目.你应该如何组织你的项目流程?数据和代码要放在那里?应该使用什么工具?在对数据处理之前,需要考虑哪些方面? 数据科学 ...

  9. 大数据数据量估算_如何估算数据科学项目的数据收集成本

    大数据数据量估算 (Notes: All opinions are my own) (注:所有观点均为我自己) 介绍 (Introduction) Data collection is the ini ...

最新文章

  1. android 广告效果图,android 仿首页广告轮播效果
  2. 2018年GPS定位器会发生什么样变化?
  3. native react ssh_React Native踩坑笔记(持续更新中...)
  4. grafana绘制动态dashboard
  5. java多表查询实体接收_java - 如何创建Criteria Builder查询以连接具有一对一和多对一实体关系的三个表? - 堆栈内存溢出...
  6. 【渝粤题库】陕西师范大学400006 公共关系学 作业(专升本)
  7. 一条消息未发,粉丝已破千万
  8. 【转】关键字过滤算法
  9. Retrofit2源码分析(一)
  10. java阿里云图片检测同/异步sdk调用详解
  11. xpath中的contains多个条件的匹配
  12. linux 7查看网络流量,CentOS7 监控网络流量
  13. 如何定向网件路由防火墙与URL
  14. Vue2 Browserslist: caniuse-lite is outdated. Please run: npx browserslist@latest --update-db
  15. 在闲鱼实习是一种什么样的体验
  16. c语言case后面,switch语句case后面的范围怎么写
  17. Java 工作2年后需要达到怎么样的技术水平
  18. 最小熵原理(四):“物以类聚”之从图书馆到词向量
  19. 数据库连接超时的处理
  20. Damn it! 又忘记VSS Admin的密码了!

热门文章

  1. PPT打印留白空隙太大,解决方法!
  2. 为什么公司和企业家需要区块链
  3. 上市公司眼中的区块链
  4. 本题要求编写函数,将输入字符串的前3个字符移到最后。
  5. SRS4.0.85流媒体搭建及HTTPS代理
  6. ★★★★★ Visual Studio 2008 每日小窍门 【持续发布ing~】 Tips:041
  7. 计算机的mac被交换机绑定,交换机MAC地址绑定
  8. SETP7 Professional V11 SP1
  9. JavaScript实现打地鼠小游戏
  10. 右下角部分图标不显示