美团脱颖而出的经验

The global COVID-19 pandemic has left many with a lot of time on their hands to work on their data science project portfolios. With everyone applying to jobs, how can you make sure that yours stand out? Read on to find out.

全球COVID-19大流行给许多人留下了很多时间来处理其数据科学项目组合。每个人都应聘工作，如何确保自己脱颖而出？请仔细阅读，找出答案。

1.使用更多独特的数据 (1. Use more unique data)

Iris, Galton, Titanic, Northwind Traders, Superstore, Go Data Warehouse. While you were studying data science in school, you no doubt came across at least one of these data sets. There is a reason for that, they help demonstrate concepts like clustering, regression, logistic regression, database structures, data visualization, or building reports. Each data set is clean and small, but that isn’t all they have in common: everyone has worked with these data sets. There are no new or exciting projects being built on the training data sets. No recruiter is going to look at your Titanic project (one of the most popular data sets on Kaggle) and say, we need this person on our team.

艾里斯(Iris)，高尔顿(Galton)，泰坦尼克号(Titanic)，罗斯文交易者(Northwind Traders)，大型超市，数据仓库。当您在学校学习数据科学时，毫无疑问，您至少会遇到这些数据集之一。这是有原因的，它们有助于演示诸如聚类，回归，逻辑回归，数据库结构，数据可视化或构建报告之类的概念。每个数据集都很干净，很小，但这并不是它们的共同点： 每个人都在使用这些数据集 。培训数据集没有建立新的或令人兴奋的项目。没有招聘人员会看您的Titanic项目(Kaggle上最受欢迎的数据集之一)并说，我们团队中需要这个人。

There are no new or exciting projects being built on the training data sets

培训数据集上没有新的或令人兴奋的项目

We live in the data age and that means that there is no shortage of data sets that are easily available for download. Get your data from somewhere more exciting than Kaggle or the data you learned machine learning on. A good place to branch out is to Data.gov. In 2013 President Obama signed an executive order making open and machine readable data the new default for government information. This means that there is a wealth of searchable information ready to download right from Data.gov. Federal student loan program data, federal aid to states data, and accidental drug related deaths are just a few of the over 200,000 data sets available for your use. Just make sure to look at the metadata provided with the file so you understand what you are working with.

我们生活在数据时代，这意味着不乏可供下载的数据集。从比Kaggle更令人兴奋的地方或机器学习的数据中获取数据。扩展到Data.gov的好地方。 2013年，奥巴马总统签署了一项行政命令，将开放和机器可读的数据作为政府信息的新默认值。这意味着可以从Data.gov直接下载大量可搜索的信息。联邦学生贷款计划数据，联邦政府对州的数据以及与毒品有关的意外死亡仅是可供使用的200,000多种数据集中的一部分。只要确保查看文件随附的元数据，就可以了解要使用的元数据。

Want to make things a little more personal? Use your own data! Anything you do can be turned into data. Many gyms are closed during stay at home orders, maybe you’re working out from home. All of those exercises you are doing can be tracked. Look at how many reps you are doing, which muscle groups you are working on, and what days you are working out on. The best part about using your own data is that you are the subject matter expert. You may end up with some smaller data sets to work with, but you will have a much deeper understanding of how it was captured and have control over adding new variables or dimensions to it.

是否想让事情变得更加个人化？使用您自己的数据！您所做的任何事情都可以变成数据。许多体育馆在停靠在家的时候都会关闭，也许您是在家锻炼。您正在执行的所有练习都可以被跟踪。看看您正在做多少次练习，正在锻炼哪些肌肉群以及正在锻炼的日子。有关使用自己的数据的最好部分是您是主题专家。您可能最终会使用一些较小的数据集，但是您将对如何捕获它有更深入的了解，并可以控制向其添加新变量或维度。

None of these sounding advanced enough? Take a look at web scraping. Web scraping is the automated process of collecting unstructured data from the internet. You will have to write the code in a language such as R or python to capture the data. You will have to do your own research about the values you capture and how the website you are scraping got those values. The end result will be much more unique, but it will also create more work to learn about the data and clean your collected data.

这些听起来还不够先进吗？看一下网页抓取。 Web抓取是从Internet收集非结构化数据的自动化过程。您将必须使用R或python等语言编写代码以捕获数据。您将必须对所捕获的价值以及要抓取的网站如何获得这些价值进行自己的研究。最终结果将更加独特，但是它还将创建更多工作来了解数据并清理您收集的数据。

2.进行数据清理项目 (2. Do a data cleaning project)

Speaking of data cleaning, real world data is disgusting, be sure to wear your face mask while working with it. Jokes aside, when someone asks for a model that uses data to predict customer churn, there is almost never a clean, ready to use data source to build that model from. Most classes will not prepare you to handle the sorts of dirty data that organizations have available. This is a critical skill that you need to showcase in at least one of your projects.

说到数据清理，现实世界中的数据令人作呕，请务必在操作时戴上口罩。撇开一个笑话，当有人要求使用数据来预测客户流失的模型时，几乎永远不会有干净的，随时可用的数据源来构建该模型。大多数类都不会让您准备好处理组织可用的各种脏数据。这是一项至关重要的技能，您需要在至少一个项目中进行展示。

Speaking of data cleaning, real world data is disgusting

说到数据清理，现实世界中的数据令人作呕

There are many tasks that can be associated with cleaning data. A good place to start is understanding the data. Government and publicly available data will often have a data dictionary containing descriptions of each dimension, measure, observation, and table in the data. This will help you understand what data was collected, when it was collected, and who collected it.

有许多任务可以与清洗数据相关联。一个很好的起点是理解数据。政府和公开可用的数据通常会有一个数据字典，其中包含数据中各个维度，度量，观察值和表格的描述。这将帮助您了解收集了哪些数据，何时收集以及谁收集的数据。

Understanding what you are looking at is a key to data validation. Once you know what a variable is you may be able to use the data dictionary, common sense, or a subject matter expert to determine which values don’t make sense. For example temperatures should fall in a certain range of values. If it is temperature data and the data dictionary specifies the units of measurement as Kelvin, any 0 or negative values would be suspect. If it is temperature data from Bermuda, warmer temperatures would make sense. Here, anything too hot or too cold would be suspect. For something like manufacturing welding temperatures, you may want to look to a professor or engineering student to give you more guidance. The key in this step is to find values that don’t look right.

了解您要查看的内容是数据验证的关键。一旦知道什么是变量，您就可以使用数据字典，常识或主题专家来确定哪些值没有意义。例如，温度应在一定范围内。如果是温度数据，并且数据字典将测量单位指定为开尔文，则可能会怀疑为0或负值。如果是百慕大的温度数据，则可以选择较暖的温度。在这里，任何太热或太冷的东西都会被怀疑。对于制造焊接温度之类的问题，您可能需要寻求教授或工程专业的学生提供更多指导。此步骤的关键是查找看起来不正确的值。

Another area to look into is how to handle missing values. Like data validation, context matters when handling missing values. If you are looking at financial loan data for cars and a loan is in good standing and has a missing value for repossession status, you won’t be worried about that value being missing. If your project involves psychological assessments and you are missing answers to a lot of questions, you may take a different course of action, like eliminating the observation. Sometimes missing values make sense in your context. As with data validation, work with your subject matter experts and peers to understand what to do with missing values.

另一个需要研究的领域是如何处理缺失值。像数据验证一样，上下文在处理缺失值时也很重要。如果您正在查看汽车的金融贷款数据，并且贷款状况良好且收回状态的价值缺失，那么您就不必担心该价值缺失。如果您的项目涉及心理评估，而您缺少许多问题的答案，那么您可能会采取不同的行动方案，例如消除观察。有时，缺失的值在您的上下文中很有意义。与数据验证一样，与主题专家和同行一起了解如何处理缺失的值。

Sometimes missing values make sense in your context

有时缺失的值在您的上下文中很有意义

3.使用版本控制 (3. Use version Control)

When doing data science projects, you likely spend a lot of time working with others, and this is just one of the many reasons to use version control. If you are not familiar with version control, it keeps all of the code in one place and allows multiple people to work on it at the same time. Everyone can add their contributions and review others’ code. If someone leaves the company, there isn’t a fight with the IT department to move the most recent codes to someone else’s machine. All code lives in a central, organized repository.

在进行数据科学项目时，您可能会花费大量时间与他人合作，而这仅仅是使用版本控制的众多原因之一。如果您不熟悉版本控制，它将所有代码都放在一个地方，并允许多个人同时处理它。每个人都可以添加自己的贡献并查看其他人的代码。如果有人离开了公司，那么IT部门就不会将最新的代码转移到他人的机器上。所有代码都存放在一个中央的，有组织的存储库中。

All code lives in a central, organized repository

所有代码都存放在中央的，有组织的存储库中

Another issue that plagues students is naming versions of files. Never again will you have 20 versions of “project_02_final.py” is different than “project_12_final_done_finished.py.” With version control, every version has comments and you can do a line by line comparison with any other version of your code to see what was deleted and added to it. You don’t even have to worry about an old version getting deleted, you can always roll back to an older version of the code.

困扰学生的另一个问题是命名文件的版本。您再也不会有20个版本的“ project_02_final.py”与“ project_12_final_done_finished.py”不同。使用版本控制，每个版本都有注释，您可以与代码的任何其他版本进行逐行比较，以查看已删除和添加的内容。您甚至不必担心会删除旧版本，可以随时回滚到旧版本的代码。

The best part about version control software is that it is easy to use once you get started. It can come as a standalone program, be integrated into your favorite integrated development environment (IDE), or be used through a command line interface. Many systems have additional features that allow you to create a website based on your repository, test builds of your code, and share your code by embedding it in other places. It is not only a method to keep your project organized, but it can also be used to start presenting your projects.

关于版本控制软件最好的部分是，一旦开始，它就易于使用。它可以作为独立程序来使用，可以集成到您喜欢的集成开发环境(IDE)中，也可以通过命令行界面使用。许多系统具有其他功能，使您可以基于存储库创建网站，测试代码的构建以及通过将代码嵌入其他位置来共享代码。它不仅是使您的项目井井有条的一种方法，而且还可以用于开始展示您的项目。

4.建立您的表现层 (4. Build your Presentation Layer)

The term data scientist is still relatively new, and it isn’t uncommon for executives to not fully understand data concepts. Regardless of whether or not they have a deep understanding of data science and machine learning, they need a way to quickly get the most important information your projects have to offer. This is your presentation layer.

术语“数据科学家”还相对较新，对于高管们来说，不完全理解数据概念并不罕见。无论他们是否对数据科学和机器学习有深入的了解，他们都需要一种方法来快速获取项目必须提供的最重要的信息。这是您的表示层。

Most aspiring data scientists don’t focus enough on communication skills. Explaining how you used parallel computing in the cloud to train and test a model that you then deployed using a custom API for someone else to use won’t hold anyone’s attention. Management won’t always be as interested in how you do it, but why you did it. That isn’t to say the technical aspects are unimportant, but keep your audience in mind. The business side will want to see how the bottom line is impacted by your model, the IT side will care more about the actual implementation of it.

大多数有抱负的数据科学家对沟通技巧的关注不够。解释如何在云中使用并行计算来训练和测试然后使用自定义API部署给其他人使用的模型不会引起任何人的注意。管理不会总是如感兴趣的是你如何做到这一点，但你为什么这样做。这并不是说技术方面并不重要，但请记住您的受众。业务方面将希望了解您的模型对底线有何影响，IT方面将更关心它的实际实施。

Most aspiring data scientists don’t focus enough on communication skills

大多数有抱负的数据科学家对沟通技巧的关注不足

Your presentation layer can take many different forms. Once again, keep your audience in mind. Some people only want spreadsheets and tables while others want graphs and colors. Some expect it to be delivered by email and others want to see all of their figures from different areas on the same page. Take the tools you have access to and create different presentation layers with different audiences in mind.

您的表示层可以采用许多不同的形式。再一次，请记住您的听众。有些人只想要电子表格和表格，而另一些人想要图形和颜色。有些人希望通过电子邮件发送它，而另一些人希望在同一页面上查看其不同区域的所有数据。利用您可以使用的工具，并在考虑不同受众的情况下创建不同的表示层。

5.尝试各种技术 (5. Experiment with various technologies)

Elevate on Elevate on UnsplashUnsplash拍摄

One of the most difficult things I experienced right out of college was seeing job openings for companies I wanted to work for and not having any relevant technology skills from the positing. A lot of companies struggle with on boarding and internal training. It is so much easier for them to hire someone who already knows what they need. Being familiar with a broader range of technologies will help in your search. Many organizations will even use multiple technologies together to piece together their data pipelines and data science environments.

我刚上大学时经历过的最困难的事情之一就是看到我想工作的公司的工作机会，而且从假设中没有任何相关的技术技能。许多公司在登机和内部培训方面都遇到困难。对于他们来说，雇用已经知道自己需要什么的人要容易得多。熟悉更广泛的技术将对您的搜索有所帮助。许多组织甚至会一起使用多种技术来组合其数据管道和数据科学环境。

There is the old joke that data science is 80% prepping data and 20% building models. There is truth to this. I have worked with data scientists who spent the vast majority of time on a project getting code to run correctly on a server instead of their desktop. I have interviewed at smaller companies where data scientists are expected to work very closely with the data engineers to build the data structures to gather data and support their models. There is a huge benefit to using multiple languages and software packages to build different parts of you data science projects.

有个老笑话，数据科学正在准备数据的80％和构建模型的20％。这是事实。我曾与数据科学家合作，他们花费大量时间在一个项目上，以使代码在服务器而非桌面上正确运行。我曾在一些较小的公司进行过采访，这些公司希望数据科学家与数据工程师紧密合作，以建立数据结构来收集数据并支持其模型。使用多种语言和软件包来构建数据科学项目的不同部分会带来巨大的好处。

Data science is 80% prepping data and 20% building models

数据科学正在准备数据的80％和构建模型的20％

There are a wealth of resources out there, and learning a new technology or programming language doesn’t have to be difficult or expensive. Sure there are professionally led trainings that can cost upwards of $1,500, but there are also free videos for almost anything, and students enjoy licenses and training for many expensive software packages for free just for having a .edu email address. Search on their website to see what is available. Once you learn one type of software or language, it usually isn’t too hard to get into another similar one. If you can use Power BI, Tableau isn’t too difficult to use.

那里有大量的资源，学习新技术或编程语言不一定是困难或昂贵的。当然，由专业人士带领的培训费用可能高达$ 1,500，但几乎所有内容都提供免费视频，而且学生只要拥有.edu电子邮件地址，即可免费获得许可和许多昂贵软件包的培训。在他们的网站上搜索以查看可用的内容。一旦学习了一种类型的软件或语言，通常就不难进入另一种类似的软件或语言。如果可以使用Power BI，Tableau并不太难使用。

6.将所有内容组合到管道中 (6. Combine everything into a pipeline)

Speaking of multiple tools, why not use a bunch of them together? As a student there are many times you get your data in a simple format, a .csv for example. It isn’t impossible to get .csv files when working for an organization, but data isn’t typically stored that way. Data will come from multiple sources, have multiple layers for transformation and storage, and touch many different technologies along the way. A small pipeline for your project shows that you understand how these parts work together.

说到多种工具，为什么不一起使用它们呢？作为一名学生，很多时候您会以一种简单的格式(例如.csv)获取数据。在组织中工作时获取.csv文件不是不可能的，但是数据通常不是那样存储的。数据将来自多个来源，具有用于转换和存储的多层，并且在此过程中涉及许多不同的技术。项目的一条小管道显示您了解这些部分如何协同工作。

Everyone knows how to read a .csv file

每个人都知道如何读取.csv文件

A pipeline can be as simple as loading files into a database, selecting the data to bring it into your analysis tool, and sending the results to your presentation layer. It doesn’t have to be pretty or fully automated, it just needs to demonstrate your understanding of how data pipelines work. Everyone knows how to read a .csv file, but not everyone can integrate the data, analysis, and presentation processes into a coherent project that reflects how organizations do it.

管道可以很简单，例如将文件加载到数据库中，选择数据以将其带入分析工具，然后将结果发送到表示层。它不必完全漂亮或完全自动化，它只需要证明您对数据管道如何工作的理解即可。每个人都知道如何读取.csv文件，但不是每个人都可以将数据，分析和表示过程集成到一个反映组织工作方式的一致项目中。

摘要 (Summary)

This is by no means a comprehensive list, but gives you ideas about how to get your data science projects noticed! Keep in mind that the data science projects are a direct reflection of you skills. Showcase what you are good at by applying those skills to projects that reflect the complexity of real world data science. Be sure to use interesting data, clean it up, use version control to stay organized, effectively communicate your message, use varied technologies, and combine everything together for a truly rounded data science project.

这绝不是一个全面的列表，但可以为您提供有关如何使数据科学项目受到关注的想法！请记住，数据科学项目是您技能的直接体现。通过将这些技能应用到反映现实世界数据科学复杂性的项目中，展示您的擅长之处。确保使用有趣的数据，对其进行清理，使用版本控制来保持井井有条，有效地传达您的信息，使用各种技术并将所有内容组合在一起，从而构成一个真正全面的数据科学项目。

Keep an eye out for more articles about how I applied these ideas in a recent data science pipeline that I created! They will include my motivations for choosing a particular problem and issues I encountered along the way.

请留意更多有关我如何在最近创建的数据科学管道中应用这些想法的文章！它们将包括我选择特定问题的动机以及在此过程中遇到的问题。

翻译自: https://towardsdatascience.com/6-ways-to-make-your-data-science-projects-stand-out-1eca46f5f76f

美团脱颖而出的经验

http://www.taodudu.cc/news/show-863633.html

aws rds同步_将数据从Python同步到AWS RDS
扫描二维码读取文档_使用深度学习读取和分类扫描的文档
电路分析导论_生存分析导论
强化学习-第3部分
范数在机器学习中的作用_设计在机器学习中的作用
贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络
模型监控psi_PSI和CSI：前2个模型监控指标
flask渲染图像_用于图像推荐的Flask应用
pytorch贝叶斯网络_贝叶斯神经网络：2个在TensorFlow和Pytorch中完全连接
稀疏组套索_Python中的稀疏组套索
deepin中zz_如何解决R中的FizzBuzz问题
图像生成对抗生成网络gan_GAN生成汽车图像
生成模型和判别模型_生成模型和判别模型简介
机器学习算法拟合曲线_制定学习曲线以检测机器学习算法中的错误
重拾强化学习的核心概念_强化学习的核心概念
gpt 语言模型_您可以使用语言模型构建的事物的列表-不仅仅是GPT-3
廉价raid_如何查找80行代码中的廉价航班
深度学习数据集制作工作_创建我的第一个深度学习+数据科学工作站
pytorch线性回归_PyTorch中的线性回归
spotify音乐下载_使用Python和R对音乐进行聚类以在Spotify上创建播放列表。
强化学习之基础入门_强化学习基础
在置信区间下置信值的计算_使用自举计算置信区间
步进电机无细分和20细分_细分网站导航会话
python gis库_使用开放的python库自动化GIS和遥感工作流
mask rcnn实例分割_使用Mask-RCNN的实例分割
使用FgSegNet进行前景图像分割
完美下巴标准_平行下颚抓
api 规则定义_API有规则，而且功能强大
r语言模型评估:_情感分析评估：对自然语言处理的过去和未来的反思
机器学习偏差方差_机器学习101 —偏差方差难题

美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法相关推荐

多元高斯分布异常检测代码_数据科学 | 异常检测的N种方法，阿里工程师都盘出来了...
↑↑↑↑↑点击上方蓝色字关注我们! 『运筹OR帷幄』转载作者:黎伟斌.胡熠.王皓编者按: 异常检测在信用反欺诈,广告投放,工业质检等领域中有着广泛的应用,同时也是数据分析的重要方法之一.随着数据量 ...
硕士学位数据分析师工资_值得拥有数据科学方面的硕士学位
硕士学位数据分析师工资 I'm sure those that have asked the same question are already aware of the current situat ...
护理方面关于人工智能的构想_如何提出惊人的AI，ML或数据科学项目构想。
护理方面关于人工智能的构想 No Matter What People Tell You, Words And Ideas Can Change The World. - Robin Williams ...
大数据数据量估算_如何估算数据科学项目的数据收集成本
大数据数据量估算 (Notes: All opinions are my own) (注:所有观点均为我自己) 介绍 (Introduction) Data collection is the ini ...
数据科学学习心得_学习数据科学时如何保持动力
数据科学学习心得 When trying to learn anything all by yourself, it is easy to lose motivation and get thrown ...
数据科学项目_完整的数据科学组合项目
数据科学项目 In this article, I would like to showcase what might be my simplest data science project ever ...
分步式数据库_创建真实数据科学项目的分步指南
分步式数据库 As an inspiring data scientist, building interesting portfolio projects is key to showcase yo ...
pca针对初学者_针对初学者和专家的12酷数据科学项目创意
pca针对初学者 The domain of Data Science brings with itself a variety of scientific tools, processes, alg ...
5g创业的构想_数据科学项目的五个具体构想
5g创业的构想 Do you want to enter the data science world? Congratulations! That's (still) the right choic ...

美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法