大数据数据科学家常用面试题

During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent among candidates! In striking disagreement with a famous quote by Tolstoy, it seems to me, “most unhappy mistakes in case studies look alike”.

在担任数据科学家期间，我有机会采访了相当一部分与数据相关的职位的候选人。在这样做的同时，我开始注意到一种模式：候选人中绝大多数(简单)的错误非常频繁！在我看来，与托尔斯泰的一句名言大相径庭的是，“案例研究中最不幸的错误看起来是相似的”。

In my mind, I started picturing the kind of candidate that I would hire in a heartbeat. No, not a Rockstar/Guru/Evangelist with 12 years of professional experience managing Kubernetes clusters and working with Hadoop/Spark, while simultaneously contributing to TensorFlow’s development, obtaining 2 PhDs, and publishing at least 3 Deep Learning papers per year. Nope; I would just instantly be struck by a person who at least does not make the kind of mistakes I am about to describe… And I can imagine the same happening in other companies, with other interviewers.

在我的脑海中，我开始想象自己会心动的候选人。不，不是拥有12年管理Kubernetes集群和Hadoop / Spark的专业经验的Rockstar / Guru / Evangelist，同时又为TensorFlow的发展做出了贡献，获得了2个博士学位，并每年发表至少3篇Deep Learning论文。不; 我将立即被至少没有犯我将要描述的那种错误的人打动……我可以想象在其他公司和其他面试官中也发生了同样的情况。

Although this is a personal and quite opinionated list, I hope these few tips and tricks can be of some help to people at the start of their data science career! I am putting here only the more DS-related things that came to my mind, but of course writing Pythonic, readable, and expressive code is also something that will please immensely whomever is interviewing you!

尽管这是一份个人且颇为自以为是的清单，但我希望这些提示和技巧对人们在数据科学事业开始时能够有所帮助！我只想起更多与DS相关的事情，但是当然编写Python式，可读性和表达性代码也将极大地取悦与您面谈的任何人！

马虎使用熊猫 (Sloppy use of Pandas)

Let’s face it: for most of your day-to-day tasks as a data scientist you will be manipulating tables, slicing them, grouping them by the values contained in a column, applying transformations to them, and so on. This almost automatically implies that Pandas is one of the most important foundational tools for a data scientist, and if you are able to showcase some mastery with it, well, people will take you quite seriously.

让我们面对现实：作为数据科学家，您在日常的大部分工作中都会处理表格，对其进行切片，将它们按列中包含的值进行分组，对其进行转换等等。这几乎自动意味着，Pandas是数据科学家最重要的基础工具之一，如果您能够展示它的精通知识，那么人们会非常重视您的。

On the contrary, if you systematically do very low-level manipulations on your DataFrames where a built-in Pandas command exist, you will potentially raise all kinds of red flags.

相反，如果您在存在内置Pandas命令的DataFrame上系统地进行非常低级的操作，则可能会引发各种危险信号。

Here are a few tricks to improve with Pandas:

以下是熊猫改进的一些技巧：

USE IT!

用它！
Whenever you have to do any manipulation of a DataFrame or Series, stop for a couple of minutes and read the docs to check whether there are already built-in methods that can save you 90% of the work. Even if you don’t find them, in the process of reading through the documentation you will learn tons of stuff that will very likely come in handy in the future.每当您需要对DataFrame或Series进行任何处理时，都请停几分钟并阅读文档，以检查是否已经有内置方法可以节省90％的工作。即使您找不到它们，在阅读文档的过程中，您还将学到很多东西，这些东西将来很有可能会派上用场。
Read tutorials written by trustworthy people, see how they do some operations. Especially, Part II of Tom Augspurger’s Modern Pandas tutorial is quite a good place to start with. Even better, read not just part II, but the whole series. Also, this talk by Vincent D. Warmerdam is worth looking at.

阅读可信赖人员撰写的教程，了解他们如何进行某些操作。特别是， Tom Augspurger的Modern Pandas教程的第二部分是一个很好的起点。更好的是，不仅阅读第二部分，还阅读整个系列。此外，文森特·D·沃默丹(Vincent D. Warmerdam)的演讲值得一看。
If you have to perform some complicated, maybe not built-in, transformation of your data, consider wrapping it in a function! After you do that, .pipe(...) and .apply(...) are your friends.

如果您必须执行一些复杂的(也许不是内置的)数据转换，请考虑将其包装在函数中！完成之后， .pipe(...)和.apply(...)是您的朋友。

Final tip: do not use inplace=True anywhere. Contrary to popular belief, it doesn’t bring any performance bonus and it naturally makes you write unclear code, as it hinders your ability to chain methods. Hopefully this feature will be discontinued sometime in the future.

最后提示：请勿在任何地方使用inplace=True 。与流行的看法相反，它不会带来任何性能上的好处，并且自然会使您编写不清楚的代码，因为这会妨碍您链接方法的能力。希望此功能将来会停止。

信息从测试仪泄漏 (Information leaking from the test set)

The test set is sacred; while building models or selecting the best one you got so far, it should not even be looked at. Think about it: the reason why we have a test set in the first place is that we want to have an unbiased estimate of the generalization error of a model. If we are allowed to get a sneak peek into “the future” (i.e., data that during training and model building fundamentally we should not have access to) it’s almost guaranteed that we will get influenced by that, and bias our error estimates.

测试集是神圣的；在构建模型或选择迄今为止获得的最佳模型时，甚至不应该考虑它。想想看：我们之所以首先拥有一个测试集，是因为我们想要对模型的泛化误差进行无偏估计。如果允许我们窥视“未来”(即从根本上讲我们在培训和模型构建过程中不应该使用的数据)，几乎可以保证我们会受到此影响，并偏离我们的错误估计。

Although I’ve never seen anybody directly fit a model on the test set, quite commonly instead candidates performed hyperparameter tuning and model selection by looking at some metric on the test set. Please do not do that, but rather save part of the data as a validation set instead, or even better, perform cross-validation.

尽管我从未见过有人直接将模型拟合到测试集上，但相当普遍的是，考生通过查看测试集上的某些指标来执行超参数调整和模型选择。请不要这样做，而是将部分数据保存为验证集，或者甚至更好地执行交叉验证。

Another quite common thing which causes leakage of information from the test set is fitting scalers (like sklearn.preprocessing.StandardScaler) or oversampling routines (e.g., imblearn.over_sampling.SMOTE) on the whole dataset. Again, feature engineering, resampling, and so on are part of how a model is built and trained: keep the test set out of it.

导致信息从测试集中泄漏的另一种非常普遍的情况是整个数据集上的拟合缩放器(例如sklearn.preprocessing.StandardScaler )或过采样例程(例如， imblearn.over_sampling.SMOTE )。同样，特征工程，重采样等也是模型构建和训练的一部分：将测试集保留在模型之外。

平均缺陷 (Flaw of averages)

Although summary statistics, like averages, quantiles, and so on, are useful to get a first impression of the data, don’t make the mistake of reducing distributions to a single number when this doesn’t make sense. A classic cautionary example to showcase this is Anscombe’s quartet, but my favorite is the Datasaurus Dozen.

尽管摘要统计信息(例如平均值，分位数等)对于获得数据的第一印象很有用，但不要犯这样的错误，即在没有意义的情况下将分布简化为单个数。一个典型的警示示例就是Anscombe的四重奏，但我最喜欢的是Datasaurus Dozen 。

More often than not, the distribution of your data points matters more than their average value, and especially in some applications the shape of the tails of your distributions is what at the end of the day governs decisions.

通常，数据点的分布比其平均值更重要，尤其是在某些应用程序中，分布的尾部形状最终决定了决策。

If you show that you take this kind of issues in consideration, and don’t even wink when somebody mentions Jensen’s inequality, only good things can happen.

如果您证明自己考虑了此类问题，甚至在有人提到詹森的不平等时甚至都不眨眼，那么只会发生好事。

盲目使用图书馆 (Blind use of libraries)

When you are given a case study, you often have an advantage you can capitalize on: you choose the model(s) to use. That means that you can anticipate some of the questions interviewers might ask you!

在进行案例研究时，通常会具有一个可以利用的优势：选择要使用的模型。这意味着您可以预见面试官可能会问您的一些问题！

For example, if you end up using an XGBClassifier for your task, try to understand how it works, as deeply as you can. Everyone knows it’s based on decision trees, but which other “ingredients” do you need for it? Do you know how XGBoost handles missing values? Could you explain Bagging and Boosting in layman’s terms?

例如，如果最终为任务使用XGBClassifier ，请尝试尽可能深入地了解其工作方式。每个人都知道它基于决策树，但是您还需要其他“成分”吗？您知道XGBoost如何处理缺失值吗？您能用外行人的术语解释装袋和提振吗？

Even if you end up using linear regression, you should have a clear idea about what is happening under the hood, and the meaning behind the parameters you set. If you say “I set the learning rate to X”, and somebody follows with “What’s a learning rate?”, it’s quite bad if you cannot at least spend a few words on it.

即使最终使用线性回归，也应该对幕后情况以及所设置参数的含义有一个清晰的了解。如果您说“我将学习率设置为X”，然后有人说“什么是学习率？”，那么您至少不能在上面花几个字就很不好了。

可视化选择差 (Poor visualization choices)

Choosing the correct options for your plots goes a long way too. Ultimately, I think the most common mistakes here are due to poor choice of normalization or not using the correct scales for the axes.

为您的绘图选择正确的选项还有很长的路要走。最终，我认为这里最常见的错误是由于归一化选择不当或未使用正确的轴比例。

Let’s look at an example; the following snippet of code

让我们看一个例子；以下代码片段

just creates two arrays with samples from an exponential distribution; then, it generates the following plot

只是创建两个具有指数分布样本的数组；然后，生成以下图

I saw some variation of this an enormous amount of times; basically, what we would really like to do is compare the distribution of something among two groups, but in this plot we are only showing raw counts of observed values. If one of the groups has more samples than the other, a plot like this is meaningless to get an idea of the underlying distributions. A better choice would be to normalize what we are displaying in a sensible way: in this case, just setting the parameter density=True transforms the raw counts into relative frequencies, and gives us the following:

我看到了很多次这种变化。基本上，我们真正想做的是比较两组之间某物的分布，但是在此图中，我们仅显示了观测值的原始计数。如果一组中的一个样本比另一组中的样本更多，则这样的图对于了解基本分布毫无意义。更好的选择是以一种明智的方式对显示的内容进行规范化：在这种情况下，只需将参数density=True设置即可将原始计数转换为相对频率，并提供以下信息：

Nice! Now we can explicitly see that, after all, a and b are samples from the same distribution. There is still something that I dislike here: a lot of white space, and the fact that for values of a or b larger than 4, I cannot really see any bar clearly. Luckily, since 1614 Logarithms are a common mathematical operation… So common that we even have a dedicated keyword argument in plt.hist(...) that just transforms our linear y-axis to a logarithmic one:

真好！现在我们可以明确地看到， a和b毕竟是来自同一分布的样本。在这里，我仍然不喜欢某些东西：很多空白，而且对于大于4的a或b值，我看不到任何清晰的条形。幸运的是，自1614年以来，对数是一种常见的数学运算...如此普遍，以至于我们甚至在plt.hist(...)中都有一个专用的关键字参数， plt.hist(...)参数仅将线性y轴转换为对数：

Notice that this is by no means a “perfect” plot: our axes are unlabeled, no legend, and it just looks kinda ugly! But hey, at least we can extract insights that we would have never been able to see with just a call to plt.hist([a,b]).

请注意，这绝不是一个“完美”的图：我们的轴是未标记的，没有图例，而且看起来有点难看！但是，至少我们可以通过调用plt.hist([a,b])来提取我们从未见过的见解。

结论 (Conclusion)

What all the above-listed mistakes have in common is that they are easily avoidable with some thought and knowledge of the subject, so my advice for your next data science case study is: relax, focus, try to be one step ahead of whatever mind game they’re playing with you, and Google for stuff (a lot!). Interviewing can be stressful, but if both parties are fair (especially people interviewing and coming up with assignments) it’s almost never lost time.

上面列出的所有错误的共同点在于，只要对主题有一定的了解和了解，就可以轻松避免这些错误，因此，我对下一个数据科学案例研究的建议是：放松，集中注意力，力争领先一步他们与您一起玩的游戏，还有Google提供的东西(很多！)。面试可能会带来压力，但如果双方都公平( 尤其是面试和提出任务的人)，则几乎不会浪费时间。

Any feedback on this article would be much appreciated; did I miss anything that you think is particularly important?

对于本文的任何反馈将不胜感激；我是否错过了您认为特别重要的事情？

To conclude, I wish you all the best in your career, whatever job you happen to be doing now! Maybe see you at an interview :-)

最后，祝您事业顺利，无论您现在正从事什么工作！也许在面试中见到你:-)

翻译自: https://towardsdatascience.com/acing-a-data-science-job-interview-b37e8b68869b

大数据数据科学家常用面试题

查看全文

http://www.taodudu.cc/news/show-995067.html

vue.js python_使用Python和Vue.js自动化报告过程
计算机科学必读书籍_5篇关于数据科学家的产品分类必读文章
python 网页编程_通过Python编程检索网页
data studio_面向营销人员的Data Studio —报表指南
乐高ev3 读取外部数据_数据就是新乐高
java 分裂数字_分裂的补充：超越数字，打印物理可视化
比赛,幸福度_幸福与生活满意度
5分钟内完成胸部CT扫描机器学习
openai-gpt_为什么到处都看到GPT-3？
数据可视化及其重要性：Python
ai驱动数据安全治理_AI驱动的Web数据收集解决方案的新起点
使用K-Means对美因河畔法兰克福的社区进行聚类
因果关系和相关关系大数据_数据科学中的相关性与因果关系
分类结果可视化python_可视化分类结果的另一种方法
rstudio 管道符号_R中的管道指南
时间序列因果关系_分析具有因果关系的时间序列干预：货币波动
无法从套接字中获取更多数据_数据科学中应引起更多关注的一个组成部分
深度学习数据更换背景_开始学习数据科学的最佳方法是了解其背景
数据中台是下一代大数据_全栈数据科学：下一代数据科学家群体
泰坦尼克数据集预测分析_探索性数据分析-泰坦尼克号数据集案例研究（第二部分）
大数据技术学习之旅_如何开始您的数据科学之旅？
搜索引擎优化学习原理_如何使用数据科学原理来改善您的搜索引擎优化工作
一件登录facebook_我从Facebook的R教学中学到的6件事
python 图表_使用Streamlit-Python将动画图表添加到仪表板
Lockdown Wheelie项目
实现klib_使用klib加速数据清理和预处理
简明易懂的c#入门指南_统计假设检验的简明指南
python 工具箱_Python交易工具箱：通过指标子图增强图表
python交互式和文件式_使用Python创建和自动化交互式仪表盘
无向图g的邻接矩阵一定是_矩阵是图

大数据数据科学家常用面试题_进行数据科学工作面试相关推荐

大数据数据科学家常用面试题_面试有关数据科学，数据理解和准备的问答
大数据数据科学家常用面试题问题1:在数据科学术语中,您如何称呼所分析的数据? (Q1: In the data science terminology, how do you call the da ...
大数据数据科学家常用面试题_想要成为数据科学家，解决数据科学面试的简单指南...
大数据数据科学家常用面试题 Choose a job you love, and you will never have to work a day in your life. - Confucius ...
大数据_MapperReduce_Hbase相关面试题_补充说明---Hbase工作笔记0030
技术交流QQ群[JAVA,C++,Python,.NET,BigData,AI]:170933152 然后咱们继续说面试题. 二级索引,就是之前我们说的,用第二张表来描述第一张表,这样来提高速度,可以 ...
解决浏览器兼容性问题面试题_如果不解决技术面试问题，就无法解决技术多样性问题。这是数据。...
解决浏览器兼容性问题面试题 by Aline Lerner 通过艾琳·勒纳(Aline Lerner) 如果不解决技术面试问题,就无法解决技术多样性问题. 这是数据. (You can't fix d ...
python常用面试题_史上最全Python工程师常见面试题集锦，有这一份就够了
从互联网诞生以来,基本上所有的程序都属于网络程序,也就需要设计到网络编程,在python中,就是在python程序本身这进程内,链接别的服务器进程的通信端口进行通信.在Python程序员找工作的时候, ...
云数据中心网络遇到的问题_云数据中心面临安全问题，华为SDN解决方案有一个安全大脑...
CNET科技资讯网 9月23日北京消息(文/周雅):当越来越多的企业开始采用云服务,安全问题往往成为待考虑的问题.在传统IT环境中,企业默认的逻辑架构是可信的,数据在自己手里,系统部署在自己的数据中 ...
python实习生基础面试题_常见的Python基本面试问题,python,基础,面试题
常见 python 基础面试题 1,文件操作时:xreadlines和readlines的区别? readlines()是把文件的全部内容读取到内存,并解析一个list,当文件的体积很大的时候,需要占 ...
史上AI最高分！谷歌大模型创美国医师执照试题新纪录，科学常识水平媲美人类医生...
杨净羿阁发自凹非寺量子位 | 公众号 QbitAI 史上AI最高分,谷歌新模型刚刚通过美国医师执照试题验证! 而且在科学常识.理解.检索和推理能力等任务中,直接与人类医生水平相匹敌.在一些临床 ...
滴滴java开发面试题_滴滴java开发工程师面试问题解答(第一回)
有位同学写了一个滴滴面试拿offer的经历,据说还面了滴滴的CTO,我就好奇,这CTO面又能是个啥水平呢?对他在文章中提到的部分问题做个解答吧. 原文请见滴滴CTO五轮面试真是太刺激了,Java高级工 ...

大数据数据科学家常用面试题_进行数据科学工作面试