数据库面试复习

大面试前先刷新 (REFRESH BEFORE THE BIG INTERVIEW)

介绍 (Introduction)

I crafted this study guide from multiple sources to make it as comprehensive as possible. This guide helped me prepare for both the technical and behavioral aspects of the hiring firm’s interview. Please note, this guide is not exhaustive or a beginner’s tutorial, but is meant to serve as a refresher for common data science terminology and algorithms, and to help you prepare some answers to behavioral questions. For more detailed explanations with tons of references and resources, I would highly recommend this guide.

我从多个来源编写了本学习指南，以使其尽可能全面。该指南帮助我为招聘公司面试的技术和行为方面做准备。请注意，本指南并不详尽，也不是初学者的教程，而只是作为常见数据科学术语和算法的复习之用，并帮助您为行为问题准备一些答案。有关大量参考资料和资源的更详细说明，我强烈建议您使用本指南。

The sources I used are other Medium articles (hyperlinked throughout), online academic journals, and personal notes from my undergraduate Applied Machine Learning course at William & Mary.

资料来源 我还使用了其他中型文章(全文超链接)，在线学术期刊以及我在William＆Mary的本科应用机器学习课程中的个人笔记。

指南概述 (Guide Overview)

General Questions & Answers一般问题与解答
Common Terminology通用术语
Machine Learning Algorithms & Concepts机器学习算法和概念
Additional Personal Preparation额外的个人准备
Questions to Ask After the Interview面试后要问的问题

第1节：一般问题与解答 (Section 1: General Questions & Answers)

For the first section of my guide, I prepared some common questions the interviewer might ask related to data science as a profession as well as common behavioral questions.

在本指南的第一部分中，我准备了一些面试官可能会问到的与数据科学专业相关的常见问题以及常见的行为问题。

What is data science? Possible Answer: Data science is turning data into actionable insights. These insights help a company make more money, optimize workflows to become more efficient, and make calculated data-driven business decisions.

什么是数据科学？ 可能的答案：数据科学正在将数据转化为可行的见解。 这些见解可帮助公司赚钱，优化工作流程以提高效率，并制定以数据为依据的业务决策。

What makes a good data scientist? Possible Answer: Being able to communicate technical problems and solutions to a non-technical person is the key to being a good data scientist. The act of turning disparate data points into actionable insights can be complicated; communicating these results is paramount for a meaningful impact on the business.

是什么让一位优秀的数据科学家呢？ 可能的 答案：能够与非技术人员交流技术问题和解决方案是成为优秀数据科学家的关键。 将不同的数据点转换为可行的见解的行为可能很复杂； 传达这些结果对于对业务产生有意义的影响至关重要。

Other common behavioral questions to think through:

需要思考的其他常见行为问题：

Tell me about yourself.说说你自己。
What is a strength and a weakness of yours?您的优点和缺点是什么？
Where do you see yourself in 5–10 years?5-10年后您在哪里看到自己？
Tell me about a time you had to explain a technical problem to a non-technical person.告诉我有关您必须向非技术人员解释技术问题的时间。
Why Company [x]?

为什么选择公司[x] ？

第2节：通用术语 (Section 2: Common Terminology)

In this next section, I define some common data science terms an interviewer may ask you to define to check your foundational knowledge.

在下一部分中，我定义了一些面试官可能会要求您定义的通用数据科学术语，以检查您的基础知识。

Overfit vs. Underfit: With an overfit model, you get very accurate predictions on the training data but make less precise predictions on the test and real world data. Overfitting occurs when the model is overly complex and captures the noise of the data. An underfit model is overly simple; it does not find the data’s underlying patterns. Inaccurate predictions are present in both the training and test results. Underfit models can be caused by insufficient data that covers all combinations, or improper randomization.

过拟合与欠拟合：使用过拟合模型，您可以对训练数据获得非常准确的预测，但对测试和现实世界数据的预测却不那么精确。当模型过于复杂并捕获数据噪声时，就会发生过度拟合。欠拟合模型过于简单；它找不到数据的基础模式。训练和测试结果中均存在不正确的预测。拟合不足的模型可能是由于覆盖所有组合的数据不足或随机分配不当造成的。
Normalization vs. Standardization vs. Regularization: Normalization re-scales the range of values of your parameter to a set range (e.g., [0,1] or [-1,1]). Normalization “normalizes” the variance. Standardization converts unit variance to standard normal with a mean of 0 and a standard deviation of 1. Outliers are represented as being ± 3 standard deviations above the mean. Regularization is the idea of building a penalty into your model to prevent overfitting. For example, we do not want an overfit model so we add a parameter that penalizes model complexity. i.e. L1-norm (Lasso) and L2-norm (Ridge). Click Here for more information

归一化与标准化与正则化：归一化将参数值的范围重新缩放到设置范围(例如[0,1]或[-1,1])。规范化“规范化”了方差。标准化将单位方差转换为平均值为0且标准偏差为1的标准法线。离群值表示为高于平均值的±3个标准偏差。正则化是在模型中建立惩罚以防止过度拟合的想法。例如，我们不需要过度拟合的模型，因此我们添加了一个参数，该参数会降低模型的复杂性。即L1范数(Lasso)和L2范数(Ridge)。 点击这里查看更多信息
Bias vs. Variance: Bias can lead to oversimplification/underfitting. It is the gap between the predicted value and the actual value. Variance is how scattered your predictions are in relation to each other. Low variance can lead to overfitting because the model follows outliers, or noise within the data. Overall, as model complexity goes up, variance goes up (overfitting increases). As model complexity goes down, bias goes down (underfitting increases). The objective is to find the balance between this trade-off.

偏差与方差：偏差可能导致过度简化/拟合不足。它是预测值和实际值之间的差距。方差是您的预测彼此之间的分散程度。由于模型遵循离群值或数据内的噪声，因此低方差会导致过度拟合。总体而言，随着模型复杂度的增加，方差也增加(过度拟合增加)。随着模型复杂度降低，偏差降低(欠拟合增加)。目的是找到这种折衷之间的平衡。
Supervised vs. Unsupervised Learning: Supervised learning trains a model using labeled data. We know the dependent variable and the values we are trying to predict, e.g., Random Forest, SVM, and Linear Regression. Unsupervised learning does not require you to train a model. You do not know the dependent variable, e.g., K-means Clustering.

监督学习与非监督学习：监督学习使用标记数据训练模型。我们知道因变量和我们试图预测的值，例如，随机森林，SVM和线性回归。无监督学习不需要您训练模型。您不知道因变量，例如K-means聚类。
Classification vs. Regression: Classification predicts a discrete class label output, e.g., High, Medium, Low → multi-classification OR Positive/Negative → binary classification. Regression is predicting a continuous quantity output, e.g., temperature.

分类与回归：分类预测离散的分类标签输出，例如，高，中，低→多重分类或正/负→二进制分类。回归预测连续的输出量，例如温度。
Entropy: The measure of randomness/variance. The higher the value, the harder it is to draw conclusions. A result of 0 entropy means perfect classification. A greedy algorithm seeks to homogenize data quickly by reducing entropy.

熵：随机性/方差的度量。价值越高，得出结论越难。熵的结果为0意味着完美的分类。 贪婪算法试图通过减少熵来快速使数据同质化。
GIT (version control): Commit — Add changes to local repository, Push — Transfer commits to remote repository, Pull Request — Update master/main branch with content on specified branch

GIT(版本控制)： 提交-将更改添加到本地存储库，推送-将提交传输到远程存储库，拉取请求-使用指定分支上的内容更新主/主分支
Type I vs. Type II Error: Type I → False Positive (value was classified as positive but is actually negative). Type II → False Negative (value was classified as negative but is actually positive)

类型I与类型II错误：类型I→假正(值归为正，但实际上为负)。类型II→假负(值被分类为负，但实际上是正)
Receiver Operating Characteristic (ROC): Performance measure for classification problems. The ROC curve is a graphical representation of the contrast between true positive and false positive rates at various thresholds. An Area Under the Curve (AUC) value of 0.5 means random. A value less than 0.5 means the model is predicting the opposite/reciprocal. Click Here for more information

接收器运行特征(ROC)：分类问题的性能度量。 ROC曲线是在各种阈值下真阳性率和假阳性率之间的对比的图形表示。 曲线下面积(AUC)值为0.5表示随机。值小于0.5表示模型正在预测相反/倒数。 点击这里查看更多信息

FP = False Positive, FN = False Negative, P = All Positives, N = All Negatives

FP =假阳性， FN =假阴性， P =所有阳性， N =所有阴性

Error Rate = (FP+FN)/(P+N)

错误率= (FP + FN)/(P + N)
Accuracy = (TP+TN)/(P+N)

精度= (TP + TN)/(P + N)
Precision = TP/(TP+FP)

精度= TP /(TP + FP)
Sensitivity/Recall (True Positive Rate)= TP/P

灵敏度/召回率(正确率)= TP / P
Specificity (True Negative Rate) = TN/N

特异性(真阴性率)= TN / N

第3节：机器学习算法 (Section 3: Machine Learning Algorithms)

This section discusses common ML algorithms at a high level. I have included the points I would expect an interviewer would want you to know.

本节从高层次讨论通用的ML算法。我已经列出了我希望面试官希望您知道的要点。

决策树 (Decision Trees)

Primarily used for solving classification problems, but they can also be used with regression problems.主要用于解决分类问题，但也可以用于回归问题。
The trees start with a root node, which is followed by splits that produce branches. The branches then link to leaves which form decision points.树从根节点开始，然后是产生分支的拆分。然后，分支链接到形成决策点的叶子。
A final categorization is made when a leaf no longer generates new branches and results in a terminal node.当叶子不再生成新分支并生成终端节点时，将进行最终分类。
The aim is to keep the decision tree as small as possible. Select a starting variable that optimally splits the data into homogeneous groups (minimizing entropy in the branch). Click Here for examples

目的是使决策树尽可能小 。选择一个起始变量，该变量可以将数据最佳地分为同质组(使分支中的熵最小)。 点击这里查看示例

提升 (Boosting)

Combines weak models into a single strong model. Most commonly, individual decision trees are aggregated together.

将弱模型合并为一个强模型 。最常见的是，各个决策树被聚集在一起。
Construct multiple decision trees using a randomized selection of inputs, that have been sampled with replacement, for each tree. Note: the algorithm does not always have to do this, e.g., normal Gradient Boosting uses all the data to build the gradient at each step of the boosting process.

使用输入的随机选择构造每棵树的多个决策树，这些输入已经用replace采样 。注意：算法不一定必须执行此操作，例如，正常的“梯度增强”会在增强过程的每一步使用所有数据来构建梯度。
The initial decision tree (or other base algorithm) is a poor predictor of the data, but it is iterated upon using weights.初始决策树(或其他基本算法)无法很好地预测数据，但是会使用权重对其进行迭代。
Higher weights are added to the trees based on misclassified cases in the previous tree; correct classifications get less weight. This type of boosting is known as AdaBoost, as it iterates along refined sample weights. Click Here for more information

根据先前树中错误分类的情况，将更高的权重添加到树中；正确的分类会减少重量。这种类型的增强称为AdaBoost ，因为它会沿着精确的样本权重进行迭代。 点击这里查看更多信息
Boosting tends to cause overfitting because of the reiteration on previous mistakes. For more complex data with outliers, random forest is better (see below).

由于对先前的错误反复重申， 提升往往会导致过度拟合 。对于具有离群值的更复杂数据，随机森林更好(请参见下文)。

梯度提升 (Gradient Boosting)

Rather than selecting random combinations of variables for each tree (or other base algorithm), gradient boosting selects variables that improve accuracy with each new tree.

梯度提升不是为每棵树(或其他基本算法)选择变量的随机组合，而是选择可以提高每棵新树的准确性的变量。
The algorithm minimizes a defined loss function by calculating the gradient at each iteration.

该算法通过在每次迭代中计算梯度来最小化定义的损失函数 。
The trees are grown sequentially (non-parallel) as each tree is derived from the previous tree.

由于每棵树都是从前一棵树派生的，所以这些树是顺序生长的 (非平行)。
This process is repeated until there is low error. The final result is the weighted average of the total predictions.重复此过程，直到出现低错误为止。最终结果是总预测的加权平均值。

装袋 (Bagging)

Short for Bootstrap Aggregating

Bootstrap Aggregation的缩写
Construct multiple decision trees (or other base algorithm) using a randomized selection of inputs, that have been sampled with replacement, for each tree. Combine the results by averaging (regression) or voting (classification).

使用随机选择的输入为每棵树构造多个决策树(或其他基本算法)，这些输入已被替换采样 。通过平均(回归)或投票(分类)合并结果。
The bootstrap component comes from the variation and randomness inherent when the different tree inputs were sampled.引导组件来自对不同的树输入进行采样时固有的变化和随机性。
This method is effective for dealing with outliers and lowering the variance found with a single decision tree.

该方法对于处理离群值和降低使用单个决策树发现的方差有效。

随机森林 (Random Forest)

Related to bagging since both algorithms utilize bootstrap sampling and randomization.与装袋有关，因为两种算法都使用自举采样和随机化。
Artificially limits the choice of variables used by capping the number of variables considered. They cannot consider n variables at each partition.

通过限制所考虑的变量数量，人为地限制了对变量的选择。 他们不能在每个分区上考虑n个变量。
This means the final tree is variable because of the randomness.这意味着由于随机性，最终的树是可变的。
Random Forest is a weak supervised learning technique where all the trees are trained together in parallel (non-sequential).

随机森林是一种弱监督学习技术，其中所有树都是并行 (非顺序) 一起训练的 。
100–150 decision trees is a recommended starting point.建议使用100–150个决策树作为起点。

神经网络： (Neural Networks:)

Use when a task is easy for a human, but difficult for a computer. Also useful when you have large datasets with complex patterns.当任务对人类来说很容易，但是对计算机来说却很困难时使用。在具有复杂模式的大型数据集时也很有用。
First layer is input data (values, text, pixels, etc.) and is divided into nodes. Each node sends info to the next layer via the network’s edges.第一层是输入数据(值，文本，像素等)，并分为节点。每个节点都通过网络边缘将信息发送到下一层。
Each edge in the network has a numeric weight that can be changed based on experience. If the sum of the connected edges satisfies a threshold, known as the activation function, this activates a neuron at the next layer.

网络中的每个边缘都有一个数字权重，可以根据经验进行更改。如果连接的边缘之和满足阈值(称为激活函数) ，则会激活下一层的神经元。
If it does not meet the threshold, the function fails which results in all or nothing arrangement.如果未达到阈值，则该功能将失败，从而导致全部或全部不布置。
Weights at edges are unique which prevents the same results.边缘的权重是唯一的，这会阻止相同的结果。
Neural Networks can be supervised. Here, the algorithm compares the results to the actual output and calculate cost. Reduce cost until predictions match the actual output. This is achieved by incrementally tweaking the networks weights until lowest cost is obtained.

可以监督神经网络。 在此，算法将结果与实际输出进行比较并计算成本。降低成本，直到预测与实际输出匹配为止。这是通过逐步调整网络权重直到获得最低成本来实现的。
Back Propagation: Roll in reverse from output layer on the right to input layer on left.

反向传播：从右侧的输出层反向滚动到左侧的输入层。
Convolution (CNN): Made for images. Takes a fixed size input and returns a fixed size output. Ideal for image and video processing.

卷积(CNN)：专用于图像。接受固定大小的输入并返回固定大小的输出。图像和视频处理的理想选择。
Recurrent (RNN): Made for sequences. Handles arbitrary length input/output data. Can use internal memory to process arbitrary sequence of inputs. They use time-series information (i.e., what I spoke last will impact what I speak next). Ideal for text and speech analysis.

循环(RNN)：针对序列。处理任意长度的输入/输出数据。可以使用内部存储器处理任意输入序列。他们使用时间序列信息(即，我最后讲的内容会影响我接下来讲的内容)。文本和语音分析的理想选择。

最近邻聚类 (K-Nearest Neighbors Clustering)

Supervised algorithm for classifying new data points.

用于对新数据点进行分类的监督算法。
Similar to voting system or popularity contest.类似于投票系统或人气竞赛。
Choosing number of neighbors (k) is crucial. Too low k will increase bias, too high k will make algorithm computationally expensive. Set k to an odd number so you do not have ties in the voting process.

选择邻居数(k)是至关重要的。 k太低将增加偏差， k太高将使算法计算昂贵。将k设置为奇数，这样您就不会在投票过程中遇到任何麻烦。
Not good for large datasets.不适用于大型数据集。
E.g., We are trying to classify new points to be Red or Blue: If k=3, and 2 of 3 points around the new point are Blue, it will become Blue. Keep doing this and adding new points until they have all been classified (grouped).

例如，我们正在尝试将新点分类为红色或蓝色：如果k = 3，并且新点周围3个点中的2个为蓝色，它将变为蓝色。继续执行此操作并添加新点，直到将其全部分类(分组)为止。

K均值聚类 (K-means Clustering)

Unsupervised algorithm for identifying groups of new data points.

用于识别新数据点组的无监督算法。
Select a centroid. Points are then assigned to the nearest centroid using Euclidean distance. The centroid is updated to reflect the cluster’s mean. Once no data points switch between clusters, the algorithm is complete.

选择一个质心。 然后使用欧几里得距离将点分配给最近的质心。重心已更新，以反映聚类的均值。一旦没有数据点在群集之间切换，该算法即告完成。
Use a scree plot (elbow method) to determine k. It compares the Sum of Squared Errors (SSE) for each total cluster count. SSE drops as more clusters are found. You can also set k = sqrt(n/2).

使用碎石图 (肘法)确定k 。它比较每个群集总数的平方误差总和(SSE)。随着发现更多集群，SSE下降。您还可以设置k = sqrt(n / 2)。

支持向量机(SVM) (Support Vector Machine (SVM))

Supervised algorithm used for both regression and classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space, with the value of each feature being the value of a particular coordinate.

用于回归和分类的监督算法。如果您有n个训练数据集的特点，SVM试图绘制它在n维空间中 ，每个功能是协调一个特定的值的值。
SVM uses hyperplanes to separate out different classes based on the provided kernel function. Hyperplanes are decision boundaries that help classify the data points.

SVM使用超平面根据提供的内核功能分离出不同的类。 超平面是有助于对数据点进行分类的决策边界。
The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.超平面的尺寸取决于特征的数量。如果输入要素的数量为2，则超平面只是一条线。如果输入要素的数量为3，则超平面将成为二维平面。很难想象功能数量超过3个。
Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier.

支持向量是更靠近超平面并影响超平面的位置和方向的数据点。使用这些支持向量，我们使分类器的余量最大化。
In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane.

在SVM算法中，我们正在寻求最大化数据点和超平面之间的余量。
Different kernels: linear (base), sigmoid, radial, polynomial and more. Non-linear kernels abstract the plot into higher dimensions where the data can be separated with a plane or something else, rather than a line.不同的内核：线性(基础)，S形，径向，多项式等。非线性内核将绘图抽象为更高的维度，可以使用平面或其他方法(而不是直线)来分离数据。

主成分分析(PCA) (Principal Components Analysis (PCA))

Use when you want to reduce number of variables; ensure variables are independent of one another.当您想要减少变量数量时使用；确保变量彼此独立。
Technique for feature extraction — it combines the input variables in a specific way, then drops the “least important” variables, while still retaining the most valuable parts of all of the variables.

特征提取技术-它以特定方式组合输入变量，然后删除“最不重要”的变量 ，同时仍保留所有变量中最有价值的部分。
Each of the “new” variables after PCA are all independent of one another, this occurs because of the assumptions of a linear model.PCA之后的每个“新”变量都彼此独立，这是由于线性模型的假设而发生的。

第4节：额外的个人准备 (Section 4: Additional Personal Preparation)

In addition to knowing the terminology and algorithms above, I would also recommend researching and preparing for the following:

除了了解上述术语和算法外，我还建议您进行以下研究并做好准备：

#1 Most Important — Past Projects: This is definitely the most important personal preparation you will need to complete. Whether you are securing a college internship or applying for your 5th new job, you will need be able to clearly articulate past projects you worked on. These days, experience is held in higher regard than degrees, so adeptly communicating the work you have done will not only showcase your communication skills, but will also truly demonstrate your understanding of your work and your capabilities. If you do not have projects you can reference or speak to well, consider starting a new one in your free time with a compelling data set and meaningful analysis.

＃1最重要的-过去的项目：这绝对是您需要完成的最重要的个人准备。无论您是获得大学实习生还是申请第五份新工作，您都需要能够清楚地阐明过去从事的项目。如今，经验比学位更受重视，因此，熟练地交流您所做的工作不仅会展示您的交流技巧，而且还将真正显示您对工作和能力的理解。如果您没有可以参考或很好说的项目，请考虑在业余时间使用引人入胜的数据集和有意义的分析来开始一个新项目 。
Know Your Resume: Make sure to know your resume inside and out. I had a copy in front of myself during the video interview to reference just in case. It is important to know your skills and accomplishments, and be able to answer any questions on them, or provide examples showcasing them in action. Be sure not to have exaggerated anything on your resume; the risk is much higher than the reward.

了解您的简历：确保从内到外都知道您的简历。在视频采访中，我面前有一个副本，以防万一。重要的是要了解自己的技能和成就，并能够回答有关这些问题的任何问题，或者提供实例展示它们的实际作用。确保简历上没有夸大其词；风险远高于回报。
Company Information: Know the size of the company, where they are headquartered and about their other offices, what their core values and mission are, what type of work they do, and who their customer/clients are. You can find a lot of this info on the company’s website. I would encourage you to explore their site for anything else that may be useful that you can reference in the interview — this will show a lot of initiative and preparation on your end. Demonstrating to the interviewer that you are a good fit for the company is paramount so be sure you know their mission and understand the company culture.

公司信息：了解公司的规模，总部位于何处以及其他办事处，其核心价值和使命是什么，他们从事的工作类型以及客户是谁。您可以在公司的网站上找到很多此类信息。我鼓励您探索他们的网站，寻找其他可以在面试中参考的有用信息，这将显示出许多主动性和准备工作 。向面试官表明您非常适合公司，这是至关重要的，因此请确保您了解他们的任务并了解公司文化。
Case Studies (if applicable): The particular company I applied for had data science case studies on their website that were curated by their employees. This was useful for a few reasons. I was able to 1) understand more of the work the company does, 2) briefly reference these case studies in the interview to show initiative, 3) match the keywords and language used in these case studies with my interview dialogue.

案例研究(如果适用)：我申请的特定公司在其网站上有由员工策划的数据科学案例研究。这出于几个原因很有用。我能够1)了解公司的更多工作，2)在面试中简要参考这些案例研究以显示出主动性，3)将这些案例研究中使用的关键字和语言与我的访谈对话相匹配。

第5节：要问的问题 (Section 5: Questions to Ask)

These are some questions I think show interest and are important to ask at the end of the interview. Always ask questions.

我认为这是一些有趣的问题，在面试结束时提出这些问题很重要。总是问问题。

What is the day-to-day work like for this role?这个角色的日常工作是怎样的？
How many people are on the team and what roles are they? E.g., data scientists, full stack developers, data engineers, business analysts, etc.团队中有多少人，他们是什么角色？例如，数据科学家，全栈开发人员，数据工程师，业务分析师等。
How do you like working at Company [x], and what is it that you like?

您喜欢在[x]公司工作的感觉如何？
[Consulting Specific] Where is the client located and how often would I need to be at their office? How often will I have direct communication with them?[具体咨询]客户位于何处，我需要在办公室多久见一次？我多久可以与他们直接沟通一次？
What are the next steps in the interview process?面试过程的下一步是什么？
When can I expect to hear from you?我什么时候可以收到您的来信？

Interviewing for a new data science role can be complicated. You not only have to know the technical aspect of the job, but you need to be prepared to demonstrate your fit and knowledge of the company, answer behavioral questions, show competence and initiative, and more — all within an hour or so. Don’t forget to follow up with a thank you note to your interviewers and to negotiate your salary/benefits if an offer is extended! I hope this refresher serves you well as you prepare for your next interview.

我 nterviewing新数据科学的作用可能非常复杂。您不仅需要了解工作的技术方面，而且还需要做好准备，以证明您对公司的适应和知识，回答行为问题，展现能力和主动性等，这些都需要在一个小时左右的时间内完成。如果要约被延长，别忘了跟您的面试官致以感谢信，并商讨您的薪水/福利！我希望这个复习会为您的下一次面试做好准备。

Luke Schwenke is a Data Scientist at Booz Allen Hamilton in McLean, VA. Booz is a top public sector consulting firm in the United States, providing management and information technology consulting services to the government. Prior to joining the firm, Luke worked as a Data Scientist with Markel Corporation, a Fortune 500 specialty insurance company.

Luke Schwenke是位于弗吉尼亚州麦克莱恩的Booz Allen Hamilton的数据科学家。 Booz是美国顶级的公共部门咨询公司，为政府提供管理和信息技术咨询服务。在加入公司之前，卢克曾在财富500强专业保险公司Markel Corporation担任数据科学家。

Luke is a graduate of William & Mary with a B.S. in Data Science (Econometrics concentration), and an Arabic Language & Literature minor. You can connect with him on LinkedIn.

Luke毕业于William＆Mary，获得了数据科学(计量经济学专业)学士学位，并且是阿拉伯语言和文学专业的未成年人。您可以在LinkedIn上与他联系。

翻译自: https://towardsdatascience.com/data-science-interview-refresher-34b254c497a