数据科学与大数据技术的案例

There are two myths about how data scientists solve problems: one is that the problem naturally exists, hence the challenge for a data scientist is to use an algorithm and put it into production. Another myth considers data scientists always try leveraging the most advanced algorithms, the fancier model equals a better solution. While these are not fully groundless, they represent two common misunderstandings on how data scientists work: one emphasizes too much on the “execution” side, and the other overstate the “algorithm” part.

关于数据科学家如何解决问题有两个神话:一个是问题自然存在,因此数据科学家面临的挑战是使用算法并将其投入生产。 另一个神话认为,数据科学家总是尝试利用最先进的算法,更高级的模型等于更好的解决方案。 尽管这些并不是完全没有根据的,但它们代表了关于数据科学家如何工作的两个常见误解:一个在“执行”方面过分强调,而另一个则夸大了“算法”部分。

Obviously, these myths are not how we actually solve problems. From my perspective, problem-solving for a data scientist is:

显然,这些神话并不是我们实际解决问题的方式。 从我的角度来看,为数据科学家解决问题的方法是:

  • more about “how to abstract the problem out of the business context”, not just “be handed with a specific task”更多关于“如何从业务环境中抽象出问题”,而不仅仅是“处理特定任务”
  • more about “solve the problem with an algorithm”, not just “use the best algorithm to solve a problem”更多关于“使用算法解决问题”,而不仅仅是“使用最佳算法来解决问题”
  • more about “iteratively deliver business value”, not just “implement the code and call it a day”.更多关于“迭代地交付业务价值”,而不仅仅是“实施代码并称其为一天”。

With this said, I observe there are usually four stages involved in the problem-solving process, and I would like to share what are the four stages, and how it works in action with a case study, and then how can we get there with the right mindsets.

如此说来,我观察到解决问题的过程通常涉及四个阶段,我想分享这四个阶段是什么,以及它如何与案例研究一起发挥作用,然后我们如何才能达到目标?正确的心态。

故事始于…… (The story starts with, once upon a time …)

My first job was in a company that operates an automotive pricing and information website and it went through the initial public offering (IPO) in May 2014. It was a great experience and I vividly remember everyone around was cheering on that day for the birth of a public company. As a public company, our revenue started to receive a lot of attention, especially with the first quarterly earnings report coming out in August. In early July, the director in the revenue department came to the Data Scientists' seating area, and it did not look like he got good news to share.

我的第一份工作是在一家经营汽车价格和信息网站的公司中,该公司于2014年5月进行了首次公开募股(IPO)。这是一次很棒的经历,我生动地记得那天周围的每个人都为该公司的诞生欢呼雀跃。上市公司。 作为一家上市公司,我们的收入开始受到广泛关注,尤其是在八月份发布了第一份季度收益报告之后。 7月初,税务部门的主管来到了数据科学家的办公区,看来他没有什么好消息可分享。

“We are in trouble, a percentage of the sales revenue cannot be credited appropriately; we need your help.”

“我们有麻烦,不能适当地记入一定比例的销售收入; 我们需要您的帮助。”

Here are some relevant contexts: the company’s revenue is generated based on the fact that it introduces more sales to car dealers. To get the deserved commission, we need to match the sale of a vehicle to the correct customer. If our data providers can tell us which customer bought which vehicle, then the matching is done and no extra effort is needed; however, the problem is that one data provider decided to not provide the 1-to-1 sale record: it has to be done in a batch (visualization on what is a “batch” shown as below), then it is much harder and uncertain to know which customer bought which car.

以下是一些相关的上下文:公司的收入是基于这样的事实而产生的:它为汽车经销商带来了更多的销售。 为了获得应得的佣金,我们需要将车辆的销售与正确的客户匹配。 如果我们的数据提供商可以告诉我们哪个客户购买了哪辆汽车,那么匹配就完成了,不需要额外的工作; 但是,问题在于,一个数据提供者决定不提供一对一的销售记录:必须分批处理(可视化显示如下所示的“批处理”),这会变得更加困难,并且不确定要知道哪个客户买了哪辆车。

The revenue team was surprised by this change and after spending the past month trying to solve the problem, only 2% of sales from that data provider could be recovered manually. This would be bad news for the first earning call, so they came to seek help from Data Scientists. This is clearly an urgent problem that needs to be solved, so we jumped right on it.

收入团队对此更改感到惊讶,在花费了过去一个月的时间来解决问题之后,只能手动恢复该数据提供商2%的销售额。 这对于第一次打来的电话来说是个坏消息,因此他们来寻求数据科学家的帮助。 显然,这是一个亟待解决的紧迫问题,因此我们跳过了。

阶段1.了解问题,然后使用数学术语重新定义 (Stage 1. understand the problem, and then redefine it using mathematical terms)

This is the first stage of problem-solving in Data Science. Regarding “understand the problem” part, one needs to clearly identify the pain points so that once the pain point is resolved, the problem should be gone; regarding “redefine” the problem part, this is usually why a problem needs Data Scientist help.

这是数据科学中解决问题的第一步。 关于“理解问题”部分,需要清楚地识别痛点,以便一旦痛点得到解决,问题就应该消除。 关于“定义”问题部分,通常这就是为什么问题需要数据科学家的帮助。

For the specific one asked by our revenue team, the problem is: we cannot assign each sold vehicle to a customer, then we lose the revenue.

对于我们的收入团队要求的特定问题,问题是:我们无法将每辆售出的车辆分配给客户,然后我们损失了收入。

The pain point is: finding who purchased a vehicle in the given batch is manual and inaccurate, considering there are thousands of batches that need matching sales, it is very time-consuming and not sustainable.

痛点是:考虑到成千上万的批次需要匹配的销售,找到谁在给定的批次中购买了汽车是手动且不准确的,这非常耗时且不可持续。

The “redefined” problem in a mathematical term is: given a batch with customer C1, C2, .., Cn, along with the sold vehicle information, V1, V2, …, Vm, we need an automated solution to accurately identify the right matching pair (Ci, Vj) reflecting the actual purchasing event.

用数学术语来说,“重新定义”的问题是:给定一个具有客户C1,C2,..,Cn的批次以及出售的车辆信息V1,V2,…,Vm,我们需要一个自动化的解决方案来准确地确定正确的反映实际购买事件的匹配对(Ci,Vj)。

第2阶段。分解问题,确定逻辑算法解决方案,然后进行构建 (Stage 2. decompose the problem, identify a logical algorithm solution, and then build it out)

With the redefined problem, we can see this is a “matching” exercise under constraint, with given customers and vehicles in a batch. So I decomposed the problem further into two steps:

有了重新定义的问题,我们可以看到这是在给定的客户和车辆成批的约束下的“匹配”练习。 因此,我将问题进一步分解为两个步骤:

  • Step 1. calculate the purchase likelihood for a customer given the vehicle P(C|V)步骤1.计算给定车辆P(C | V)的客户的购买可能性
  • Step 2. based on the likelihood, attribute a car to the most likely customer in the batch步骤2.根据可能性,将汽车分配给批次中最有可能的客户

Now we can further identify the solution for each.

现在,我们可以进一步确定每种解决方案。

步骤1.概率计算 (Step 1. probability calculation)

For simplicity, let’s assume there are three customers (c1, c2, c3) in this batch, and one vehicle (v1) information is provided as a sale.

为简单起见,我们假设此批次中有三个客户(c1,c2,c3),并且提供了一辆汽车(v1)信息作为销售。

  • P(C=c1) represents the likelihood of c1 to buy any car. Assuming no prior knowledge about each customer, their likelihood of buying any car should be the same: P(C=c1) = P(C=c2) = P(C=c3), which equals a constant (e.g. 1/3 in this situation)P(C = c1)表示c1购买任何汽车的可能性。 假设没有每个客户的先验知识,那么他们购买任何汽车的可能性应该是相同的:P(C = c1)= P(C = c2)= P(C = c3),它等于一个常数(例如1/3 in这个情况)
  • P(V=v1) is the likelihood for v1 to be sold, given it is shown in this batch, this should be 1 (100% likelihood to be sold)P(V = v1)是v1被出售的可能性,鉴于此批次中显示,该值应为1(100%的可能性出售)

Since there is only one customer making the purchase, this probability can be extended into:

由于只有一位客户进行购买,因此可以将这种可能性扩展为:

P(V=v1) = P(C=c1, V=v1) + P(C=c2, V=v1) + P(C=c3, V=v1) = 1.0

P(V = v1)= P(C = c1,V = v1)+ P(C = c2,V = v1)+ P(C = c3,V = v1)= 1.0

For each of the item, given the following formula

对于每个项目,给定以下公式

P(C=c1, V=v1) = P(C=c1|V=v1) * P(V=v1) = P(V=v1|C=c1) * P(C=c1)

P(C = c1,V = v1)= P(C = c1 | V = v1)* P(V = v1)= P(V = v1 | C = c1)* P(C = c1)

We can see P(C=c1|V=v1) is proportional to P(V=v1|C=c1). So now, we can get the formula for the probability calculation:

我们可以看到P(C = c1 | V = v1)与P(V = v1 | C = c1)成正比。 现在,我们可以得出概率计算的公式:

P(C=c1|V=v1) = P(V=v1|C=c1) / (P(V=v1|C=c1) + P(V=v1|C=c2) + P(V=v1|C=c3))

P(C = c1 | V = v1)= P(V = v1 | C = c1)/(P(V = v1 | C = c1)+ P(V = v1 | C = c2)+ P(V = v1 | C = c3))

and the key is to get the probability for each P(V|C). Such a formula can be verbally explained as: the likelihood for a vehicle to be purchased by a specific customer is proportional to the likelihood for the customer to buy this specific vehicle.

关键是获得每个P(V | C)的概率。 这样的公式可以用语言来解释为:特定顾客购买车辆的可能性与顾客购买该特定车辆的可能性成比例。

The above formula may look too “mathematical”, so let me put it into an intuitive context: assuming three people were in a room, one is a musician, one is an athlete, and one is a data scientist. You were told there is a violin in this room belong to one of them. Now guess, whom do you think is the owner of the violin? This is pretty straightforward, right? given the likelihood of musician to own a violin is high, and the likelihood of athlete and data scientists to own a violin is lower, it is much more likely for the violin to belong to the musician. The “mathematical” thinking process is illustrated below.

上面的公式看起来太“数学”了,因此让我将其放在一个直观的上下文中:假设三个人在一个房间里,一个是音乐家,一个是运动员,一个是数据科学家。 有人告诉您,这个房间里有一把小提琴属于其中之一。 现在猜,您认为小提琴的所有者是谁? 这很简单,对吧? 鉴于音乐家拥有小提琴的可能性较高,而运动员和数据科学家拥有小提琴的可能性较低,因此小提琴属于音乐家的可能性更大。 下面说明了“数学”思维过程。

Now, let’s put the probabilities into a business context. As an online automotive pricing platform, each customer needs to generate at least one vehicle quote, hence, we assume the customer can be reasonably represented as the vehicles he/she quoted. Then such P(V|C) probability can be learned from existing data the company already accumulated in the history, including who generated a vehicle quote at when, and what vehicle they eventually bought. I would not further elaborate on the details, but the key point is that we can learn P(V|C), and then calculate the needed probability P(C|V) in each batch.

现在,让我们将概率放入业务环境中。 作为一个在线汽车定价平台,每个客户都需要至少生成一个车辆报价,因此,我们假设该客户可以合理地代表其报价的车辆。 然后,可以从公司在历史记录中已经积累的现有数据获悉这种P(V | C)概率,包括谁在何时生成车辆报价以及他们最终购买了哪种车辆。 我不会进一步详细说明,但是关键是我们可以学习P(V | C),然后计算每批中所需的概率P(C | V)。

步骤2.车辆归属 (Step 2. vehicle attribution)

Once we get the expected probability for each vehicle to be sold to customers, the second step is the attribution process. Assuming there is only one sold vehicle in the batch, such process is trivial; however, if there are multiple sold vehicles in the batch, either following approaches would work:

一旦我们获得了每辆车出售给客户的预期概率,第二步就是归因过程。 假设批次中只有一辆售出的车辆,那么这个过程很简单; 但是,如果批次中有多个售出的车辆,则可以使用以下两种方法之一:

  • (direct attribution) use only the calculated probability P(C|V), always attribute vehicle to customers with the highest likelihood. Under this approach, it is possible to attribute two vehicles to the same customer.(直接归因)仅使用计算出的概率P(C | V),始终将车辆归因于可能性最高的客户。 在这种方法下,可以将两辆车分配给同一客户。
  • (round-robin way) assume each customer buys at most one vehicle: once one vehicle is attributed to a customer, both are removed before the next round vehicle attribution.(轮循方式)假设每个客户最多购买一辆车辆:一旦将一辆车辆归于客户,则在下一轮归属之前将两者都移除。

Now we have designed a two-stepped algorithm to solve the key challenge, and it’s time to test the performance! Given there are historic quotes and sales data, it is straightforward to simulate the process of “creating random batches”, “attaching sales to the batch”, and try to “recover sales from the given batch information”. Such simulation provides a way to evaluate the model’s performance and we estimated more than 50% of sales can be recovered with high precision (>95%). We deployed the model for the real dataset, and the results matched our expectations well.

现在,我们设计了一个两步算法来解决关键挑战,现在该测试性能了! 鉴于有历史报价和销售数据,可以轻松地模拟“创建随机批次”,“将销售附加到批次”并尝试“从给定的批次信息中恢复销售”的过程。 这种模拟提供了一种评估模型性能的方法,我们估计可以以高精度(> 95%)收回超过50%的销售额。 我们为实际数据集部署了该模型,结果与我们的预期非常吻合。

The revenue team was very happy with the above solution: comparing to the ~2% recovery rate, 50% is more than 25 X! From a business impact perspective, this revenue directly added to the bottom line for our first quarterly earnings report, and the contributed value from the Data Science team is significant.

收入团队对上述解决方案感到非常满意:与〜2%的回收率相比,50%的回收率是25倍以上! 从业务影响的角度来看,该收入直接添加到了我们的第一季度收入报告的底线中,数据科学团队的贡献是巨大的。

阶段3.深思熟虑,寻求机会进行进一步的改进 (Stage 3. Think deeper, and seek opportunities to make further improvement)

We run the above solution for an extra month and see the performance is pretty consistent, and now it is time to think about what’s next? We recovered 50% of sales, but how about the rest 50%? Is it possible to further improve the algorithm to get there?

我们将上述解决方案运行了一个多月,看到性能相当稳定,现在是时候考虑下一步了吗? 我们收回了50%的销售额,但其余50%呢? 是否有可能进一步改进算法以达到目标?

Usually, we, as data scientists, have a tendency to focus too much on the algorithm details; in this case, there were some discussions around how to better model the P(V|C): should we use a deep learning model to make this probability much better, etc. However, per my understanding, these pure algorithmic improvements usually result in just incremental performance, and it’s less likely we close the rest 50% gap.

通常,作为数据科学家,我们倾向于过多地关注算法细节。 在这种情况下,围绕如何更好地对P(V | C)建模进行了一些讨论:我们是否应使用深度学习模型来使这种概率更好,等等。但是,据我了解,这些纯算法上的改进通常导致只是提高性能,而我们缩小50%的剩余差距的可能性较小。

Then I started a deeper conversation with the revenue team and trying to figure out what was missing in our understanding about the problem, turns out we can control how the customers are grouped into a batch! Although there are some restrictions (e.g. customers have to generate quotes from the same dealership), this gives us the freedom to further optimize, and I see this is the direction to close the gap of the rest 50% sales.

然后,我与收入团队进行了更深入的对话,试图找出我们对问题的了解中缺少的内容,结果我们可以控制将客户分组的方式! 尽管存在一些限制(例如,客户必须从同一个经销商处生成报价),但是这给了我们进一步优化的自由,我认为这是缩小其余50%销售差距的方向。

Why am I confident in this direction? Think about this situation: if you have 4 people to be batched, and each batch has 2 people. The best batching strategy is to put the most different people in the same batch so that once an item is returned, the attribution will be more accurate. The following visualization shows the concept. On the left side, if you put two musicians in the same batch, two athletes in the same batch, it’s very hard to know who owns the violin or basketball. While on the right side, if you have each batch with one musician and one athlete, it is much easier to tell Musician A owns the violin, and Athlete D owns the basketball, with high confidence.

我为什么对这个方向充满信心? 考虑这种情况:如果要分批处理4个人,每批分2个人。 最佳的批处理策略是将最多的人放在同一批中,这样一来,一旦退回货品,归因将更加准确。 以下可视化显示了该概念。 在左侧,如果将两个音乐家放在同一批中,将两个运动员放在同一批中,则很难知道谁拥有小提琴或篮球。 在右侧,如果您每批都有一位音乐家和一位运动员,那么说出音乐家A拥有小提琴而运动员D拥有篮球则要容易得多。

To materialize the above concept, there are two steps required:

要实现上述概念,需要执行两个步骤:

  • (similarity definition) how to define customer to customer similarity? and then a batch’s entropy as the objective function to optimize for?(相似度定义)如何定义顾客与顾客之间的相似度? 然后将一批熵作为目标函数进行优化?
  • (batch optimization) based on the above similarities, how to design an optimization strategy to achieve optimal batches?(批次优化)基于以上相似性,如何设计优化策略以实现最佳批次?

步骤1.相似性定义 (Step 1. similarity definition)

In the first stage solution, we already find a way to calculate P(V|C), here, I would make a direct generalization: the similarity between two customers is proportional to the average likelihood for both customers to purchase each other’s quoted vehicles. If each customer quoted only one vehicle (c1 quoted v1, and c2 quoted v2), then a simplified version looks as follows:

在第一阶段的解决方案中,我们已经找到一种计算P(V | C)的方法 ,在这里,我将直接进行概括:两个客户之间的相似性与两个客户购买彼此报价的车辆的平均可能性成正比。 如果每个客户仅报价一辆车(c1报价为v1,c2报价为v2),则简化版本如下所示:

Similarity(C1, C2) = 0.5 * (P(V=v1|C=c2) + P(V=v2|C=c1))

相似度(C1,C2)= 0.5 *(P(V = v1 | C = c2)+ P(V = v2 | C = c1))

Once we have the pairwise similarity between two customers, we can define the entropy for a batch as the sum of mutual pairwise similarities between customers in the batch. Now, we have an objective function to optimize for: we want batches with maximum entropy

一旦我们有了两个客户之间的成对相似性,就可以将一个批次的熵定义为该批次中客户之间相互成对相似性的总和。 现在,我们有一个优化的目标函数:我们想要具有最大熵的批次

步骤2.批次最佳化 (Step 2. batch optimization)

After reading some similar studies, I decided to use the 2-opt algorithm, which is a simple local search algorithm for solving the traveling salesman problem.

阅读一些类似的研究后,我决定使用2-opt算法,这是一种用于解决旅行商问题的简单本地搜索算法。

The basic concept of 2-opt algorithm is as follows: in every step, two edges are randomly picked and attempt to “swap”, if the objective function is better after the swap is done, then the swap will be executed; or else, re-pick two edges. The algorithm continues until the objective function is converged or the maximum iteration number is met. The following figure illustrates when two edges (red) are picked and swapped into new edges (blue), achieving a shorter distance.

2-opt算法的基本概念如下:在每个步骤中,随机选择两个边缘并尝试“交换”,如果交换完成后目标函数更好,则将执行交换; 否则,重新拾取两个边缘。 该算法继续进行,直到目标函数收敛或满足最大迭代次数为止。 下图说明了拾取两个边缘(红色)并将其交换为新边缘(蓝色)时获得的距离更短的情况。

To apply the 2-opt algorithm in my case, I made analogies to the traveling salesman problem (TSP):

为了在我的情况下应用2-opt算法,我对旅行商问题(TSP)进行了类比:

  • In TSP, two edges are randomly selected; in my cases, two batches are randomly selected, and then each batch randomly pick one customer inside to exchange在TSP中,随机选择两个边; 在我的情况下,随机选择两个批次,然后每个批次随机选择一个内部客户进行交换
  • In TSP, the total distance is used as the objective function, the shorter the better; in my case, the entropy of all batches is the objective function, the higher the better.在TSP中,总距离用作目标函数,越短越好;反之亦然。 就我而言,所有批次的熵都是目标函数,越高越好。

Great, we have all the elements to optimize the batches! After implementing the algorithm, we further backtest over the existing data and found that: more than 85% of sales could be recovered. In the following month, when we apply this over the real dataset, the recovery rate is found at a similar level. This approach works, as expected!

太好了,我们拥有优化批次的所有要素! 实施该算法后,我们对现有数据进行了进一步的回测,发现:可以收回超过85%的销售额。 在下个月,当我们将其应用于实际数据集时,发现恢复率处于相似的水平。 这种方法符合预期!

阶段4.设计解决方案以使其可扩展和可维护 (Stage 4. Engineering the solution to make it extendable and maintainable)

What I described above is mainly the algorithm design part; and in parallel, there is the Engineering development part, and it is not easy to simply write the code and expect it to be extendable and maintainable.

我上面描述的主要是算法设计部分; 同时,还有工程开发部分,要简单地编写代码并期望它具有可扩展性和可维护性并不容易。

During the project evolution, we gradually noticed there is a pattern of dependencies across the modules needed. The vehicle is represented by many features, and the customer is represented by a set of vehicles, and the batch is represented by a set of customers. With this high-level representation, we can build the dependency lineage as Vehicle -> Customer -> Batch.

在项目发展过程中,我们逐渐注意到,所需模块之间存在某种依赖关系模式。 车辆由许多功能代表,客户由一组车辆代表,批次由一组客户代表。 通过这种高级表示,我们可以将依赖关系谱系构建为Vehicle-> Customer-> Batch。

Meanwhile, as a data product, we need to make sure the system can evolve to update the needed parameters and always evaluate the performance along the way. Hence the architecture was designed in the following way

同时,作为数据产品,我们需要确保系统可以发展以更新所需的参数,并始终评估性能。 因此,架构是通过以下方式设计的

With this architecture, what the Data Scientist need to do on a regular basis are:

使用这种架构,数据科学家需要定期进行以下操作:

  • re-train model for the P(V|C) to ensure it incorporates the most recent customer purchasing behavior对P(V | C)进行重新训练模型,以确保它包含最新的客户购买行为
  • simulation over the whole process, including both batch optimization and sales attribution, to ensure the system performance is above a threshold在整个过程中进行仿真,包括批次优化和销售归因,以确保系统性能超过阈值
  • monthly batch optimization to prepare data for our revenue team and sales attribution to match a customer to the sales每月批量优化,以为我们的收入团队和销售归因准备数据,以使客户与销售匹配

Now we have built a sustainable data product that is maintainable. Given the data science team established a good reputation, in the next year, we heavily involved in the re-design of the sales matching system, which further expanded the data science footprint over the company. Because of this architecture’s operational excellence, it frees us more resources to seek the next challenge.

现在,我们已经构建了可维护的可持续数据产品。 鉴于数据科学团队建立了良好的声誉,明年,我们将大量参与销售匹配系统的重新设计,从而进一步扩大了数据科学在公司的业务范围。 由于该体系结构的卓越操作性,它使我们有更多的资源来寻求下一个挑战。

正确心态的一般问题解决流程 (The general problem-solving flow with the right mindset)

The data science area is quite broad and designing algorithmic data products is only part of many potential projects. Other commonly-seen data science projects are experimentation design, causal inference, deep-dive analysis to drive strategic changes, etc. Although they may not strictly follow or even need all the stages I listed above, the four-stage flow still help to lay out a way to think about problem-solving in general:

数据科学领域非常广泛,设计算法数据产品只是许多潜在项目的一部分。 其他常见的数据科学项目包括实验设计,因果推断,深入分析以推动战略变革等。尽管它们可能并不严格遵循甚至需要上面列出的所有阶段,但四阶段流程仍然有助于奠定基础提出一种思考解决问题的方法:

  • Stage 1 (problem identification) is to help you focus on the key question and not loose track while diving deep into data第1阶段(问题识别)旨在帮助您专注于关键问题,而不会在深入研究数据时迷失方向
  • Stage 2 (first logical solution) is to get you a quick win and keep the momentum to build trust with business partners第2阶段(第一个合乎逻辑的解决方案)是使您快速获胜并保持与业务合作伙伴建立信任的动力
  • Stage 3 (iterative improvement) is to help you move the solution further ahead and be the owner of the area第3阶段(迭代改进)旨在帮助您将解决方案向前推进并成为该区域的所有者
  • Stage 4 (operational excellence) is to help you remove tech debt, to set you free from mundane maintenance works going forward第4阶段(卓越运营)旨在帮助您消除技术债务,使您免于日后的日常维护工作

The four-stage flow is not necessarily a strict rule one should follow, but it is more like a natural outcome if a data scientist has the right mindsets while facing any incoming challenge. In my opinion, these mindsets are:

四个阶段的流程不一定是应该遵循的严格规则,但是如果数据科学家在面对任何即将来临的挑战时具有正确的心态,则它更像是自然的结果。 我认为这些心态是:

  • Business-driven, not algorithm-driven. Look at the big picture and see how data science fits in the business, understand why data science is needed and how it delivers value. Don’t be too attached to any specific algorithm: “if all you have is a hammer, everything looks like a nail".

    业务驱动,而不是算法驱动 。 纵观全局,了解数据科学如何适应业务,了解为什么需要数据科学以及它如何带来价值。 不要太拘泥于任何特定的算法:“如果您只有锤子,那么一切看起来就像钉子”。

  • Owning the problem, not just taking orders. Being the owner of a problem means one will be proactive in thinking about how to solve it now, solve it better, and solve it with less effort. One would not stop at a sub-optimal solution and consider it done.

    造成问题的,不仅是接单 。 成为问题的所有者,意味着人们将积极思考如何立即解决,更好地解决问题以及以更少的精力解决问题。 人们不会停在一个次优的解决方案上并认为它已经完成。

  • Open-minded, and always be learning. As an interdisciplinary field, data science overlaps with statistics, computer science, operational research, psychology, economics, marketing, sales, and more! It’s almost impossible to know all the areas ahead of time, so be open-minded and keep learning along the way. There could always be a better solution than the one you already knew.

    胸襟开阔,永远学习 。 作为一个跨学科领域,数据科学与统计,计算机科学,运筹学,心理学,经济学,市场营销,销售等等重叠! 提前知道所有领域几乎是不可能的,因此要胸襟开阔,并不断学习。 总会有比您已经知道的更好的解决方案。

Hope you may find the above sharing helpful: happy problem solving, the data science way.

希望以上分享对您有所帮助:快乐的问题解决,数据科学的方式。

— — — — — — — — — — — — — —

— — — — — — — — — — — — — — — — —

If you enjoyed this article, help spread the word by liking, sharing, and commenting. Pan is currently a Data Science Manager at LinkedIn. You can read previous posts and follow him on LinkedIn.

如果您喜欢这篇文章,请通过喜欢,共享和评论来传播这个词。 Pan目前是LinkedIn的数据科学经理。 您可以阅读以前的帖子并在 LinkedIn 上关注他

Here are two previous articles sharing Pan’s Data Science experience:

这是分享Pan的Data Science经验的前两篇文章:

My First Data Science Project

我的第一个数据科学项目

How to innovate in Data Science

如何在数据科学中创新

翻译自: https://towardsdatascience.com/problem-solving-as-data-scientist-a-case-study-49296d8cd7b7

数据科学与大数据技术的案例


http://www.taodudu.cc/news/show-997389.html

相关文章:

  • 商业数据科学
  • 数据科学家数据分析师_站出来! 分析人员,数据科学家和其他所有人的领导和沟通技巧...
  • 分析工作试用期收获_免费使用零编码技能探索数据分析
  • 残疾科学家_数据科学与残疾:通过创新加强护理
  • spss23出现数据消失_改善23亿人口健康数据的可视化
  • COVID-19研究助理
  • 缺失值和异常值的识别与处理_识别异常值-第一部分
  • 梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例
  • yolo人脸检测数据集_自定义数据集上的Yolo-V5对象检测
  • 图深度学习-第2部分
  • 量子信息与量子计算_量子计算为23美分。
  • 失物招领php_新奥尔良圣徒队是否增加了失物招领?
  • 客户细分模型_Avarto金融解决方案的客户细分和监督学习模型
  • 梯度反传_反事实政策梯度解释
  • facebook.com_如何降低电子商务的Facebook CPM
  • 西格尔零点猜想_我从埃里克·西格尔学到的东西
  • 深度学习算法和机器学习算法_啊哈! 4种流行的机器学习算法的片刻
  • 统计信息在数据库中的作用_统计在行业中的作用
  • 怎么评价两组数据是否接近_接近组数据(组间)
  • power bi 中计算_Power BI中的期间比较
  • matplotlib布局_Matplotlib多列,行跨度布局
  • 回归分析_回归
  • 线性回归算法数学原理_线性回归算法-非数学家的高级数学
  • Streamlit —使用数据应用程序更好地测试模型
  • lasso回归和岭回归_如何计划新产品和服务机会的回归
  • 贝叶斯 定理_贝叶斯定理实际上是一个直观的分数
  • 文本数据可视化_如何使用TextHero快速预处理和可视化文本数据
  • 真实感人故事_您的数据可以告诉您真实故事吗?
  • k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类
  • 衡量试卷难度信度_我们可以通过数字来衡量语言难度吗?

数据科学与大数据技术的案例_作为数据科学家解决问题的案例研究相关推荐

  1. 数据科学与大数据技术的案例_主数据科学案例研究,招聘经理的观点

    数据科学与大数据技术的案例 I've been in that situation where I got a bunch of data science case studies from diff ...

  2. 推荐 :数据科学与大数据技术专业特色课程研究

    在我国,数据科学与大数据技术专业的建设已成为新的热点话题.在系统调研世界一流大学数据科学专业建设现状的基础上,从特色课程视角重点分析加州大学伯克利分校.约翰·霍普金斯大学.华盛顿大学.纽约大学.斯坦福 ...

  3. 数据科学与大数据技术和大数据管理与应用哪个好

    学计算机学与技术好还是学大数据好? 本人认为学大数据好一些.首先,当前计算机科学与技术和大数据这两个专业的热度都比较高,这两个专业本身也没有所谓的好坏之分,而且这两个专业本身也有非常紧密的联系,当前计 ...

  4. 数据科学与大数据技术专业保研

    数据科学与大数据技术保研经历 基本情况: 学校:末211 排名:1/63 英语:六级475 竞赛:数模美赛M.蓝桥国三以及一堆小奖 科研:一个软著,校级大创负责人 论文:无 最终去向:浙大工程师 夏令 ...

  5. 信息与计算科学和数据科学与大数据技术哪个好

    学计算机学与技术好还是学大数据好? 本人认为学大数据好一些.首先,当前计算机科学与技术和大数据这两个专业的热度都比较高,这两个专业本身也没有所谓的好坏之分,而且这两个专业本身也有非常紧密的联系,当前计 ...

  6. 第三届全国高校“数据科学与大数据技术”教学研讨会

    第三届全国高校"数据科学与大数据技术"教学研讨会 暨新工科背景下的人才培养与课程建设师资培训会 目前,我国大数据专业人才匮乏,培养优秀的专业人才迫在眉睫.为实施国家大数据战略,加快 ...

  7. 全国高校“数据科学与大数据技术”专业教学研讨会

    全国高校"数据科学与大数据技术"专业教学研讨会 暨新工科背景下的人才培养与课程建设师资培训会 目前,我国大数据专业人才匮乏,培养优秀的专业人才迫在眉睫.为实施国家大数据战略,加快大 ...

  8. 中国大学数据科学与大数据技术专业排名!2021软科排名

    转载于 软科 高等教育评价专业机构软科正式发布2021"软科中国大学专业排名".排名包括509个本科专业,每个专业榜单发布的是所有开设该专业的高校中排名位列前50%的高校,共有92 ...

  9. 数据科学与大数据技术和计算机科学与技术哪个好

    如今大学生面对众多的可选专业,哪个专业最有前景呢? 1.信息与通信工程 说到这个专业大家首先想到的肯定是"两电一邮",没错北京邮电大学.西安电子科技大学.电子科技大学这三所高校的信 ...

最新文章

  1. redis 未授权访问详解
  2. 批量Excel数据导入Oracle数据库
  3. angular跳转指定页面_通过 angular CDK 实现页面元素拖放
  4. 亿级流量场景下的平滑扩容:TDSQL的水平扩容方案实践
  5. Java8下载安装详细教程,环境配置,Java、jre下载安装教程,此电脑图标位置,电脑处理器版本查询查询
  6. IDEA中快捷输入法
  7. Day54.XML解析(DOM4J)、Tomcat服务器、HTML协议简介: 请求、响应报文、响应码
  8. 批量创建工作表并以本月日期命名——《超级处理器》应用
  9. 华为4G路由器2虚拟服务器,华为4g2pro路由器虚拟服务器设置
  10. 【深度学习】深度学习基础-Warm_up训练策略
  11. 解决在word中插入Mathtype公式后行距变大的问题(简单有效)
  12. 将yolov2-tiny模型部署到前端
  13. 前端面试重要问题总结(前端100问小结)(六)
  14. 为什么要做网站SEO优化?
  15. 大内存笔记本如何提升性能
  16. Ubuntu是什么?
  17. 幅值单位是v吗_电压幅值什么意思
  18. Accessdiver使用指南
  19. R语言Tobit模型的分组回归
  20. ESP32开发之旅——AS608指纹识别模块

热门文章

  1. 将visio的图片插入latex(png格式转换成eps格式图片)
  2. C#中全局处理异常方式
  3. 用solidity语言开发代币智能合约
  4. react事件处理函数中绑定this的bind()函数
  5. mac下的svn服务器建立
  6. [BZOJ 1834] [ZJOI2010]network 网络扩容
  7. RUNOOB python练习题27 递归逆向输出字符串
  8. Apache Shiro 简介
  9. 天地图专题五:在天地图上绘制电子区域并保存数据
  10. FiddlerScript-常用总结