大数据产业核心产业环节

by Kavita Ganesan

通过Kavita Ganesan

产业实力自然语言处理 (Industrial strength Natural Language Processing)

I have spent much of my career as a graduate student researcher, and now as a Data Scientist in the industry. One thing I have come to realize is that a vast majority of solutions proposed both in academic research papers and in the workplace are just not meant to ship — they just don’t scale!

我的职业生涯大部分时间是作为一名研究生研究员，现在是该行业的数据科学家。我已经意识到的一件事是，无论是在学术研究论文中还是在工作场所中提出的绝大多数解决方案，都不是为了发运而已，它们只是无法扩展！

And when I say scale, I mean:

当我说规模时，我的意思是：

Handling real world uses cases

处理现实用例
Handling large amounts of data

处理大量数据
Ease of deployment in a production environment.

易于在生产环境中进行部署。

Some of these approaches either work on extremely narrow use cases, or have a tough time generating results in a timely manner.

这些方法中的一些要么适用于极其狭窄的用例，要么难以及时生成结果。

More often than not, the problem lies is in the approach that was used although when things go wrong, we tend to render the problem “unsolvable”. Remember, there will almost always be more than one way to solve a Natural Language Processing (NLP) or Data Science problem. Optimizing your choices will increase your chance of success in deploying your models to production.

问题往往出在所使用的方法上，尽管当出现问题时，我们倾向于使问题“无法解决”。记住，几乎总是是解决自然语言处理(NLP)或数据科学问题的方法不止一种。优化选择将增加成功将模型部署到生产中的机会。

Over the past decade I have shipped solutions that serve real users. From this experience, I now follow a set of best practices that maximizes my chance of success every time I start a new NLP project.

在过去的十年中，我已经交付了为真实用户提供服务的解决方案。从这次经验中，我现在遵循了一组最佳实践，可以在每次启动新的NLP项目时最大程度地提高成功的机会。

In this article, I will share some of these with you. I swear by these principles and I hope these become handy to you as well.

在本文中，我将与您分享其中的一些。我信奉这些原则，并希望这些原则也对您有用。

1.吻！ (1. KISS please!)

KISS (Keep it simple, stupid). When solving NLP problems, this seems like common sense.

吻(保持简单，愚蠢) 。解决NLP问题时，这似乎是常识。

But I can’t say this enough: choose techniques and pipelines that are easy to understand and maintain. Avoid complex ones that only you understand, sometimes only partially.

但是我不能这么说：选择易于理解和维护的技术和管道。避免只有您自己理解的复杂事物，有时只是部分理解。

In a lot of NLP applications, you would typically notice one of two things:

在许多NLP应用程序中，通常会注意到以下两种情况之一：

Deep pre-processing layers, OR
深度预处理层，或
Complex neural network architectures that are just hard to grasp, let alone train, maintain and improve on iteratively.
复杂的神经网络架构很难掌握，更不用说反复训练，维护和改进了。

The first question to ask yourself is if you need all the layers of pre-processing?

要问自己的第一个问题是，是否需要所有预处理步骤？

Do you really need part-of-speech tagging, chunking, entity resolution, lemmatization and etc. What if you strip out a few layers? How does this affect the performance of your models?

您是否真的需要词性标注，分块，实体解析，词义化等等。如果您剥离几层怎么办？这如何影响模型的性能？

With access to massive amounts of data, you can often actually let the evidence in the data guide your model.

通过访问大量数据，您通常通常实际上可以让数据中的证据指导您的模型。

Think of Word2Vec. The success of Word2Vec is in its simplicity. You use large amounts of data to draw meaning — using the data itself. Layers? What layers?

想想Word2Vec 。 Word2Vec的成功在于其简单性。您需要使用大量数据来绘制含义-使用数据本身。层？什么层？

When it comes to Deep Learning, use it wisely. Not all problems benefit from Deep Learning. For the problems that do, use the architectures that are easy to understand and improve on.

当涉及深度学习时，请明智地使用它。并非所有问题都受益于深度学习。对于存在的问题，请使用易于理解和改进的体系结构。

For example, for a programming language classification task, I just used a two-layer Artificial Neural Network and realized big wins in terms of training speed and accuracy.

例如，对于编程语言分类任务，我只使用了两层人工神经网络，并且在训练速度和准确性方面取得了巨大的成功。

In addition, adding a new programming language is pretty seamless, as long as you have data to feed into the model.

此外，添加新的编程语言是非常无缝的，只要您有数据要输入到模型中即可。

I could have complicated the model to gain some social currency by using a really complex RNN architecture straight from a research paper. But I ended up starting simple just to see how far this would get me, and now I’m at the point where I can say, what’s the need to add more complexity?

我可以通过直接使用研究论文中使用的非常复杂的RNN架构来使该模型复杂化，从而获得一些社交货币。但是我最终还是从简单开始，只是为了看看这能给我带来多大的好处，现在我可以说，增加更多复杂性的需求是什么？

2.如有疑问，请使用经过时间检验的方法 (2. When in doubt, use a time-tested approach)

With every NLP/text mining problem, your options are plenty. There will always be more than one way to accomplish the same task.

对于每个NLP /文本挖掘问题，您都有很多选择。总会有不止一种方法来完成相同的任务。

For example, in finding similar documents, you could use a simple bag-of-words approach and compute document similarities using the resulting tf-idf vector.

例如，在查找相似文档时，您可以使用简单的词袋方法，并使用生成的tf-idf向量计算文档相似度。

Alternatively, you could do something fancier by generating embeddings of each document and compute similarities using the document embeddings.

另外，您可以通过生成每个文档的嵌入并使用文档嵌入来计算相似度来做一些更奇特的事情。

Which should you use? It actually depends on several things:

您应该使用哪个？它实际上取决于几件事：

Which of these methods have seen a higher chance of success in practice? (Hint: We see tf-idf being used all the time for information retrieval and it is super fast. How does the embedding option compare?)
哪种方法在实践中成功的可能性更高？ (提示：我们看到tf-idf一直用于信息检索，而且速度超快。嵌入选项如何比较？)
Which of these do I understand better? Remember the more you understand something, the better your chance of tuning it and getting it to work the way you expect it to.
我更了解其中哪几个？记住，您越了解某件事，就越有机会对其进行调整并使其以预期的方式工作。
Do I have the necessary tools/data to implement either of these?
我是否有必要的工具/数据来实现以上任何一个？

Some of these questions can be easily answered with some literature search.

这些问题中的一些可以通过一些文献搜索轻松回答。

But you could also reach out to experts such as university professors or other data scientists who have worked on similar problems to give you a recommendation. Occasionally, I run my ideas by my peers who are in the same field to make sure I am thinking about problems and potential solutions correctly, before diving right in.

但是，您也可以与研究过类似问题的大学教授或其他数据科学家等专家联系，为您提供建议。有时，我会由同一个领域的同龄人提出自己的想法，以确保我在继续学习之前正确思考问题和潜在的解决方案。

As you get more and more projects under your belt, the intuition factor kicks in. You will develop a very strong sense about what’s going to work and what’s not.

当您获得越来越多的项目时，直觉因素就开始发挥作用。您将对将要起作用的东西和不起作用的东西有非常强烈的认识。

3.非常了解您的端点 (3. Understand your end-point extremely well)

My work on topics for GitHub initially started off as topics for the purpose of repository recommendations. Those topics would have never been exposed to the user. They were only intended to be internally used to compute repo to repo similarity.

我在GitHub主题方面的工作最初是作为存储库建议的主题开始的。这些主题永远不会向用户公开。它们仅打算在内部用于计算回购到回购相似性。

During development, people got really excited and suggested that these should be exposed to users directly. My immediate response was “Heck, no!”. But people wondered, why not?

在开发过程中，人们非常兴奋，建议将这些内容直接暴露给用户。我的立即回应是“哎呀，不！”。但是人们想知道，为什么不呢？

Very simple, that was not the intended use of those topics. The level of noise tolerance for something you would use only internally is much higher than what you show to users as suggestions, externally.

非常简单，这不是这些主题的预期用途。您仅在内部使用的东西的噪声容忍度要比在外部作为建议向用户显示的东西高得多。

So in the case of topics, I actually spent three additional months improving the work so that it can actually be exposed to users.

因此，在主题方面，我实际上花了另外三个月的时间来改进这项工作，以使它实际上可以向用户公开。

I can’t say this enough, but you need to know what your end goal is so that you are actually working towards providing a solution that addresses the problem.

我不能说足够多，但是您需要知道最终目标是什么，以便您实际上正在努力提供解决该问题的解决方案。

Fuzziness in the end goal your are trying to achieve can result in either a complete redo, or months of extra work tuning and tweaking your models to do the right thing.

您试图达到的最终目标的模糊性可能导致完全重做 ，或者导致数月的额外工作调整和调整模型以执行正确的操作。

4.注意您的数据质量 (4. Pay attention to your data quality)

“Garbage in, garbage out” is true in every sense of the word when it comes to machine learning and NLP.

在机器学习和NLP方面，“垃圾进垃圾”在每个词上都是正确的。

If you are trying to make predictions of sentiment classes (positive versus negative) and your positive examples contain a large number of negative comments and vice versa, your classifier is going to be confused.

如果您要预测情感类别(正面还是负面)，而正面示例包含大量负面评论，反之亦然，则分类器将很混乱。

Imagine if I told you 1+2=3 and the next time I tell you 1+2=4 and the next time I tell you again 1+2=3. Ugh, wouldn’t you be so confused? It’s the same analogy.

想象一下，如果我告诉你1+2=3 ，下次我告诉你1+2=4 ，下次我再次告诉你1+2=3 。 gh，你不会那么困惑吗？同样的比喻。

Also, if you have 90% positive examples and 10% negative ones, how well do you think your classifier is going to perform on negative comments? It’s probably going to say every comment is a positive comment.

另外，如果您有90％的正面示例和10％的负面示例，您认为分类器对负面评论的表现如何？可能会说每个评论都是正面评论。

Class imbalance and lack of diversity in your data can be a real problem. The more diverse your training data, the better it will generalize.

类不平衡和数据缺乏多样性可能是一个真正的问题。您的训练数据越多样化，它的推广效果就越好。

This was very evident in one of my research projects on clinical text segmentation. When we forced variety in training examples, the results clearly improved.

这在我的一项有关临床文本分割的研究项目中非常明显。当我们在训练示例中强制多样化时，结果明显得到了改善。

While over-processing your data may be unnecessary, under-processing it may also be detrimental.

尽管可能不必要对数据进行过度处理，但对数据进行不足处理也可能有害。

Let’s take Tweets for example. Tweets are highly noisy. You may have out-of-vocabulary words like looooooove and abbreviations like lgtm.

让我们以推文为例。推文非常嘈杂。您可能会looooooove话来，例如looooooove和缩写，例如lgtm 。

To make sense of any of this, you would probably would need to bring these back to their normal form first. Without that, this would fall right into the trap of garbage-in-garbage-out especially if you are dealing with a fairly small dataset.

要理解其中的任何一种，您可能需要先将它们恢复为正常形式。否则，这将陷入垃圾填埋的陷阱，尤其是在处理相当小的数据集时。

5.不要完全相信您的定量结果。 (5. Don’t completely believe your quantitative results.)

Numbers can sometimes lie.

有时可能会撒谎。

For example, in a text summarization project, the overlap between your machine learning summary and the human-curated summary may be a 100%.

例如，在文本摘要项目中，您的机器学习摘要与人工策划的摘要之间的重叠可能是100％。

However, when you actually visually inspect the machine and human summaries, you might find something astonishing.

但是，当您实际目视检查机器和人员摘要时，您可能会发现一些惊人的发现。

The human says: “this is a great example of a bad summary”. The machine says: “example great this is summary a bad a of”

人类说 ： “这是总结不好的一个很好的例子” 。 机器说 ： “很好的例子，这是总结的坏事之一”

And your overlap score would still be 100%. See my point? Quantitative evaluation alone is not enough!

您的重叠分数仍将是100％。明白我的意思吗？仅仅进行定量评估还不够！

You need to visually inspect your results — and lots of it. Try to intuitively understand the problems that you are seeing. That’s one excellent way of getting more ideas on how to tweak your algorithm or ditch it altogether

您需要目视检查您的结果 -还有很多。尝试直观地了解所遇到的问题。这是获得更多关于如何调整算法或完全放弃算法的想法的极好方法

In the summarization example, the problem was obvious: the word arrangement needs a lot of work!

在摘要示例中，问题很明显：单词排列需要大量工作！

6.考虑成本和可伸缩性。 (6. Think about cost and scalability.)

Have you ever thought about what it would take to deploy your model in a production environment?

您是否考虑过在生产环境中部署模型将需要什么？

What are your data dependencies?
您的数据依存关系是什么？
How long does your model take to run?
您的模型需要运行多长时间？
How about time to predict or generate results?
时间来预测或产生结果怎么样？
Also, what are the memory and computation requirements of your approach when you scale up to the real number of data points that it would be handling?
此外，当您扩展到将要处理的实际数据点数时，您的方法的内存和计算要求是什么？

All of these have a direct impact on whether you can afford to use your proposed approach, and secondly, if you will be able to handle a production load.

所有这些都直接影响您是否有能力使用建议的方法，其次，是否可以处理生产负荷。

If your model is GPU bound, make sure that you are able to afford the cost of serving such a model.

如果您的模型是GPU绑定的，请确保您有能力负担提供这种模型的费用。

The earlier you think about cost and scalability, the higher your chance of success in getting your models deployed.

您越早考虑成本和可伸缩性，成功部署模型的机会就越大。

In my projects, I always instrument time to train, classify and process different loads to approximate how well the solutions that I am developing would hold up in a production environment.

在我的项目中，我总是花时间训练，分类和处理不同的负载，以估算我正在开发的解决方案在生产环境中的承受能力。

长话短说… (Long story short…)

The prototypes you develop don’t at all have to be throwaway prototypes. It can be the start of some really powerful production level solution if you plan ahead.

您开发的原型根本不必是一次性原型。如果您提前计划，它可能是一些功能强大的生产级解决方案的开始。

Think about your end-point and how the output from your approach will be consumed and used. Don’t over-complicate your solution. You will not go wrong if you KISS and pick a technique that fits the problem instead of forcing your problem to fit your chosen technique!

考虑一下您的端点以及方法的输出将如何被使用和使用。不要使您的解决方案过于复杂。如果您亲吻并选择一种适合问题的技术，而不是强迫您的问题适应您选择的技术，那么您不会出错！

I write about Text Mining, NLP and Machine Learning from an applied perspective. Follow my blog to keep learning.

我从应用角度撰写了有关文本挖掘，NLP和机器学习的文章。 跟随我的博客继续学习。

This article was originally published at kavita-ganesan.com

本文最初发表在kavita-ganesan.com

翻译自: https://www.freecodecamp.org/news/industrial-strength-natural-language-processing-de2588b6b1ed/