大数据产业核心产业环节

by Kavita Ganesan

通过Kavita Ganesan

产业实力自然语言处理 (Industrial strength Natural Language Processing)

I have spent much of my career as a graduate student researcher, and now as a Data Scientist in the industry. One thing I have come to realize is that a vast majority of solutions proposed both in academic research papers and in the workplace are just not meant to ship — they just don’t scale!

我的职业生涯大部分时间是作为一名研究生研究员,现在是该行业的数据科学家。 我已经意识到的一件事是,无论是在学术研究论文中还是在工作场所中提出的绝大多数解决方案,都不是为了发运而已,它们只是无法扩展!

And when I say scale, I mean:

当我说规模时,我的意思是:

  • Handling real world uses cases

    处理现实用例

  • Handling large amounts of data

    处理大量数据

  • Ease of deployment in a production environment.

    易于在生产环境中进行部署

Some of these approaches either work on extremely narrow use cases, or have a tough time generating results in a timely manner.

这些方法中的一些要么适用于极其狭窄的用例,要么难以及时生成结果。

More often than not, the problem lies is in the approach that was used although when things go wrong, we tend to render the problem “unsolvable”. Remember, there will almost always be more than one way to solve a Natural Language Processing (NLP) or Data Science problem. Optimizing your choices will increase your chance of success in deploying your models to production.

问题往往出在所使用的方法上,尽管当出现问题时,我们倾向于使问题“无法解决”。 记住,几乎总是 是 解决自然语言处理(NLP)或数据科学问题的方法不止一种。 优化选择将增加成功将模型部署到生产中的机会。

Over the past decade I have shipped solutions that serve real users. From this experience, I now follow a set of best practices that maximizes my chance of success every time I start a new NLP project.

在过去的十年中,我已经交付了为真实用户提供服务的解决方案。 从这次经验中,我现在遵循了一组最佳实践,可以在每次启动新的NLP项目时最大程度地提高成功的机会。

In this article, I will share some of these with you. I swear by these principles and I hope these become handy to you as well.

在本文中,我将与您分享其中的一些。 我信奉这些原则,并希望这些原则也对您有用。

1.吻! (1. KISS please!)

KISS (Keep it simple, stupid). When solving NLP problems, this seems like common sense.

吻(保持简单,愚蠢) 。 解决NLP问题时,这似乎是常识。

But I can’t say this enough: choose techniques and pipelines that are easy to understand and maintain. Avoid complex ones that only you understand, sometimes only partially.

但是我不能这么说:选择易于理解和维护的技术和管道。 避免只有您自己理解的复杂事物,有时只是部分理解。

In a lot of NLP applications, you would typically notice one of two things:

在许多NLP应用程序中,通常会注意到以下两种情况之一:

  1. Deep pre-processing layers, OR
    深度预处理层,或
  2. Complex neural network architectures that are just hard to grasp, let alone train, maintain and improve on iteratively.
    复杂的神经网络架构很难掌握,更不用说反复训练,维护和改进了。

The first question to ask yourself is if you need all the layers of pre-processing?

要问自己的第一个问题是,是否需要所有预处理步骤?

Do you really need part-of-speech tagging, chunking, entity resolution, lemmatization and etc. What if you strip out a few layers? How does this affect the performance of your models?

您是否真的需要词性标注,分块,实体解析,词义化等等。如果您剥离几层怎么办? 这如何影响模型的性能?

With access to massive amounts of data, you can often actually let the evidence in the data guide your model.

通过访问大量数据,您通常通常实际上可以让数据中的证据指导您的模型。

Think of Word2Vec. The success of Word2Vec is in its simplicity. You use large amounts of data to draw meaning — using the data itself. Layers? What layers?

想想Word2Vec 。 Word2Vec的成功在于其简单性。 您需要使用大量数据来绘制含义-使用数据本身。 层? 什么层?

When it comes to Deep Learning, use it wisely. Not all problems benefit from Deep Learning. For the problems that do, use the architectures that are easy to understand and improve on.

当涉及深度学习时,请明智地使用它。 并非所有问题都受益于深度学习。 对于存在的问题,请使用易于理解和改进的体系结构。

For example, for a programming language classification task, I just used a two-layer Artificial Neural Network and realized big wins in terms of training speed and accuracy.

例如,对于编程语言分类任务,我只使用了两层人工神经网络,并且在训练速度和准确性方面取得了巨大的成功。

In addition, adding a new programming language is pretty seamless, as long as you have data to feed into the model.

此外,添加新的编程语言是非常无缝的,只要您有数据要输入到模型中即可。

I could have complicated the model to gain some social currency by using a really complex RNN architecture straight from a research paper. But I ended up starting simple just to see how far this would get me, and now I’m at the point where I can say, what’s the need to add more complexity?

我可以通过直接使用研究论文中使用的非常复杂的RNN架构来使该模型复杂化,从而获得一些社交货币。 但是我最终还是从简单开始,只是为了看看这能给我带来多大的好处,现在我可以说,增加更多复杂性的需求是什么?

2.如有疑问,请使用经过时间检验的方法 (2. When in doubt, use a time-tested approach)

With every NLP/text mining problem, your options are plenty. There will always be more than one way to accomplish the same task.

对于每个NLP /文本挖掘问题,您都有很多选择。 总会有不止一种方法来完成相同的任务。

For example, in finding similar documents, you could use a simple bag-of-words approach and compute document similarities using the resulting tf-idf vector.

例如,在查找相似文档时,您可以使用简单的词袋方法,并使用生成的tf-idf向量计算文档相似度。

Alternatively, you could do something fancier by generating embeddings of each document and compute similarities using the document embeddings.

另外,您可以通过生成每个文档的嵌入并使用文档嵌入来计算相似度来做一些更奇特的事情。

Which should you use? It actually depends on several things:

您应该使用哪个? 它实际上取决于几件事:

  1. Which of these methods have seen a higher chance of success in practice? (Hint: We see tf-idf being used all the time for information retrieval and it is super fast. How does the embedding option compare?)
    哪种方法在实践中成功的可能性更高? (提示:我们看到tf-idf一直用于信息检索,而且速度超快。嵌入选项如何比较?)
  2. Which of these do I understand better? Remember the more you understand something, the better your chance of tuning it and getting it to work the way you expect it to.
    我更了解其中哪几个? 记住,您越了解某件事,就越有机会对其进行调整并使其以预期的方式工作。
  3. Do I have the necessary tools/data to implement either of these?
    我是否有必要的工具/数据来实现以上任何一个?

Some of these questions can be easily answered with some literature search.

这些问题中的一些可以通过一些文献搜索轻松回答。

But you could also reach out to experts such as university professors or other data scientists who have worked on similar problems to give you a recommendation. Occasionally, I run my ideas by my peers who are in the same field to make sure I am thinking about problems and potential solutions correctly, before diving right in.

但是,您也可以与研究过类似问题的大学教授或其他数据科学家等专家联系,为您提供建议。 有时,我会由同一个领域的同龄人提出自己的想法,以确保我在继续学习之前正确思考问题和潜在的解决方案。

As you get more and more projects under your belt, the intuition factor kicks in. You will develop a very strong sense about what’s going to work and what’s not.

当您获得越来越多的项目时,直觉因素就开始发挥作用。您将对将要起作用的东西和不起作用的东西有非常强烈的认识。

3.非常了解您的端点 (3. Understand your end-point extremely well)

My work on topics for GitHub initially started off as topics for the purpose of repository recommendations. Those topics would have never been exposed to the user. They were only intended to be internally used to compute repo to repo similarity.

我在GitHub主题方面的工作最初是作为存储库建议的主题开始的。 这些主题永远不会向用户公开。 它们仅打算在内部用于计算回购到回购相似性。

During development, people got really excited and suggested that these should be exposed to users directly. My immediate response was “Heck, no!”. But people wondered, why not?

在开发过程中,人们非常兴奋,建议将这些内容直接暴露给用户。 我的立即回应是“哎呀,不!”。 但是人们想知道,为什么不呢?

Very simple, that was not the intended use of those topics. The level of noise tolerance for something you would use only internally is much higher than what you show to users as suggestions, externally.

非常简单,这不是这些主题的预期用途。 您仅在内部使用的东西的噪声容忍度要比在外部作为建议向用户显示的东西高得多。

So in the case of topics, I actually spent three additional months improving the work so that it can actually be exposed to users.

因此,在主题方面,我实际上花了另外三个月的时间来改进这项工作,以使它实际上可以向用户公开。

I can’t say this enough, but you need to know what your end goal is so that you are actually working towards providing a solution that addresses the problem.

我不能说足够多,但是您需要知道最终目标是什么,以便您实际上正在努力提供解决该问题的解决方案。

Fuzziness in the end goal your are trying to achieve can result in either a complete redo, or months of extra work tuning and tweaking your models to do the right thing.

您试图达到的最终目标的模糊性可能导致完全重做 ,或者导致数月的额外工作调整和调整模型以执行正确的操作。

4.注意您的数据质量 (4. Pay attention to your data quality)

“Garbage in, garbage out” is true in every sense of the word when it comes to machine learning and NLP.

在机器学习和NLP方面,“垃圾进垃圾”在每个词上都是正确的。

If you are trying to make predictions of sentiment classes (positive versus negative) and your positive examples contain a large number of negative comments and vice versa, your classifier is going to be confused.

如果您要预测情感类别(正面还是负面),而正面示例包含大量负面评论,反之亦然,则分类器将很混乱。

Imagine if I told you 1+2=3 and the next time I tell you 1+2=4 and the next time I tell you again 1+2=3. Ugh, wouldn’t you be so confused? It’s the same analogy.

想象一下,如果我告诉你1+2=3 ,下次我告诉你1+2=4 ,下次我再次告诉你1+2=3 。 gh,你不会那么困惑吗? 同样的比喻。

Also, if you have 90% positive examples and 10% negative ones, how well do you think your classifier is going to perform on negative comments? It’s probably going to say every comment is a positive comment.

另外,如果您有90%的正面示例和10%的负面示例,您认为分类器对负面评论的表现如何? 可能会说每个评论都是正面评论。

Class imbalance and lack of diversity in your data can be a real problem. The more diverse your training data, the better it will generalize.

类不平衡和数据缺乏多样性可能是一个真正的问题。 您的训练数据越多样化,它的推广效果就越好。

This was very evident in one of my research projects on clinical text segmentation. When we forced variety in training examples, the results clearly improved.

这在我的一项有关临床文本分割的研究项目中非常明显。 当我们在训练示例中强制多样化时,结果明显得到了改善。

While over-processing your data may be unnecessary, under-processing it may also be detrimental.

尽管可能不必要对数据进行过度处理,但对数据进行不足处理也可能有害。

Let’s take Tweets for example. Tweets are highly noisy. You may have out-of-vocabulary words like looooooove and abbreviations like lgtm.

让我们以推文为例。 推文非常嘈杂。 您可能会looooooove话来,例如looooooove和缩写,例如lgtm

To make sense of any of this, you would probably would need to bring these back to their normal form first. Without that, this would fall right into the trap of garbage-in-garbage-out especially if you are dealing with a fairly small dataset.

要理解其中的任何一种,您可能需要先将它们恢复为正常形式。 否则,这将陷入垃圾填埋的陷阱,尤其是在处理相当小的数据集时。

5.不要完全相信您的定量结果。 (5. Don’t completely believe your quantitative results.)

Numbers can sometimes lie.

有时可能会撒谎。

For example, in a text summarization project, the overlap between your machine learning summary and the human-curated summary may be a 100%.

例如,在文本摘要项目中,您的机器学习摘要与人工策划的摘要之间的重叠可能是1​​00%。

However, when you actually visually inspect the machine and human summaries, you might find something astonishing.

但是,当您实际目视检查机器和人员摘要时,您可能会发现一些惊人的发现。

The human says: “this is a great example of a bad summary”. The machine says: “example great this is summary a bad a of”

人类说“这是总结不好的一个很好的例子”机器说“很好的例子,这是总结的坏事之一”

And your overlap score would still be 100%. See my point? Quantitative evaluation alone is not enough!

您的重叠分数仍将是100%。 明白我的意思吗? 仅仅进行定量评估还不够!

You need to visually inspect your results — and lots of it. Try to intuitively understand the problems that you are seeing. That’s one excellent way of getting more ideas on how to tweak your algorithm or ditch it altogether

您需要目视检查您的结果 -还有很多。 尝试直观地了解所遇到的问题。 这是获得更多关于如何调整算法或完全放弃算法的想法的极好方法

In the summarization example, the problem was obvious: the word arrangement needs a lot of work!

在摘要示例中,问题很明显:单词排列需要大量工作!

6.考虑成本和可伸缩性。 (6. Think about cost and scalability.)

Have you ever thought about what it would take to deploy your model in a production environment?

您是否考虑过在生产环境中部署模型将需要什么?

  • What are your data dependencies?
    您的数据依存关系是什么?
  • How long does your model take to run?
    您的模型需要运行多长时间?
  • How about time to predict or generate results?
    时间来预测或产生结果怎么样?
  • Also, what are the memory and computation requirements of your approach when you scale up to the real number of data points that it would be handling?
    此外,当您扩展到将要处理的实际数据点数时,您的方法的内存和计算要求是什么?

All of these have a direct impact on whether you can afford to use your proposed approach, and secondly, if you will be able to handle a production load.

所有这些都直接影响您是否有能力使用建议的方法,其次,是否可以处理生产负荷。

If your model is GPU bound, make sure that you are able to afford the cost of serving such a model.

如果您的模型是GPU绑定的,请确保您有能力负担提供这种模型的费用。

The earlier you think about cost and scalability, the higher your chance of success in getting your models deployed.

您越早考虑成本和可伸缩性,成功部署模型的机会就越大。

In my projects, I always instrument time to train, classify and process different loads to approximate how well the solutions that I am developing would hold up in a production environment.

在我的项目中,我总是花时间训练,分类和处理不同的负载,以估算我正在开发的解决方案在生产环境中的承受能力。

长话短说… (Long story short…)

The prototypes you develop don’t at all have to be throwaway prototypes. It can be the start of some really powerful production level solution if you plan ahead.

您开发的原型根本不必是一次性原型。 如果您提前计划,它可能是一些功能强大的生产级解决方案的开始。

Think about your end-point and how the output from your approach will be consumed and used. Don’t over-complicate your solution. You will not go wrong if you KISS and pick a technique that fits the problem instead of forcing your problem to fit your chosen technique!

考虑一下您的端点以及方法的输出将如何被使用和使用。 不要使您的解决方案过于复杂。 如果您亲吻并选择一种适合问题的技术,而不是强迫您的问题适应您选择的技术,那么您不会出错!

I write about Text Mining, NLP and Machine Learning from an applied perspective. Follow my blog to keep learning.

我从应用角度撰写了有关文本挖掘,NLP和机器学习的文章。 跟随我的博客继续学习。

This article was originally published at kavita-ganesan.com

本文最初发表在kavita-ganesan.com

翻译自: https://www.freecodecamp.org/news/industrial-strength-natural-language-processing-de2588b6b1ed/

大数据产业核心产业环节

大数据产业核心产业环节_产业实力自然语言处理相关推荐

  1. 什么是大数据的核心价值?

    以下是一些长篇的讨论.这里我把大数据的核心价值理解为核心商业价值. "非常多人还没搞清楚什么是PC互联网,移动互联网来了,我们还没搞清楚移动互联的时候,大数据时代又来了. "--马 ...

  2. (转)大数据最核心的价值是什么?

    下面是一些长篇的讨论,这里我把大数据的核心价值理解为核心商业价值. "很多人还没搞清楚什么是PC互联网,移动互联网来了,我们还没搞清楚移动互联的时候,大数据时代又来了."--马 云 ...

  3. 【2016年第3期】以大数据为核心 驱动智慧城市变革

    单志广,房毓菲 国家信息中心信息化研究部,北京 100045 摘要:建设智慧城市已成为国家发展新空间的重要举措,近年来我国智慧城市建设取得了积极进展.当前社会正在迈入大数据时代,大数据将成为智慧城市建 ...

  4. 《大数据》2015年第2期“专题”——关于大数据交易核心法律问题 —— 数据所有权的探讨及建议...

    关于大数据交易核心法律问题--数据所有权的探讨及建议 王 融 中国信息通信研究院互联网法律中心 北京 100191 摘要:清晰的产权归属是交易的前提与基础.然而,当前关于数据的产权归属问题还远未达成共 ...

  5. 以人工智能和大数据为核心的第四次工业革命已经悄然而至

    智能电网的核心要义是"智能".近些年,我国在智能电网的物理建设方面取得了显著的成果,但其智能化水平却表现不足.如何进一步提高智能化水平,是发展智能电网急需解决的问题. 以人工智能和 ...

  6. 大数据产业链包括哪几个环节,具体包含哪些内容

    大数据作为继云计算.物联网之后IT行业又一颠覆性的技术,备受关注,要想知道大数据创业方向,一定要知道,大数据产业链包括哪几个环节,具体的包含内容,接下来,为大家一一介绍: IT基础设施,包括提供硬件. ...

  7. DT时代,大数据最核心的意义是什么?

    大数据最核心的价值就是在于对于海量数据进行存储和分析.相比起现有的其他技术而言,大数据的"廉价.迅速.优化"这三方面的综合成本是最优的. 当这项技术在自己用的时候,自己将会非常收益 ...

  8. 互联网时代大数据的核心价值

    都说现在是大数据时代,那么大数据是什么?大数据有什么用?大数据最核心的价值是什么呢?其实大数据的核心价值很简单,就是了解用户行为(更简单说就是了解用户行为习惯).今天我们就细说大数据的核心价值. 一. ...

  9. 2015年《大数据》高被引论文Top10文章No.4——关于大数据交易核心法律问题 —— 数据所有权的探讨...

    2015年<大数据>高被引论文Top10文章展示 [编者按]本刊将把2015年<大数据>高被引论文Top10的文章陆续发布,欢迎大家关注!本文为高被引Top10论文的No.4, ...

  10. 大数据的核心价值表现在哪里

    近些年来,大数据已成为了大家茶余饭后讨论的热门话题,像数据安全.数据挖掘.数据分析等围绕大数据的一系列技术也深受市场的喜爱.那么,在这样一个满城尽谈大数据的时代背景下,大数据的核心价值究竟是什么?今天 ...

最新文章

  1. docker镜像的备份和恢复
  2. Android 编译源码 注意事项
  3. php上传头像的代码,php头像上传预览实例代码
  4. n 模块切换 node 版本无效的解决办法
  5. C#调用WebService实例和开发
  6. [Android] SQLite数据库之增删改查基础操作
  7. 安全:incaseformat蠕虫病毒来袭,你中招了吗?
  8. php中session总结,PHP5中Session总结(一)
  9. 记录zedboard无法识别com的解决思路
  10. XenCenter建立SR存储库,添加系统ISO镜像源
  11. Python实现统计代码行数功能
  12. 电脑编程就业找哪方面
  13. 智能开关的零火版和单火版有什么区别
  14. MarkdownPad2安装汉化与注册码
  15. 几款接口文档管理工具
  16. 【DC系列】DC-4靶机渗透练习
  17. 什么是绩效点、奖励加分、处罚扣分
  18. 电影评分预测系统分析
  19. 【Window 入侵排查】
  20. Django REST framework的使用简单介绍

热门文章

  1. 江苏专转本计算机基础资料,江苏专转本计算机基础历年真题.doc
  2. mybatis 中标签selectkey的作用
  3. 骚操作之一行获取本机IP
  4. 就算有雷军强推,小米CC想讨年轻女性喜欢也不容易
  5. LPCWSTR 和 LPCSTR群里大佬
  6. 网页扫描图像并以pdf格式上传到服务器端
  7. SQL 语句单引号的处理
  8. java反射详解 三
  9. 《Chinese Open Relation Extraction and Knowledge Base Establishment》阅读记录
  10. mysql nginx