nlp错字检测

阳光下没有新事物 (Nothing New Under The Sun)

Fake News is being talked about by everyone from your best friend, to your parents, perhaps even your goldfish are whispering in the corners of the tank. It’s even being covered by Real News at an alarming clip. Dictionary.com even listed ‘misinformation’ as their Word of the Year in 2018. However, this isn’t a particularly new problem, right? After all Jonathon Swift wrote in 1710, “Falsehood flies, and the Truth comes limping after it”. And then there is the more famous, often attributed to Mark Twain, “A lie can travel halfway around the world before the truth can get its boots on.”

从您最好的朋友到父母,每个人都在谈论“假新闻”,也许甚至您的金鱼也在水箱的角落里窃窃私语。 Real News甚至以惊人的速度报道了它。 Dictionary.com甚至将“错误信息”列为“ 2018年年度最佳词汇”。但是,这不是一个特别新的问题,对吗? 毕竟,乔纳森·斯威夫特(Jonathon Swift)在1710年写道:“虚假无疾而终,真相也随之而来”。 然后是一个更著名的,通常归功于马克·吐温(Mark Twain):“谎言可以在真相得以掩盖之前在世界各地传播。”

So, if this isn’t a new problem, why is it one of the most talked-about topics today?

因此,如果这不是一个新问题,为什么它成为当今最受关注的话题之一?

Mo数据,Mo问题 (Mo Data, Mo Problems)

We won’t dive into all the intricate reasons here. However, one of the most obvisous reasons it’s easy to understand why this problem has proliferated and permeated into every crevice of our lives. In a word, accessibility.

我们不会在这里深入探讨所有复杂的原因。 但是,最显而易见的原因之一是很容易理解为什么这个问题扩散并渗透到我们生活的每一个缝隙中。 总之,可访问性。

It isn’t hard to see how quickly things can get out of hand. The ability for almost anyone, anywhere in the world to publish or share an article, video or podcast comes at a cost. It would take me 20 minutes at least to verify that this story is true! Nah, I’ll just retweet it because it looks true enough.

不难看出事情会很快失控。 世界上几乎任何地方的任何人都可以发布或共享文章,视频或播客,这需要付出一定的代价。 我至少要花20分钟才能证实这个故事是正确的! 不,我将转发它,因为它看起来足够真实

So, let’s review a project, then discuss a bit more about where things are and where they could be going.

因此,让我们回顾一个项目,然后再讨论一些有关事物在什么地方以及它们将要去往何处的信息。

使用NLP检测虚假新闻 (Detect Fake News Using NLP)

We will be using two datasets for this project. There will be one real news set and a fake news data set. Let’s take a look at the first five observations for each.

我们将为此项目使用两个数据集 。 将有一个真实的新闻集和一个伪造的新闻数据集。 让我们看一下每一个的前五个观察值。

We can see the datasets are not that different. Which I feel is a metaphor for this entire issue… Well, anyway, we have a title, text, subject, and date for each observation — in both datasets.

我们可以看到数据集没有什么不同。 我觉得这是整个问题的隐喻...嗯,无论如何,我们在两个数据集中都为每个观察结果提供了标题,文本,主题和日期。

But, I’m sure you can already see some of the differences. Take a look at the text for the real articles. Each text string starts with the location of the story, then it’s followed by the name of the news outlet. Before running this through an algorithm, we can already see that there is a key difference between the datasets.

但是,我相信您已经可以看到其中的一些区别。 看一下真正文章的文字。 每个文本字符串都以故事的位置开始,然后是新闻发布地的名称。 在通过算法运行之前,我们已经看到数据集之间存在关键差异。

After we add a new column to each dataframe to distinguish if it is a real or fake article, we concatenate them into a single dataframe. Looking at a sample of the new dataframe we see another difference. None of the real articles have ALL CAPS in the title. However, in this cross-section, we can see that three of the five fake articles have all caps in the title. Also, we can see that the subjects are similar, but not the same.

在向每个数据框添加新列以区分它是真实商品还是伪造商品之后,我们将它们连接为一个数据框。 查看新数据框的样本,我们会看到另一个差异。 实际文章中都没有标题中的“全部大写”。 但是,在此横断面中,我们可以看到五篇假文章中的三篇标题都大写。 此外,我们可以看到主题相似但不相同。

We have 21,417 and 23,481 observations for real and fake news, respectively. So, we have a decently balanced dataset, with no null values in the 44,898 observations.

对于真实和虚假新闻,我们分别有21,417和23,481个观测值。 因此,我们有一个相当平衡的数据集,在44,898个观测值中没有空值。

图表攻击 (Chart-attack)

Now that we’re done wrangling and doing some basic prep/exploration with the data, let’s bust out some of those visualization libraries.

现在,我们已经完成了数据整理和一些基本的准备工作/探索工作,下面我们来看一下其中的一些可视化库。

Like we’ve already discussed, the data set we’re dealing with is decently balanced. Even with the difference in total counts being this close, we will still stratify the data when we do our train-test split.

就像我们已经讨论过的那样,我们正在处理的数据集是相当平衡的。 即使总计数的差异接近,我们仍将在进行火车测试拆分时对数据进行分层。

After reviewing the dataframe I started to wonder about the subject column. My initial hypothesis was that it might be a useful column to include in our analysis. With that in mind, we did a value count on the totals per category.

在查看了数据框之后,我开始对主题列感到疑惑。 我最初的假设是,将其包含在我们的分析中可能是一个有用的专栏。 考虑到这一点,我们对每个类别的总计进行了价值计数。

Even though there is a low representation in a few of the categories I still thought this might be something meaningful to consider when doing our analysis. However, taking it one step further…

即使在某些类别中代表较少,我仍然认为这在进行分析时可能值得考虑。 但是,再往前走一步……

Ah, there it is. Look at us continuing to find patterns! Who needs those Machine Learning algothims anyway? We do. We all do. But, mostly me.

啊,有。 看着我们继续寻找模式! 无论如何,谁需要这些机器学习算法? 我们的确是。 大家都这样做。 但是,主要是我。

It does seem odd that the real news articles only have two different categories — its the most unbalanced thing about our data. Perhaps all the articles were pulled from a few sources, a few categories within them. All the more reason to use as much data as possible for this type of work.

真正的新闻报道只有两个不同的类别,这似乎很奇怪,这是关于我们数据的最不平衡的事情。 也许所有文章都来自一些来源,其中一些类别。 更有理由为此类型的工作使用尽可能多的数据。

This points to the bigger picture of what makes exploratory data analysis fun and insightful. We’ve found some good insights, even about the data scraping process used.

这表明了使探索性数据分析变得有趣而有见地的更大前景。 我们发现了一些很好的见解,甚至是关于所使用的数据抓取过程的见解。

那朵云是鸭子的形状 (That cloud is in the shape of a duck)

Before we get to everyone’s favorite NLP graphic, the Word Cloud, we’ll need to clean up the dataframe a bit.

在获得大家喜欢的NLP图形词云之前,我们需要稍微清理一下数据框。

All the cleaning functions have been broken out into individual functions. When you’re creating functions you have the option to create One Function To Rule Them All, Mother Of All Function functions. These are not necessarily bad. However, I find that method more difficult to debug and implement. In this example here, if we wanted to change the language of the stop-words, or if any individual function didn’t work, we’d be able to make changes to just that function without having other issues potentially crop up.

所有清洁功能均已分解为单独的功能。 创建函数时,可以选择创建一个函数来统治所有函数,所有函数之母 。 这些不一定坏。 但是,我发现该方法更难以调试和实现。 在这里的示例中,如果我们想更改停用词的语言,或者任何单个功能不起作用,我们将能够对该功能进行更改,而不会出现其他问题。

We’ve removed English stop-words, removed some bracket punctuation, etc. and now we can make a few word-clouds. We can use this little piece of code to make them.

我们删除了英文停用词,删除了一些括号标点符号等,现在我们可以制作一些词云了。 我们可以使用这段小代码来制作它们。

plt.figure(figsize = (20,20)) # Text from the real news articleswc = WordCloud(max_words = 2000, width = 1600, height = 800, stopwords = STOPWORDS).generate(“ “.join(df[df.category == 1].text))plt.imshow(wc , interpolation = ‘bilinear’)plt.axis(“off”)plt.show

Real News Word Cloud
真实新闻词云
Fake News Word Cloud
假新闻词云

We can certainly see some differences between these two word-clouds. I prefer a more custom approach when making word-clouds. In the case of these two data sets, I went with these:

我们当然可以看到这两个词云之间的一些差异。 在制作词云时,我更喜欢使用更自定义的方法。 在这两个数据集的情况下,我使用了以下数据:

Some of the words have similar representations. This isn’t too surprising, since most of the articles were listed as being of a political nature. A few standout differences for me are Hillary Clinton and Obama in the fake news word-cloud.

一些单词具有相似的表示。 这并不奇怪,因为大多数文章被列为具有政治性质。 假新闻词云中的希拉里·克林顿 ( Haryary Clinton)奥巴马 (O bama)对我来说是几个杰出的区别。

准确,但这有关系吗? (Accurate, but does it matter?)

If you want to see all the code used during the modeling process head over to Github. Here are the results:

如果您想查看建模过程中使用的所有代码,请转到Github 。 结果如下:

So, we’re left with a fairly accurate model using basic NLP libraries and techniques.

因此,我们剩下的是使用基本NLP库和技术的相当准确的模型。

At the start of this article, we talked briefly about fake news, misinformation, propaganda, none of it is new. We can see that some basic techniques can produce decent results without much tuning. So, what makes it so difficult for platforms like Facebook to root out fake news shared on their platform?

在本文开始时,我们简要讨论了假新闻,错误信息,宣传,这些都不是新鲜事物。 我们可以看到,一些基本技术无需太多调整即可产生不错的结果。 那么,是什么让Facebook这样的平台很难根除其平台上共享的假新闻呢?

Well, a tremendous number of things. In Facebook’s own words,

好吧,数量众多。 用Facebook的话来说 ,

[false] news is harmful to our community, it makes the world less informed, and it erodes trust. It’s not a new phenomenon, and all of us — tech companies, media companies, newsrooms, teachers — have a responsibility to do our part in addressing it. At Facebook, we’re working to fight the spread of false news in three key areas:

[假]新闻对我们的社区有害,它使世界不那么了解信息,并且削弱了信任。 这不是一个新现象,我们所有人(技术公司,媒体公司,新闻编辑室,教师)都有责任尽我们的力量来解决这一问题。 在Facebook,我们正在努力在以下三个关键领域阻止虚假新闻的传播:

disrupting economic incentives because most false news is financially motivated;

由于大多数虚假消息是出于经济动机,因此破坏了经济激励措施;

building new products to curb the spread of false news; and

开发新产品以遏制虚假新闻的传播; 和

helping people make more informed decisions when they encounter false news.

帮助人们在遇到错误消息时做出更明智的决策。

Across all of their platforms, they are attempting to find a suitable solution with a multi-pronged approach — though possibly there are more they aren’t sharing publicly. Even with world-class data teams at every tech company around the world, the elusiveness of fake news is something that will continue to challenge what seems to be logically possible — its eradication. Or at the very least, the ability to find as much of it as possible and remove it from their platforms.

在所有平台上,他们都试图通过多管齐下的方法找到合适的解决方案-尽管可能有更多的公司没有公开共享。 即使在世界各地的每个科技公司都有世界一流的数据团队,对假新闻的难以捉摸的情况仍将继续挑战似乎在逻辑上可能的可能性—消除它。 或至少是能够找到尽可能多的内容并将其从其平台中删除的功能。

The fairly straightforward nature of this project can lead to a misguided understanding of the problem, but more importantly the solution.

该项目的直接性质可能导致对问题的误解,但更重要的是解决方案

Watching from the sidelines to see what strategies get implemented to curtail this major issue is exciting. Though, joining one of these amazing teams to help bring to life would be a bit more exciting.

从一边观望,看看可以采取什么策略来减少这一重大问题,这是令人兴奋的。 不过,加入这些令人惊叹的团队之一来帮助实现生活会更加令人兴奋。

领英 (LinkedIn)

Connect with me on LinkedIn: https://www.linkedin.com/in/wchasethompson

在LinkedIn上与我联系: https : //www.linkedin.com/in/wchasethompson

翻译自: https://medium.com/swlh/fake-news-detection-using-nlp-e744a6909276

nlp错字检测


http://www.taodudu.cc/news/show-3308983.html

相关文章:

  • Kubernetes 容器安全 Seccomp 限制容器进程系统调用
  • redis geo 性能分析及对问题的思考
  • ROS + UDEV管理开源小车rikirobot的USB设备
  • 158行Python代码复现DeepMind递归神经网络DRAW
  • 【数据分析】【数据获取】【Python爬虫】快速入门+实例+代码+GIF实操
  • 是时候展现“真正”的技术了!
  • 自动泊车之停车位检测算法
  • 如何基于深度学习实现图像的智能审核
  • 美团如何基于深度学习实现图像的智能审核?
  • 美团是如何基于深度学习实现图像的智能审核?
  • 干货 | 美团如何基于深度学习实现图像的智能审核?
  • 【AI in 美团】如何基于深度学习实现图像的智能审核?
  • 【AI in 美团】 深度学习在OCR中的应用
  • 如何基于深度学习实现图像的智能审核?
  • AI技术在智能海报设计中的应用
  • 【AI in 美团】深度学习在OCR中的应用
  • 运行ts文件时报错:return new TSError(diagnosticText, diagnosticCodes)
  • RabbitMQ-报错Error: unable to perform an operation on node ‘rabbit@xxx‘. Please see diagno
  • 《A FA ST SEGMENTATION-DRIVEN ALGORITHM FOR ACCUR ATE STEREO CORRESPONDENCE》
  • TezSession has already shutdown. No cluster diagnostics found.
  • 肿瘤早筛有望实现新的突破
  • Error: Type ‘DiagnosticableMixin‘ not found.
  • vue-resource post php,Vue学习笔记进阶篇——vue-resource安装及使用
  • AutoSAR系列讲解(实践篇)12.1-Diagnostics简介
  • 记录下奇奇怪怪的问题
  • 文献 Application of deep learning tothe diagnosis of cervical lymph node metastasis from thyroid阅读笔记
  • python+Web自动化打包exe+配置文件
  • 【论文笔记】End-to-End Knowledge-Routed Relational Dialogue System for Automatic Diagnosis
  • 论文阅读2018-Deep Convolutional Neural Networks for breast cancer screening 重点:利用迁移学习三个网络常规化进行分类
  • Properly shutting down MongoDB database connection from C# 2.1 driver?

nlp错字检测_使用nlp进行假新闻检测相关推荐

  1. 机器学习-NLP(二):LSTM假新闻检测

    本文案例为假新闻检测,主要使用模型为LSTM.通过案例的过程,来轻松的入门实践文本分类. 文章目录 导入相关库 读取数据 创建x数据和y标签 数据清洗 编码输入数据 数据拆分 创建模型 训练假新闻分类 ...

  2. 【论文翻译 假新闻检测综述 HICSS 2019】Can Machines Learn to Detect Fake News? A Survey Focused on Social Media

    论文题目:Can Machines Learn to Detect Fake News? A Survey Focused on Social Media 论文来源:HICSS 2019,Procee ...

  3. 如何判别假新闻?多模态假新闻检测

    作者丨周鹏(公众号原创作者名:双鸭山学长) 学校丨中山大学硕士 研究方向丨计算语言学.语言加工.认知与教学 多模态假新闻细粒度检测基准数据集Fakeddit: https//aclanthologor ...

  4. SIGIR 2021 | UPFD:用户偏好感知假新闻检测

    目录 前言 1. 本文框架 1.1 Endogenous Preference编码 1.2 Exogenous Context提取 1.3 信息融合 2. 实验 前言 题目: User Prefere ...

  5. 跌倒检测_使用姿势估计的跌倒检测

    跌倒检测 Fall detection has become an important stepping stone in the research of action recognition - w ...

  6. dEFEND Explainable Fake News Detection 可【解释的假新闻检测】

    文章目录 摘要 介绍 相关工作 网络结构 网络结构 新闻内容编码 单词编码 句子编码 用户评论编码 Sentence-Comment Co-attention dEFEND 实验(略) 最近在学习文本 ...

  7. 加速度和陀螺仪 日常活动识别 跌倒检测_七台河房屋综合检测方法

    ,偶联剂的选择应考虑粘度,流动性,附着力,工件表面无腐蚀,易清洁,经济,结合上述因素选择糊剂作偶联剂..由于基底金属的厚度薄,因此检测方向在一侧和两侧进行.环境监测(空气质量检测): .构件及节点腐蚀 ...

  8. python中nlp的库_用于nlp的python中的网站数据清理

    python中nlp的库 The most important step of any data-driven project is obtaining quality data. Without t ...

  9. python nlp 句子提取_关于nlp:使用NLTK python进行因果句提取

    我正在从水事故报告中提取因果关系句子.我在这里使用NLTK作为工具.我通过采用20个因果关系句子结构手动创建了regExp语法[请参见下面的示例].构造的语法是以下类型的 grammar = r''' ...

最新文章

  1. AUTOML 和 NAS 的真谛
  2. Linux文件基本属性
  3. 别怀疑,换了位置就该换你的脑袋(转)
  4. docker 打包_Springboot2.0学习11 使用maven插件打包docker部署应用
  5. coreseek mysql.sock_Coreseek + Sphinx + Mysql + PHP构建中文检索引擎
  6. 当ORACLE归档日志满后如何正确删除归档日志
  7. 【Luogu1345】周游加拿大(动态规划)
  8. Centos 7忘记密码,如何重置
  9. 随机森林matlab实现
  10. DevC++ 软件下载及安装教程(详细、具体)
  11. nutch mysql hadoop_nutch+hadoop 配置使用
  12. SSH和SSM对比总结
  13. 功率单位dBm与W间的换算
  14. 图像各向异性平滑滤波
  15. 笔记本计算机无法开机,笔记本电脑无法开机黑屏?故障分析大全,及时解决办法...
  16. 该虚拟机似乎正在使用中。如果该虚拟机未在使用,请按“获取所有权(T)”按钮获取它的所有权。
  17. Flink 1.11 中的动态加载 udf jar 包
  18. 远程启动IDEA时报错Startup Error: Unable to detect graphics environment
  19. Codeforces Round #599 (Div. 2) B2. Character Swap (Hard Version)
  20. ikbc pocker键盘 快捷键说明

热门文章

  1. 作者:车品觉,现任阿里巴巴集团副总裁,浙江大学管理学院客座教授。
  2. 微信小程序wepy框架编译生成dist目录
  3. unity点击UI防止触碰UI后面物体
  4. iphonex正面图_【苹果iPhoneX评测】刘海上八个模块各显神通_苹果 iPhone X _手机评测-中关村在线...
  5. 使用Keil虚拟仿真逻辑仪和真实逻辑仪(SaleaeLogic16)
  6. 中国车载激光雷达市场洞察报告-220425
  7. docker系列(三)docker三剑客之Compose
  8. 使用vm虚拟机安装黑群晖服务并实现公网访问练习
  9. 操作系统实验--30天自制操作系统第11天实验日志(第3次节点考核代码)
  10. 理想智慧社区建设的总体框架