mongdb 群集

This is a part 2 of the series analyzing healthcare chart notes using Natural Language Processing (NLP)

这是使用自然语言处理(NLP)分析医疗保健图表笔记的系列文章的第2部分。

In the first part, we talked about cleaning the text and extracting sections of the chart notes which might be useful for further annotation by analysts. Hence, reducing their time in manually going through the entire chart note if they are only looking for “allergies” or “social history”.

在第一部分中，我们讨论了清理文本和提取图表注释的各个部分，这可能有助于分析师进一步注释。因此，如果他们只是在寻找“过敏”或“社会病史”，则可以减少他们手动查看整个图表笔记的时间。

NLP任务： (NLP Tasks:)

Pre-processing and Cleaning

预处理和清洁
Text Summarization — We are here

文字摘要-我们在这里
Topic Modeling using Latent Dirichlet allocation (LDA)

使用潜在Dirichlet分配(LDA)进行主题建模
Clustering

聚类

If you want to try the entire code yourself or follow along, go to my published jupyter notebook on GitHub: https://github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering%20Text.ipynb

如果您想亲自尝试或遵循整个代码，请转至我在GitHub上发布的jupyter笔记本： https : //github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering% 20Text.ipynb

数据： (DATA:)

Source: https://mimic.physionet.org/about/mimic/

资料来源： https : //mimic.physionet.org/about/mimic/

Doctors take notes on their computer and 80% of what they capture is not structured. That makes the processing of information even more difficult. Let’s not forget, interpretation of healthcare jargon is not an easy task either. It requires a lot of context for interpretation. Let’s see what we have:

医生会在计算机上做笔记，而所捕获的内容中有80％都是没有结构的。这使得信息处理更加困难。别忘了，对医疗术语的解释也不是一件容易的事。它需要很多上下文来进行解释。让我们看看我们有什么：

文字摘要 (Text Summarization)

Spacy isn’t great at identifying the “Named Entity Recognition” of healthcare documents. See below:

Spacy不能很好地识别医疗文档的“命名实体识别”。见下文：

doc = nlp(notes_data["TEXT"][178])text_label_df = pd.DataFrame({"label":[ent.label_ for ent in doc.ents],                                   "text": [ent.text for ent in doc.ents]                                 })display(HTML(text_label_df.head(10).to_html()))

Image by Author: Poor job at POS tagging in healthcare jargon

But, that does not mean it can not be used to summarize our text. It is still great at identifying the dependency in the texts using “Parts of Speech tagging”. Let’s see:

但是，这并不意味着它不能用来总结我们的文字。在使用“词性标签”来识别文本中的依存关系方面仍然很棒。让我们来看看：

# Process the textdoc = nlp(notes_data["TEXT"][174][:100])print(notes_data["TEXT"][174][:100], "\n")# Iterate over the tokens in the docfor token in doc:    if not (token.pos_ == 'DET' or token.pos_ == 'PUNCT' or token.pos_ == 'SPACE' or 'CONJ' in token.pos_):        print(token.text, token.pos_)        print("lemma:", token.lemma_)        print("dependency:", token.dep_, "- ", token.head.orth_)        print("prefix:", token.prefix_)        print("suffix:", token.suffix_)

Image by Author: Dependency identification

So, we can summarize the text; based on the dependency tracking. YAYYYYY!!!

因此，我们可以总结文本；基于依赖项跟踪。耶！

Here are the results for the summary! (btw, I tried zooming out my jupyter notebook to show you the text difference, but still failed to capture the chart notes in its entirety. I’ll paste these separately as well or you can check my output on the Github page(mentioned at the top).

这是摘要的结果！ (顺便说一句，我尝试将jupyter笔记本放大以显示文本差异，但仍然无法完整捕获图表注释。我也将它们分别粘贴，或者您可以在Github页面上检查我的输出(顶端)。

Isn’t it great how we could get the gist of the entire document into concise and crisp phrases? These summaries will be used in topic modeling (in section 3) and the clustering of documents in section 4.

我们怎样才能使整个文档的要旨简明扼要，这不是很好吗？这些摘要将用于主题建模(第3节)和第4节中的文档聚类。

翻译自: https://towardsdatascience.com/text-summarization-for-clustering-documents-2e074da6437a

mongdb 群集

查看全文

http://www.taodudu.cc/news/show-995275.html

gdal进行遥感影像读写_如何使用遥感影像进行矿物勘探
推荐算法的先验算法的连接_数据挖掘专注于先验算法
时间序列模式识别_空气质量传感器数据的时间序列模式识别
数据科学学习心得_学习数据科学
数据科学生命周期_数据科学项目生命周期第1部分
条件概率分布_条件概率
成为一名真正的数据科学家有多困难
数据驱动开发_开发数据驱动的股票市场投资方法
算法偏见是什么_算法可能会使任何人（包括您）有偏见
线性回归非线性回归_了解线性回归
数据图表可视化_数据可视化如何选择正确的图表第1部分
使用python和javascript进行数据可视化
github gists 101使代码共享漂亮
大熊猫卸妆后_您不应错过的6大熊猫行动
jdk重启后步行_向后介绍步行以一种新颖的方式来预测未来
scrapy模拟模拟点击_模拟大流行
plsql中导入csvs_在命令行中使用sql分析csvs
交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统
火种 ctf_分析我的火种数据
分析citibike数据eda
带有postgres和jupyter笔记本的Titanic数据集
机器学习模型非线性模型_机器学习模型说明
算命数据_未来的数据科学家或算命精神向导
熊猫数据集_熊猫迈向数据科学的第三部分
充分利用UC berkeleys数据科学专业
铁拳nat映射_铁拳如何重塑我的数据可视化设计流程
有效沟通的技能有哪些_如何有效地展示您的数据科学或软件工程技能
vue取数据第一个数据_我作为数据科学家的第一个月
rcp rapido_为什么气流非常适合Rapido
算法组合优化算法_算法交易简化了风险价值和投资组合优化

mongdb 群集_群集文档的文本摘要相关推荐

api数据接口文档_接口文档示例（Taobao/jd/pinduoduo/开放接口调用）
api数据接口文档_接口文档示例本文主要是提供了一个接口文档的范文,内容修订历史.目录.时序图.接口要素描述.接口说明.使用示例.字典.FAQ. 使用MD格式文档(makedown),选择原因,容 ...
php验证码手册,验证码_专题_帮助文档_Thinkphp手册
验证码_专题_帮助文档_Think Think/Verify类可以支持验证码的生成和验证功能. 生成验证码下面是最简单的方式生成验证码: seover" style=" marg ...
php方法帮助文档,Trace方法_帮助文档_Thinkphp手册
Trace方法_帮助文档_Think 页面Trace只能用于有页面输出的情况,但是trace方法可以用在任何情况,而且trace方法可以用于AJAX等操作. Trace方法的格式:trace('变量' ...
使用免费组件给PDF文档添加文本和图片页眉
C#/.NET 使用免费组件给PDF文档添加文本和图片页眉如今PDF文档与Office文档一样成为了一种通用文档,在日常工作中我们经常会碰到需要给PDF文件添加页眉和页脚/页码等情况,如果你正好是喜 ...
relative会脱离文档流吗_脱离文档流与脱离文本流
脱离文档流与脱离文本流根据官方文档所述,脱离文档流的方法有浮动(float)和绝对定位(poistion:absolute/fixed). 那么我们接下来来看看他们的区别吧使用浮动(float) ...
在线文本文档txt编辑器_审查了6位在线文档和文本编辑者
在线文本文档txt编辑器 Who wants to limit himself to one computer nowadays? Say hello to online editors, where ...
7 款从 HTML 文档提取文本的工具
2019独角兽企业重金招聘Python工程师标准>>> 收集电子邮件地址.竞争分析.网站检查.定价分析和客户数据收集 - 这些可能只是你需要从 HTML 文档中提取文本和其他数据的几 ...
单文档应用程序弹出新对话框_简介——文档
1.3 文档(Documents) 文档是绘制和编辑结构的工作区域.文档可以包含许多页,也可以只包含一页. 1.3.1 创建文档(Creating Documents) 可以直接使用默认设置创建文档, ...
vb.net 设置打印纸张与页边距_装订文档时不想让文字被挡住？在Excel中你可以这样设置打印！...
平时我们在打印文档的时候,通常会把文档左侧的页边距设置的大一点,这样在装订的时候显得美观一点.但如果我们进行双面打印时,文档左右两边的页边距刚好相反,装订时第2页的文本很容易被挡住,这样子反而更难装订 ...

mongdb 群集_群集文档的文本摘要

NLP任务： (NLP Tasks:)

数据： (DATA:)

文字摘要 (Text Summarization)

相关文章：

mongdb 群集_群集文档的文本摘要相关推荐

最新文章

热门文章