数据增强 数据集扩充

班级分配不均衡的创新解决方案 (A Creative Solution to Imbalanced Class Distribution)

Imbalanced class distribution is a common problem in Machine Learning. I was recently confronted with this issue when training a sentiment classification model. Certain categories were far more prevalent than others and the predictive quality of the model suffered. The first technique I used to address this was random under-sampling, wherein I randomly sampled a subset of rows from each category up to a ceiling threshold. I selected a ceiling that reasonably balanced the upper 3 classes. Although a small improvement was observed, the model was still far from optimal.

班级分配不平衡是机器学习中的常见问题。 最近,我在训练情感分类模型时遇到了这个问题。 某些类别比其他类别更为普遍,因此模型的预测质量受到影响。 我用来解决此问题的第一个技术是随机欠采样,其中我从每个类别中随机采样了行的子集,直到上限阈值。 我选择了一个合理地平衡前三类的上限。 尽管观察到很小的改进,但是该模型仍远非最佳。

I needed a way to deal with the under-represented classes. I could not rely on traditional techniques used in multi-class classification such as sample and class weighting, as I was working with a multi-label dataset. It became evident that I would need to leverage oversampling in this situation.

我需要一种方法来处理代表性不足的课程。 当我使用多标签数据集时,我不能依赖于用于多类分类的传统技术,例如样本和类加权。 很明显,在这种情况下,我需要利用过度采样。

A technique such as SMOTE (Synthetic Minority Over-sampling Technique) can be effective for oversampling, although the problem again becomes a bit more difficult with multi-label datasets. MLSMOTE (Multi-Label Synthetic Minority Over-sampling Technique) has been proposed [1], but the high dimensional nature of the numerical vectors created from text can sometimes make other forms of data augmentation more appealing.

诸如SMOTE(合成少数族裔过采样技术)之类的技术可以有效地进行过采样,尽管对于多标签数据集,问题再次变得更加棘手。 已经提出了MLSMOTE (多标签综合少数族裔过采样技术)[1],但是从文本创建的数字矢量的高维性质有时会使其他形式的数据增强更具吸引力。

Photo by Christian Wagner on Unsplash
克里斯蒂安·瓦格纳在《 Unsplash》上的照片

变形金刚救援! (Transformers to the Rescue!)

If you decided to read this article, it is safe to assume that you are aware of the latest advances in Natural Language Processing bequeathed by the mighty Transformers. The exceptional developers at Hugging Face in particular have opened the door to this world through their open source contributions. One of their more recent releases implements a breakthrough in Transfer Learning called the Text-to-Text Transfer Transformer or T5 model, originally presented by Raffel et. al. in their paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [2].

如果您决定阅读本文,可以假定您了解强大的变形金刚在自然语言处理方面的最新进展。 Hugging Face的杰出开发人员尤其通过其开源贡献为这个世界打开了一扇门。 他们的一个更新的版本工具的转移的突破口学习所谓的T外部- 牛逼T外部贸易交接牛逼 ransformer或T5型号,最初由拉费尔等人提出的。 等 在他们的论文《使用统一的文本到文本的转换器探索迁移学习的局限性》中 [2]。

T5 allows us to execute various NLP tasks by specifying prefixes to the input text. In my case, I was interested in Abstractive Summarization, so I made use of the summarize prefix.

T5允许我们通过指定输入文本的前缀来执行各种NLP任务。 就我而言,我感兴趣的是写意总结,所以我利用的summarize前缀。

Text-to-Text Transfer Transformer [2]
文本到文本传输变压器[2]

抽象总结 (Abstractive Summarization)

Abstractive Summarization put simplistically is a technique by which a chunk of text is fed to an NLP model and a novel summary of that text is returned. This should not be confused with Extractive Summarization, where sentences are embedded and a clustering algorithm is executed to find those closest to the clusters’ centroids — namely, existing sentences are returned. Abstractive Summarization seemed particularly appealing as a Data Augmentation technique because of its ability to generate novel yet realistic sentences of text.

简而言之,抽象摘要是一种将文本块输入NLP模型并返回该文本的新颖摘要的技术。 这不应与“提取摘要”相混淆,在“摘要提取”中嵌入句子并执行聚类算法以查找最接近聚类质心的那些,即返回现有的句子。 抽象汇总作为一种数据增强技术特别吸引人,因为它能够生成新颖而逼真的文本句子。

算法 (Algorithm)

Here are the steps I took to use Abstractive Summarization for Data Augmentation, including code segments illustrating the solution.

这是我使用抽象汇总进行数据增强所采取的步骤,包括说明解决方案的代码段。

I first needed to determine how many rows each under-represented class required. The number of rows to add for each feature is thus calculated with a ceiling threshold, and we refer to these as the append_counts. Features with counts above the ceiling are not appended. In particular, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0. The following methods trivially achieve this in the situation where features have been one-hot encoded:

首先,我需要确定每个代表性不足的类需要多少行。 因此,每个特征要添加的行数是使用上限阈值计算的,我们将其称为append_counts 。 计数不超过上限的要素不会被附加。 特别是,如果给定要素具有1000行且上限为100,则其附加计数将为0。在要素已被一键编码的情况下,以下方法可以轻松实现此目的:

def get_feature_counts(self, df):    shape_array = {}    for feature in self.features:        shape_array[feature] = df[feature].sum()    return shape_arraydef get_append_counts(self, df):    append_counts = {}    feature_counts = self.get_feature_counts(df)    for feature in self.features:        if feature_counts[feature] >= self.threshold:            count = 0        else:            count = self.threshold - feature_counts[feature]        append_counts[feature] = count    return append_counts

For each feature, a loop is completed from an append index range to the append count specified for that given feature. This append_index variable along with a tasks array are introduced to allow for multi-processing which we will discuss shortly.

对于每个功能,从附加索引范围到为该给定功能指定的附加计数的循环完成。 引入了这个append_index变量以及一个task数组,以允许进行多重处理,我们将在稍后进行讨论。

counts = self.get_append_counts(self.df)# Create append dataframe with length of all rows to be appendedself.df_append = pd.DataFrame(    index=np.arange(sum(counts.values())),    columns=self.df.columns)# Creating array of tasks for multiprocessingtasks = []# set all feature values to 0for feature in self.features:    self.df_append[feature] = 0for feature in self.features:    num_to_append = counts[feature]    for num in range(            self.append_index,            self.append_index + num_to_append    ):        tasks.append(            self.process_abstractive_summarization(feature, num)        )    # Updating index for insertion into shared appended dataframe     # to preserve indexing for multiprocessing    self.append_index += num_to_append

An Abstractive Summarization is calculated for a specified size subset of all rows that uniquely have the given feature, and is added to the append DataFrame with its respective feature one-hot encoded.

为唯一具有给定特征的所有行的指定大小的子集计算一个摘要汇总,并将其摘要附加到附加DataFrame中,并对其各个特征进行一次热编码。

df_feature = self.df[    (self.df[feature] == 1) &    (self.df[self.features].sum(axis=1) == 1)]df_sample = df_feature.sample(self.num_samples, replace=True)text_to_summarize = ' '.join(    df_sample[:self.num_samples]['review_text'])new_text = self.get_abstractive_summarization(text_to_summarize)self.df_append.at[num, 'text'] = new_textself.df_append.at[num, feature] = 1

The Abstractive Summarization itself is generated in the following way:

摘要汇总本身是通过以下方式生成的:

t5_prepared_text = "summarize: " + text_to_summarizeif self.device.type == 'cpu':    tokenized_text = self.tokenizer.encode(        t5_prepared_text,        return_tensors=self.return_tensors).to(self.device)else:    tokenized_text = self.tokenizer.encode(        t5_prepared_text,        return_tensors=self.return_tensors)summary_ids = self.model.generate(    tokenized_text,    num_beams=self.num_beams,    no_repeat_ngram_size=self.no_repeat_ngram_size,    min_length=self.min_length,    max_length=self.max_length,    early_stopping=self.early_stopping)output = self.tokenizer.decode(    summary_ids[0],    skip_special_tokens=self.skip_special_tokens)

In initial tests the summarization calls to the T5 model were extremely time-consuming, reaching up to 25 seconds even on a GCP instance with an NVIDIA Tesla P100. Clearly this needed to be addressed to make this a feasible solution for data augmentation.

在最初的测试中,对T5模型的汇总调用非常耗时,即使在使用NVIDIA Tesla P100的GCP实例上,也要长达25秒。 显然,需要解决此问题,以使其成为可行的数据增强解决方案。

Photo by Brad Neathery on Unsplash
Brad Neathery在Unsplash上拍摄的照片

多处理 (Multiprocessing)

I introduced a multiprocessing option, whereby the calls to Abstractive Summarization are stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library. This resulted in an exponential decrease in runtime. I must thank David Foster for his succinct stackoverflow contribution [3]!

我介绍了一个multiprocessing选项,其中对抽象总结的调用存储在一个任务数组中,然后传递给一个子例程,该子例程使用多处理库并行运行这些调用。 这导致运行时间呈指数下降。 我必须感谢David Foster所做的简洁的stackoverflow贡献[3]!

running_tasks = [Process(target=task) for task in tasks]for running_task in running_tasks:    running_task.start()for running_task in running_tasks:    running_task.join()

简化解决方案 (Simplified Solution)

To make things easier for everybody I packaged this into a library called absum. Installing is possible through pip:pip install absum. One can also download directly from the repository.

为了使每个人都更容易,我将其打包到一个名为absum的库中。 可以通过pip install absumpip install absum 。 也可以直接从资源库下载。

Running the code on your own dataset is then simply a matter of importing the library’s Augmentor class and running its abs_sum_augment method as follows:

在自己的数据集运行的代码则只需导入库的事项Augmentor类和运行其abs_sum_augment方法如下:

import pandas as pdfrom absum import Augmentorcsv = 'path_to_csv'df = pd.read_csv(csv)augmentor = Augmentor(df)df_augmented = augmentor.abs_sum_augment()df_augmented.to_csv(    csv.replace('.csv', '-augmented.csv'),     encoding='utf-8',     index=False)

absum uses the Hugging Face T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformer models capable of Abstractive Summarization. It is format agnostic, expecting only a DataFrame containing text and one-hot encoded features. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the features parameter.

absum默认情况下使用Hugging Face T5模型,但以模块化方式设计,允许您使用任何能够进行抽象总结的预训练或开箱即用的Transformer模型。 它与格式无关,只希望包含文本和一键编码功能的DataFrame。 如果存在您不希望考虑的其他列,则可以选择将特定的一键编码特征作为逗号分隔的字符串传递给features参数。

Also of special note are the min_length and max_length parameters, which determine the size of the resulting summarizations. One trick I found useful is to find the average character count of the text data you’re working with and start with something a bit lower for the minimum length while slightly padding it for the maximum. All available parameters are detailed in the documentation.

还要特别注意的是min_lengthmax_length参数,它们确定所得汇总的大小。 我发现有用的一个技巧是找到正在使用的文本数据的平均字符数,并从最小长度的小一些开始,而最大长度的填充一些。 文档中详细介绍了所有可用参数。

Feel free to add any suggestions for improvement in the comments or even better yet in a PR. Happy coding!

可以随意添加任何建议以改善评论,甚至可以改善PR 。 编码愉快!

翻译自: https://towardsdatascience.com/abstractive-summarization-for-data-augmentation-1423d8ec079e

数据增强 数据集扩充

http://www.taodudu.cc/news/show-863715.html

相关文章:

  • 贝叶斯优化神经网络参数_贝叶斯超参数优化:神经网络,TensorFlow,相预测示例
  • 如何学习 azure_Azure的监督学习
  • t-sne 流形_流形学习[t-SNE,LLE,Isomap等]变得轻松
  • 数据库课程设计结论_结论
  • 摘要算法_摘要
  • 数据库主从不同步_数据从不说什么
  • android 揭示动画_遗传编程揭示具有相互作用的多元线性回归
  • 检测和语义分割_分割和对象检测-第5部分
  • 如何在代码中将menu隐藏_如何在40行代码中将机器学习用于光学/光子学应用
  • pytorch实现文本分类_使用变形金刚进行文本分类(Pytorch实现)
  • python 机器学习管道_构建机器学习管道-第1部分
  • pandas数据可视化_5利用Pandas进行强大的可视化以进行数据预处理
  • 迁移学习 迁移参数_迁移学习简介
  • div文字自动扩充_文字资料扩充
  • ml是什么_ML,ML,谁是所有人的冠军?
  • 随机森林分类器_建立您的第一个随机森林分类器
  • Python中的线性回归:Sklearn与Excel
  • 机器学习中倒三角符号_机器学习的三角误差
  • 使用Java解决您的数据科学问题
  • 树莓派 神经网络植入_使用自动编码器和TensorFlow进行神经植入
  • opencv 运动追踪_足球运动员追踪-使用OpenCV根据运动员的球衣颜色识别运动员的球队
  • 犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(
  • 使用Keras和TensorFlow构建深度自动编码器
  • 出人意料的生日会400字_出人意料的有效遗传方法进行特征选择
  • fast.ai_使用fast.ai自组织地图—步骤4:使用Fast.ai DataBunch处理非监督数据
  • 无监督学习与监督学习_有监督与无监督学习
  • 分类决策树 回归决策树_决策树分类器背后的数学
  • 检测对抗样本_对抗T恤以逃避ML人检测器
  • 机器学习中一阶段网络是啥_机器学习项目的各个阶段
  • 目标检测 dcn v2_使用Detectron2分6步进行目标检测

数据增强 数据集扩充_数据扩充的抽象总结相关推荐

  1. 文本数据增强之回译数据增强

    文章目录 题目 回译数据增强法 回译数据增强优势 回译数据增强存在的问题 前言 单句翻译代码 运行结果1 数组翻译法代码 运行结果2 回译 回译结果 题目 ''' Description: 文本数据增 ...

  2. 图像数据增强及其对应的标签扩充

    先安装imgaug库,命令为: pip install imgaug 以下代码用到的标签数据类型为yolo格式(即.txt),如果你的是voc格式(即.xml),可参考: VOC格式标签与yolo格式 ...

  3. 学习率和数据集规模_数据集和数据

    学习率和数据集规模 Often the words data and dataset are used interchangeably due to the understanding the wor ...

  4. 数据分析模型和工具_数据分析师工具包:模型

    数据分析模型和工具 You've cleaned up your data and done some exploratory data analysis. Now what? As data ana ...

  5. 数据科学还是计算机科学_数据科学101

    数据科学还是计算机科学 什么是数据科学? (What is data science?) Well, if you have just woken up from a 10-year coma and ...

  6. 数据科学生命周期_数据科学项目生命周期第1部分

    数据科学生命周期 This is series of how to developed data science project. 这是如何开发数据科学项目的系列. This is part 1. 这 ...

  7. 数据科学的发展_数据科学的发展与发展

    数据科学的发展 There's perhaps nothing that sets the 21st century apart from others more than the concept o ...

  8. 数据归一化处理方法_数据预处理:归一化和标准化

    1. 概述 数据的归一化和标准化是特征缩放(feature scaling)的方法,是数据预处理的关键步骤.不同评价指标往往具有不同的量纲和量纲单位,这样的情况会影响到数据分析的结果,为了消除指标之间 ...

  9. python文本数据增强_CVPR2020场景文字数据增强(纯python实现)

    CVPR2020论文: Learn to Augment: Joint Data Augmentation and Network Optimization for TextLearn to Augm ...

最新文章

  1. taro 重新加载小程序_Taro 小程序采坑
  2. 基于用户评价的评分模型
  3. ESP8266--学习笔记(八)串口源码分析
  4. BZOJ 1003 物流运输 最短路+dp
  5. 转:asp.net 负载平衡-Session相关
  6. 使用springfox 集成swagger 与spring mvc
  7. 淘宝双12趣味大数据:150万件打底裤被男人买走了;套套销量暴涨50%...
  8. ExcelVB脚本小记(1)
  9. android ORM框架的性能简单测试(androrm vs ormlite)
  10. Analysis-ik 中文分词安装
  11. mysql flaskalchemy_python flask sqlalchemy 数据库mysql操作
  12. word2vec原理_初识word2vec词向量
  13. View如何设置16进制颜色值
  14. Spring RestTemplate的使用(解决put,delete方法无返回值问题)
  15. Floyd-Warshall算法过程中矩阵计算方法—十字交叉法
  16. win10系统有些软件有声音,有些软件没有声音
  17. ACM-ICPC 2018 沈阳赛区网络预赛 F. Fantastic Graph (有上下界可行流)
  18. ARM与Intel芯片性能不严谨比较
  19. 青蛙的约会(POJ 1061 扩展欧几里德算法)
  20. Office 365小型企业版同时安装 Office Project 2016 教程

热门文章

  1. 1084. Broken Keyboard (20)
  2. Linux中read接收用户输入
  3. Tomcat8.01及nginx-1.8.1安装
  4. Scrum Meeting 报告
  5. Console-算法[for,if]-一堆桃子和一只猴子
  6. Dell XP版本在非Dell机子上的激活问题
  7. php json无法解析中文,json 无法解析解决方法
  8. JavaScript判断浏览器类型及版本(新增谷歌的Chrome)
  9. c语言 map转换成字符串数组,JSON数组形式字符串转换为ListMapString,String的几种方法...
  10. php curl http2,用php做ios http2推送服务遇到的坑