snorkel

OpenAI’s recent release of GPT-3 API set the Twitterverse on fire with developers using its massive language model to generate everything from text to code:

OpenAI最近发布的GPT-3 API使得Twitterverse受到开发人员的欢迎，他们使用其庞大的语言模型来生成从文本到代码的所有内容：

GPT-3 Generating React Code

GPT-3生成React代码

While the hype and the attention towards GPT-3 were justified, another exciting development in AI went largely unnoticed. On July 14th, Snorkel AI, which spun out of Stanford AI Lab in 2019, emerged out of stealth with $15 million in funding from Greylock, GV, and In-Q-Tel. The launch showcased Snorkel Flow, a machine learning platform to programmatically label and prepare the training data to accelerate the build and deploy process for ML models. Snorkel’s early users include Google, Intel, Apple, and Stanford Medicine. Although the project is still in its infancy, Snorkel’s approach to AI may be an important breakthrough for enterprise AI and the broad adoption of AI/ML across various verticals.

尽管对GPT-3的大肆宣传和关注是有道理的，但AI的另一个激动人心的发展在很大程度上并未引起注意。 7月14日， Snorkel AI于2019年从斯坦福大学AI实验室中分离出来，并从Greylock，GV和In-Q-Tel获得了1500万美元的资金，脱颖而出。此次发布会展示了Snorkel Flow，这是一个机器学习平台，可通过编程方式标记和准备训练数据，从而加快ML模型的构建和部署过程。 Snorkel的早期用户包括Google ， Intel ， Apple和Stanford Medicine 。尽管该项目仍处于起步阶段，但Snorkel的AI方法可能是企业AI的重要突破，以及AI / ML在各个垂直领域的广泛采用。

手贴训练数据瓶颈 (Hand Labeling Training Data Bottleneck)

Despite massive improvements to machine learning frameworks (e.g. Tensorflow, PyTorch, Keras), hardware (e.g. GPUs, TPUs), and research across the board (e.g. AlphaGo, GPT-3), preparing training data remains largely a manual process. From drawing bounding boxes on images to annotating audio files, data scientists have to either label massive files by hand or crowdsource the task to an army of contract workers. The problem is exacerbated by the massive amounts of data needed to train and build deep learning models. This bottleneck is becoming more apparent as deep learning is more accessible than ever thanks to various open-source tools and cloud-hosted workloads.

尽管机器学习框架大量的改进(如Tensorflow ， PyTorch ， Keras )，硬件(例如图形处理器，热塑性聚氨酯)，以及全线研究(如AlphaGo，GPT-3)，准备训练数据在很大程度上仍然是手动过程。从在图像上绘制边界框到注释音频文件，数据科学家必须手动标记大量文件或将任务众包给合同工大军。训练和建立深度学习模型所需的大量数据加剧了这个问题。由于各种开放源代码工具和云托管工作负载，深度学习比以往任何时候都更容易访问，因此这一瓶颈变得越来越明显。

https://github.com/opencv/cvat/raw/0.2.0/cvat/apps/documentation/static/documentation/images/cvat.jpg, CC BY-SA 4.0, https://github.com/opencv/cvat/raw/0.2.0/cvat/apps/documentation/static/documentation/images/cvat.jpg，CC BY-SA 4.0， https://commons.wikimedia.org/w/index.php?curid=74718749https ：//commons.wikimedia.org/w/index.php？curid = 74718749

Compare the manual labeling process to significant improvements in other stages of machine learning. We now have AutoML and other programmatic methods to expedite feature engineering and hyperparameter tuning. Infrastructure provisioning tools and cloud architectures allow the deployment of those models easier than ever. On the other hand, hand labeling data does not scale (at least not cost-efficiently, especially when expertise and privacy may be involved — e.g. medical images, financial statements) and is extremely error-prone.

将手动标记过程与机器学习其他阶段的重大改进进行比较。现在，我们有了AutoML和其他编程方法来加速特征工程和超参数调整。基础架构配置工具和云体系结构使这些模型的部署比以往更加容易。另一方面，手贴标签数据无法扩展(至少不具有成本效益，特别是在涉及专业知识和隐私的情况下，例如医学图像，财务报表)，并且容易出错。

The team at Snorkel argues that the process of manually labeling training data is fundamentally broken. Behind today’s massive success of ML milestones belie a hidden cost. For example, it took Dr. Fei Fei Li and her team at Stanford AI Lab two years to create ImageNet, the foundational dataset that led to incredible research at Google, Clarifai, DeepMind, Baidu, and Huawei. In order for data to truly become the new oil in the digital economy, turning dataset labeling into an iterative development process is paramount.

Snorkel的团队认为，手动标记训练数据的过程从根本上被破坏了。在当今ML里程碑取得巨大成功的背后，隐藏着代价。例如，费飞利博士及其团队在斯坦福大学AI实验室花了两年时间创建了ImageNet ，该基础数据集导致Google，Clarifai，DeepMind，百度和华为进行了惊人的研究。为了使数据真正成为数字经济中的新油，将数据集标签转变为迭代开发过程至关重要。

“Despite spending billions of dollars on AI, few organizations have been able to use it as widely and effectively as they want to. This is because available solutions either ignore the most important part of AI today — the labeled training data that fuels modern approaches — or rely on armies of human labelers to produce it.” — Alex Ratner, CEO of Snorkel AI

“尽管在人工智能上花费了数十亿美元，但很少有组织能够像他们想要的那样广泛而有效地使用它。这是因为可用的解决方案要么忽略了当今AI最重要的部分-标记了现代方法的培训数据，要么依赖人类标签制造商的队伍来生产它。” — Snorkel AI首席执行官Alex Ratner

程序标记数据 (Programmatic Labeling of Data)

Snorkel AI’s founding team — Alex Ratner (Assistant Professor at the University of Washington) and Chris Ré (Associate Professor at Stanford and 2015 MacArthur Genius Fellow) — sought to implement rule-based systems to programmatic label and build quality training data. The idea is to allow the programmer or the domain expert to define labeling functions to generate labeled data. The example used on the Snorkel blog takes legal documents where a legal analyst creates a function to tag documents as “Employment contracts” if “employment” is in the title of the document.

Snorkel AI的创始团队-华盛顿大学助理教授亚历克斯·拉特纳(Alex Ratner)和斯坦福大学副教授兼克里斯·雷( ChrisRé )以及2015年麦克阿瑟·天才研究员(MacArthur Genius Fellow )–试图实现基于规则的系统来程序化标签和建立高质量的培训数据。这个想法是允许程序员或领域专家定义标记函数以生成标记数据。 Snorkel博客上使用的示例获取法律文档，其中，如果文档标题中包含“ employment”，则法律分析师会创建一个将文档标记为“ Employment contract”的功能。

While rule-based systems have the advantage of being simple and direct, it is also brittle in that it lacks robustness to deal with inputs outside the defined schema (e.g. what happens if “contract” is used instead of “employment”). So Snorkel Flow takes the dataset created by labeling functions, treats it as a generative model filled with noise, and uses weak supervision ML models to “denoise” or generalize that algorithm. In essence, it combines human-defined rules and machine learning algorithms to create a more robust labeled dataset.

尽管基于规则的系统具有简单直接的优势，但它也很脆弱，因为它缺乏处理已定义模式之外的输入的鲁棒性(例如，如果使用“合同”而非“雇佣”会发生什么)。因此，Snorkel Flow提取了通过标注函数创建的数据集，将其视为充满噪声的生成模型，并使用弱监督ML模型来“降噪”或推广该算法。本质上，它结合了人类定义的规则和机器学习算法，以创建更强大的标记数据集。

呼吸管流量 (Snorkel Flow)

While the original, open-source Snorkel project focused on programmatic approaches to training data, Snorkel Flow goes further to tackle building an end-to-end ML solution. This includes data augmentation (e.g. adding rotate or blurred images), data slicing (i.e. taking subsets of data), and operational components needed for a production-grade ML application (e.g. monitoring, alerting, IaaC).

最初的开源Snorkel项目专注于训练数据的编程方法，而Snorkel Flow则进一步致力于构建端到端的ML解决方案。这包括数据扩充(例如，添加旋转或模糊的图像)，数据切片(即，获取数据的子集)以及生产级ML应用程序所需的操作组件(例如，监视，警报，IaaC)。

Access to Snorkel Flow is currently limited (need to request a demo), but the fundamental idea behind Snorkel seems like the missing link to democratizing machine learning beyond the tech behemoths. Keep an eye on this startup and see how it reshapes the ML industry in years to come.

目前对Snorkel Flow的访问受到限制( 需要请求演示 )，但Snorkel背后的基本思想似乎是使技术巨头之外的机器学习民主化的缺失环节。密切关注这家初创公司，看看它如何在未来几年内重塑机器学习行业。

翻译自: https://towardsdatascience.com/snorkel-ai-programmatic-approach-to-labeling-training-data-11973cf14f70

snorkel

查看全文

http://www.taodudu.cc/news/show-1873844.html

ai/ml_本月有关AI / ML的令人印象深刻的中等文章
ai人工智能最新相关消息_我如何了解最新的AI研究
人工智能算法自动化测试_自动化：算法如何塑造我和你的生活
情书，由多士炉写。
快二游戏数据分析_1.更快的数据分析
决策树人工智能预测模型_部署和服务AI模型进行预测的10种方法
商业洞察力_正在进行的寻求洞察力和远见卓识
阿里ai布局开始_如何从AI开始？
python惰性_如何创建惰性属性以提高Python的性能
如何识别媒体偏见_面部识别技术存在偏见：为什么我们不应该盲目相信新技术
自然语言处理：简单解释
ai技术领先的企业_领先企业如何扩展AI
机器学习为什么重要_什么是机器学习？为什么对您的业务很重要？
数据重塑_人工智能能否重塑全球力量平衡？
平安科技一轮等多久_科技正等着我们成长
r语言生成等差序列_使用序列模型生成自然语言
人工智能火灾报警器_使用AI进行准确的火灾预测
ai/ml_您应该在本周（7月11日）阅读有趣的AI / ML文章
西蒙决策_西蒙的象棋因子
ai的利与弊辩论_为什么AI辩论失败了
k8s apollo_AI增强的Apollo 16素材让您以4K登上月球
ai疾病风险因素识别_克服AI的“蠕动因素”
人工智能的未来是强化学习_多主体强化学习与AI的未来
ai人工智能的本质和未来_什么是人工智能，它将如何塑造我们的未来？
日本初创公司Elix正在使用AI研究COVID-19药物
ai里怎样取消扩展外观_扩展AI：困难的5个原因
自动化机器人 rpa_机器人过程自动化和机器人的出现
月球 dem_通过“月球灾害”应对错误信息的流行
openai-gpt_GPT-3对非技术专业人员意味着什么
自学人工智能途径_成为数据科学家，AI或ML工程师的自学途径

snorkel_Snorkel AI：标记培训数据的程序化方法相关推荐

ai伪造论文实验数据_5篇有关AI培训数据的基本论文
ai伪造论文实验数据 Many data scientists claim that around 80% of their time is spent on data preprocessing, ...
不会做特征工程的 AI 研究员不是好数据科学家！上篇 - 连续数据的处理方法本文作者：s5248 编辑：杨晓凡 2018-01-19 11:32 导语：即便现代机器学习模型已经很先进了，也别
不会做特征工程的 AI 研究员不是好数据科学家!上篇 - 连续数据的处理方法雷锋网(公众号:雷锋网) AI 科技评论按:眨眼间我们就从人工特征.专家系统来到了自动特征.深度学习的人工智能新时代,众多 ...
AI基础：数据增强方法综述
导语在深度学习时代,数据的规模越大.质量越高,模型就能够拥有更好的泛化能力,数据直接决定了模型学习的上限.然而在实际工程中,采集的数据很难覆盖全部的场景,比如图像的光照条件,同一场景拍摄的图片可能由 ...
excel不显示0_Excel表格技巧—用箭头标记Excel表格中数据增减的方法
在统计工作中,经常会用Excel表格来进行同项目的数值对比,在增减数值中,有时候看起来不直观,想用箭头标记Excel表格中数据增减,可是很多新手不知道该怎么操作: 下面小编来跟大家分享箭头标记Exce ...
DAGA : 基于生成方法的低资源标记任务数据增强精读笔记
DAGA : 基于生成方法的低资源标记任务数据增强精读笔记文章目录 DAGA : 基于生成方法的低资源标记任务数据增强精读笔记 1 Introduction 2 Background Name ...
构建未来情报体系—— AI及大数据时代情报分析人员的战略价值
知远战略与防务研究所沐俭/编译来自:美国战略与国际问题研究中心网站 [知远导读]本篇推送编辑节选自美国战略与国际问题研究中心(CSIS)技术与情报专项研究小组撰写的一篇利用新兴科技提高情报分析人员 ...
想要AI优先？数据优先才行
戳蓝字"CSDN云计算"关注我们哦! 来源 | forbes 编译 | shawn 责编 | Carol 出品 | CSDN云计算(ID:CSDNcloud) 开展人工智能和机器学 ...
10种网站数据的采集方法
10种AI训练数据采集工具排行榜 10种网站数据的采集方法 1.目前常用的10种网站数据 2.如何写Python爬虫: 3.人生第一个爬虫代码示例: 另外: 10种网站数据的采集方法如何收集网站数 ...
10种招聘数据的采集方法
10种AI训练数据采集工具排行榜 10种招聘数据的采集方法 1.目前常用的10种数据网站 2.如何写Python爬虫: 3.人生第一个爬虫代码示例: 另外: 10种招聘数据的采集方法如何收集招聘数 ...
6种上市公司数据的采集方法和工具
10种AI训练数据采集工具排行榜 6种上市公司数据的采集方法和工具 1.目前常用的6种数据网站 2.如何写Python爬虫: 3.人生第一个爬虫代码示例: 另外: 6种上市公司数据的采集方法和工具 ...

snorkel_Snorkel AI：标记培训数据的程序化方法

手贴训练数据瓶颈 (Hand Labeling Training Data Bottleneck)

程序标记数据 (Programmatic Labeling of Data)

呼吸管流量 (Snorkel Flow)

相关文章：

snorkel_Snorkel AI：标记培训数据的程序化方法相关推荐

最新文章

热门文章