机器学习 伪标签

Articles in this series:1. Introduction2. Pseudo-English (You are here)3. Keyboard Input (Coming soon)4. Web Workers (Coming soon)

本系列文章:1。 简介 2.伪英语(您在这里) 3.键盘输入(即将推出) 4.网络工作者(即将推出)

The finished project is located here: https://www.bayanbennett.com/projects/rnn-typing-practice

完成的项目位于此处: https : //www.bayanbennett.com/projects/rnn-typing-practice

目的 (Objective)

Generate English-looking words using a recurrent neural network.

使用递归神经网络生成英语单词。

琐碎的方法 (Trivial Methods)

Before settling on using ML, first I had to convince myself that the trivial methods did not provide adequate results.

在开始使用ML之前,首先我必须说服自己,琐碎的方法无法提供足够的结果。

随机字母 (Random Letters)

const getRandom = (distribution) => {  const randomIndex = Math.floor(Math.random() * distribution.length);  return distribution[randomIndex];}const alphabet = "abcdefghijklmnopqrstuvwxyz";const randomLetter = getRandom(alphabet);

Unsurprisingly, no resemblance to English words. The character sequences that were generated were painful to type. Here are a few examples of five letter words:

毫不奇怪,它与英语单词没有相似之处。 生成的字符序列很难键入。 以下是五个字母词的一些示例:

snyam   iqunm   nbspl   onrmx   wjavb   nmlgjarkpt   ppqjn   zgwce   nhnxl   rwpud   uqhuqyjwpt   vlxaw   uxibk   rfkqa   hepxb   uvxaw

加权随机字母 (Weighted Random Letters)

What if we generated sequences that had the same distribution of letters as English? I obtained the letter frequencies from Wikipedia and created a JSON file that mapped the alphabet to their corresponding relative frequency.

如果我们生成的序列具有与英语相同的字母分布怎么办? 我从Wikipedia获得字母频率,并创建了一个JSON文件,该文件将字母映射到其相应的相对频率。

// letter-frequencies.json{  "a": 0.08497,  "b": 0.01492,  "c": 0.02202,  "d": 0.04253,  "e": 0.11162,  "f": 0.02228,  "g": 0.02015,  "h": 0.06094,  "i": 0.07546,  "j": 0.00153,  "k": 0.01292,  "l": 0.04025,  "m": 0.02406,  "n": 0.06749,  "o": 0.07507,  "p": 0.01929,  "q": 0.00095,  "r": 0.07587,  "s": 0.06327,  "t": 0.09356,  "u": 0.02758,  "v": 0.00978,  "w": 0.02560,  "x": 0.00150,  "y": 0.01994,  "z": 0.00077}

The idea here is to create a large sequence of letters whose distribution closely matches frequencies above. Math.random has a uniform distribution, so when we select random letters from the sequence, the probability for picking a letter matches its frequency.

这里的想法是创建一个大的字母序列,其分布与上面的频率紧密匹配。 Math.random具有均匀的分布,因此当我们从序列中选择随机字母时,选择字母的概率与其频率匹配。

const TARGET_DISTRIBUTION_LENGTH = 1e4; // 10,000const letterFrequencyMap = require("./letter-frequencies.json");const letterFrequencyEntries = Object.entries(letterFrequencyMap);const reduceLetterDistribution = (result, [letter, frequency]) => {  const num = Math.round(TARGET_DISTRIBUTION_LENGTH * frequency);  const letters = letter.repeat(num);  return result.concat(letters);};const letterDistribution = letterFrequencyEntries  .reduce(reduceLetterDistribution, "");const randomLetter = getRandom(letterDistribution);

The increase in the number of vowels was noticeable, but the generated sequences still fail to resemble an English word. Here are a few examples of five-letter words:

元音数量的增加是明显的,但是生成的序列仍然不能类似于英语单词。 以下是一些由五个字母组成的单词的示例:

aoitv   aertc   cereb   dettt   rtrsl   ararmoftoi   rurtd   ehwra   rnfdr   rdden   kiddanieri   eeond   cntoe   rirtp   srnye   enshk

马尔可夫链 (Markov Chains)

This would be the next logical step, where we would create probabilities of letter sequence pairs. This was the point that I decided to go straight to RNNs. If anyone would like to implement this approach, I’d be interested in seeing the results.

这将是下一步的逻辑步骤,我们将在其中创建字母序列对的概率。 这就是我决定直接使用RNN的要点。 如果有人想实现这种方法,那么我会对看到结果感兴趣。

递归神经网络 (Recurrent Neural Networks)

Neural networks are usually memoryless, where the system has no information from previous steps. RNNs are a type of neural network where the previous state of the network is an input to the current step.

神经网络通常是无记忆的,其中系统没有来自先前步骤的信息。 RNN是一种神经网络,其中网络的先前状态是当前步骤的输入。

  • Input: A character

    输入 :一个字符

  • Output: A tensor with the probabilities for the next character.

    输出 :具有下一个字符的概率的张量。

NNs are inherently bad at processing inputs of varying length, there are ways around this (like with positional encoding in transformers). With RNNs, the inputs are consistent in size, a single character. Natural language processing has a natural affinity for RNNs as languages are unidirectional (LTR or RTL) and the order of the characters are important. In other words, although the words united and untied only have two characters swapped, but they have opposite meanings.

NN本质上不利于处理不同长度的输入,对此有很多解决方法(例如在变压器中进行位置编码) 。 使用RNN,输入的大小是一致的,一个字符。 自然语言处理对RNN具有天然的亲和力,因为语言是单向的(LTR或RTL),并且字符的顺序很重要。 换句话说,尽管“ 团结”和“ 解开 ”一词仅交换了两个字符,但它们具有相反的含义。

The model below is based on the Tensorflow Text generation with an RNN tutorial.

以下模型基于带有RNN教程的Tensorflow Text生成 。

嵌入输入层 (Input Layer with Embedding)

This was the first time I encountered the concept of an embedding layer. It was a fascinating concept and I was excited to start using it.

这是我第一次遇到嵌入层的概念。 这是一个令人着迷的概念,我很高兴开始使用它。

I wrote a short post summarizing embeddings here: https://bayanbennett.com/posts/embeddings-in-machine-learning

我在这里写了一篇简短的文章,总结了嵌入: https : //bayanbennett.com/posts/embeddings-in-machine-learning

const generateEmbeddingLayer = (batchSize, outputDim) =>  tf.layers.embedding({    inputDim: vocabSize,    outputDim,    maskZero: true,    batchInputShape: [batchSize, null],  });

门控循环单元(GRU) (Gated Recurrent Unit (GRU))

I don’t have enough knowledge to justify why a GRU was chosen, so I deferred to the implementation in the aforementioned Tensorflow tutorial.

我没有足够的知识来说明为什么选择GRU的理由,因此我推迟到上述Tensorflow教程中的实现。

const generateRnnLayer = (units) =>  tf.layers.gru({    units,    returnSequences: true,    recurrentInitializer: "glorotUniform",    activation: "softmax",  });

放在一起 (Putting it all together)

Since we are sequentially feeding the output of one layer into the input of another layer, tf.Sequential is the class of model that we should use.

由于我们将一层的输出顺序地馈送到另一层的输入中,因此tf.Sequential是我们应该使用的模型类别。

const generateModel = (embeddingDim, rnnUnits, batchSize) => {  const layers = [    generateEmbeddingLayer(batchSize, embeddingDim),    generateRnnLayer(rnnUnits),  ];  return tf.sequential({ layers });};

训练数据 (Training Data)

I used Princeton’s WordNet 3.1 data set as a source for words.

我使用普林斯顿大学的WordNet 3.1数据集作为单词来源。

“WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets)…” — Princeton University “About WordNet.” WordNet. Princeton University. 2010.

“WordNet®是一个大型的英语词汇数据库。 名词,动词,形容词和副词被分为多组认知同义词(同义词)……” –普林斯顿大学“关于WordNet”。 词网 。 普林斯顿大学。 2010。

Since I was only interested in the words, I parsed each file and extracted only the words. Words with spaces were split into separate words. Words that matched the following criteria were also removed:

由于我只对单词感兴趣,因此我解析每个文件并仅提取单词。 带空格的单词被分成单独的单词。 符合以下条件的单词也被删除:

  • Words with diacritics
    变音符号
  • Single character words
    单字词
  • Words with numbers
    带数字的单词
  • Roman numerals
    罗马数字
  • Duplicate words
    单词重复

数据集生成器 (Dataset Generator)

Both the tf.LayersModel and tf.Sequential both have the .fitDataset method, which is a convenient way of—fitting a dataset. We need to create a tf.data.Dataset, but first here are some helper functions:

tf.LayersModeltf.Sequential都具有.fitDataset 方法 ,这是一种适合数据集的便捷方法。 我们需要创建一个tf.data.Dataset ,但是首先这里是一些帮助函数:

// utils.jsconst characters = Array.from("\0 abcdefghijklmnopqrstuvwxyz");const mapCharToInt = Object.fromEntries(  characters.map((char, index) => [char, index]));const vocabSize = characters.length;const int2Char = (int) => characters[int];const char2Int = (char) => mapCharToInt[char];// dataset.jsconst wordsJson = require("./wordnet-3.1/word-set.json");const wordsArray = Array.from(wordsJson);// add 1 to max length to accommodate a single space that follows each wordconst maxLength = wordsArray.reduce((max, s) => Math.max(max, s.length), 0) + 1;const data = wordsArray.map((word) => {  const paddedWordInt = word    .concat(" ")    .padEnd(maxLength, "\0")    .split("")    .map(char2Int);  return { input: paddedWordInt, expected: paddedWordInt.slice(1).concat(0) };});function* dataGenerator() {  for (let { input, expected } of data) {    /* If I try to make the tensors inside `wordsArray.map`,     * I get an error on the second epoch of training */    yield { xs: tf.tensor1d(input), ys: tf.tensor1d(expected) };  }}module.exports.dataset = tf.data.generator(dataGenerator);

Note that we need all the inputs to be the same length, so we pad all words with null characters, which will be converted to integer 0 with the char2Int function.

请注意,我们需要所有输入都具有相同的长度,因此我们用空字符填充所有单词,这些字符将通过char2Int函数转换为整数0。

生成并编译模型 (Generating and compiling the model)

Here it is, the moment we’ve been building towards:

在这里,我们一直在努力:

const BATCH_SIZE = 500;const batchedData = dataset.shuffle(10 * BATCH_SIZE).batch(BATCH_SIZE, false);const model = generateModel(vocabSize, vocabSize, BATCH_SIZE);const optimizer = tf.train.rmsprop(1e-2);model.compile({  optimizer,  loss: "sparseCategoricalCrossentropy",  metrics: tf.metrics.sparseCategoricalAccuracy,});model.fitDataset(batchedData, { epochs: 100 });

A batch size of 500 was selected as that was around what I could fit without running out of memory.

选择了500的批量大小,因为这在不耗尽内存的情况下可以满足我的要求。

例子 (Examples)

ineco uno kam whya qunaben qunobinxexaela sadinon zaninab mecoomasphanonyus lyatra fema inimo unenones

It’s not perfect, but it produces words that vaguely appear to come from another Romance or Germanic language. The size of the model.json and weights.bin files are only 44 kB. This is important since simpler models generally run inference faster and are light enough for the end user to download without affecting perceived page performance.

它不是完美的,但是它产生的词隐约似乎来自另一种罗曼语或日耳曼语。 model.jsonweights.bin文件的大小仅为44 kB。 这一点很重要,因为较简单的模型通常可以更快地进行推理,并且足够轻巧以供最终用户下载而不影响感知的页面性能。

The next step is where the fun begins, building a typing practice web app!

下一步是乐趣的开始,构建打字练习网络应用程序!

Originally from:

最初是从:

翻译自: https://levelup.gitconnected.com/pseudo-english-typing-practice-with-machine-learning-5700eb4dc54

机器学习 伪标签

http://www.taodudu.cc/news/show-3263463.html

相关文章:

  • Mysql 8.0修改密码
  • openstack修改密码
  • sourceTree 更改密码
  • 使用saltstack批量修改密码
  • linux root密码修改失败,【转】Linux root修改密码失败
  • linux修改密码策略
  • 让word文档中的代码更美观
  • word中如何美观插入代码?
  • c#窗体美观原则
  • 如何做出美观高大上的前端页面
  • 一个简单而又美观的 beamer 模板制作
  • PyQt如何使界面按钮更加美观
  • 写出一个美观的表单页
  • qt如何设计界面更美观_8个更好的界面设计的黄金法则
  • 如何让你的网页看起来更美观
  • 在word中如何美观地插入代码
  • 【oh-my-zsh】打造强大又美观的linux终端
  • 巧用Vscode编辑器,快速格式化代码,让你的代码变得整洁又美观
  • 使用Qt绘制一个简约美观的界面 【使用QSS简单美化】(笔记)
  • 怎么写出美观,可读性高的代码?
  • 高赞 GitHub 项目盘点:美观的中文排版样式
  • 优化VSCode:让你的VSCode变得好用又美观
  • 让PyQt5更加美观
  • PCB设计如何美观的几大原则
  • Excel打印表格如何美观又漂亮
  • java ui界面美观,JavaFX实现UI美观效果代码实例
  • 让网页更美观(css3新特性)
  • 【总结】PPT如何写的更美观
  • 传感网、泛在网、M2M、移动网等与物联网之间,主要有什么关系?
  • 中国移动M5310 nbiot开发板连接onenet平台全过程

机器学习 伪标签_伪英语—机器学习打字练习相关推荐

  1. 使用机器学习预测天气_如何使用机器学习预测着陆

    使用机器学习预测天气 Based on every NFL play from 2009–2017 根据2009-2017年每场NFL比赛 Ah, yes. The times, they are c ...

  2. 机器学习管道模型_使用连续机器学习来运行您的ml管道

    机器学习管道模型 Vaithy NarayananVaithy Narayanan Follow跟随 Jul 15 7月15 使用连续机器学习来运行ML管道 (Using Continuous Mac ...

  3. 使用机器学习预测天气_如何使用机器学习根据文章标题预测喜欢和分享

    使用机器学习预测天气 by Flavio H. Freitas Flavio H.Freitas着 如何使用机器学习根据文章标题预测喜欢和分享 (How to predict likes and sh ...

  4. 机器学习股票_使用概率机器学习来改善您的股票交易

    机器学习股票 Note from Towards Data Science's editors: While we allow independent authors to publish artic ...

  5. 【机器学习】小数据集怎么上分? 几行代码生成伪标签数据集

    背景 伪标签(Pseudo-Labeling)的定义来自于半监督学习,其核心思想是通过借助无标签的数据来提升有监督模型的性能.伪标签技术在许多场景中被验证了它的有效性,例如在kaggle竞赛Santa ...

  6. 范数在机器学习中的作用_设计在机器学习中的作用

    范数在机器学习中的作用 Today, machine learning (ML) is a component of practically all new software products. Fo ...

  7. 机器学习系列(8)_读《Nature》论文,看AlphaGo养成

    机器学习系列(8)_读<Nature>论文,看AlphaGo养成  标签: 机器学习算法深度学习神经网络蒙特卡罗树搜索 2016-03-16 11:23 17843人阅读 评论(8) 收藏 ...

  8. 机器学习系列(2)_从初等数学视角解读逻辑回归

    作者:龙心尘 && 寒小阳  时间:2015年10月.  出处:http://blog.csdn.net/longxinchen_ml/article/details/49284391 ...

  9. 机器学习系列(1)_逻辑回归初步

    转载自: 机器学习系列(1)_逻辑回归初步 - 寒小阳 - 博客频道 - CSDN.NET http://blog.csdn.net/han_xiaoyang/article/details/4912 ...

最新文章

  1. JavaScript代码检验工具——JS Lint工具安装指南
  2. 长能耐了?想造反了?你老婆没了.......
  3. Android签名机制---签名过程
  4. 阿里云发布首个流式存储与播放解决方案
  5. 微信公众帐号开发教程第13篇-图文消息全攻略
  6. 命令行传感器和模板的使用之在 Home Assistant 中监控树莓派的 CPU 温度,内存等信息
  7. Vue----.stop、.prevent、.capture、.self用法以及.stop和.self的区别
  8. gas费用测试优化:hardhat-gas-reporter
  9. [转]小总结一下矩阵的对角化
  10. 硬核,这年头机器人都开始自学“倒车入库”了
  11. 图森未来:营收增长与亏损扩大并行
  12. 笔记:分布式大数据技术原理(一)Hadoop 框架
  13. error:type/value mismatch at ... ::iterator
  14. ChatGPT深度体验记录,期待GPT-4(测试各领域知识,正常聊天,写代码,写诗歌,模拟人格,机器翻译,语法改错等)
  15. 《离散数学》每章内容及其重点梳理
  16. 大数据时代,AIoT在智慧社区的深度应用
  17. mysql 启动时,服务无法启动:发生系统错误1067.
  18. 【阅文集团2020秋招10.21笔试题目】机器学习/NLP算法工程师
  19. python读取mt4数据_MT4下载历史数据
  20. Java、.NET,为什么不合二为一?

热门文章

  1. 笔记本识别不出来U盘的解决方法 [亲测有效]
  2. java 嵌套类 map,Java 8:具有列表嵌套类的收集器groupby
  3. kohya_ss GUI安装教程
  4. 申请免费SSL证书的网站或工具
  5. 2020-10-03 天梯赛--悄悄关注
  6. http指南单子版_了解单子。 困惑的指南
  7. 关于iOS招聘面试的一些问题
  8. python 应用 IPy 计算IP/掩码的网络地址
  9. barrett hand
  10. Linux中的两种链接:硬链接(Hard Link)和软连接(Soft Link)