ref：Basics of Using Pre-trained GloVe Vectors in Python.

1. 下载

从 glove官网获取下载地址

# 下载glove文件
import urllib
import requests
urllib.request.urlretrieve('https://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip', "glove.840B.300d.zip")

2. 解压 glove文件

压缩解压zip可参考: Python压缩解压zip文件

# 解压 glove文件import os
import shutil
import zipfile
from os.path import join, getsizedef unzip_file(zip_src, dst_dir):r = zipfile.is_zipfile(zip_src)if r:     fz = zipfile.ZipFile(zip_src, 'r')for file in fz.namelist():fz.extract(file, dst_dir)       else:print('This is not zip')unzip_file('./glove.840B.300d.zip','./glove.840B.300d')

解压后

glove.twitter.27B.zip 解压后有4个文件，其中的embed维度不一样。分别为25，50，100，200 维

注：gensim.downloader 中有很多数据和词向量资源:

import gensim.downloader as api
api.info()

3. 提取词表和词向量并保存

这个过程逐行读取，有点久

# 提取word-embedding matrix
import numpy as np
import scipy
import sklearn
import pickledir_glove = './glove.840B.300d//glove.840B.300d.txt'
words = ['PAD']
embeds = np.zeros(shape = [1,300],dtype= np.float32)# embeddings_dict = {}
with open(dir_glove, 'r', encoding="utf-8") as f:for line in f:values = line.split()words.append(values[0])vector = np.asarray(values[1:], "float32")embeds = np.concatenate((embeds,vector.reshape(1,300)),axis = 0)# embeddings_dict[word] = vectorw2idx_dict = zip(words,range(len(words)))
idx2w_dict = zip(range(len(words)),words)
assert embeds.shape[0]==len(words)glove_data = {'w2idx_dict':w2idx_dict,'idx2w_dict':idx2w_dict,'embed_matrix':embeds}
pickle.dump(glove_data,'glove_data_840B_300d.pkl')

4. 或直接从glove的文件中提取词表：

Ref: Github Pytorch-RNN-text-classification

from collections import Counterimport pandas as pd
import torchwordemb
import torchimport util as utclass VocabBuilder(object):'''Read file and create word_to_index dictionary.This can truncate low-frequency words with min_sample option.'''def __init__(self, path_file=None):# word countself.word_count = VocabBuilder.count_from_file(path_file)self.word_to_index = {}@staticmethoddef count_from_file(path_file, tokenizer=ut._tokenize):"""count word frequencies in a file.Args:path_file:Returns:dict: {word_n :count_n, ...}"""df = pd.read_csv(path_file, delimiter='\t')# tokenizedf['body'] = df['body'].apply(tokenizer)# countword_count = Counter([tkn for sample in df['body'].values.tolist() for tkn in sample])print('Original Vocab size:{}'.format(len(word_count)))return word_countdef get_word_index(self, min_sample=1, padding_marker='__PADDING__', unknown_marker='__UNK__',):"""create word-to-index mapping. Padding and unknown are added to last 2 indices.Args:min_sample: for Truncationpadding_marker: padding markunknown_marker: unknown-word markReturns:dict: {word_n: index_n, ... }"""# truncate low fq word_word_count = filter(lambda x:  min_sample<=x[1], self.word_count.items())tokens = zip(*_word_count)[0]# inset padding and unknownself.word_to_index = { tkn: i for i, tkn in enumerate([padding_marker, unknown_marker] + sorted(tokens))}print('Turncated vocab size:{} (removed:{})'.format(len(self.word_to_index),len(self.word_count) - len(self.word_to_index)))return self.word_to_index, Noneclass GloveVocabBuilder(object) :def __init__(self, path_glove):self.vec = Noneself.vocab = Noneself.path_glove = path_glovedef get_word_index(self, padding_marker='__PADDING__', unknown_marker='__UNK__',):_vocab, _vec = torchwordemb.load_glove_text(self.path_glove)vocab = {padding_marker:0, unknown_marker:1}for tkn, indx in _vocab.items():vocab[tkn] = indx + 2vec_2 = torch.zeros((2, _vec.size(1)))vec_2[1].normal_()self.vec = torch.cat((vec_2, _vec))self.vocab = vocabreturn self.vocab, self.vec# create vocab
def CreatVocab_from_Glove(glove_path):print("===> creating vocabs ...")end = time.time()v_builder, d_word_index, embed = None, None, None# if os.path.exists(args.glove):v_builder = GloveVocabBuilder(path_glove=glove_path)d_word_index, embed = v_builder.get_word_index()# embedding_size = embed.size(1)print (embed.shape) # torch.Size([1193517, 25])return d_word_index, embedglove_path = './glove.twitter.27B/glove.twitter.27B.25d.txt'
d_word_index, embed = CreatVocab_from_Glove(glove_path)# ===> creating vocabs ...
# torch.Size([1193517, 25])
# d_word_index 为词典格式，共119万的词语

pytorch Glove 下载到使用相关推荐

window下安装pytorch(不用下载cuda和cudnn)(用清华镜像)
window下安装pytorch 前言: 最近换了一台笔记本电脑(3060显卡),单单为了配这个pytorch环境就花了1天时间(即使我以前配过).所以,现在想要记录一些细节防止忘记. 1.anaco ...
【conda安装pytorch总是下载cpu版本的问题】
conda安装pytorch总是下载cpu版本的问题首先,表述问题:我在使用pytorchu官方网站安装torch1.9.0的GPU包, 链接: link 确保自己的安装的代码为 // cuda 1 ...
pytorch 模型下载，郑重推荐
http://download.pytorch.org/models/ http://download.pytorch.org/models/densenet121-a639ec97.pth 下载安装 ...
pytorch无法下载或下载缓慢问题
第一步切换Anaconda下载源. # 添加清华源 conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda ...
【PyTorch】下载的预训练模型的保存位置（Windows)
保存位置 C:\Users\xxx\.cache\torch\hub\checkpoints\ xxx替换为你的用户名. 项目场景迁移学习的时候一般需要用到预训练模型,那么预训练模型的保存位置是在哪 ...
pytorch怎么下载？pytorch在哪里下载？
PyTorch是一个基于Torch的Python开源机器学习库,许多同学表示不知道怎么下载,或者是下载的非常慢,今天来教大家怎么下载PyTorch. pytorch在哪里下载? 官方下载地址:http ...
各版本pytorch离线下载网址
pytorch版本离线下载地址https://download.pytorch.org/whl/torch_stable.html cpu/torch-0.3.0.post4-cp27-cp27m-l ...
pytorch加速下载方法
1.添加pytorchd的清华源:conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pyt ...
pytorch的下载解决方案（下载出错、下载过慢问题）
前言第一次下载pytorch往往会出现一些问题,比如不知道如何下载,或者下载过慢等问题,由此本文给出以下解决放方案,并给出图示解决. 正文一.下载anaconda 首先下载anaconda,点击进 ...

pytorch Glove 下载到使用

1. 下载

2. 解压 glove文件

3. 提取词表和词向量并保存

4. 或直接从glove的文件中提取词表：

pytorch Glove 下载到使用相关推荐

最新文章

热门文章

pytorch Glove 下载到使用

1. 下载

2. 解压 glove文件

3. 提取词表和词向量 并保存

4. 或直接从glove的文件中提取词表：

pytorch Glove 下载到使用相关推荐

最新文章

热门文章

3. 提取词表和词向量并保存