Keras LSTM对20 Newsgroups数据集进行分类

1.20 Newsgroup数据集介绍

20newsgroups数据集是用于文本分类、文本挖据和信息检索研究的国际标准数据集之一。数据集收集了大约20,000左右的新闻组文档，均匀分为20个不同主题的新闻组集合。一些新闻组的主题特别相似(e.g. comp.sys.ibm.pc.hardware/ comp.sys.mac.hardware)，还有一些却完全不相关 (e.g misc.forsale /soc.religion.christian)。

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

misc.forsale

talk.politics.misc

talk.politics.guns

talk.politics.mideast

talk.religion.misc

alt.atheism

soc.religion.christian

20newsgroups数据集有三个版本。第一个版本19997是原始的并没有修改过的版本。第二个版本bydate是按时间顺序分为训练(60%)和测试(40%)两部分数据集，不包含重复文档和新闻组名（新闻组，路径，隶属于，日期）。第三个版本18828不包含重复文档，只有来源和主题。

20news-19997.tar.gz –原始20 Newsgroups数据集
20news-bydate.tar.gz –按时间分类; 不包含重复文档和新闻组名(18846 个文档)
20news-18828.tar.gz– 不包含重复文档，只有来源和主题 (18828 个文档)

在sklearn中，该模型有两种装载方式，第一种是sklearn.datasets.fetch_20newsgroups，返回一个可以被文本特征提取器（如sklearn.feature_extraction.text.CountVectorizer）自定义参数提取特征的原始文本序列；第二种是sklearn.datasets.fetch_20newsgroups_vectorized，返回一个已提取特征的文本序列，即不需要使用特征提取器。

2.加载训练好的向量（Glove，100维）

BASE_DIR = './data'
GLOVE_DIR = BASE_DIR + '/glove.6B/'
TEXT_DATA_DIR = BASE_DIR + '/20_newsgroup/'
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
batch_size = 32print('Indexing word vectors.')
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR,'glove.6B.100d.txt'),encoding='utf-8')
for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:],dtype='float32')embeddings_index[word] = coefs
f.close()print('Found %s word vectors.'%len(embeddings_index) )

3.加载数据集，这里也包括标签，我们是根据文档所在的文件夹用数字进行分类

print('Processing text dataset')texts = []
labels_index = {}
labels = []for name in sorted(os.listdir(TEXT_DATA_DIR)):path = os.path.join(TEXT_DATA_DIR,name)if os.path.isdir(path):label_id = len(labels_index)labels_index[name] = label_id    #每个文件夹给一个IDfor fname in sorted(os.listdir(path)):if fname.isdigit():fpath = os.path.join(path,fname)if sys.version_info<(3,):f = open(fpath)else:f = open(fpath,encoding='latin-1')texts.append(f.read())f.close()labels.append(label_id)
print('Found %s texts.'%len(texts))

4.将文本数据向量化

tokenizer = Tokenizer(nb_words = MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)word_index = tokenizer.word_index
print('Found %s unique tokens.'%len(word_index))

5.构造训练集和测试集，这里我们对数据进行了清洗

data = pad_sequences(sequences,maxlen = MAX_SEQUENCE_LENGTH)labels = to_categorical(np.asarray(labels))print('Shape of data tensor:',data.shape)
print('Shape of label tensor:',labels.shape)indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT*data.shape[0])x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]print('Preparing embedding matrix.')
print(nb_validation_samples)

6.我们构建了LSTM网络模型，并对其评估

nb_words = min(MAX_NB_WORDS,len(word_index))
embedding_matrix = np.zeros((nb_words +1,EMBEDDING_DIM))
for word,i in word_index.items():if i>MAX_NB_WORDS:continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None:embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)embedding_layer = Embedding(nb_words+1,EMBEDDING_DIM,weights= [embedding_matrix],input_length = MAX_SEQUENCE_LENGTH,trainable=False,#trainable，由于我们的W是word2vec训练出来的，算作预训练模型，所以就无需训练了。dropout = 0.2
)
batch_size = 32
print('Build model...')model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100,dropout_W = 0.2,dropout_U=0.2))#输出维度 :100
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.add(Dense(len(labels_index),activation='softmax'))model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']
)
print('Train...')
model.fit(x_train,y_train,batch_size=batch_size,nb_epoch=5,validation_data=(x_val,y_val))score,acc = model.evaluate(x_val,y_val,batch_size=batch_size)print('Test score:',score)
print('Test sccuracy:',acc)

7.训练结果（部分结果）

3936/3999 [============================>.] - ETA: 0s
3968/3999 [============================>.] - ETA: 0s
3999/3999 [==============================] - 52s 13ms/step
Test score: 0.18472591743048325
Test sccuracy: 0.9499999885411226