twitter 数据集处理

In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. While there is no limit to the range of information conveyed by tweets and texts, often these short messages are used to share opinions and sentiments that people have about what is going on in the world around them.

过去的十年中,诸如微博和文本消息之类的新通信形式已经出现并无处不在。 尽管对推文和文本传达的信息范围没有限制,但这些短消息通常用于分享人们对周围世界正在发生的事情的看法和观点。

Opinion mining (known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

观点挖掘(称为情感分析或情感AI)是指使用自然语言处理,文本分析,计算语言学和生物识别技术来系统地识别,提取,量化和研究情感状态和主观信息。 情绪分析广泛应用于客户材料的声音,例如评论和调查响应,在线和社交媒体以及医疗保健材料,其应用范围从营销到客户服务再到临床医学。

Both Lexion and Machine learning-based approach will be used to for Emoticons based sentiment analysis. Firstly we stand up with the Machine Learning based clustering. In MachineLearning based approach we are used Supervised and Unsupervised learning methods. The twitter data are collected and given as input in the system. The system classifies each tweets data as Positive, Negative and Neutral and also produce the positive, negative and neutral no of tweets of each emoticon separately in the output. Besides being the polarity of each tweet is also determined on the basis of polarity.

Lexion和基于机器学习的方法都将用于基于表情的情绪分析。 首先,我们支持基于机器学习的集群。 在基于MachineLearning的方法中,我们使用了有监督和无监督的学习方法。 收集twitter数据并作为系统中的输入给出。 系统将每个推文数据分类为“正”,“负”和“中性”,并且还分别在输出中生成每个表情符号的正,负和中性no。 除了作为每个推文的极性之外,还基于极性来确定。

Collection of Data

资料收集

To collecting the twitter data, we have to do some data mining process. In that process, we have created our own applicating with help of twitter API. With the help of twitter API, we have collected a large no of the dataset . From this, we have to create a developer account and register our app. Here we received a consumer key and a consumer secret: these are used in application settings and from the configuration page of the app we also require an access token and an access token secrets which provide the application access to Twitter on behalf of the account. The process is divided into two sub-process. This is discussed in the next subsection.

要收集Twitter数据,我们必须执行一些数据挖掘过程。 在此过程中,我们借助twitter API创建了自己的应用程序。 借助twitter API,我们已收集了大量数据集。 由此,我们必须创建一个开发人员帐户并注册我们的应用程序。 在这里,我们收到了一个用户密钥和一个消费者密钥:这些密钥用于应用程序设置中,并且在应用程序的配置页面中,我们还需要访问令牌和访问令牌密钥,以代表帐户向Twitter提供应用程序访问权限。 该过程分为两个子过程。 下一部分将对此进行讨论。

Accessing Twitter Data and Strimming

访问Twitter数据并加强

To make the application and to interact with twitter services we use Twitter provided REST API. We use a bunch of Python-based clients. The API variable is now our entry point for most of the operations we can perform with Twitter. The API provides features to access different types of data. In this way, we can easily collect tweets (and more) and store them in the system. By default, the data is in JSON format, we change it to txt format for easy accessibility.

为了制作应用程序并与Twitter服务进行交互,我们使用Twitter提供的REST API。 我们使用了许多基于Python的客户端。 现在,API变量是我们可以使用Twitter执行的大多数操作的入口点。 该API提供了访问不同类型数据的功能。 这样,我们可以轻松地收集(和更多)推文并将其存储在系统中。 默认情况下,数据采用JSON格式,为了方便访问,我们将其更改为txt格式。

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. By extending and customizing the stream-listener process, we processed the incoming data. This way, we gather a lot of tweets. This is especially true for live events with worldwide live coverage.

如果我们想“保持连接打开”并收集有关特定事件的所有即将发布的推文,则需要流API。 通过扩展和定制流侦听器过程,我们处理了传入的数据。 这样,我们收集了很多推文。 对于具有全球实时报道的现场活动尤其如此。

# Twitter Sentiment Analysis
import sys
import csv
import tweepy
import matplotlib.pyplot as pltfrom collections import Counterif sys.version_info[0] < 3:input = raw_input## Twitter credentials
consumer_key = "------------"
consumer_secret = "------------"
access_token = "----------"
access_token_secret = "-----------"## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")results = api.search(lang="en",q=query + " -rt",count=number,result_type="recent"
)print("--- Gathered Tweets \n")## open a csv file to store the Tweets and their sentiment
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)with open(file_name, 'w', newline='') as csvfile:csv_writer = csv.DictWriter(f=csvfile,fieldnames=["Tweet", "Sentiment"])csv_writer.writeheader()print("--- Opened a CSV file to store the results of your sentiment analysis... \n")## tidy up the Tweets and send each to the AYLIEN Text APIfor c, result in enumerate(results, start=1):tweet = result.texttidy_tweet = tweet.strip().encode('ascii', 'ignore')if len(tweet) == 0:print('Empty Tweet')continueresponse = client.Sentiment({'text': tidy_tweet})csv_writer.writerow({'Tweet': response['text'],'Sentiment': response['polarity']})print("Analyzed Tweet {}".format(c))

Data Pre-Processing and Cleaning

数据预处理和清理

The data pre-processing steps perform the necessary data pre-processing and cleaning on the collected dataset. On the previously collected dataset, the are some key attributes text: the text of the tweet itself, created_at: the date of creation,favorite_count, retweet_count: the number of favourites and retweets, favourited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet etc. We have applied an extensive set of pre-processing steps to decrease the size of the feature set to make it suitable for learning algorithms. The cleaning method is based on dictionary methods.

数据预处理步骤对收集的数据集执行必要的数据预处理和清理。 在先前收集的数据集上,有一些关键属性文本:tweet本身的文本,created_at:创建日期,favorite_count,retweet_count:收藏和转推的数量,已收藏,已转推:布尔值,指明是否通过身份验证的用户(您)对此推文等有帮助或转发。我们已应用了广泛的预处理步骤,以减小功能集的大小,使其适合于学习算法。 清洁方法基于字典方法。

Data obtained from twitter usually contains a lot of HTML entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Hare, we are using the HTML parser module of Python which can convert these entities to standard HTML tags. For example &lt; is converted to “<” and &amp; is converted to “&”. After this, we are removing this special HTML Character and links. In decoding data, this is the process of transforming information from complex symbols to simple and easier to understand characters. The collected data uses different forms of decoding like “Latin”, “UTF8” etc.

从Twitter获得的数据通常包含许多HTML实体,例如&lt; &gt; &amp; 嵌入到原始数据中。 因此有必要摆脱这些实体。 一种方法是通过使用特定的正则表达式直接删除它们。 野兔,我们正在使用PythonHTML解析器模块,该模块可以将这些实体转换为标准HTML标记。 例如&lt; 转换为“ <”和&amp; 转换为“&”。 此后,我们将删除此特殊HTML字符和链接。 在解码数据时,这是将信息从复杂的符号转换为简单易懂的字符的过程。 收集的数据使用不同的解码形式,例如“拉丁”,“ UTF8”等。

In the twitter datasets, there is also other information as retweet, Hashtag, Username and modified tweets. All of this is ignored and removed from the dataset.

在Twitter数据集中,还有其他信息,如转推,标签,用户名和已修改的推文。 所有这些都将被忽略并从数据集中删除。

from nltk import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import words
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, pos_tag_sents#import for bag of word
import numpy as np
#For the regular expression
import re
#Textblob dependency
from textblob import TextBlob
from textblob import Word
#set to string
from ast import literal_eval
#From src dependency
from sentencecounter import no_sentences,getline,gettempwords import os
def getsysets(word):syns = wordnet.synsets(word)  #wordnet from ntlk.corpus  will not work with textblob#print(syns[0].name()) #print(syns[0].lemmas()[0].name())  #get synsets names #print(syns[0].definition())  #defination #print(syns[0].examples())    #example# getsysets("good")def getsynonyms(word):synonyms = []# antonyms = []for syn in wordnet.synsets(word):for l in syn.lemmas():synonyms.append(l.name())# if l.antonyms():#     antonyms.append(l.antonyms()[0].name())# print(set(synonyms))return(set(synonyms))# print(set(antonyms))# getsynonyms_and_antonyms("good")def extract_words(sentence):ignore_words = ['a']words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence)words_cleaned = [w.lower() for w in words if w not in ignore_words]return words_cleaned    def tokenize_sentences(sentences):words = []for sentence in sentences:w = extract_words(sentence)words.extend(w)words = sorted(list(set(words)))return wordsdef bagofwords(sentence, words):sentence_words = extract_words(sentence)# frequency word countbag = np.zeros(len(words))for sw in sentence_words:for i,word in enumerate(words):if word == sw: bag[i] += 1return np.array(bag)def tokenizer(sentences):token = word_tokenize(sentences)return tokenprint("#"*100)print (sent_tokenize(sentences))print (token)print("#"*100)# sentences = "Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"
# vocabulary = tokenize_sentences(sentences)
# print (vocabulary)
# tokenizer(sentences)def createposfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word+'\n')def createnegfile(filename,word):# filename = input("Enter destination file name in string format :")f = open(filename,'w')f.writelines(word)def getsortedsynonyms(word):sortedsynonyms = sorted(getsynonyms(word))return sortedsynonymsdef getlengthofarray(word):return getsortedsynonyms(word).__len__()def readposfile():f = open('list of positive words.txt')return f# def searchword(word, sourcename):
#     if word in open('list of negative words.txt').read():
#             createnegfile('destinationposfile.txt',word)
#     elif word in open('list of positive words.txt').read():
#             createposfile('destinationnegfile.txt',word)     #     else:
#         for i in range (0,getlengthofarray(word)):
#             searchword(getsortedsynonyms(word)[i],sourcename)def searchword(word,srcfile):# if word in open('list of negative words.txt').read():#         createnegfile('destinationposfile.txt',word)if word in open('list of positive words.txt').read():createposfile('destinationnegfile.txt',word)else:for i in range(0,getlengthofarray(word)):searchword(sorted(getsynonyms(word))[i],srcfile)f = open(srcfile,'w')f.writelines(word)print ('#'*50)
# searchword('lol','a.txt')
print(readposfile())
# tokenizer(sentences)
# getsynonyms('good')
# print(sorted(getsynonyms('good'))[2])  #finding an array object [hear it's 3rd object]
print ('#'*50)
# print (getsortedsynonyms('bad').__len__())
# createposfile('created.txt','lol')
# for word in word_tokenize(getline()):
#     searchword(word,'a.txt')

Stop words are generally thought to be a “single set of words”. We would not want these words taking up space in our database. For this using NLTK and using a “Stop Word Dictionary” . The stop words are removed as they are not useful.All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed. In the twitter datasets, there is also other information as retweet, Hashtag, Username and Modified tweets. All of this is ignored and removed from the dataset. We should remove these duplicates, which we already did. Sometimes it is better to remove duplicate data based on a set of unique identifiers. For example, the chances of two transactions happening at the same time, with the same square footage, the same price, and the same build year are close to zero.

停用词通常被认为是“单个词集”。 我们不希望这些单词占用数据库中的空间。 为此,请使用NLTK并使用“停用词词典”。 停用词因无用而被删除。应根据优先级处理所有标点符号。 例如: ”。”, ”,”,”?” 是重要的标点符号,应保留下来,而其他标点符号则需要删除。 在Twitter数据集中,还存在其他信息,如转推,标签,用户名和修改过的推文。 所有这些都将被忽略并从数据集中删除。 我们应该删除这些重复项,而我们已经这样做了。 有时最好根据一组唯一的标识符删除重复的数据。 例如,以相同的平方英尺,相同的价格和相同的建造年份,同时进行两次交易的机会几乎为零。

Thank you for reading.

感谢您的阅读。

I hope you found this data cleaning guide helpful. Please leave any comments to let us know your thoughts.

希望本数据清理指南对您有所帮助。 请留下任何评论,让我们知道您的想法。

To read previous part of the series -

要阅读本系列的前一部分-

https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f

https://medium.com/@sayanmondal2098/sentimental-analysis-of-twitter-emoji-64432793b76f

翻译自: https://medium.com/swlh/twitter-data-cleaning-and-preprocessing-for-data-science-3ca0ea80e5cd

twitter 数据集处理

http://www.taodudu.cc/news/show-997582.html

相关文章:

  • 使用管道符组合使用命令_如何使用管道的魔力
  • 2020年十大币预测_2020年十大商业智能工具
  • 为什么我们需要使用Pandas新字符串Dtype代替文本数据对象
  • nlp构建_使用NLP构建自杀性推文分类器
  • 时间序列分析 lstm_LSTM —时间序列分析
  • 泰晤士报下载_《泰晤士报》和《星期日泰晤士报》新闻编辑室中具有指标的冒险活动-第1部分:问题
  • 异常检测机器学习_使用机器学习检测异常
  • 特征工程tf-idf_特征工程-保留和删除的内容
  • 自我价值感缺失的表现_不同类型的缺失价值观和应对方法
  • 学习sql注入:猜测数据库_面向数据科学家SQL:学习简单方法
  • python自动化数据报告_如何:使用Python将实时数据自动化到您的网站
  • 学习深度学习需要哪些知识_您想了解的有关深度学习的所有知识
  • 置信区间估计 预测区间估计_估计,预测和预测
  • 地图 c-suite_C-Suite的模型
  • sap中泰国有预扣税设置吗_泰国餐厅密度细分:带有K-means聚类的python
  • 傅里叶变换 直观_A / B测试的直观模拟
  • 鸽子 迷信_人工智能如何帮助我战胜鸽子
  • scikit keras_Scikit学习,TensorFlow,PyTorch,Keras…但是天秤座呢?
  • 数据结构两个月学完_这是我作为数据科学家两年来所学到的
  • 迈向数据科学的第一步:在Python中支持向量回归
  • 使用Python和MetaTrader在5分钟内开始构建您的交易策略
  • ipywidgets_未来价值和Ipywidgets
  • 用folium模块画地理图_使用Folium表示您的地理空间数据
  • python创建类统计属性_轻松创建统计数据的Python包
  • knn分类 knn_关于KNN的快速小课程
  • 机器学习集群_机器学习中的多合一集群技术在无监督学习中应该了解
  • 政府公开数据可视化_公开演讲如何帮助您设计更好的数据可视化
  • 消费者行为分析_消费者行为分析-是否点击广告?
  • 魅族mx5游戏模式小熊猫_您不知道的5大熊猫技巧
  • 数据科学中的数据可视化

twitter 数据集处理_Twitter数据清理和数据科学预处理相关推荐

  1. 数据预处理—-(数据探索、数据清理、数据集成、数据规约、数据变换)

    数据挖掘概念与技术 定义挖掘目标–>数据取样–>数据探索–>数据预处理–>挖掘建模–>模型评价 第一章.数据 挖掘的数据类型:时间序列,序列,数据流,时间空间数据,多媒体 ...

  2. 数据分析 数据清理_数据清理| 数据科学

    数据分析 数据清理 数据清理 (Data Cleaning) Data cleaning is the way toward altering information to guarantee tha ...

  3. 数据预处理之数据清理,数据集成,数据规约,数据变化和离散化

    目录 数据清理 数据集成 数据规约 数据的变换 1.Min-Max 规范化 [0,1]规划

  4. DataAnalysis:数据分析、数据清理、数据合并

    数据清洗 缺失值处理 删除法(占比极少) 插补法(均值插补,回归插补,极大似然估计) 噪声过滤(减少随机误差) 回归法:一个函数拟合数据来使得数据光滑,达到去噪效果. 均值平滑法:对于具有序列特征的变 ...

  5. python的数据清理_Python数据清理,清洗

    一.数据清洗与准备 1.缺失值 NaN(np.nan): 对数值型数据,浮点值NaN(not a number) NA(not available) None 均为缺失值,通过data.isnull( ...

  6. python异常数据筛选_学习笔记(06):Python数据清理实践-数据过滤,06Python,清洗,实战,筛选...

    数据筛选:直接引用,选择行列,基础索引,loc和iloc,如何区分 import三个库:pandas as pd,os, numpy as np os.chdir('数据存放路径') 读取 变量 = ...

  7. Python数据清理之数据质量

    转载于:https://www.cnblogs.com/erinspace/p/6686453.html

  8. 独家 | 用于数据清理的顶级R包(附资源)

    作者:Anna Kayfitz,CEO of StrategicDB Corp 翻译:顾宇华 校对:杨光 本文约1700字,建议阅读5分钟. 确保数据干净整洁应该始终是数据科学工作流程中首要也是最重要 ...

  9. 数据挖掘-数据清理过程

    数据清理-数据清理过程 数据清理过程的第一步是偏差检测(discrepancy detection). 导致偏差的因素有很多,包括: 具有很多可选字段的设计糟糕的输入表单 人为的数据输入错误 有意的错 ...

最新文章

  1. linux lua ide,mac os上开发lua用什么ide
  2. OVS vswitchd启动(三十八)
  3. matlab期末试题,Matlab期末考试试题库(共12套卷)
  4. 据说一般人轻易做不了技术支撑…
  5. Swift:用UICollectionView整一个瀑布流
  6. python绘制动态图表怎么存下来_做动态图表,没有数据?用Python就能获取!
  7. 计算占比并保证百分比和为1
  8. 小程序数据框有重影_小程序开发(二):数据绑定
  9. 虚拟机Centos系统安装
  10. android相机代码权限,Android – 相机权限被拒绝而没有提示
  11. echarts 横向柱状图
  12. 计算机应用技术专业就业方向分析
  13. 基于SVM支持向量机的车牌分割识别算法matlab仿真——详细版
  14. ABAP CDS编写
  15. 22061周市场回顾
  16. Android向SDCard中上传文件时报错:Failed to push items
  17. 不修条地铁,都不好意思叫自己大城市(附地铁发展图)
  18. 沙沙野—让作品遇见全世界
  19. C++ 视频学习笔记
  20. 使用STM32hal库usart的接收中断分析及出现部分问题的解决

热门文章

  1. HDOJ 2037:今年暑假不AC_大二写
  2. 【小摘抄】关于C++11下 string各类用法(持续更新)
  3. RandomForestClassifier(随机森林检测每个特征的重要性及每个样例属于哪个类的概率)...
  4. 常用自动化框架简单的分析与介绍
  5. LeetCode 7 Reverse Integer(反转数字)
  6. Tomcat配置JNDI数据源
  7. 一个经典实例理解继承与多态原理与优点(附源码)---面向对象继承和多态性理解得不够深刻的同学请进...
  8. 分享一个帮助你在线测试响应式设计的web工具 - Screenqueri.es
  9. 【转来警醒自己】最近的一些面试感悟
  10. 2007年淘宝网手机销量统计报告