Module 3 Text processing and data cleaning

这里有GROK-Module3 的全部内容,篇幅有点长,请有耐心看完。每一个大目录的最后一个小目录是程序小测验,算final成绩,会重点讲解。
Module3 一共有6大章节:1、Introduction 2、Transforming data 3、Filtering data 4、Filtering and transforming 5、Advanced filtering and transforming 6、Alternative transforms


文章目录

  • Module 3 Text processing and data cleaning
  • 前言
  • Introduction
  • Pattern 1: Transforming data
    • Transforming data
    • Breaking it down
    • Transform each word
    • String methods recap
    • Stripping whitespace
    • Strip it!(算分)
  • Pattern 2: Filtering data
    • Breaking it down
    • In or Out
    • Character in string
    • Filter(算分)
  • Pattern 3: Filtering and transforming
  • 总结

前言

创作不易,拒绝抄袭,可以引用,标明出处。
小编会尽力去完善每一个知识点,如果有错误,漏掉的内容欢迎留言私信补充。


Introduction

In this module we will learn how to process text-based data. We start by looking at how to write programs that open and read from text files.
这一模块我们学习如何处理文本数据,从如何编写打开和读取文本文件开始。

From there, we will concentrate on two important concepts in the field of text processing: transforming and filtering. These two tasks are routinely applied in data cleaning and data mining。
有两个非常重要的概念:转换和过滤,经常用在数据清理和数据挖掘。

The patterns for this module are:这篇模块的内容
1.Transforming data 转换数据
2.Filtering data 过滤数据
3.Filtering and transforming 过滤和转换
4.Advanced filtering and transforming 高级过滤和转换



Pattern 1: Transforming data

Transforming data

Below we have a text file that contains the beginning of our novel.(only head)
这是一个文本文件,包括的是小说开头。(只选取前六行))

pride_and_prejudice.txt

It
Is
A
Truth
Universally
Acknowledged

We want to transform each word such that it only contains lower case characters.
我们将所有的字母转换成小写

for word in open("pride_and_prejudice.txt"):word_new = word.lower()print(word_new)

When you run this code, you should see the following output:

it
is
a
truth
universally
acknowledged


Breaking it down

The first line in our example program initiates a so-called loop that runs through each line of the file. This is a standard syntax when working with files.
程序的第一行是循环,贯穿文章的每一行,这是处理文件的标准语法。

for word in open("pride_and_prejudice.txt"):

The loop variable plays a special role in the for statement: to this variable we assign each line from the file in turn.
循环变量在for语句中起着特殊的作用:我们依次将文件中的每一行赋给这个变量。


Transform each word

Inside the indented block of the for loop we can do anything we want with ourword variable.
在for循环的缩进块里,可以用1word变量做很多事情。

for word in open("pride_and_prejudice.txt"):word_new = word.lower()print(word_new)

Here we created a new variable called word_new which contains a transformed version of the original word variable. By calling.lower() at the end of a string, it gets converted to lower case characters.
创建一个新的变量来存储新的值,用.lower()函数转换为小写。

Here’s another example of how we can transform words in a file. For the following small file,We want to print out the length of each word. We can do this by using thelen function:
(另一个例子,我们想知道单词的长度,用len() 函数。)

example.txt

one
two
three

for word in open("example.txt"):length = len(word)print(length)

When you run this code, you should get the following output:

4
4
5


String methods recap

We call lower() from the example a string method. It converts all characters in a string to lower case.
我们从示例中将lower()称为字符串方法。 它将字符串中的所有字符转换为小写。
"string".method(args)

A very useful string method in the context of file processing is the rstrip method. It allows us to strip characters from a string. This can for example be used to remove the carriage return (\n) character:
rstrip()是文件处理很好的一个方法

s1 = "line 1\n"
s2 = "line 2"
s1_stripped = s1.rstrip("\n")
print(s1_stripped)
print(s2)

Stripping whitespace

If we call rstrip and pass in a particular character as the argument, then all instances of that character are stripped from the right side (end) of the string.( 如果我们调用rstrip并传入一个特定字符作为参数,那么该字符的所有实例都将从字符串的右侧(结尾)剥离

Python has two other string methods that work similarly to rstrip:

  • lstrip — same as rstrip, but strips from the left side (beginning) of the string; and
  • strip — same as lstrip, but strips from both sides of the string.

其余两种方法与rstrip相似,lstrip 分离左边的部分,strip 分离两边的部分。


Strip it!(算分)

跳转链接:USYD悉尼大学DATA1002 详细作业解析Module3



Pattern 2: Filtering data

Using the following text file,(只选取txt的head)

pride_and_prejudice.txt

I
certainly
have
had
my
of

we want to keep those words with more than six characters, thereby filtering out all the short words(我们想筛选出超过6个字母的单词

for word in open("pride_and_prejudice.txt"):if len(word.rstrip("\n")) > 6:print(word)

When you run this code, you should see the following output:(输出

certainly
pretend
extraordinary


Breaking it down

Here we check whether the word contains more than 6 characters using the len function, which returns the number of characters in a string. We strip off the carriage return character before we calculate the length using rstrip("\n"), like we did before. You can think of Python evaluating from the innermost instruction to the outermost. Here, rstrip get’s executed first, and then len gets executed with the result of rstrip.
使用len()函数检查单词长度,在这里,首先执行rstrip get,然后使用rstrip的结果执行len

Why we need to strip?

Remember that each line in a file (except for the last one) ends in a carriage return character. This character gets counted as well when you call the len function on the string that contains the data from that line. Compare:
文件中的每一行(最后一行除外)都以回车符结尾。 当您在包含该行数据的字符串上调用len函数时,也会计算此字符

word1 = "Carriage\n"
word2 = "Carriage"
print(len(word1))
print(len(word2))
#output:
9
8

For this reason (and many others) we should always use rstrip("\n") in filtering statements.
因此(以及许多其他原因),我们应始终在过滤语句中使用rstrip(“ \ n”)


In or Out

Boundary cases

A programmer needs to be very careful and precise in stating the condition in any if statement. It is good practice to check your code on examples that fall exactly on the boundary between the cases that are printed and those that are filtered, and also check it on cases that are just either side of the boundary.
程序员在任何if语句中说明条件时都需要非常小心和精确。优良作法是在完全落在已打印案例与已过滤案例之间的边界上的示例上检查代码,并在边界两侧的案例上进行检查。

例如:

if len(word.rstrip("\n")) <= 9:

if len(word.rstrip("\n")) < 9:

Code that “filters out the words with more than 9 characters”, and code that “filters out the words with 9 or more characters”, will do different things when they are given a word with exactly 9 characters.
当代码被赋予恰好9个字符的单词时,“过滤出9个以上字符的单词”和“过滤出9个或更多字符的单词”的代码将执行不同的操作。


Character in string

This filtering technique checks whether a certain character (or set of characters) is present within a string. For this, we use the in keyword like this:(此过滤技术检查字符串中是否存在某个字符(或一组字符)。 为此,我们使用in关键字,)

letter in word

This syntax checks whether the value of a variable called letter is among the characters that are present in the value of a variable called word, and it returns either True orFalse.
(此语法检查称为letter的变量的值是否在称为word的变量的值中存在的字符中,并返回TrueFalse)

letter = "p"
word = "accept"
print(not letter in word)
print(letter in word and len(word)==6)
#output:
False
True

Filter(算分)

跳转链接:USYD悉尼大学DATA1002 详细作业解析Module3



Pattern 3: Filtering and transforming

We want to be able to apply both filtering and transforming at the same time. Using the following text file as an example. 我们希望应用同时过滤和转换,下面是文本例子(文本太长,只截取头部):

pride_and_prejudice.txt
However
little
known
the
feelings
or

问题: suppose we want to find all words that end in the character “e” and then find out how long these words are. 我们想知道多少个单词最后一个字母包含‘e’, 然后这个单词的长度是多少?

思路: we first have to filter out the words which don’t end with an “e” and then transform each word which wasn’t filtered, into a number which represents its length. 首先我们过滤出结尾不是 ’e’ 的单词,然后将没被过滤的单词转换为长度。

for word in open('pride_and_prejudice.txt'):if word.rstrip('\n').endwith('e'):length = len(word.rstrip('\n'))print(length)# out

总结

以上就是今天要讲的内容,讲述了Module3 课件的内容。每章节的最后小节算分,希望大家认真阅读,不懂就问,取得高分,成功上岸。

【系列笔记一】-USYD悉尼大学Data1002 Grok Module 3 课件 作业 assignment讲解相关推荐

  1. python字符串去头尾_悉尼大学某蒟蒻的Python学习笔记

    About me 本蒟蒻是悉尼大学计算机科学大一的学生,这篇博客记录了学习INFO1110这门课的一些心得,希望能对大家有帮助. To start with 因为计算机只能识别机器语言,所以我们需要编 ...

  2. 系列笔记 | 深度学习连载(6):卷积神经网络基础

    点击上方"AI有道",选择"星标"公众号 重磅干货,第一时间送达 卷积神经网络其实早在80年代,就被神经网络泰斗Lecun 提出[LeNet-5, LeCun ...

  3. 系列笔记 | 深度学习连载(5):优化技巧(下)

    点击上方"AI有道",选择"星标"公众号 重磅干货,第一时间送达 深度学习中我们总结出 5 大技巧: 本节继续从第三个开始讲起. 3. Early stoppi ...

  4. 系列笔记 | 深度学习连载(4):优化技巧(上)

    点击上方"AI有道",选择"星标"公众号 重磅干货,第一时间送达 深度学习中我们总结出 5 大技巧: 1. Adaptive Learning Rate 我们先 ...

  5. 系列笔记 | 深度学习连载(2):梯度下降

    点击上方"AI有道",选择"星标"公众号 重磅干货,第一时间送达 我们回忆深度学习"三板斧": 1. 选择神经网络 2. 定义神经网络的好坏 ...

  6. 悉尼大学教授陶大程加入京东,出任京东探索研究院院长

    来源丨机器之心 编辑丨蛋酱 3 月 9 日消息,人工智能和信息科学领域国际知名学者.悉尼大学教授.澳大利亚科学院院士陶大程已正式出任京东探索研究院 (JD Explore) 院长. 陶大程 2002 ...

  7. 悉尼大学计算机专业本科2019,悉尼大学2019 S1官方校历时间表……

    原标题:悉尼大学2019 S1官方校历时间表-- 悉尼大学 2019 S1校历 2019 S1 USYD Key Date 2019 年 第一学期校历一览表 2019.02.20-22 O-week ...

  8. 悉尼大学计算机研究生学制,悉尼大学研究生学制

    澳大利亚悉尼大学具有丰富的研究生专业课程,学制安排一般在1-2年时间. 悉尼大学硕士申请要求 要求非211大学申请者,暂不需清华认证 (毕业证.学位证.成绩单) 入学要求: 工程类专业(Enginee ...

  9. 视觉+Transformer最新论文出炉,华为联合北大、悉尼大学发表

    作者 | CV君 来源 | 我爱计算机视觉 Transformer 技术最开始起源于自然语言处理领域,但今年5月份Facebook 的一篇文章将其应用于计算机视觉中的目标检测(DETR算法,目前已有7 ...

  10. 悉尼大学陶大程:遗传对抗生成网络有效解决GAN两大痛点

    来源:新智元 本文共7372字,建议阅读10分钟. 本文为你整理了9月20日的AI WORLD 2018 世界人工智能峰会上陶大程教授的演讲内容. [ 导读 ]悉尼大学教授.澳大利亚科学院院士.优必选 ...

最新文章

  1. why we see different http status code like 404, 500. where are they handled
  2. powershell实现设置程序相关性脚本
  3. Java连接Mysql数据库警告:Establishing SSL connection without server's identity verification is not recommend
  4. git push origin master 出错
  5. 【MVC+MySQL+EntityFramework】查询性能优化笔记
  6. Linux/AIX上部署VNC Server
  7. OpenGL基础28:模型
  8. python 只用opencv吗,python – OpenCV:使用solvePnP来确定单应性
  9. Linux下学习进程控制
  10. SecureRandom生成随机数慢(阻塞)问题解决记录
  11. 人工智能ai的有关专业术语_您需要知道的11个人工智能术语
  12. html论坛发帖案例
  13. 用CH341A烧录外挂Flash (W25Q16JV)
  14. 达人评测 i9 13900H和i7 13700h差距 i913900H和i713700h选哪个
  15. 银河麒麟V10安装与运行人大金仓数据库
  16. 微信CRM六大模块详解
  17. 工业路由器智能井盖监控方案
  18. Java图形化界面编程超详细知识点(8)——列表框
  19. (1)学习ArduPilot代码库
  20. 手机文件管理ftp服务器,ftp工具手机版(ftp文件传输管理工具)V1.0.2 手机版

热门文章

  1. The Client hold the Interface
  2. oracle+ebs+fsg报表,EBS 11i FSG报表用XML publish输出问题!!!!
  3. 使用Photoshop出现提示“脚本错误-50出现一般Photoshop错误“
  4. VMware: 虚拟机报错 ( 虚拟化性能计数器需要至少一个可正常使用的计数器, 模块 “VPMC“ 启动失败 , 未能启动虚拟机 )
  5. 在线微信公众号调查数据分析报告
  6. 900万!!!!!!!!这也太强了吧!!!我的老天!!!!!!!!!!
  7. 代码读智识  笔墨知人心 1
  8. 深以为然-为什么一些JAVA EE / J2EE 工程是效率低下或者至少是效率欠佳的(翻译)
  9. 杂篇:随笔编程杂谈录--《隆中对》
  10. Mysql大字段blob返回是数字_innodb使用大字段text,blob的一些优化建议(转)