读写文本格式的数据

import pandas as pd
import numpy as np
from pandas import Series,DataFrame
!type "E:\python_study_files\python\pydata-book-2nd-edition\examples\ex1.csv"
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
df = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex1.csv")
#等价于
df = pd.read_table("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex1.csv",sep=',')
#对于无标题行文件
pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex2.csv",header=None)
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
#自定义列名
pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex2.csv",names=['a','b','c','d','message'])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
names=['a','b','c','d','message']
pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex2.csv",names=names,index_col='message')
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
#做一个层次化索引。传入由列编号或列名组成的列表
parsed = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\csv_mindex.csv",index_col=['key1','key2'])
parsed
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16
list(open("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex3.txt"))
['            A         B         C\n','aaa -0.264438 -1.026059 -0.619500\n','bbb  0.927272  0.302904 -0.032399\n','ccc -0.264273 -0.386314 -0.217601\n','ddd -0.871858 -0.348382  1.100491\n']
#文件各个字段由数量不定的空白符分隔
#可以用正则表达式\s+
result = pd.read_table("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex3.txt",sep='\s+')
result
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
#读取文件的跳行操作
pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex4.csv",skiprows=[0,2,3])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
#含缺失值的数据
result = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex5.csv")
result
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
pd.isnull(result)
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

pandas会用一组经常出现的标记值进行识别。如NA、-1.#IND以及NULL和空字符串等

#na_values可以接受一组用于表示缺失值的字符串
result = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex5.csv",na_values=['NULL'])
pd.isnull(result)
something a b c d message
0 False False False False False True
1 False False False True False True
2 False False False False False False
#可以用一个字典为各列指定不同的NA标记值
sentinels = {'message':['foo','NA'],'something':['two']}
pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex5.csv",na_values=sentinels)
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN
#有个知识点
import matplotlib.pyplot as plt
from pylab import *
img = plt.imread('read_csv或read_table函数的参数.png')
imshow(img)
<matplotlib.image.AxesImage at 0x1e8fd724130>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-etaY4igA-1645969826326)(output_17_1.png)]

img = plt.imread('read_csv或read_table函数的参数2.png')
imshow(img)
<matplotlib.image.AxesImage at 0x1e8fd712770>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WniQIXjq-1645969826327)(output_18_1.png)]

逐块读取文本文件

result = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex6.csv")
#读取前几行
pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex6.csv",nrows=5)
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
#逐块读取文件
chunker = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex6.csv",chunksize=1000)
chunker
<pandas.io.parsers.readers.TextFileReader at 0x2eac66984f0>
#返回的是TextParser对象,对文件进行逐块迭代
#将值计数聚合到key列
tot = Series([])
for piece in chunker:#add() 方法向集合添加元素。如果该元素已存在,则 add() 方法就不会添加元素。计数函数value_counts()tot = tot.add(piece['key'].value_counts(),fill_value=0)tot = tot.sort_values(ascending=False)
tot[:10]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

get_chunk方法可以读取任意大小的块

将数据写出到文本格式

data = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex5.csv")
#将数据写到一个以逗号分隔的文件中
data.to_csv("E:\python_study_files\ipython_data_analysis\out.csv")
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,columns=['a','b','c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
9,10,11.0
#Series也有一个to_csv方法
#生成时间2000/1/1-2000/1/7
dates = pd.date_range('1/1/2000',periods=7)
ts = Series(np.arange(7),index=dates)
ts.to_csv("E:\\python_study_files\\ipython_data_analysis\\tseries.csv")
!type "E:\python_study_files\ipython_data_analysis\tseries.csv"
,0
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
dates
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04','2000-01-05', '2000-01-06', '2000-01-07'],dtype='datetime64[ns]', freq='D')
#read_csv等价于
#Series.from_csv("E:\python_study_files\ipython_data_analysis\tseries.csv",parse_dates=True)
#from_csv无了,用回read_csv
pd.read_csv("E:\\python_study_files\\ipython_data_analysis\\tseries.csv",parse_dates=True)
Unnamed: 0 0
0 2000-01-01 0
1 2000-01-02 1
2 2000-01-03 2
3 2000-01-04 3
4 2000-01-05 4
5 2000-01-06 5
6 2000-01-07 6

手工处理分隔符格式

!type "E:\python_study_files\python\pydata-book-2nd-edition\examples\ex7.csv"
"a","b","c"
"1","2","3"
"1","2","3"
import csv
f = open("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex7.csv")
reader = csv.reader(f)#reader = csv.reader(f,delimiter=',')
reader
<_csv.reader at 0x1c28677b160>
for line in reader:print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
lines = list(csv.reader(open("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex7.csv")))
header,values = lines[0],lines[1:]
data_dict = {h: v for h, v in zip(header,zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
zip(*values)
<zip at 0x1c28671d0c0>

with open('mydata.csv','w') as f:writer = csv.writer(f,dialect=my_dialect)writer.writerow(('one','two','three'))writer.writerow(('1','2','3'))
---------------------------------------------------                  NameError: name 'my_dialect' is not defined

JSON数据

obj =  """
{"name":"Wes",
"places_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},{"name":"Katie","age":33,"pet":"Cisco"}]
}
"""
import json
#将json对象转化为python对象
result = json.loads(obj)
result
{'name': 'Wes','places_lived': ['United States', 'Spain', 'Germany'],'pet': None,'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'},{'name': 'Katie', 'age': 33, 'pet': 'Cisco'}]}
#将python对象转换为json对象
asjson = json.dumps(result)
siblings = DataFrame(result['siblings'],columns=['name','age'])
siblings
name age
0 Scott 25
1 Katie 33

XML和HTML:Web信息收集

from lxml.html import parse
import urllib.request
parsed = parse(urllib.request.urlopen('http://www.stats.gov.cn/tjsj/ndsj/2021/indexch.htm'))
doc = parsed.getroot()
#获取特定类型的所有HTML标签
links = doc.findall('.//a')
links
[]
#要得到URL和链接文本,可使用各对象的get方法(针对URL)和text_content方法(针对显示文本)
lnk = links[28]
lnk.get('href')
lnk.text_content()
#用列表推导式获取文档中的全部URL
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
#TextParser类可用于自动类型转换
from pandas.io.parsers import TextParser
from lxml import objectify
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
data = []
skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']
#root.INDICATOR返回一个用于产生各个<INDICATOR>XML元素的生成器
for elt in root.INDICATOR:el_data = {}for child in elt.getchildren():if child.tag in skip_fields:continueel_data[child.tag] = child.pyvaldata.append(el_data)
perf = DataFrame(data)
perf
AGENCY_NAME INDICATOR_NAME DESCRIPTION PERIOD_YEAR PERIOD_MONTH CATEGORY FREQUENCY INDICATOR_UNIT YTD_TARGET YTD_ACTUAL MONTHLY_TARGET MONTHLY_ACTUAL
0 Metro-North Railroad On-Time Performance (West of Hudson) Percent of commuter trains that arrive at thei... 2008 1 Service Indicators M % 95.0 96.9 95.0 96.9
1 Metro-North Railroad On-Time Performance (West of Hudson) Percent of commuter trains that arrive at thei... 2008 2 Service Indicators M % 95.0 96.0 95.0 95.0
2 Metro-North Railroad On-Time Performance (West of Hudson) Percent of commuter trains that arrive at thei... 2008 3 Service Indicators M % 95.0 96.3 95.0 96.9
3 Metro-North Railroad On-Time Performance (West of Hudson) Percent of commuter trains that arrive at thei... 2008 4 Service Indicators M % 95.0 96.8 95.0 98.3
4 Metro-North Railroad On-Time Performance (West of Hudson) Percent of commuter trains that arrive at thei... 2008 5 Service Indicators M % 95.0 96.6 95.0 95.8
... ... ... ... ... ... ... ... ... ... ... ... ...
643 Metro-North Railroad Escalator Availability Percent of the time that escalators are operat... 2011 8 Service Indicators M % 97.0 97.0
644 Metro-North Railroad Escalator Availability Percent of the time that escalators are operat... 2011 9 Service Indicators M % 97.0 97.0
645 Metro-North Railroad Escalator Availability Percent of the time that escalators are operat... 2011 10 Service Indicators M % 97.0 97.0
646 Metro-North Railroad Escalator Availability Percent of the time that escalators are operat... 2011 11 Service Indicators M % 97.0 97.0
647 Metro-North Railroad Escalator Availability Percent of the time that escalators are operat... 2011 12 Service Indicators M % 97.0 97.0

648 rows × 12 columns

二进制数据格式

pickle最好只用于短期存储,难以保证该格式永远稳定。

frame = pd.read_csv("E:\python_study_files\python\pydata-book-2nd-edition\examples\ex1.csv")
frame
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
#frame.save('frame_pickle')
#只需要将save用to_pickle方法代替即可。同理,load要替换为read_pickle
frame.to_pickle('frame_pickle')#二进制格式化
#读取二进制数据
pd.read_pickle('frame_pickle')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

使用HDF5格式

如果需要处理海量数据,最好看看PyTables和h5py,因为很多数据分析问题都是IO密集型(不是CPU密集型)。但HDF5适合用作“一次写多次读”的数据集。

store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store
---------------------------------------------------------------------------ImportError: Missing optional dependency 'tables'.  Use pip or conda to install tables.

读取Microsoft Excel文件

import openpyxl
xls_file = pd.ExcelFile('data.xlsx')
table = xls_file.parse('Sheet1')

使用HTML和Web API

import requests
url = 'http://www.json.org.cn/resource/json-in-javascript.htm'
resp = requests.get(url)
resp
<Response [200]>
import json
data = json.loads(resp.text)
data.keys()
---------------------------------------------------------------------------JSONDecodeError: Expecting value: line 1 column 1 (char 0)

使用数据库

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20),b VARCHAR(20),
c REAL,        d INTEGER
);"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
#插入几行数据
data = [('Atlanta','Georgia',1.25,6),('Tallahassee','Florida',2.6,3),('Sacramento','California',1.7,5)]
stmt = "INSERT INTO test VALUES(?,?,?,?)"con.executemany(stmt,data)
con.commit()
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),('Tallahassee', 'Florida', 2.6, 3),('Sacramento', 'California', 1.7, 5)]
cursor.description
(('a', None, None, None, None, None, None),('b', None, None, None, None, None, None),('c', None, None, None, None, None, None),('d', None, None, None, None, None, None))
DataFrame(rows,columns=zip(*cursor.description)[0])
---------------------------------------------------------------------------TypeError: 'zip' object is not subscriptable

存取MongoDB中的数据

#import pymongo
#con = pymongo.Connection('localhost',port=27017)改了
from pymongo import MongoClient
con = MongoClient('localhost',port=27017)

利用Python进行数据分析的学习笔记——chap6相关推荐

  1. 《利用Python进行数据分析》学习笔记ch02-1(1)

    前言 這是我第一次开通博客,主要目的是想记录下自己学习python的过程,同时也是想作为学习笔记,我会把<利用python进行数据分析>这本树上的每个例子都自己敲一边,很多语句并不知道为什 ...

  2. 利用Python进行数据分析(学习笔记)

    第壹章 准备工作 1.1 本书内容 1.1.1 什么类型的数据 1.2 为何利用Python进行数据分析 1.2.1 Python作为胶水 1.2.2 解决"双语言"难题 1.2. ...

  3. python输入一组数据、进行简单的统计_《利用Python进行数据分析》学习笔记——第二章(3)...

    1880-2010年间全美婴儿姓名 用pandas.read_csv加载.txt文件 图2.1 用read_csv加载.txt文件 DataFrame.names1880中只有births这一列是in ...

  4. 利用Python进行数据分析的学习笔记——chap9

    数据聚合与分组运算 GroupBy技术 import numpy as np import pandas as pd from pandas import DataFrame,Series df = ...

  5. 《利用Python进行数据分析》学习笔记ch02-3(3)

    索引: pandas.read_csv pandas.concat ignore_index=True groupby或pivot_table进行聚合 np.allclose 计算prop的累计和cu ...

  6. python 数据分析学什么-利用Python做数据分析 需要学习哪些知识

    根据调查结果,十大最常用的数据工具中有八个来自或利用Python.Python广泛应用于所有数据科学领域,包括数据分析.机器学习.深度学习和数据可视化.不过你知道如何利用Python做数据分析吗?需要 ...

  7. 《利用python进行数据分析》读书笔记

    <利用python进行数据分析>是一本利用python的Numpy.Pandas.Matplotlib库进行数据分析的基础介绍,非常适合初学者. 重要的python库 NumPy http ...

  8. 用python进行数据分析举例说明_《利用python进行数据分析》读书笔记 --第一、二章 准备与例子...

    第一章 准备工作 今天开始码这本书--<利用python进行数据分析>.R和python都得会用才行,这是码这本书的原因.首先按照书上说的进行安装,google下载了epd_free-7. ...

  9. 《利用python进行数据分析》读书笔记--第八章 绘图和可视化

    python有许多可视化工具,本书主要讲解matplotlib.matplotlib是用于创建出版质量图表的桌面绘图包(主要是2D方面).matplotlib的目的是为了构建一个MATLAB式的绘图接 ...

最新文章

  1. React.js 小书 Lesson12 - state vs props
  2. Unity中sharedMaterials 和 materials
  3. redhat 6.4 安装ftp
  4. linux-stat查属
  5. 【lucene】lucene自定义评分
  6. AI 与 5G 时代,实时互联网的下一个风口是什么?
  7. Effective C++ 条款42
  8. Java设计模式介绍
  9. 解决跨域form表单post提交时Forbidden的问题。
  10. tf.data.Dataset.zip()讲解 和 python自带的zip()的异同
  11. 【转载】古诗背串了,可是会出大事的哟
  12. 监控软件加入智能零售 试着用人脸辨识让消费力提升
  13. 浅谈韦达定理的“来龙去脉”
  14. unity3d 鼠标点击事件处理 处理鼠标点击
  15. 【无标题】奥的斯故障223 1TH-Fault 2TH Fault故障分析
  16. JS pos机- V0.2
  17. yocto编译错误- Error executing a python function in exec_python_func() autogenerated
  18. 根轨迹起始角与终止角的确定
  19. js 多选框被选中触发的事件_JS实现select选中option触发事件操作示例
  20. 人大金仓 日常命令 已解决

热门文章

  1. framework初始化错误,面试大厂应该注意哪些问题?隔壁都馋哭了
  2. 机器学习分类问题指标评估内容详解(准确率、精准率、召回率、F1、ROC、AUC等)
  3. java扰码器_扰码器的实现(一)
  4. mysql不停機添加節點_MySQL 5.7主从不停机添加新从库
  5. winform-htmlEditor窗体超文本编辑器
  6. 8.利用红外遥控信号控制LED灯的亮灭
  7. FusionCharts Free (FCF) V3 新特性之样式(Styles)
  8. 【软考备战·希赛网每日一练】2023年4月28日
  9. 数据分析师入门_数据分析师入门基础指南
  10. CSS导航条从入门到精通