Pandas系列教程(一)：创建和读写

在大多数数据分析项目中第一步往往是读取数据文件，在这个教程中，你将分别通过手动和读取数据文件来创建Series和DataFrame对象。

import pandas as pd

创建数据

在pandas中有两个核心的对象: DataFrame 和 Series

一个DataFrame是一张表格，每个元素有行和列来确定

举例如下：是一个简单的DataFrame

pd.DataFrame({'Yes':[50, 21], 'No':[131, 2]})

	Yes	No
0	50	131
1	21	2

在上面这个例子中，元素"0,No"的值是131。

当然DataFrame元素的值不仅限于整数，举例来说，下面的DataFrame的元素是字符串。

pd.DataFrame({'Bob':['I like it', 'It was awful'], 'Sue':['Pretty good', 'Bland']})

	Bob	Sue
0	I like it	Pretty good
1	It was awful	Bland

我们在使用pd.DataFrame来创建DataFrame对象。这个语法是最常见的。

你赋值时给的字典的key变成了DataFrame的列的索引，二行的索引默认是从0，1，2...开始

当你不想使用默认的行的索引的时候，你可以加上一个index:

pd.DataFrame({'Bob':['I liked it', 'It was awful'],
              'Sue':['Pretty good', 'Bland']},
              index=['Product A', 'Product B'])

	Bob	Sue
Product A	I liked it	Pretty good
Product B	It was awful	Bland

一个Series，是数字的一个序列，可以看作是一个列表

pd.Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

值得注意的是，这样创建出来的序列是一列而不是一行！如果你想为他的列加上索引，你可以加上name字段，如果你想为行加上索引，你可以加上index字段

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

通常情况下，我们很少会去手动创建数据，而是读取已经存在的数据

数据可以会以不同格式的文件存储，最常见的往往是CSV格式的文件

当你打开一个CSV文件时，往往是这样的：

csv
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11

一个CSV文件是由逗号分隔的一张表格，CSV的全称是"comma-seperated values"

现在我们将读取一个文件中的数据并将其放入DataFrame中：

wine_reviews = pd.read_csv("data/wine-reviews/winemag-data-130k-v2.csv")

我们可以使用shape属性来获得DataFrame的大小

wine_reviews.shape

(129971, 14)

我们的DataFrame有130000条记录，每条记录有14个不同的特征值，这几乎是2百万个数据

我们可以使用head命令来获取来获取前5行的数据

wine_reviews.head()

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

pandas的read_csv函数有30多个参数。举个例子，你可以看到上面的CSV文件有一列数据是自带的索引,而你恰好就想用这个索引，不用pandas再创建一个附加的索引，这时候，你只要指定一个index_col参数即可，pandas会将index_col那一列自动作为索引

wine_reviews = pd.read_csv("data/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

现在让我们看一下你会遇到的另外一些数据格式，如Excel文件,后缀名是XLS或者XLTS，它本身就是由一张张表格组成的，所以当你读入Excel格式的数据时，你需要一个另外的参数:表格的名称，想下面这个例子：

使用excel打开是这样的

wic = pd.read_excel("data/publicassistance/sub-est2016_all.xlsx", sheet_name='sub-est2016_all')
wic.head()

	SUMLEV	STATE	PLACE	FUNCSTAT	NAME	STNAME	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015	POPESTIMATE2016
0	40	1	0	A	Alabama	Alabama	4779736	4780131	4785492	4799918	4815960	4829479	4843214	4853875	4863300
1	162	1	124	A	Abbeville city	Alabama	2688	2688	2683	2685	2647	2631	2619	2616	2603
2	162	1	460	A	Adamsville city	Alabama	4522	4522	4517	4495	4472	4447	4428	4395	4360
3	162	1	484	A	Addison town	Alabama	758	756	754	753	748	748	747	740	738
4	162	1	676	A	Akron town	Alabama	356	356	355	345	345	342	337	337	334

Excel格式的文件往往是适合人类看的，而不是机器看的，而CSV文件是非常适合机器看的

现在，我们继续另外一种常见的格式：SQL 文件

SQL数据库是再web领域中最常见的用于存储的文件格式，可以使用python创建和数据库的链接，从而读取数据，我们已sqlite3为例

import sqlite3
conn = sqlite3.connect("data/188-million-us-wildfires/FPA_FOD_20170508.sqlite")

SQL 文件非常不同，你需要写SQL语句来对其进行操作，当然，pandas提供了完善的接口

fires = pd.read_sql_query("SELECT * FROM fires", conn)

这条命令的输出是：

fires.head()

将数据写入到文件中往往比读取数据简单的多，因为pandas将所有过程都替你处理好了，

将数据写入CSV文件中：

wine_reviews.head().to_csv("wine_reviews.csv")

将数据写入excel的文件中

wic.to_excel('wic.xlsx', sheet_name='Total Women')

将数据写入sqlite文件中

conn = sqlite3.connect("fires.sqlite")
fires.head(10).to_sql("fires", conn)