Pandas系列教程(二): 索引，选择，赋值

import pandas as pd
reviews = pd.read_csv("data/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

能够选择一张表中的某些数据是一个极其重要的操作，如果你连这个都不会的话，后面的处理根本就不可能实现了。

Python语言自带的索引，切片方法

原生的Python提供了很多优秀的方法来索引数据，pandas继承了这些方法，考虑这个DataFrame

reviews

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

129971 rows × 13 columns

在Python中，我们可以通过将其作为属性访问来访问对象的属性。例如，book对象可能有title属性，我们可以通过调用book.title来访问它。 pandas DataFrame中的列以相同的方式工作。

因此，要访问我们评论的国家/地区属性，我们可以使用：

reviews.country

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

如果我们在Python中有一个dict对象，我们可以使用索引（[]）运算符访问它的值。同样，我们可以对pandas DataFrame列执行相同的操作。它“正常”：

reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

这是从pandas DataFrame中选择特定列的两种方法。它们中的任何一个没有说谁比谁更好，操作 []确实具有可以处理其中包含保留字符的列名的优点（例如，如果我们有一个country providence列，索引中有空格或者其他字符，那么很显然，第一种方法就失效了）。

Pandas的Series看起来就像一个特殊的字典，我们可以获取其中一个明确的值：

reviews['country'][0]

'Italy'

基于索引来选择数据

上面的两种方法非常好，因为他使得操作就像在处理原生的Python一样，而pandas也有其自己特殊的方法，loc和iloc，当涉及到一些高级操作时，你可能会用到这些方法。

上面的两种方法都是得到列的值，如果我们对某一行感兴趣呢？这个时候有两种方法，一种是iloc方法，另一种方法是loc方法。loc是指location的意思，iloc中的i是指integer。这两者的区别如下：

loc works on labels in the index.
iloc works on the positions in the index (so it only takes integers).

也就是说loc是根据index来索引，比如下边的df定义了一个index，那么loc就根据这个index来索引对应的行。iloc并不是根据index来索引，而是根据行号来索引，行号从0开始，逐次加1。

reviews.iloc[0]

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

reviews.iloc[:, 0]

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

reviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

reviews.iloc[1:3, 0]

1    Portugal
2          US
Name: country, dtype: object

reviews.iloc[[0, 1, 2], 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

reviews.iloc[-5:]

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
129966	Germany	Notes of honeysuckle and cantaloupe sweeten th...	Brauneberger Juffer-Sonnenuhr Spätlese	90	28.0	Mosel	NaN	NaN	Anna Lee C. Iijima	NaN	Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...	Riesling	Dr. H. Thanisch (Erben Müller-Burggraef)
129967	US	Citation is given as much as a decade of bottl...	NaN	90	75.0	Oregon	Oregon	Oregon Other	Paul Gregutt	@paulgwine	Citation 2004 Pinot Noir (Oregon)	Pinot Noir	Citation
129968	France	Well-drained gravel soil gives this wine its c...	Kritt	90	30.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Gresser 2013 Kritt Gewurztraminer (Als...	Gewürztraminer	Domaine Gresser
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

根据标签来索引

不是根据数据的位置，而是根据数据的索引，举例来说，如果我们想要reviews的第一个元素：

reviews.loc[0, 'country']

'Italy'

iloc在概念上比loc更简单，因为它忽略了数据集的索引。当我们使用iloc时，我们将数据集视为一个大矩阵（列表列表），我们必须按位置索引。相比之下，loc使用索引中的信息来完成其工作。由于您的数据集通常具有有意义的索引，因此使用loc通常更容易。例如，这是一个使用loc更容易的操作：

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

	taster_name	taster_twitter_handle	points
0	Kerin O’Keefe	@kerinokeefe	87
1	Roger Voss	@vossroger	87
...	...	...	...
129969	Roger Voss	@vossroger	90
129970	Roger Voss	@vossroger	90

129971 rows × 3 columns

在loc和iloc之间选择或转换时，有一个值得记住的“问题”，即两种方法使用略有不同的索引方案。

iloc使用Python stdlib索引方案，其中包含范围的第一个元素，排除最后一个元素。因此0:10将选择条目0，...，9。同时，loc包含索引。因此0:10将选择条目0，...，10。

为什么要改变？请记住，loc可以索引任何stdlib类型：例如，字符串。如果我们有一个带有索引值Apples，...，Potatoes，...的DataFrame，并且我们想要选择“苹果和土豆之间的所有字母水果选择”，那么索引df会更加方便。 loc ['Apples'：'Potatoes']比索引像df.loc ['Apples'，'Potatoet]（t在字母表中的s之后）。

当DataFrame索引是简单的数字列表时，例如，这尤其令人困惑。 0，...，1000。在这种情况下，df.iloc [0：1000]将返回1000个条目，而df.loc [0：1000]将返回1001个条目！要使用loc获取1000个元素，您需要降低一个并请求df.iloc [0：999]。

否则，使用loc的语义与iloc的语义相同。

操纵索引

基于标签的选择从索引中的标签获得其权力。关键的是，我们使用的索引不是一成不变的。我们可以以我们认为合适的任何方式操纵索引。

set_index方法可用于完成工作。以下是set_index到title字段时会发生的情况：

reviews.set_index("title")

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	variety	winery
title
Nicosia 2013 Vulkà Bianco (Etna)	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	White Blend	Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Pinot Gris	Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace)	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Gewürztraminer	Domaine Schoffit

129971 rows × 12 columns

如果您可以为数据集提供比当前数据集更好的索引，则执行set_index非常有用。

条件选择

到目前为止，我们一直使用DataFrame本身的结构属性索引各种数据。但是，为了对数据做有趣的事情，我们经常需要根据条件提出问题。

例如，假设我们特别关注意大利生产的优质葡萄酒。

我们可以先询问每种葡萄酒是否为意大利葡萄酒：

reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

此操作根据每条记录的国家/地区生成一系列真/假布尔值。然后可以在loc内部使用此结果来选择相关数据：

reviews.loc[reviews.country == 'Italy']

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here's a bright, informal red that opens with ...	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129961	Italy	Intense aromas of wild cherry, baking spice, t...	NaN	90	30.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	COS 2013 Frappato (Sicilia)	Frappato	COS
129962	Italy	Blackberry, cassis, grilled herb and toasted a...	Sàgana Tenuta San Giacomo	90	40.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...	Nero d'Avola	Cusumano

19540 rows × 13 columns

这个DataFrame有大约20,000行。原来有~130,000。这意味着大约15％的葡萄酒来自意大利。

我们还想知道哪些比平均水平更好。葡萄酒的评分为80至100分，因此这可能意味着葡萄酒至少累积90分。

我们可以使用＆符号（＆）将两个问题放在一起：

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
120	Italy	Slightly backward, particularly given the vint...	Bricco Rocche Prapó	92	70.0	Piedmont	Barolo	NaN	NaN	NaN	Ceretto 2003 Bricco Rocche Prapó (Barolo)	Nebbiolo	Ceretto
130	Italy	At the first it was quite muted and subdued, b...	Bricco Rocche Brunate	91	70.0	Piedmont	Barolo	NaN	NaN	NaN	Ceretto 2003 Bricco Rocche Brunate (Barolo)	Nebbiolo	Ceretto
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129961	Italy	Intense aromas of wild cherry, baking spice, t...	NaN	90	30.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	COS 2013 Frappato (Sicilia)	Frappato	COS
129962	Italy	Blackberry, cassis, grilled herb and toasted a...	Sàgana Tenuta San Giacomo	90	40.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...	Nero d'Avola	Cusumano

6648 rows × 13 columns

假设我们将购买任何在意大利制造或评级高于平均水平的葡萄酒。为此，我们使用管道（|）：

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here's a bright, informal red that opens with ...	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

61937 rows × 13 columns

pandas附带了一些预先建立的条件选择器，其中两个我们将在这里重点介绍。第一个是isin。 isin允许您选择值“在...中”的数据列表。例如，以下是我们如何使用它来选择仅来自意大利或法国的葡萄酒：

reviews.loc[reviews.country.isin(['Italy', 'France'])]

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here's a bright, informal red that opens with ...	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

41633 rows × 13 columns

第二个是isnull（和它的伴侣notnull）。这些方法可以突出显示非空（NaN）的值。例如，要过滤掉数据集中缺少价格标签的葡萄酒，我们将采取以下措施：

reviews.loc[reviews.price.notnull()]

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

120975 rows × 13 columns

赋值

另一方面，将数据分配给DataFrame很容易。您可以指定一个常量值：

reviews['critic'] = 'everyone'
reviews['critic']

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

或者使用可迭代的值：

reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

Pandas系列教程(二): 索引，选择， 赋值