Pandas系列教程(二): 索引,选择, 赋值
1 | import pandas as pd |
能够选择一张表中的某些数据是一个极其重要的操作,如果你连这个都不会的话,后面的处理根本就不可能实现了。
Python语言自带的索引,切片方法
原生的Python提供了很多优秀的方法来索引数据,pandas继承了这些方法,考虑这个DataFrame
1 | reviews |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
在Python中,我们可以通过将其作为属性访问来访问对象的属性。 例如,book对象可能有title属性,我们可以通过调用book.title来访问它。 pandas DataFrame中的列以相同的方式工作。
因此,要访问我们评论的国家/地区属性,我们可以使用:
1 | reviews.country |
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
如果我们在Python中有一个dict对象,我们可以使用索引([])运算符访问它的值。 同样,我们可以对pandas DataFrame列执行相同的操作。 它“正常”:
1 | reviews['country'] |
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
这是从pandas DataFrame中选择特定列的两种方法。 它们中的任何一个没有说谁比谁更好,操作 []确实具有可以处理其中包含保留字符的列名的优点(例如,如果我们有一个country providence列,索引中有空格或者其他字符,那么很显然,第一种方法就失效了)。
Pandas的Series看起来就像一个特殊的字典,我们可以获取其中一个明确的值:
1 | reviews['country'][0] |
'Italy'
基于索引来选择数据
上面的两种方法非常好,因为他使得操作就像在处理原生的Python一样,而pandas也有其自己特殊的方法,loc和iloc,当涉及到一些高级操作时,你可能会用到这些方法。
上面的两种方法都是得到列的值,如果我们对某一行感兴趣呢?这个时候有两种方法,一种是iloc方法,另一种方法是loc方法。loc是指location的意思,iloc中的i是指integer。这两者的区别如下:
- loc works on labels in the index.
- iloc works on the positions in the index (so it only takes integers).
也就是说loc是根据index来索引,比如下边的df定义了一个index,那么loc就根据这个index来索引对应的行。iloc并不是根据index来索引,而是根据行号来索引,行号从0开始,逐次加1。
1 | reviews.iloc[0] |
country Italy
description Aromas include tropical fruit, broom, brimston...
...
variety White Blend
winery Nicosia
Name: 0, Length: 13, dtype: object
1 | reviews.iloc[:, 0] |
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
1 | reviews.iloc[:3, 0] |
0 Italy
1 Portugal
2 US
Name: country, dtype: object
1 | reviews.iloc[1:3, 0] |
1 Portugal
2 US
Name: country, dtype: object
1 | reviews.iloc[[0, 1, 2], 0] |
0 Italy
1 Portugal
2 US
Name: country, dtype: object
1 | reviews.iloc[-5:] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
根据标签来索引
不是根据数据的位置,而是根据数据的索引,举例来说,如果我们想要reviews的第一个元素:
1 | reviews.loc[0, 'country'] |
'Italy'
iloc在概念上比loc更简单,因为它忽略了数据集的索引。 当我们使用iloc时,我们将数据集视为一个大矩阵(列表列表),我们必须按位置索引。 相比之下,loc使用索引中的信息来完成其工作。 由于您的数据集通常具有有意义的索引,因此使用loc通常更容易。 例如,这是一个使用loc更容易的操作:
1 | reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']] |
taster_name | taster_twitter_handle | points | |
---|---|---|---|
0 | Kerin O’Keefe | @kerinokeefe | 87 |
1 | Roger Voss | @vossroger | 87 |
... | ... | ... | ... |
129969 | Roger Voss | @vossroger | 90 |
129970 | Roger Voss | @vossroger | 90 |
129971 rows × 3 columns
在loc和iloc之间选择或转换时,有一个值得记住的“问题”,即两种方法使用略有不同的索引方案。
iloc使用Python stdlib索引方案,其中包含范围的第一个元素,排除最后一个元素。因此0:10将选择条目0,...,9。同时,loc包含索引。因此0:10将选择条目0,...,10。
为什么要改变?请记住,loc可以索引任何stdlib类型:例如,字符串。如果我们有一个带有索引值Apples,...,Potatoes,...的DataFrame,并且我们想要选择“苹果和土豆之间的所有字母水果选择”,那么索引df会更加方便。 loc ['Apples':'Potatoes']比索引像df.loc ['Apples','Potatoet](t在字母表中的s之后)。
当DataFrame索引是简单的数字列表时,例如,这尤其令人困惑。 0,...,1000。在这种情况下,df.iloc [0:1000]将返回1000个条目,而df.loc [0:1000]将返回1001个条目!要使用loc获取1000个元素,您需要降低一个并请求df.iloc [0:999]。
否则,使用loc的语义与iloc的语义相同。
操纵索引
基于标签的选择从索引中的标签获得其权力。 关键的是,我们使用的索引不是一成不变的。 我们可以以我们认为合适的任何方式操纵索引。
set_index方法可用于完成工作。 以下是set_index到title字段时会发生的情况:
1 | reviews.set_index("title") |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
title | ||||||||||||
Nicosia 2013 Vulkà Bianco (Etna) | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | White Blend | Nicosia |
Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Pinot Gris | Domaine Marcel Deiss |
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace) | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Gewürztraminer | Domaine Schoffit |
129971 rows × 12 columns
如果您可以为数据集提供比当前数据集更好的索引,则执行set_index非常有用。
条件选择
到目前为止,我们一直使用DataFrame本身的结构属性索引各种数据。 但是,为了对数据做有趣的事情,我们经常需要根据条件提出问题。
例如,假设我们特别关注意大利生产的优质葡萄酒。
我们可以先询问每种葡萄酒是否为意大利葡萄酒:
1 | reviews.country == 'Italy' |
0 True
1 False
...
129969 False
129970 False
Name: country, Length: 129971, dtype: bool
此操作根据每条记录的国家/地区生成一系列真/假布尔值。 然后可以在loc内部使用此结果来选择相关数据:
1 | reviews.loc[reviews.country == 'Italy'] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129961 | Italy | Intense aromas of wild cherry, baking spice, t... | NaN | 90 | 30.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | COS 2013 Frappato (Sicilia) | Frappato | COS |
129962 | Italy | Blackberry, cassis, grilled herb and toasted a... | Sàgana Tenuta San Giacomo | 90 | 40.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... | Nero d'Avola | Cusumano |
19540 rows × 13 columns
这个DataFrame有大约20,000行。 原来有~130,000。 这意味着大约15%的葡萄酒来自意大利。
我们还想知道哪些比平均水平更好。 葡萄酒的评分为80至100分,因此这可能意味着葡萄酒至少累积90分。
我们可以使用&符号(&)将两个问题放在一起:
1 | reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120 | Italy | Slightly backward, particularly given the vint... | Bricco Rocche Prapó | 92 | 70.0 | Piedmont | Barolo | NaN | NaN | NaN | Ceretto 2003 Bricco Rocche Prapó (Barolo) | Nebbiolo | Ceretto |
130 | Italy | At the first it was quite muted and subdued, b... | Bricco Rocche Brunate | 91 | 70.0 | Piedmont | Barolo | NaN | NaN | NaN | Ceretto 2003 Bricco Rocche Brunate (Barolo) | Nebbiolo | Ceretto |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129961 | Italy | Intense aromas of wild cherry, baking spice, t... | NaN | 90 | 30.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | COS 2013 Frappato (Sicilia) | Frappato | COS |
129962 | Italy | Blackberry, cassis, grilled herb and toasted a... | Sàgana Tenuta San Giacomo | 90 | 40.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... | Nero d'Avola | Cusumano |
6648 rows × 13 columns
假设我们将购买任何在意大利制造或评级高于平均水平的葡萄酒。 为此,我们使用管道(|):
1 | reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
61937 rows × 13 columns
pandas附带了一些预先建立的条件选择器,其中两个我们将在这里重点介绍。 第一个是isin。 isin允许您选择值“在...中”的数据列表。 例如,以下是我们如何使用它来选择仅来自意大利或法国的葡萄酒:
1 | reviews.loc[reviews.country.isin(['Italy', 'France'])] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
41633 rows × 13 columns
第二个是isnull(和它的伴侣notnull)。 这些方法可以突出显示非空(NaN)的值。 例如,要过滤掉数据集中缺少价格标签的葡萄酒,我们将采取以下措施:
1 | reviews.loc[reviews.price.notnull()] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
120975 rows × 13 columns
赋值
另一方面,将数据分配给DataFrame很容易。 您可以指定一个常量值:
1 | reviews['critic'] = 'everyone' |
0 everyone
1 everyone
...
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
或者使用可迭代的值:
1 | reviews['index_backwards'] = range(len(reviews), 0, -1) |
0 129971
1 129970
...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64
- 标题: Pandas系列教程(二): 索引,选择, 赋值
- 作者: Oliver xu
- 创建于 : 2018-11-30 22:22:58
- 更新于 : 2024-11-20 21:07:04
- 链接: https://blog.oliverxu.cn/2018/11/30/Pandas系列教程-二-索引,选择,-赋值/
- 版权声明: 本文章采用 CC BY-NC-SA 4.0 进行许可。