1%的人知道的pandas骚操作，传授给你-Python教程

资源魔 2020-10-09 22:47:07 46 0

python教程栏目明天引见pandas的操作。

pandas有一种性能十分弱小的办法，它就是accessor，能够将它了解为一种属性接口，经过它能够取得额定的办法。其实这样说仍是很笼统，上面咱们经过代码以及实例来了解一下。

>>> pd.Series._accessors
{'cat', 'str', 'dt'}复制代码

关于Series数据构造应用_accessors办法，咱们失去了3个工具：cat，str，dt。

.cat：用于分类数据（Categorical data）
.str：用于字符数据（String Object data）
.dt：用于工夫数据（datetime-like data）

上面咱们顺次看一下这三个工具是若何应用的。

str工具的应用

Series数据类型：str字符串

# 界说一个Series序列
>>> addr = pd.Series([
...     'Washington, D.C. 20003',
...     'Brooklyn, NY 11211-1755',
...     'Omaha, NE 68154',
...     'Pittsburgh, PA 15211'
... ]) 

>>> addr.str.upper()
0     WASHINGTON, D.C. 20003
1    BROOKLYN, NY 11211-1755
2            OMAHA, NE 68154
3       PITTSBURGH, PA 15211
dtype: object

>>> addr.str.count(r'\d') 
0    5
1    9
2    5
3    5
dtype: int64复制代码

对于以上str工具的2个办法阐明：

Series.str.upper：将Series中一切字符串变成年夜写；
Series.str.count：对Series中一切字符串的个数进行计数；

并不难发现，该用法的应用与Python中字符串的操作很类似。没错，正在pandas中你同样能够这样简略的操作，而没有同的是你操作的是一整列的字符串数据。依然基于以上数据集，再看它的另外一个操作：

>>> regex = (r'(?P<city>[A-Za-z ]+), '      # 一个或更多字母
...          r'(?P<state>[A-Z]{2}) '        # 两个年夜写字母
...          r'(?P<zip>\d{5}(?:-\d{4})?)')  # 可选的4个延长数字
...
>>> addr.str.replace('.', '').str.extract(regex)
         city state         zip
0  Washington    DC       20003
1    Brooklyn    NY  11211-1755
2       Omaha    NE       68154
3  Pittsburgh    PA       15211复制代码

对于以上str工具的2个办法阐明：

Series.str.replace：将Series中指定字符串交换；
Series.str.extract：经过正则表白式提取字符串中的数据信息；

这个用法就有点复杂了，由于很显著看到，这是一个链式的用法。经过replace将 " . " 交换为""，即为空，紧接着又应用了3个正则表白式（辨别对应city，state，zip）经过extract对数据进行了提取，并由原来的Series数据构造变成了DataFrame数据构造。

当然，除了了以上用法外，罕用的属性以及办法另有.rstrip，.contains，split等，咱们经过上面代码查看一下str属性的完好列表：

>>> [i for i in dir(pd.Series.str) if not i.startswith('_')]
['capitalize',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_du妹妹ies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']复制代码

属性有不少，关于详细的用法，假如感兴味能够本人进行试探操练。

dt工具的应用

Series数据类型：datetime

由于数据需求datetime类型，以是上面应用pandas的date_range()天生了一组日期datetime演示若何进行dt工具操作。

>>> daterng = pd.Series(pd.date_range('2017', periods=9, freq='Q'))
>>> daterng
0   2017-03-31
1   2017-06-30
2   2017-09-30
3   2017-12-31
4   2018-03-31
5   2018-06-30
6   2018-09-30
7   2018-12-31
8   2019-03-31
dtype: datetime64[ns]

>>>  daterng.dt.day_name()
0      Friday
1      Friday
2    Saturday
3      Sunday
4    Saturday
5    Saturday
6      Sunday
7      Monday
8      Sunday
dtype: object

>>> # 查看下半年
>>> daterng[daterng.dt.quarter > 2]
2   2017-09-30
3   2017-12-31
6   2018-09-30
7   2018-12-31
dtype: datetime64[ns]

>>> daterng[daterng.dt.is_year_end]
3   2017-12-31
7   2018-12-31
dtype: datetime64[ns]复制代码

以上对于dt的3种办法阐明：

Series.dt.day_name()：从日期判别出所处礼拜数；
Series.dt.quarter：从日期判别所处节令；
Series.dt.is_year_end：从日期判别能否处正在年末；

其它办法也都是基于datetime的一些变换，并经过变换来查看详细宏观或许微观日期。

cat工具的应用

Series数据类型：Category

正在说cat工具的应用前，先说一下Category这个数据类型，它的作用很弱小。尽管咱们不常常性的正在内存中运转上g的数据，然而咱们也总会遇到执行几行代码会期待很久的状况。应用Category数据的一个益处就是：能够很好的节流正在工夫以及空间的耗费。上面咱们经过几个实例来学习一下。

>>> colors = pd.Series([
...     'periwinkle',
...     'mint green',
...     'burnt orange',
...     'periwinkle',
...     'burnt orange',
...     'rose',
...     'rose',
...     'mint green',
...     'rose',
...     'navy'
... ])
...
>>> import sys
>>> colors.apply(sys.getsizeof)
0    59
1    59
2    61
3    59
4    61
5    53
6    53
7    59
8    53
9    53
dtype: int64复制代码

下面咱们经过应用sys.getsizeof来显示内存占用的状况，数字代表字节数。
另有另外一种较量争论内容占用的办法：memory_usage()，前面会应用。

如今咱们将下面colors的没有反复值映照为一组整数，而后再看一下占用的内存。

>>> mapper = {v: k for k, v in enumerate(colors.unique())}
>>> mapper
{'periwinkle': 0, 'mint green': 1, 'burnt orange': 2, 'rose': 3, 'navy': 4}

>>> as_int = colors.map(mapper)
>>> as_int
0    0
1    1
2    2
3    0
4    2
5    3
6    3
7    1
8    3
9    4
dtype: int64

>>> as_int.apply(sys.getsizeof)
0    24
1    28
2    28
3    24
4    28
5    28
6    28
7    28
8    28
9    28
dtype: int64复制代码

注：关于以上的整数值映照也能够应用更简略的pd.factorize()办法替代。

咱们发现下面所占用的内存是应用object类型时的一半。其实，这类状况就相似于Category data类型外部的原理。

内存占用区分：Categorical所占用的内存与Categorical分类的数目以及数据的长度成反比，相同，object所占用的内存则是一个常数乘以数据的长度。

上面是object内存应用以及category内存应用的状况比照。

>>> colors.memory_usage(index=False, deep=True)
650
>>> colors.astype('category').memory_usage(index=False, deep=True)
495复制代码

下面后果是应用object以及Category两种状况下内存的占用状况。咱们发现成果并无咱们设想中的那末好。然而留意Category内存是成比例的，假如数据集的数据量很年夜，但没有反复分类（unique）值很少的状况下，那末Category的内存占用能够节流达到10倍以上，比方上面数据量增年夜的状况：

>>> manycolors = colors.repeat(10)
>>> len(manycolors) / manycolors.nunique() 
20.0

>>> manycolors.memory_usage(index=False, deep=True)
6500
>>> manycolors.astype('category').memory_usage(index=False, deep=True)
585复制代码

能够看到，正在数据量添加10倍当前，应用Category所占内容节流了10倍以上。

除了了占用内存节流外，另外一个额定的益处是较量争论效率有了很年夜的晋升。由于关于Category类型的Series，str字符的操作发作正在.cat.categories的非反复值上，而并不是原Series上的一切元素上。也就是说关于每一个非反复值都只做一次操作，而后再向与非反复值同类的值映照过来。

关于Category的数据类型，能够应用accessor的cat工具，和相应的属性以及办法来操作Category数据。

>>> ccolors = colors.astype('category')
>>> ccolors.cat.categories
Index(['burnt orange', 'mint green', 'navy', 'periwinkle', 'rose'], dtype='object')复制代码

实际上，关于开端的整数类型映照，咱们能够先经过reorder_categories进行从新排序，而后再应用cat.codes来完成对整数的映照，来达到一样的成果。

>>> ccolors.cat.reorder_categories(mapper).cat.codes
0    0
1    1
2    2
3    0
4    2
5    3
6    3
7    1
8    3
9    4
dtype: int8复制代码

dtype类型是Numpy的int8（-127~128）。能够看出以上只要要一个单字节就能够正在内存中蕴含一切的值。咱们开端的做法默许应用了int64类型，但是经过pandas的应用能够很智能的将Category数据类型变成最小的类型。

让咱们来看一下cat另有甚么其它的属性以及办法能够应用。上面cat的这些属性根本都是对于查看以及操作Category数据类型的。

>>> [i for i in dir(ccolors.cat) if not i.startswith('_')]
['add_categories',
 'as_ordered',
 'as_unordered',
 'categories',
 'codes',
 'ordered',
 'remove_categories',
 'remove_unused_categories',
 'rename_categories',
 'reorder_categories',
 'set_categories']复制代码

然而Category数据的应用没有是很灵敏。例如，拔出一个以前不的值，起首需求将这个值增加到.categories的容器中，而后再增加值。

>>> ccolors.iloc[5] = 'a new color'
# ...
ValueError: Cannot setitem on a Categorical with a new category,
set the categories first

>>> ccolors = ccolors.cat.add_categories(['a new color'])
>>> ccolors.iloc[5] = 'a new color'  
复制代码