English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Pandas GroupBy的操作实例
任何groupby操作都会对原始对象进行以下操作:
拆分对象 应用函数 合并结果
在许多情况下,我们将数据分成几组,然后在每个子集上应用一些功能。在Apply功能中,我们可以执行以下操作-
聚合 − 计算汇总统计 transformation − 分组操作 Filtragem − 在某些条件下过滤数据
现在我们创建一个DataFrame对象并对其执行所有操作。
#import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) print(df)
Os resultados da execução são os seguintes:
Points Rank Team Year 0 876 1 Riders 2014 1 789 2 Riders 2015 2 863 2 Devils 2014 3 673 3 Devils 2015 4 741 3 Kings 2014 5 812 4 kings 2015 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 9 701 4 Royals 2014 10 804 1 Royals 2015 11 690 2 Riders 2017
象可以拆分为任何对象。有多种分割对象的方法,例如:
obj.groupby('key') obj.groupby(['key1','key2']) obj.groupby(key,axis=1)
现在我们看看如何将分组对象应用于DataFrame对象
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) print(df.groupby('Team'))
Os resultados da execução são os seguintes:
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) print(df.groupby('Team').groups)
Os resultados da execução são os seguintes:
{'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings' : Int64Index([5], dtype='int64)}
用多列分组
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) print(df.groupby(['Team','Year']).groups)
Os resultados da execução são os seguintes:
{('Kings', 2014): Int64Index([4], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64)}
With the groupby object, we can iterate over the object similar to itertools.obj
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name, group in grouped: print(name) print(group)
Os resultados da execução são os seguintes:
2014 Pontos Classificação Equipe Ano 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 2015 Pontos Classificação Equipe Ano 1 789 2 Riders 2015 3 673 3 Devils 2015 5 812 4 kings 2015 10 804 1 Royals 2015 2016 Pontos Classificação Equipe Ano 6 756 1 Kings 2016 8 694 2 Riders 2016 2017 Points Rank Team Year 7 788 1 Kings 2017 11 690 2 Riders 2017
By default, the label name of the groupby object is the same as the group name.
Using the get_group() method, we can select a group.
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print(grouped.get_group(2014))
Os resultados da execução são os seguintes:
Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014
Aggregation functions return an aggregated value for each group. Once a group object is created, several aggregation operations can be performed on the grouped data.
One obvious method is to aggregate using the sum or equivalent agg method.
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print(grouped['Points'].agg(np.mean))
Os resultados da execução são os seguintes:
Year 2014 795.25 2015 769.50 2016 725.00 2017 739.00 Name: Points, dtype: float64
Another way to view the size of each group is to apply the size() function.
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) Attribute Access in Python Pandas grouped = df.groupby('Team') print(grouped.agg(np.size))
Os resultados da execução são os seguintes:
Points Rank Year Team Devils 2 2 2 Kings 3 3 3 Riders 4 4 4 Royals 2 2 2 kings 1 1 1
With the grouped Series, you can also pass a list or dictionary of functions to aggregate and generate a DataFrame as output-
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print(grouped['Points'].agg([np.sum, np.mean, np.std]))
Os resultados da execução são os seguintes:
Team sum mean std Devils 1536 768.000000 134.350288 Kings 2285 761.666667 24.006943 Riders 3049 762.250000 88.567771 Royals 1505 752.500000 72.831998 kings 812 812.000000 NaN
Performing a transformation on groups or columns returns an index, the size of which is the same as the size of the object being grouped. Therefore, the transformation should return a result of the same size as the group block.
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') score = lambda x: (x - x.mean())) / x.std()*10 print(grouped.transform(score))
Os resultados da execução são os seguintes:
Pontos Classificação Ano 0 12.843272 -15.000000 -11.618950 1 3.020286 5.000000 -3.872983 2 7.071068 -7.071068 -7.071068 3 -7.071068 7.071068 7.071068 4 -8.608621 11.547005 -10.910895 5 NaN NaN NaN 6 -2.360428 -5.773503 2.182179 7 10.969049 -5.773503 8.728716 8 -7.705963 5.000000 3.872983 9 -7.071068 7.071068 -7.071068 10 7.071068 -7.071068 7.071068 11 -8.157595 5.000000 11.618950
Filtragem Filtrar dados com base em condições definidas e retornar um subconjunto de dados. A função filter() é usada para filtrar dados.
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Classificação': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Ano': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Pontos': [876,789,863,673,741,812,756,788,694,701,804,690]}} df = pd.DataFrame(ipl_data) print(df.groupby('Team').filter(lambda x: len(x) >= 3))
Os resultados da execução são os seguintes:
Pontos Classificação Equipe Ano 0 876 1 Riders 2014 1 789 2 Riders 2015 4 741 3 Kings 2014 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 11 690 2 Riders 2017