Pandas的数据清洗
Pandas数据分析:数据清洗与处理工具 #生活技巧# #工作学习技巧# #编程语言学习路径#
http://liao.cpython.org/pandas20/
http://liao.cpython.org/pandas21/
构造一个NaN的df
val = np.arange(10,38).reshape(7,4) col = 'a b c d'.split() idx = 'this is just a fake practise today'.split() df = pd.DataFrame(val,index = idx, columns = col) df['e'] = np.nan df.at['is','a'] = 100 df.at['a','c']= 300 df.loc['d'] = np.nan df['f'] = np.nan df Out[74]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 NaN NaN just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN NaN NaN
123456789101112131415161718192021查看缺失的数据:查看缺失数据的情况可以用pandas的isnull函数看看有那些字段上的数据缺失,可以用sum进行统计。也可使用notnull和count函数统计非NaN数据量。
df.isnull() Out[76]: a b c d e f this False False False False True True is False False False False True True just False False False False True True a False False False False True True fake False False False False True True practise False False False False True True today False False False False True True d True True True True True True 1234567891011
df.isnull().sum() Out[77]: a 1 b 1 c 1 d 1 e 8 f 8 dtype: int6412345678
df.isnull().sum().sum() Out[78]: 201
每列非空的数据个数统计:
df.count() Out[79]: a 7 b 7 c 7 d 7 e 0 f 0 dtype: int6412345678
和isnull()的反义词:
df.notnull() Out[80]: a b c d e f this True True True True False False is True True True True False False just True True True True False False a True True True True False False fake True True True True False False practise True True True True False False today True True True True False False d False False False False False False12345678910
布尔选择删除NaN: 删除NaN(主要针对列Series),在pandas里可以使用布尔选择或者dropna函数删除DataFrame的某Series列里的数据,但不会影响DataFrame本身。
构造一个df
df.at['is','e'] = 100 df.at['d','e'] = 520 a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 NaN just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN 520.0 NaN 1234567891011
取得布尔值:
df.e.notnull() Out[88]: this False is True just False a False fake False practise False today False d True Name: e, dtype: bool df.e[df.e.notnull()] Out[89]: is 100.0 d 520.0 Name: e, dtype: float64
12345678910111213141516dropna函数删除DataFrame的某Series列里的数据,但不会影响DataFrame本身:
df.e.dropna() Out[92]: is 100.0 d 520.0 Name: e, dtype: float64 df Out[93]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 NaN just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN 520.0 NaN
12345678910111213141516在df.dropna() 某行有任何列含有na,则这行删除,原df不变
df Out[97]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN 520.0 NaN df.dropna() Out[96]: a b c d e f is 100.0 15.0 16.0 17.0 100.0 999.0
123456789101112131415如果仅想删除行里全都是NaN的行,可以在dropna函数里使用how= 'all’参数。
df Out[99]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN NaN NaN df.dropna(how = 'all') Out[100]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN
123456789101112131415161718192021设定阈值删除:如果想保留含有若干非空值的行或者列可以使用thresh参数。
df Out[102]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN d NaN NaN NaN NaN NaN NaN '列上至少有7个非NaN的列留下:' df.dropna(axis = 1,thresh = 7) Out[103]: a b c d this 10.0 11.0 12.0 13.0 is 100.0 15.0 16.0 17.0 just 18.0 19.0 20.0 21.0 a 22.0 23.0 300.0 25.0 fake 26.0 27.0 28.0 29.0 practise 30.0 31.0 32.0 33.0 today 34.0 35.0 36.0 37.0 d NaN NaN NaN NaN
123456789101112131415161718192021222324行上至少有4列非空值的留下:
df.dropna(axis = 0, thresh = 4) Out[108]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 NaN NaN123456789
填充NaN数据 .fillna(0)
df.fillna(0) Out[109]: a b c d e f this 10.0 11.0 12.0 13.0 0.0 0.0 is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 0.0 0.0 a 22.0 23.0 300.0 25.0 0.0 0.0 fake 26.0 27.0 28.0 29.0 0.0 0.0 practise 30.0 31.0 32.0 33.0 0.0 0.0 today 34.0 35.0 36.0 37.0 0.0 0.0 d 0.0 0.0 0.0 0.0 0.0 0.012345678910
前向或后向数据填充:
method = ‘ffill’ : 是用每一列/行前面的值填充后面的空白
method = ‘bfill’: 是用每一列/行后面的值填充前面的空白
df Out[124]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN NaN NaN NaN NaN df.f.fillna(method = 'bfill') Out[125]: this 999.0 is 999.0 just NaN a NaN fake NaN practise NaN today NaN d NaN Name: f, dtype: float64 df.f.fillna(method = 'ffill') Out[126]: this NaN is 999.0 just 999.0 a 999.0 fake 999.0 practise 999.0 today 999.0 d 999.0 Name: f, dtype: float64
12345678910111213141516171819202122232425262728293031323334列和行的不同填充规则:默认是在列的方向上填充列的数据:
df Out[131]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN 666.0 666.0 666.0 666.0 df.fillna(method = 'ffill') Out[132]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 100.0 999.0 a 22.0 23.0 300.0 25.0 100.0 999.0 fake 26.0 27.0 28.0 29.0 100.0 999.0 practise 30.0 31.0 32.0 33.0 100.0 999.0 today 34.0 35.0 36.0 37.0 88.0 999.0 d 34.0 35.0 666.0 666.0 666.0 666.0 df.fillna(method = 'ffill',axis = 0) Out[134]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 100.0 999.0 a 22.0 23.0 300.0 25.0 100.0 999.0 fake 26.0 27.0 28.0 29.0 100.0 999.0 practise 30.0 31.0 32.0 33.0 100.0 999.0 today 34.0 35.0 36.0 37.0 88.0 999.0 d 34.0 35.0 666.0 666.0 666.0 666.0 df.fillna(method = 'ffill',axis = 1) Out[136]: a b c d e f this 10.0 11.0 12.0 13.0 13.0 13.0 is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 21.0 21.0 a 22.0 23.0 300.0 25.0 25.0 25.0 fake 26.0 27.0 28.0 29.0 29.0 29.0 practise 30.0 31.0 32.0 33.0 33.0 33.0 today 34.0 35.0 36.0 37.0 88.0 88.0 d NaN NaN 666.0 666.0 666.0 666.0
12345678910111213141516171819202122232425262728293031323334353637383940414243444546用Series填充某列某行值:可以给出一个Series数据(label和DataFrame的label有相同的)去填充DataFrame对应行(label)的数据。
df Out[137]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 NaN NaN a 22.0 23.0 300.0 25.0 NaN NaN fake 26.0 27.0 28.0 29.0 NaN NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN 666.0 666.0 666.0 666.0 fill = pd.Series([5,6,7],index = ['just','a','fake']) df['e'].fillna(fill,inplace = True) df Out[140]: a b c d e f this 10.0 11.0 12.0 13.0 NaN NaN is 100.0 15.0 16.0 17.0 100.0 999.0 just 18.0 19.0 20.0 21.0 5.0 NaN a 22.0 23.0 300.0 25.0 6.0 NaN fake 26.0 27.0 28.0 29.0 7.0 NaN practise 30.0 31.0 32.0 33.0 NaN NaN today 34.0 35.0 36.0 37.0 88.0 NaN d NaN NaN 666.0 666.0 666.0 666.0
12345678910111213141516171819202122232425网址:Pandas的数据清洗 https://www.yuejiaxmz.com/news/view/529419
相关内容
Python pandas 数据清洗(二)利用Python进行数据分析——Pandas(2)
Pandas数据清洗
从0到1数据分析实战学习笔记(二)数据清洗
大数据清洗随手记(一)
数据清洗实战:工业生产数据的深入探讨
Pandas 数据处理(一)缺失值处理
大数据环境下的高效数据清洗策略
数据清洗:最佳实践与工具推荐
数据挖掘:数据清洗——数据噪声处理