博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
机器学习基础 --- pandas的基本使用
阅读量:6808 次
发布时间:2019-06-26

本文共 17427 字,大约阅读时间需要 58 分钟。

一、pandas的简介

  Python Data Analysis Library 或 pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

pandas的数据结构:

  Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。

  Time- Series:以时间为索引的Series。

  DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

  Panel :三维的数组,可以理解为DataFrame的容器。

  本文主要介绍DateFrame和Series,其中DataFrame充电介绍。

  本文中用到的数据文件地址:

  本文只是结合实例介绍pandas的基本使用,若要详细深入学习,请参阅。

二、pandas中的DateFrame

  使用pandas我们可以很方便的对二维表结构进行一些常规操作。 

1. 使用pandas读取csv(或excel等)文件

import pandasfood_info = pandas.read_csv("food_info.csv")          # 读取csv文件# 读取Excel文件使用pandas.read_excel()即可print(type(food_info))           # food_info为一个DataFrame对象print(food_info.dtypes)          # 各项数据的类型
NDB_No int64Shrt_Desc objectWater_(g) float64Energ_Kcal int64Protein_(g) float64Lipid_Tot_(g) float64Ash_(g) float64Carbohydrt_(g) float64Fiber_TD_(g) float64Sugar_Tot_(g) float64Calcium_(mg) float64Iron_(mg) float64Magnesium_(mg) float64Phosphorus_(mg) float64Potassium_(mg) float64Sodium_(mg) float64Zinc_(mg) float64Copper_(mg) float64Manganese_(mg) float64Selenium_(mcg) float64Vit_C_(mg) float64Thiamin_(mg) float64Riboflavin_(mg) float64Niacin_(mg) float64Vit_B6_(mg) float64Vit_B12_(mcg) float64Vit_A_IU float64Vit_A_RAE float64Vit_E_(mg) float64Vit_D_mcg float64Vit_D_IU float64Vit_K_(mcg) float64FA_Sat_(g) float64FA_Mono_(g) float64FA_Poly_(g) float64Cholestrl_(mg) float64dtype: object
输出
 

2.  获取数据

food_info.head(10)     # 获取前10行数据,默认获取5行# first_rows = food_info.head()# first_rows# food_info.tail(8)     # 获取尾8行数据,默认获取5行   # print(food_info.tail())print(food_info.columns)    # 获取foodinfo的各字段名(即表头)# print(food_info.shape)    # 获取结构  比如此文件时8618行×36列
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',       'Cholestrl_(mg)'],      dtype='object')
输出1:
 
# print(food_info.loc[0])     # 获取第0行数据print(food_info.loc[6000])          # 获取第6000行数据# food_info.loc[10000] # 获取第10000行数据,超过数据文件本身长度,报错KeyError: 'the label [10000] is not in the [index]'
NDB_No                                                         18995Shrt_Desc          KELLOGG'S EGGO BISCUIT SCRAMBLERS BACON EGG & CHSWater_(g)                                                       42.9Energ_Kcal                                                       258Protein_(g)                                                      8.8Lipid_Tot_(g)                                                    7.9Ash_(g)                                                          NaNCarbohydrt_(g)                                                  38.3Fiber_TD_(g)                                                     2.1Sugar_Tot_(g)                                                    4.7Calcium_(mg)                                                     124Iron_(mg)                                                        2.7Magnesium_(mg)                                                    14Phosphorus_(mg)                                                  215Potassium_(mg)                                                   225Sodium_(mg)                                                      610Zinc_(mg)                                                        0.5Copper_(mg)                                                      NaNManganese_(mg)                                                   NaNSelenium_(mcg)                                                   NaNVit_C_(mg)                                                       NaNThiamin_(mg)                                                     0.3Riboflavin_(mg)                                                 0.26Niacin_(mg)                                                      2.4Vit_B6_(mg)                                                     0.02Vit_B12_(mcg)                                                    0.1Vit_A_IU                                                         NaNVit_A_RAE                                                        NaNVit_E_(mg)                                                         0Vit_D_mcg                                                          0Vit_D_IU                                                           0Vit_K_(mcg)                                                      NaNFA_Sat_(g)                                                       4.1FA_Mono_(g)                                                      1.5FA_Poly_(g)                                                      1.1Cholestrl_(mg)                                                    27Name: 6000, dtype: object
输出2
# food_info.loc[3:6]        # 获取第3到6行数据two_five_ten = [2,5,10]     print(food_info.loc[two_five_ten])   # 获取第2,5,10数据
NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \2     1003  BUTTER OIL ANHYDROUS       0.24         876         0.28   5     1006           CHEESE BRIE      48.42         334        20.75   10    1011          CHEESE COLBY      38.20         394        23.76       Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \2           99.48     0.00            0.00           0.0           0.00   5           27.68     2.70            0.45           0.0           0.45   10          32.11     3.36            2.57           0.0           0.52            ...        Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  \2        ...          3069.0      840.0        2.80        1.8      73.0   5        ...           592.0      174.0        0.24        0.5      20.0   10       ...           994.0      264.0        0.28        0.6      24.0       Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  2           8.6      61.924       28.732        3.694           256.0  5           2.3      17.410        8.013        0.826           100.0  10          2.7      20.218        9.280        0.953            95.0
输出3
 
# food_info['Shrt_Desc']     # 获取字段名为'Shrt_Desc'的这一列ndb_col = food_info['NDB_No']    # 获取字段名为'NDB_No'的这一列# print(ndb_col)col_name = 'Shrt_Desc'print(food_info[col_name])
0                                        BUTTER WITH SALT1                                BUTTER WHIPPED WITH SALT2                                    BUTTER OIL ANHYDROUS3                                             CHEESE BLUE4                                            CHEESE BRICK5                                             CHEESE BRIE6                                        CHEESE CAMEMBERT7                                          CHEESE CARAWAY8                                          CHEESE CHEDDAR9                                         CHEESE CHESHIRE10                                           CHEESE COLBY11                    CHEESE COTTAGE CRMD LRG OR SML CURD12                            CHEESE COTTAGE CRMD W/FRUIT13       CHEESE COTTAGE NONFAT UNCRMD DRY LRG OR SML CURD14                       CHEESE COTTAGE LOWFAT 2% MILKFAT15                       CHEESE COTTAGE LOWFAT 1% MILKFAT16                                           CHEESE CREAM17                                            CHEESE EDAM18                                            CHEESE FETA19                                         CHEESE FONTINA20                                         CHEESE GJETOST21                                           CHEESE GOUDA22                                         CHEESE GRUYERE23                                       CHEESE LIMBURGER24                                        CHEESE MONTEREY25                             CHEESE MOZZARELLA WHL MILK26                    CHEESE MOZZARELLA WHL MILK LO MOIST27                       CHEESE MOZZARELLA PART SKIM MILK28                   CHEESE MOZZARELLA LO MOIST PART-SKIM29                                        CHEESE MUENSTER                              ...                        8588           BABYFOOD CRL RICE W/ PEARS & APPL DRY INST8589                       BABYFOOD BANANA NO TAPIOCA STR8590                       BABYFOOD BANANA APPL DSSRT STR8591         SNACKS TORTILLA CHIPS LT (BAKED W/ LESS OIL)8592    CEREALS RTE POST HONEY BUNCHES OF OATS HONEY RSTD8593                           POPCORN MICROWAVE LOFAT&NA8594                         BABYFOOD FRUIT SUPREME DSSRT8595                                 CHEESE SWISS LOW FAT8596               BREAKFAST BAR CORN FLAKE CRUST W/FRUIT8597                              CHEESE MOZZARELLA LO NA8598                             MAYONNAISE DRSNG NO CHOL8599                            OIL CORN PEANUT AND OLIVE8600                     SWEETENERS TABLETOP FRUCTOSE LIQ8601                                CHEESE FOOD IMITATION8602                                  CELERY FLAKES DRIED8603             PUDDINGS CHOC FLAVOR LO CAL INST DRY MIX8604                      BABYFOOD GRAPE JUC NO SUGAR CND8605                     JELLIES RED SUGAR HOME PRESERVED8606                           PIE FILLINGS BLUEBERRY CND8607                 COCKTAIL MIX NON-ALCOHOLIC CONCD FRZ8608              PUDDINGS CHOC FLAVOR LO CAL REG DRY MIX8609    PUDDINGS ALL FLAVORS XCPT CHOC LO CAL REG DRY MIX8610    PUDDINGS ALL FLAVORS XCPT CHOC LO CAL INST DRY...8611                                   VITAL WHEAT GLUTEN8612                                        FROG LEGS RAW8613                                      MACKEREL SALTED8614                           SCALLOP (BAY&SEA) CKD STMD8615                                           SYRUP CANE8616                                            SNAIL RAW8617                                     TURTLE GREEN RAWName: Shrt_Desc, Length: 8618, dtype: object
输出4
 
columns = ['Water_(g)', 'Shrt_Desc']   zinc_copper = food_info[columns]      # 获取字段名为'Water_(g)', 'Shrt_Desc'的这两列print(zinc_copper) # 获取以"(mg)"结尾的各列数据col_names = food_info.columns.tolist()# print(col_names)milligram_columns = []for items in col_names:    if items.endswith("(mg)"):        milligram_columns.append(items)milligram_df = food_info[milligram_columns]print(milligram_df)

 

3. 对数据的简单处理:

import pandasfood_info = pandas.read_csv('food_info.csv')# food_info.head(3)# print(food_info.shape) # print(food_info['Iron_(mg)'])# Iron_(mg)这一列的单位是mg,将其转为mg,对其值除以1000div_1000 = food_info['Iron_(mg)'] / 1000# print(div_1000) # 对每行数据中的其中两列进行计算water_energy = food_info['Water_(g)'] * food_info['Energ_Kcal']  # print(food_info.shape)# DateFrame结构插入一列,字段名为'water_energy',值为water_energy的数据food_info['water_energy'] = water_energy# print(food_info[['Water_(g)', 'Energ_Kcal', 'water_energy']])# print(food_info.shape)# 求某列的最大值max_calories = food_info['Energ_Kcal'].max()# print(max_calories)# 对指定字段排序,inplace=False将排序后的结果生成一个新的DataFrame,inplace=True则在原来的基础上进行排序,默认升序排序# food_info.sort_values('Sodium_(mg)', inplace=True)# print(food_info['Sodium_(mg)'])a = food_info.sort_values('Sodium_(mg)', inplace=False, ascending=False)  # ascending=False 使用降序排序# print(food_info['Sodium_(mg)'])# print(a['Sodium_(mg)'])

 

4. 对数据的常规操作

import pandas as pdimport numpy as nptitanic_survival = pd.read_csv('titanic_train.csv')# titanic_survival.head()age = titanic_survival['Age']# print(age.loc[0:10])age_is_null = pd.isnull(age)    # 迭代判断值是否为空,结果可以作为一个索引# print(age_is_null)age_null_true = age[age_is_null]   # 获取值为空的数据集# print(age_null_true)print(len(age_null_true))     # 判断一共有多少个空数据# 求平均值,应用不为空的数据集求good_ages = age[age_is_null == False]     # 获取值不为空的数据集# print(good_ages)correct_mean_age = sum(good_ages) / len(good_ages)   # 求平均print(correct_mean_age)# 或者使用pandas内置的求均值函数,自动去除空数据correct_mean_age = age.mean()   # 求平均,将空值舍弃print(correct_mean_age)# pivot_table方法默认求平均值,如果需求是求平均aggfunc参数可以不写# index tells the method which column to group by# values is the column that we want to apply the calculation to# aggfunc specifies the calculation we want to performpassenger_surival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)  # 对index相同的分别求平均值print(passenger_surival)# 分组对多列求和# port_stats = titanic_survival.pivot_table(index="Embarked", values=['Fare', "Survived"], aggfunc=np.sum)  # ,分别对价格和存活人数求和# print(port_stats)# 丢弃空值数据drop_na_columns = titanic_survival.dropna(axis=1, inplace=False)    # axis=1,以行为判断依据,数据为空,则从Dataframe中丢弃,inplace=False返回一个新的Dataframe对象,否则对当前对象做操作# print(drop_na_columns)new_titanic_survival = titanic_survival.dropna(axis=0, subset=['Age', 'Sex'], inplace=False)  # axis=0,以列为判断依据,需要指定判断列的字段,数据为空,则从Dataframe中丢弃# print(new_titanic_survival)# 具体定位到某行某列row_index_83_age = titanic_survival.loc[83, 'Age']row_index_766_pclass = titanic_survival.loc[766, 'Pclass']print(row_index_83_age)print(row_index_766_pclass)new_titanic_survival = titanic_survival.sort_values("Age", ascending=False)   # 每行的年龄按降序排序    print(new_titanic_survival[0:10])print('------------------------>')titanic_reindexed = new_titanic_survival.reset_index(drop=True)    # 重置每行的索引值print(titanic_reindexed[0:20])# 自定义函数,对每行或每列逐个使用def null_count(column):    column_null = pd.isnull(column)    null = column[column_null]    return len(null)column_null_count = titanic_survival.apply(null_count, axis=0)    # 通过自定义函数,统计每列为空的个数print(column_null_count)def which_class(row):    pclass = row['Pclass']     if pclass == 1:        return 'First Class'    elif pclass == 2:        return 'Second Class'    elif pclass == 3:        return 'Third Class'    else:        return 'Unknow'classes = titanic_survival.apply(which_class, axis=1)    # 通过自定义函数,替换每行的Pclass值, 注意axis=1print(classes)

 

5. 配合numpy将数据载入后进行预处理

import pandas as pdimport numpy as npfandango = pd.read_csv('fandango_score_comparison.csv')# print(type(fandango))# 返回一个新的dataframe,返回的新数据以设定的值为index,并将丢弃index值为空的数据,drop=True,丢弃为索引的列,否则不丢弃fandango_films = fandango.set_index('FILM', drop=False)# fandango_films# print(fandango_films.index)# 按索引获取数据fandango_films["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]fandango_films.loc["Avengers: Age of Ultron (2015)" : "Hot Tub Time Machine 2 (2015)"]fandango_films.loc['Southpaw (2015)']movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']fandango_films.loc[movies]# def func(coloumn):#     return np.std(coloumn)types = fandango_films.dtypes# print(types)float_columns = types[types.values == 'float64'].index  # 获取特定类型的数据的索引# print(float_columns)float_df = fandango_films[float_columns]        # 获取特定类型的数据# print(float_df.dtypes)# float_df# print(float_df)deviations = float_df.apply(lambda x: np.std(x))    # 计算每列标准差print(deviations)# print('----------------------->')# print(float_df.apply(func))# help(np.std)rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]print(rt_mt_user.apply(np.std, axis=1))   # 计算每行数据标准差# rt_mt_user.apply(np.std, axis=0)

三、DataFrame中的Series

   Series为DateFrame中一行或一列的数据结构

1. 获取一个Series对象

import pandas as pdfrom pandas import Seriesfandango = pd.read_csv('fandango_score_comparison.csv')series_film = fandango['FILM']   # 获取fandango中FILM这一列# print(type(series_film))print(series_film[0:5])series_rt = fandango['RottenTomatoes']  # 获取fandango中RottenTomatoes这一列print(series_rt[0:5])

 

2. 对Series对象的一些常规操作

file_names = series_film.values  # 获取series_film的所有值,返回值为一个
# print(type(file_names))# print(file_names)rt_scores = series_rt.values# print(rt_scores)series_custom = Series(rt_scores, index=file_names) # 构建一个新的Series, index为file_names, value为rt_scores# help(Series)print(series_custom[['Top Five (2014)', 'Night at the Museum: Secret of the Tomb (2014)']]) # 以index获取数据# print(type(series_custom))print('--------------------------------->')print(series_custom[5:10]) # 切片操作# print(series_custom[["'71 (2015)"]])original_index = series_custom.index.tolist() # 获取所有的index值并将其转为list# print(original_index)sorted_index = sorted(original_index) # 对list排序# print(sort_index)sorted_by_index = series_custom.reindex(sorted_index) # 以排过序的list重新为series_custom设置索引print(sorted_by_index)sc2 = series_custom.sort_index() # 以index按升序排序整个series_custom# print(sc2)sc3 = series_custom.sort_values(ascending=False) # 以values按降序排序整个series_customprint(sc3)import numpy as np# print(np.add(series_custom, series_custom)) #将series_custom当成一个矩阵,使用numpy进行计算print(np.sin(series_custom))print(np.max(series_custom))# series_custom > 50series_greater_than_50 = series_custom[series_custom > 50] # 获取series_custom的值大于50的数据# series_greater_than_50criteria_one = series_custom > 50criteria_two = series_custom < 75both_criteria = series_custom[criteria_one & criteria_two] # 获取series_custom的值大于50且小于75的数据print(both_criteria)rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])rt_mean = (rt_critics + rt_users) / 2 # 将rt_critics 和 rt_users的值相加除以2print(rt_mean)

 

转载地址:http://pyhwl.baihongyu.com/

你可能感兴趣的文章
CentOS 6.4下Squid代理服务器的安装与配置
查看>>
CentOS 6.5安装YouCompleteMe使用vim C/C++语法自动补全
查看>>
利用php利用root权限执行shell脚本必须进行以下几个步骤
查看>>
Storm原理及单机安装指南
查看>>
Linux一些有用的操作
查看>>
IAT 注入ImportInject(dll)
查看>>
[C++]面向对象部分——类
查看>>
Websense:别让移动设备触痛企业的安全神经
查看>>
DHCP服务的介绍及配置详解
查看>>
脚本实现为一系列账号生成随机密码
查看>>
mysql高级管理-note
查看>>
linux登录显示 Error in service module错误
查看>>
娱乐篇第十期:互联网的事情you意思(十)
查看>>
MyBatis基础:MyBatis入门(1)
查看>>
Linux文件的复制、删除和移动命令
查看>>
vmware workstation虚拟环境安装及创建虚拟机
查看>>
360搜索引擎能否给苦逼的站长们带来希望?
查看>>
yarn管理命令
查看>>
SSH KEY免密码验证
查看>>
我的友情链接
查看>>