以 “阿里天池 Imagine Computing 创新技术大赛赛道1:边缘云内容分发网络客户体验预测算法” 的训练数据集为例。
文件大小:1002 MB
行数:9000000
列数:18 列(其中字符串 2 列、整型 5 列、浮点型 11列)
我们使用三种方法读取训练数据集的 Excel,分别为:
pandas 库的 read_csv 方法读取(直接指定各字段数据类型,避免自动判断类型消耗性能)import pandas as pd
df = pd.read_csv(os.path.join(self.path, "training_dataset.csv"), dtype={"domain_name": str,"node_name": str,"avg_fbt_time": int,"tcp_conntime": int,"inner_network_rtt": int,"io_await_avg": int,"io_await_max": int,"synack1_ratio": float,"icmp_lossrate": float,"icmp_rtt": float,"ratio_499_5xx": float,"inner_network_droprate": float,"cpu_util": float,"mem_util": float,"io_util_avg": float,"io_util_max": float,"ng_traf_level": float,"buffer_rate": float
})
读取使用时间:15.8557 秒
csv 库的 reader 类读取from csv import reader
with open(os.path.join(self.path, "training_dataset.csv"), "r", encoding="UTF-8") as file:data = reader(file)for row in data:pass
读取使用时间:17.8881 秒
csv 库的 DictReader 类读取from csv import DictReader
with open(os.path.join(self.path, "training_dataset.csv"), "r", encoding="UTF-8") as file:data = DictReader(file)for row in data:pass
读取使用时间:30.9297 秒
pandas 库的 read_csv 方法的性能略优于 csv 库的 reader 类,显著优于 csv 库的 DictReader 类。