2022-03-31发表NLP / Python2 分钟读完 (大约237个字)

1_数据探索

第一步：简单观察训练集和测试集

使用pyplot做直方图观察句子长度分布情况

# -*- coding:utf-8 -*-
import os
from re import S
import numpy as np
import matplotlib.pyplot as plt


def length_process(data_dir):
    train_dir = data_dir
    with open(train_dir, "r", encoding="utf-8") as f:
        tmp_x = []
        for i, line in enumerate(f):
            if i == 0:
                continue  # skip the first line

            # line 的格式为：0,7442 27878 9601 235 4004 ， 9601 4004 ， 8194 2281 10893,5-30
            # 行号 英文逗号 短句 中文逗号 短句 英文逗号 行号
            sent_ = line.strip().split(",")[1]
            sent_ = [item.strip() for item in sent_.split("，")]
            sent_ = " ".join(item for item in sent_).strip().split(" ")

            tmp_x.append(len(sent_))

        n, bins, patches = plt.hist(x=tmp_x, bins="auto", alpha=0.7, rwidth=0.85)
        plt.grid(axis="y", alpha=0.75)
        plt.xlabel("sentence length")
        plt.ylabel("Frequency")
        plt.title("Histogram: sentence length")
        maxfreq = n.max()
        # 设置y轴的上限
        plt.ylim(ymax=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)
        print("maxfreq = {0}".format(maxfreq))
        plt.show()


if __name__ == "__main__":
    rootdir = os.sep.join(os.path.dirname(__file__).strip().split(os.sep)[:-2])
    train_dir = os.path.join(rootdir, "dataset/datagrand_2021_train.csv")
    test_dir = os.path.join(rootdir, "dataset/datagrand_2021_test.csv")

    length_process(train_dir)
    length_process(test_dir)

1_数据探索

https://dustofstars.github.io/NLP/Python/1-数据探索/

作者

Gavin

发布于

2022-03-31

更新于

2022-03-31

许可协议

CC BY-NC-SA 4.0

#Python NLP

1_数据探索

第一步：简单观察训练集和测试集

作者

发布于

更新于

许可协议

目录

分类

标签

Your browser is out-of-date!