数据挖掘二：自动摘要

A^-A⁺

astipsy2020年1月11日0134 次浏览Python

文章目录

摘要
自动摘要
原理
算法步骤

摘要

是全面准确的反映了某一文献中心内容的简单连贯的短文

自动摘要

是利用计算机自动的从原始文献中提取摘要

原理

余弦相似度

算法步骤

获取到需要摘要的文章
对该文章进行词频统计
对该文章进行分句：根据中文的标点符号进行分句
计算分句与文章之间的预先相似度
取相似度最高的分句，作为文章的摘要

总结：与相似文件推荐步骤大致一样，区别于文章推荐的介质是词，摘要的介质是句

import codecs
import os

import jieba
import numpy
import pandas
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances

contents = []
paths = []
clauses = []
summarys = []

# 停顿词
stopwords = pandas.read_csv(
    "./StopwordsCN.txt",
    encoding='utf8',
    index_col=False,
    quoting=3,
    sep="\t"
)

# 初始化TF-IDF计算
count_vectorizer = CountVectorizer(
    stop_words=list(stopwords['stopword'].values),
    min_df=0, token_pattern=r"\b\w+\b"
)

for root, dir, files in os.walk("./SogouC.mini/Sample"):
    for file in files:
        # 文件路径
        file_path = os.path.join(root, file)
        # 打开文件流
        f = codecs.open(file_path, "r", "utf-8")
        # 文件内容
        file_content = f.read()
        # 关闭文件流
        f.close()

        # 分句
        clause = []
        for ci in re.split(r'[。？！\n]\s*', file_content):
            if len(ci.strip()) > 10:
                clause.append(ci)

        # 分句TFIDF计算
        clause_tfidf = count_vectorizer.fit_transform(clause)
        # 相似度
        distance_matrix = pairwise_distances(
            clause_tfidf,
            metric="cosine"
        )
        # 相似度倒序
        distance_matrix_index = numpy.argsort(distance_matrix, axis=1)
        # 获取相似度最高的分句作为摘要
        summary = pandas.Index(clause)[distance_matrix_index[0]].values[0]

        contents.append(file_content)
        paths.append(file_path)
        summarys.append(summary)

data_frame_all = pandas.DataFrame({
    'path': paths,
    'content': contents,
    'summary': summarys
})

原文链接：数据挖掘二：自动摘要，转发请注明来源！

作者 : astipsy

没有个人说明

评论已关闭。

书签

已默默运行了

Made By astipsy.