基于阅读理解的文档级人物属性抽取研究

刘资蕴; 张世奇; 陈文亮

doi:10.13451/j.sxu.ns.2024026

PDF(1135 KB)

山西大学学报(自然科学版) ›› 2025, Vol. 48 ›› Issue (3) : 470-480. DOI: 10.13451/j.sxu.ns.2024026

信息科学

基于阅读理解的文档级人物属性抽取研究

作者信息 +

苏州大学计算机科学与技术学院，江苏苏州 215006

作者简介:

刘资蕴（1999 $-$ ），男，内蒙古兴安盟乌兰浩特市人，硕士研究生，研究方向为知识图谱构建。E-mail：zyliu129@stu.suda.edu.cn

通信作者:

陈文亮（CHEN Wenliang），E-mail：wlchen@suda.edu.cn

折叠

Machine Reading Comprehension for Document-level Person Aspect Term Extraction

Author information +

History +

摘要

人物属性抽取旨在从人物介绍中抽取人物的各项属性，如性别、国籍等。已有抽取方法通常由序列标注模型对远程监督数据进行训练从而得到抽取模型，但是用该方式在数据上存在标注不准确和不同属性值重合的问题，在模型上缺少可扩展性和可泛化性能力。为解决此问题，该文提出将任务转化为阅读理解问题，通过阅读人物介绍来对人物属性表进行填写补全。为此，本文构造了一份基于阅读理解的文档级人物属性抽取数据集，并采用了基于Transformer算法的双向编码表征模型-机器阅读理解（BERT-MRC）和基于Transformer算法的双向编码表征模型-条件随机场-机器阅读理解（BERT-CRF-MRC）两种基线模型。研究结果表明BERT-CRF-MRC相比于BERT-MRC在F1值上高三个百分点，BERT-CRF-MRC的试验结果在短文本人物介绍中F1平均值约为92%，在长文本人物介绍中F1平均值约为75%。本文的新构建数据和代码已公开在Github上。

Abstract

Person aspect term extraction aims to extract various attributes of individuals such as gender and nationality from their descriptions. Existing extraction methods typically train sequence labeling models on distantly-supervised data to obtain the extraction model. However, this approach has issues with inaccurate annotations and overlapping different attribute values in the data, and lacks scalability and generalizability in their models. To solve the problems, this article proposes to transform this task into a machine reading comprehension (MRC) problem, that is, to fill in the person attribute-value table by reading the person profile. This paper constructs a person attribute recognition data based on the reading comprehension framework from the person encyclopedia, and constructs two baseline models of bidirectional encoder representations from transformers-machine reading comprehension (BERT-MRC) and bidirectional encoder representations from transformers-conditional random field-machine reading comprehension (BERT-CRF-MRC). Among them, BERT-CRF-MRC is three percentage points higher than BERT-MRC on average in F1 score and the experimental results of BERT-CRF-MRC are about 92% F1 average in short text person profiles while about 75% in long text person profiles. The constructed data and code are exposed on Github.

关键词

属性抽取 / 机器阅读理解 / 标注数据

Key words

aspect term extraction / MRC / annotated data

中图分类号

TP391

引用本文

EndNote

Ris (Procite)

Bibtex

导出引用

刘资蕴 , 张世奇 , 陈文亮. 基于阅读理解的文档级人物属性抽取研究. 山西大学学报(自然科学版). 2025, 48(3): 470-480 https://doi.org/10.13451/j.sxu.ns.2024026

LIU Ziyun, ZHANG Shiqi, CHEN Wenliang. Machine Reading Comprehension for Document-level Person Aspect Term Extraction[J]. Journal of Shanxi University(Natural Science Edition). 2025, 48(3): 470-480 https://doi.org/10.13451/j.sxu.ns.2024026

参考文献

列表( 原文顺序 | 文献年度倒序 | 文中引用次数倒序 ) 可视化分析

1	徐庆婷, 洪宇, 潘雨晨等. 属性抽取研究综述[J]. 软件学报, 2023, 34: 690-711. DOI: 10.13328/j.cnki.jos.006709 . XU Q T, HONG Y, PAN Y C, et al. Survey on Aspect Term Extraction[J]. J Softw, 2023, 34: 690-711. DOI: 10.13328/j.cnki.jos.006709 . 本文引用 [1]

2	EMBAR V, KAN A, SISMAN B, et al. DiffXtract: Joint Discriminative Product Attribute-value Extraction[C]//2021 IEEE International Conference on Big Knowledge (ICBK). New York: IEEE, 2021: 271-280. DOI: 10.1109/ICKG52313.2021.00044 . 本文引用 [1]

3	李昊迪. 医学领域知识抽取方法研究[D]. 哈尔滨: 哈尔滨工业大学, 2018. LI H D. Research on Medical Domain Knowledge Extraction Methods[D]. Harbin: Harbin Institute of Technology, 2018. 本文引用 [1]

4	FAN Z F, WU Z, DAI X Y, et al. Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling[C]//Proceedings of the 2019 Conference of the North. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 2509-2519. DOI: 10.18653/v1/n19-1259 . 本文引用 [1]

5	HU M Q, LIU B. Mining and Summarizing Customer Reviews[C]//Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 2004: 168-177. DOI: 10.1145/1014052.1014073 . 本文引用 [1]

6	李红亮. 基于规则的百科人物属性抽取算法的研究[D]. 成都: 西南交通大学, 2013. LI H L. Research on Character Attributes Extraction Based on Rules from Baidu Encyclopedia[D]. Chengdu: Southwest Jiaotong University, 2013. 本文引用 [3]

7	HOCHREITER S, SCHMIDHUBER J. Long Short-term Memory[J]. Neural Comput, 1997, 9(8): 1735-1780. DOI: 10.1162/neco.1997.9.8.1735 . 本文引用 [1]

8	JOHN L, ANDREW M, FERNANDO P. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//The 18th International Conference on Machine Learning. Williamstown, Massachusetts, USA: Morgan Kanfmann Publishers Inc. 2001: 282-289. DOI: 20.500.14332/6188 . 本文引用 [1]

9	FAN Z F, WU Z, DAI X Y, et al. Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling[C]//Proceedings of the 2019 Conference of the North. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 2509-2518. DOI: 10.18653/v1/n19-1259 . 本文引用 [1]

DAI

H L

, SONG

Y Q

. Neural Aspect and Opinion Term Extraction with Mined Rules as Weak Supervision[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 5268-5277. DOI: 10.18653/v1/p19-1520 .

本文引用 [1]

11	VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is All You Need[EB/OL]. (2017-06-12) [2025-04-21]. https://doi.org/10.48550/arXiv.1706.03762 本文引用 [3]

DEVLIN

, CHANG

M W

, LEE

, et al. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2018. 4171-4186.DOI:10.18653/v1/N19-1423 .

本文引用 [2]

13	HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF Models for Sequence Tagging[EB/OL]. (2015-08-09) [2025-04-21]. https://doi.org/10.48550/arXiv.1508.01991 本文引用 [1]

14	LI X Y, FENG J R, MENG Y X, et al. A Unified MRC Framework for Named Entity Recognition[EB/OL]. (2019-10-25) [2025-04-21]. https://doi.org/10.48550/arXiv.1910.11476 本文引用 [3]

15	马进, 杨一帆, 陈文亮. 基于远程监督的人物属性抽取研究[J]. 中文信息学报, 2020, 34(6): 64-72. DOI: 10.3969/j.issn.1003-0077.2020.06.009 . MA J, YANG Y F, CHEN W L. Distant Supervision for Person Attribute Recognition[J]. J Chin Inf Process, 2020, 34(6): 64-72. DOI: 10.3969/j.issn.1003-0077.2020.06.009 . 本文引用 [4]

张巧, 熊锦华, 程学旗. 基于弱监督学习的主页人物属性抽取方法[J]. 山西大学学报(自然科学版), 2015, 38(1): 8-15. DOI:10.13451/j.cnki.shanxi.univ(nat.sci.).2015.01.002 .

ZHANG

, XIONG

J H

, CHENG

X Q

. Person Attributes Extraction Based on a Weakly Supervised Learning Method[J]. J Shanxi Univ Nat Sci Ed, 2015, 38(1): 8-15. DOI: 10.13451/j.cnki.shanxi.univ(nat.sci.).2015.01.002 .

本文引用 [1]

ANGELI

, TIBSHIRANI

, WU

, et al. Combining Distant and Partial Supervision for Relation Extraction[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1556-1567. DOI: 10.3115/v1/d14-1164 .

本文引用 [1]

苏丰龙, 谢庆华, 邱继远, 等. 基于深度学习的领域实体属性词聚类抽取研究[J]. 微型机与应用, 2016, 35(1): 53-55. DOI: 10.19358/j.issn.1674-7720.2016.01.017 .

F L

, XIE

Q H

, QIU

J Y

, et al. Study on Word Clusting for Attribute Extraction Based on Deep Learning[J]. Microcomput Appl, 2016, 35(1): 53-55. DOI: 10.19358/j.issn.1674-7720.2016.01.017 .

本文引用 [1]

19	向晓雯. 基于条件随机场的中文命名实体识别[D]. 厦门: 厦门大学, 2006. XIANG X W. Chinese Named Entity Recognition Based on Conditional Random Fields[D]. Xiamen: Xiamen University, 2006. 本文引用 [1]

KATIYAR

, CARDIE

. Investigating LSTMS for Joint Extraction of Opinion Entities and Relations[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 919-929. DOI: 10.18653/v1/p16-1087 .

本文引用 [1]

CHO

, VAN MERRIENBOER

, BAHDANAU

, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[C]//Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 103-111. DOI: 10.3115/v1/w14-4012 .

本文引用 [1]

22	RADFORD A, NARASIMHAN K, SALIMANS T, et al.Improving Language Understanding by Generative Pre-Training [J]. Open Access Library Journal, 2021, 8: 7. 本文引用 [1]

23	PETERS M E, NEUMANN M, IYYER M, et al. Deep Contextualized Word Representations[EB/OL]. (2018-02-14) [2025-04-21]. https://doi.org/10.48550/arXiv.1802.05365 本文引用 [1]

基金

国家自然科学基金(62376177)

PDF(1135 KB)

Accesses

Citation

Detail

段落导航

Received	Accepted	Published
2023-10-15	2023-12-15	2025-05-25
Issue Date
2025-06-13

选择文件类型/文献管理软件名称

选择包含的内容