基于阅读理解的文档级人物属性抽取研究

刘资蕴, 张世奇, 陈文亮

PDF(1135 KB)
PDF(1135 KB)
山西大学学报(自然科学版) ›› 2025, Vol. 48 ›› Issue (3) : 470-480. DOI: 10.13451/j.sxu.ns.2024026
信息科学

基于阅读理解的文档级人物属性抽取研究

作者信息 +

Machine Reading Comprehension for Document-level Person Aspect Term Extraction

Author information +
History +

摘要

人物属性抽取旨在从人物介绍中抽取人物的各项属性,如性别、国籍等。已有抽取方法通常由序列标注模型对远程监督数据进行训练从而得到抽取模型,但是用该方式在数据上存在标注不准确和不同属性值重合的问题,在模型上缺少可扩展性和可泛化性能力。为解决此问题,该文提出将任务转化为阅读理解问题,通过阅读人物介绍来对人物属性表进行填写补全。为此,本文构造了一份基于阅读理解的文档级人物属性抽取数据集,并采用了基于Transformer算法的双向编码表征模型-机器阅读理解(BERT-MRC)和基于Transformer算法的双向编码表征模型-条件随机场-机器阅读理解(BERT-CRF-MRC)两种基线模型。研究结果表明BERT-CRF-MRC相比于BERT-MRC在F1值上高三个百分点,BERT-CRF-MRC的试验结果在短文本人物介绍中F1平均值约为92%,在长文本人物介绍中F1平均值约为75%。本文的新构建数据和代码已公开在Github上。

Abstract

Person aspect term extraction aims to extract various attributes of individuals such as gender and nationality from their descriptions. Existing extraction methods typically train sequence labeling models on distantly-supervised data to obtain the extraction model. However, this approach has issues with inaccurate annotations and overlapping different attribute values in the data, and lacks scalability and generalizability in their models. To solve the problems, this article proposes to transform this task into a machine reading comprehension (MRC) problem, that is, to fill in the person attribute-value table by reading the person profile. This paper constructs a person attribute recognition data based on the reading comprehension framework from the person encyclopedia, and constructs two baseline models of bidirectional encoder representations from transformers-machine reading comprehension (BERT-MRC) and bidirectional encoder representations from transformers-conditional random field-machine reading comprehension (BERT-CRF-MRC). Among them, BERT-CRF-MRC is three percentage points higher than BERT-MRC on average in F1 score and the experimental results of BERT-CRF-MRC are about 92% F1 average in short text person profiles while about 75% in long text person profiles. The constructed data and code are exposed on Github.

关键词

属性抽取 / 机器阅读理解 / 标注数据

Key words

aspect term extraction / MRC / annotated data

中图分类号

TP391

引用本文

导出引用
刘资蕴 , 张世奇 , 陈文亮. 基于阅读理解的文档级人物属性抽取研究. 山西大学学报(自然科学版). 2025, 48(3): 470-480 https://doi.org/10.13451/j.sxu.ns.2024026
LIU Ziyun, ZHANG Shiqi, CHEN Wenliang. Machine Reading Comprehension for Document-level Person Aspect Term Extraction[J]. Journal of Shanxi University(Natural Science Edition). 2025, 48(3): 470-480 https://doi.org/10.13451/j.sxu.ns.2024026

参考文献

1
徐庆婷, 洪宇, 潘雨晨 等. 属性抽取研究综述[J]. 软件学报, 2023, 34: 690-711. DOI: 10.13328/j.cnki.jos.006709 .
XU Q T, HONG Y, PAN Y C, et al. Survey on Aspect Term Extraction[J]. J Softw, 2023, 34: 690-711. DOI: 10.13328/j.cnki.jos.006709 .
2
EMBAR V, KAN A, SISMAN B, et al. DiffXtract: Joint Discriminative Product Attribute-value Extraction[C]//2021 IEEE International Conference on Big Knowledge (ICBK). New York: IEEE, 2021: 271-280. DOI: 10.1109/ICKG52313.2021.00044 .
3
李昊迪. 医学领域知识抽取方法研究[D]. 哈尔滨: 哈尔滨工业大学, 2018.
LI H D. Research on Medical Domain Knowledge Extraction Methods[D]. Harbin: Harbin Institute of Technology, 2018.
4
FAN Z F, WU Z, DAI X Y, et al. Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling[C]//Proceedings of the 2019 Conference of the North. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 2509-2519. DOI: 10.18653/v1/n19-1259 .
5
HU M Q, LIU B. Mining and Summarizing Customer Reviews[C]//Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 2004: 168-177. DOI: 10.1145/1014052.1014073 .
6
李红亮. 基于规则的百科人物属性抽取算法的研究[D]. 成都: 西南交通大学, 2013.
LI H L. Research on Character Attributes Extraction Based on Rules from Baidu Encyclopedia[D]. Chengdu: Southwest Jiaotong University, 2013.
7
HOCHREITER S, SCHMIDHUBER J. Long Short-term Memory[J]. Neural Comput, 1997, 9(8): 1735-1780. DOI: 10.1162/neco.1997.9.8.1735 .
8
JOHN L, ANDREW M, FERNANDO P. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//The 18th International Conference on Machine Learning. Williamstown, Massachusetts, USA: Morgan Kanfmann Publishers Inc. 2001: 282-289. DOI: 20.500.14332/6188 .
9
FAN Z F, WU Z, DAI X Y, et al. Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling[C]//Proceedings of the 2019 Conference of the North. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 2509-2518. DOI: 10.18653/v1/n19-1259 .
10
DAI H L, SONG Y Q. Neural Aspect and Opinion Term Extraction with Mined Rules as Weak Supervision[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 5268-5277. DOI: 10.18653/v1/p19-1520 .
11
VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is All You Need[EB/OL]. (2017-06-12) [2025-04-21].
12
DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2018. 4171-4186.DOI:10.18653/v1/N19-1423 .
13
HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF Models for Sequence Tagging[EB/OL]. (2015-08-09) [2025-04-21].
14
LI X Y, FENG J R, MENG Y X, et al. A Unified MRC Framework for Named Entity Recognition[EB/OL]. (2019-10-25) [2025-04-21].
15
马进, 杨一帆, 陈文亮. 基于远程监督的人物属性抽取研究[J]. 中文信息学报, 2020, 34(6): 64-72. DOI: 10.3969/j.issn.1003-0077.2020.06.009 .
MA J, YANG Y F, CHEN W L. Distant Supervision for Person Attribute Recognition[J]. J Chin Inf Process, 2020, 34(6): 64-72. DOI: 10.3969/j.issn.1003-0077.2020.06.009 .
16
张巧, 熊锦华, 程学旗. 基于弱监督学习的主页人物属性抽取方法[J]. 山西大学学报(自然科学版), 2015, 38(1): 8-15. DOI:10.13451/j.cnki.shanxi.univ(nat.sci.).2015.01.002 .
ZHANG Q, XIONG J H, CHENG X Q. Person Attributes Extraction Based on a Weakly Supervised Learning Method[J]. J Shanxi Univ Nat Sci Ed, 2015, 38(1): 8-15. DOI: 10.13451/j.cnki.shanxi.univ(nat.sci.).2015.01.002 .
17
ANGELI G, TIBSHIRANI J, WU J, et al. Combining Distant and Partial Supervision for Relation Extraction[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1556-1567. DOI: 10.3115/v1/d14-1164 .
18
苏丰龙, 谢庆华, 邱继远, 等. 基于深度学习的领域实体属性词聚类抽取研究[J]. 微型机与应用, 2016, 35(1): 53-55. DOI: 10.19358/j.issn.1674-7720.2016.01.017 .
SU F L, XIE Q H, QIU J Y, et al. Study on Word Clusting for Attribute Extraction Based on Deep Learning[J]. Microcomput Appl, 2016, 35(1): 53-55. DOI: 10.19358/j.issn.1674-7720.2016.01.017 .
19
向晓雯. 基于条件随机场的中文命名实体识别[D]. 厦门: 厦门大学, 2006.
XIANG X W. Chinese Named Entity Recognition Based on Conditional Random Fields[D]. Xiamen: Xiamen University, 2006.
20
KATIYAR A, CARDIE C. Investigating LSTMS for Joint Extraction of Opinion Entities and Relations[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 919-929. DOI: 10.18653/v1/p16-1087 .
21
CHO K, VAN MERRIENBOER B, BAHDANAU D, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[C]//Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 103-111. DOI: 10.3115/v1/w14-4012 .
22
RADFORD A, NARASIMHAN K, SALIMANS T, et al.Improving Language Understanding by Generative Pre-Training [J]. Open Access Library Journal, 2021, 8: 7.
23
PETERS M E, NEUMANN M, IYYER M, et al. Deep Contextualized Word Representations[EB/OL]. (2018-02-14) [2025-04-21].

基金

国家自然科学基金(62376177)

评论

PDF(1135 KB)

Accesses

Citation

Detail

段落导航
相关文章

/