Machine Reading Comprehension for Document-level Person Aspect Term Extraction

LIU Ziyun; ZHANG Shiqi; CHEN Wenliang

doi:10.13451/j.sxu.ns.2024026

PDF(1135 KB)

Journal of Shanxi University(Natural Science Edition) ›› 2025, Vol. 48 ›› Issue (3) : 470-480. DOI: 10.13451/j.sxu.ns.2024026

Information Sciences

Machine Reading Comprehension for Document-level Person Aspect Term Extraction

Author information +

History +

Abstract

Person aspect term extraction aims to extract various attributes of individuals such as gender and nationality from their descriptions. Existing extraction methods typically train sequence labeling models on distantly-supervised data to obtain the extraction model. However, this approach has issues with inaccurate annotations and overlapping different attribute values in the data, and lacks scalability and generalizability in their models. To solve the problems, this article proposes to transform this task into a machine reading comprehension (MRC) problem, that is, to fill in the person attribute-value table by reading the person profile. This paper constructs a person attribute recognition data based on the reading comprehension framework from the person encyclopedia, and constructs two baseline models of bidirectional encoder representations from transformers-machine reading comprehension (BERT-MRC) and bidirectional encoder representations from transformers-conditional random field-machine reading comprehension (BERT-CRF-MRC). Among them, BERT-CRF-MRC is three percentage points higher than BERT-MRC on average in F1 score and the experimental results of BERT-CRF-MRC are about 92% F1 average in short text person profiles while about 75% in long text person profiles. The constructed data and code are exposed on Github.

Key words

aspect term extraction / MRC / annotated data

Cite this article

EndNote

Ris (Procite)

Bibtex

Download Citations

LIU Ziyun , ZHANG Shiqi , CHEN Wenliang. Machine Reading Comprehension for Document-level Person Aspect Term Extraction. Journal of Shanxi University(Natural Science Edition). 2025, 48(3): 470-480 https://doi.org/10.13451/j.sxu.ns.2024026

References

List( Publishing order | Descend order by publishing year | Descend order by cited within ) Chart analysis

1	徐庆婷, 洪宇, 潘雨晨等. 属性抽取研究综述[J]. 软件学报, 2023, 34: 690-711. DOI: 10.13328/j.cnki.jos.006709 . XU Q T, HONG Y, PAN Y C, et al. Survey on Aspect Term Extraction[J]. J Softw, 2023, 34: 690-711. DOI: 10.13328/j.cnki.jos.006709 . 本文引用 [1]

2	EMBAR V, KAN A, SISMAN B, et al. DiffXtract: Joint Discriminative Product Attribute-value Extraction[C]//2021 IEEE International Conference on Big Knowledge (ICBK). New York: IEEE, 2021: 271-280. DOI: 10.1109/ICKG52313.2021.00044 . 本文引用 [1]

3	李昊迪. 医学领域知识抽取方法研究[D]. 哈尔滨: 哈尔滨工业大学, 2018. LI H D. Research on Medical Domain Knowledge Extraction Methods[D]. Harbin: Harbin Institute of Technology, 2018. 本文引用 [1]

4	FAN Z F, WU Z, DAI X Y, et al. Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling[C]//Proceedings of the 2019 Conference of the North. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 2509-2519. DOI: 10.18653/v1/n19-1259 . 本文引用 [1]

5	HU M Q, LIU B. Mining and Summarizing Customer Reviews[C]//Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 2004: 168-177. DOI: 10.1145/1014052.1014073 . 本文引用 [1]

6	李红亮. 基于规则的百科人物属性抽取算法的研究[D]. 成都: 西南交通大学, 2013. LI H L. Research on Character Attributes Extraction Based on Rules from Baidu Encyclopedia[D]. Chengdu: Southwest Jiaotong University, 2013. 本文引用 [3]

7	HOCHREITER S, SCHMIDHUBER J. Long Short-term Memory[J]. Neural Comput, 1997, 9(8): 1735-1780. DOI: 10.1162/neco.1997.9.8.1735 . 本文引用 [1]

8	JOHN L, ANDREW M, FERNANDO P. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//The 18th International Conference on Machine Learning. Williamstown, Massachusetts, USA: Morgan Kanfmann Publishers Inc. 2001: 282-289. DOI: 20.500.14332/6188 . 本文引用 [1]

9	FAN Z F, WU Z, DAI X Y, et al. Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling[C]//Proceedings of the 2019 Conference of the North. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 2509-2518. DOI: 10.18653/v1/n19-1259 . 本文引用 [1]

DAI

H L

, SONG

Y Q

. Neural Aspect and Opinion Term Extraction with Mined Rules as Weak Supervision[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 5268-5277. DOI: 10.18653/v1/p19-1520 .

本文引用 [1]

11	VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is All You Need[EB/OL]. (2017-06-12) [2025-04-21]. https://doi.org/10.48550/arXiv.1706.03762 本文引用 [3]

DEVLIN

, CHANG

M W

, LEE

, et al. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2018. 4171-4186.DOI:10.18653/v1/N19-1423 .

本文引用 [2]

13	HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF Models for Sequence Tagging[EB/OL]. (2015-08-09) [2025-04-21]. https://doi.org/10.48550/arXiv.1508.01991 本文引用 [1]

14	LI X Y, FENG J R, MENG Y X, et al. A Unified MRC Framework for Named Entity Recognition[EB/OL]. (2019-10-25) [2025-04-21]. https://doi.org/10.48550/arXiv.1910.11476 本文引用 [3]

15	马进, 杨一帆, 陈文亮. 基于远程监督的人物属性抽取研究[J]. 中文信息学报, 2020, 34(6): 64-72. DOI: 10.3969/j.issn.1003-0077.2020.06.009 . MA J, YANG Y F, CHEN W L. Distant Supervision for Person Attribute Recognition[J]. J Chin Inf Process, 2020, 34(6): 64-72. DOI: 10.3969/j.issn.1003-0077.2020.06.009 . 本文引用 [4]

张巧, 熊锦华, 程学旗. 基于弱监督学习的主页人物属性抽取方法[J]. 山西大学学报(自然科学版), 2015, 38(1): 8-15. DOI:10.13451/j.cnki.shanxi.univ(nat.sci.).2015.01.002 .

ZHANG

, XIONG

J H

, CHENG

X Q

. Person Attributes Extraction Based on a Weakly Supervised Learning Method[J]. J Shanxi Univ Nat Sci Ed, 2015, 38(1): 8-15. DOI: 10.13451/j.cnki.shanxi.univ(nat.sci.).2015.01.002 .

本文引用 [1]

ANGELI

, TIBSHIRANI

, WU

, et al. Combining Distant and Partial Supervision for Relation Extraction[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1556-1567. DOI: 10.3115/v1/d14-1164 .

本文引用 [1]

苏丰龙, 谢庆华, 邱继远, 等. 基于深度学习的领域实体属性词聚类抽取研究[J]. 微型机与应用, 2016, 35(1): 53-55. DOI: 10.19358/j.issn.1674-7720.2016.01.017 .

F L

, XIE

Q H

, QIU

J Y

, et al. Study on Word Clusting for Attribute Extraction Based on Deep Learning[J]. Microcomput Appl, 2016, 35(1): 53-55. DOI: 10.19358/j.issn.1674-7720.2016.01.017 .

本文引用 [1]

19	向晓雯. 基于条件随机场的中文命名实体识别[D]. 厦门: 厦门大学, 2006. XIANG X W. Chinese Named Entity Recognition Based on Conditional Random Fields[D]. Xiamen: Xiamen University, 2006. 本文引用 [1]

KATIYAR

, CARDIE

. Investigating LSTMS for Joint Extraction of Opinion Entities and Relations[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 919-929. DOI: 10.18653/v1/p16-1087 .

本文引用 [1]

CHO

, VAN MERRIENBOER

, BAHDANAU

, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[C]//Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 103-111. DOI: 10.3115/v1/w14-4012 .

本文引用 [1]

22	RADFORD A, NARASIMHAN K, SALIMANS T, et al.Improving Language Understanding by Generative Pre-Training [J]. Open Access Library Journal, 2021, 8: 7. 本文引用 [1]

23	PETERS M E, NEUMANN M, IYYER M, et al. Deep Contextualized Word Representations[EB/OL]. (2018-02-14) [2025-04-21]. https://doi.org/10.48550/arXiv.1802.05365 本文引用 [1]

Comments

PDF(1135 KB)

Accesses

Citation

Detail

Sections

Recommended

Received	Accepted	Published
15 Oct 2023	15 Dec 2023	25 May 2025
Issue Date
13 Jun 2025

Please choose a citation manager

Content to export