高质量的材料科学文本挖掘数据集构建方法

刘悦; 刘大晖; 葛献远; 杨正伟; 马舒畅; 邹喆乂; 施思齐

doi:10.7498/aps.72.20222316

摘要
科学文献中蕴含的大量历史数据和经验知识, 对材料设计与研发具有重要参考价值. 文本挖掘尽管能高效地探索并利用被存储在海量科学文献中的信息, 但高质量文本数据的获取困难阻碍了其在材料领域更广泛的应用. 本文从品质和数量双视角剖析了材料领域的文本数据质量问题及其相关研究工作, 提出高质量的材料科学文本挖掘数据集构建方法. 该方法通过可溯源的文献自动获取方案确保文本数据的源头可追溯; 以下游任务为驱动对文献进行预处理以提升预标注文本语料的质量; 基于材料四面体准则定义适配全体系的标签注释方案以完成对语料的高品质标注; 利用融合材料领域知识的有条件文本数据增强模型实现材料文本数据量的扩充. 在不同体系数据集上的实验结果表明, 该方法可有效地提升下游文本挖掘模型的预测精度, 其中在NASICON型固态电解质材料实体识别任务上的F1值达84%. 本文为文本挖掘在材料领域的深入应用提供理论指导和解决方案, 并有望推进数据与知识双向驱动的材料设计与研发.

关键词:
材料科学文本挖掘 /

数据增强 /

数据质量
Abstract
Numerous data and knowledge generated and stored as text in peer-reviewed scientific literature are important for materials research and development. Although text mining can automatically explore this information, the barriers of acquiring high-quality textual data prevent its general application in materials science. Herein, we systematically analyze the issues of textual DATA QUALITY and related research from the perspectives of data quality and quantity. Following this, we propose a pipeline to construct high-quality datasets for text mining in materials science. In this pipeline, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is used to generate high-quality pre-annotated corpora conditioned on the characteristics of material texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating material domain knowledge (cDA-DK) is constructed to augment the data quantity. Experimental results on datasets with various material systems demonstrate that our method can effectively improve the accuracy of downstream models and the F1-score towards the named entity recognition task in NASICON-type solid electrolyte material reaches 84%. This study provides an important insight into the general application of text mining in materials science, and is expected to advance the material design and discovery driven by data and knowledge bidirectionally.

Keywords:
text mining in materials science /

data augmentation /

data quality
作者及机构信息
刘悦,

刘大晖,

葛献远,

杨正伟,

马舒畅,

邹喆乂,

施思齐
Authors and contacts
文章全文

补充材料

参考文献

施引文献

搜索

高质量的材料科学文本挖掘数据集构建方法

A high-quality dataset construction method for text mining in materials science

摘要

Abstract

作者及机构信息

Authors and contacts

文章全文

补充材料

参考文献

施引文献

目录