CSL: A Large-scale Chinese Scientific Literature Dataset

Yudong Li; Yuqing Zhang; Zhe Zhao; Linlin Shen; Weijie Liu; Weiquan Mao; Hui Zhang

CSL: A Large-scale Chinese Scientific Literature Dataset

Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan Mao, Hui Zhang

Research output: Journal Publication › Conference article › peer-review

17 Citations (Scopus)

Abstract

Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL .

Original language	English
Pages (from-to)	3917-3923
Number of pages	7
Journal	Proceedings - International Conference on Computational Linguistics, COLING
Volume	29
Issue number	1
Publication status	Published - 2022
Externally published	Yes
Event	29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of Duration: 12 Oct 2022 → 17 Oct 2022

ASJC Scopus subject areas

Computational Theory and Mathematics
Computer Science Applications
Theoretical Computer Science

Cite this

@article{ca75611703cf4a8daace3dbe6b130a7b,

title = "CSL: A Large-scale Chinese Scientific Literature Dataset",

abstract = "Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL .",

author = "Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang",

note = "Publisher Copyright: {\textcopyright} 2022 Proceedings - International Conference on Computational Linguistics, COLING. All rights reserved.; 29th International Conference on Computational Linguistics, COLING 2022 ; Conference date: 12-10-2022 Through 17-10-2022",

year = "2022",

language = "English",

volume = "29",

pages = "3917--3923",

journal = "Proceedings - International Conference on Computational Linguistics, COLING",

issn = "2951-2093",

publisher = "Association for Computational Linguistics (ACL)",

number = "1",

}

TY - JOUR

T1 - CSL

T2 - 29th International Conference on Computational Linguistics, COLING 2022

AU - Li, Yudong

AU - Zhang, Yuqing

AU - Zhao, Zhe

AU - Shen, Linlin

AU - Liu, Weijie

AU - Mao, Weiquan

AU - Zhang, Hui

PY - 2022

Y1 - 2022

N2 - Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL .

AB - Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL .

UR - http://www.scopus.com/inward/record.url?scp=85147697955&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85147697955

SN - 2951-2093

VL - 29

SP - 3917

EP - 3923

JO - Proceedings - International Conference on Computational Linguistics, COLING

JF - Proceedings - International Conference on Computational Linguistics, COLING

IS - 1

Y2 - 12 October 2022 through 17 October 2022

ER -

CSL: A Large-scale Chinese Scientific Literature Dataset

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this