Training Datasets

WG4 / Data-to-text NLG training datasets

Dataset name and brief description, including purpose	Authors/creators	Link
MSVD-Turkish: The first large scale video captioning dataset for Turkish languages, obtained by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish.	Begum Citamak, Ozan Caglayan, Menekse Kuyu, Erkut Erdem, Aykut Erdem, Pranava Madhyastha, and Lucia Specia.	MSVD-Turkish

WG4 / Data-to-text NLG training datasets

Data-to-text NLG systems require training data. Here we provide a list of freely available datasets that have been created with different methodologies (automatically, crowdsourcing etc.) and for different NLG sub-tasks.

Name	Paper	Year	Link
WebNLG 2017	Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017). Creating Training Corpora for NLG Micro-Planners. ACL.	2017	https://webnlg-challenge.loria.fr/challenge_2017/
WebNLG 2020	Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017). Creating Training Corpora for NLG Micro-Planners. ACL.	2020	https://webnlg-challenge.loria.fr/challenge_2020/
KBGen	Banik, E., Gardent, C., & Kow, E. (2013). The KBGen Challenge. ENLG.	2013	http://www.kbgen.org
E2E NLG Challenge	Dusek, O., Novikova, J., & Rieser, V. (2020). Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Comput. Speech Lang., 59, 123-156.	2017	http://www.macs.hw.ac.uk/InteractionLab/E2E/
MultiWOZ 2.2	Zang, X., Rastogi, A., Zhang, J., & Chen, J. (2020). MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines. ArXiv, abs/2007.12720.	2020	https://github.com/budzianowski/multiwoz
ToTTo	Parikh, Ankur P., et al. “Totto: A controlled table-to-text generation dataset.” arXiv preprint arXiv:2004.14373 (2020).	2020	https://paperswithcode.com/dataset/totto
RotoWire	Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. “Challenges in Data-to-Document Generation.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.	2017	https://github.com/harvardnlp/boxscore-data/blob/master/rotowire.tar.bz2
WikiBio	Lebret, Rémi, David Grangier, and Michael Auli. “Neural text generation from structured data with application to the biography domain.” arXiv preprint arXiv:1603.07771 (2016).	2016	https://paperswithcode.com/dataset/wikibio
WEATHER GOV
ROBOCUP
Logic2Text	Chen, Zhiyu, et al. “Logic2Text: High-Fidelity Natural Language Generation from Logical Forms.” arXiv preprint arXiv:2004.14579 (2020).	2020	https://paperswithcode.com/dataset/logic2text
DART	Nan, Linyong, et al. “Dart: Open-domain structured data record to text generation.” arXiv preprint arXiv:2007.02871 (2020).	2020	https://paperswithcode.com/dataset/dart
ENT-DESC	Cheng, Liying, et al. “ENT-DESC: Entity Description Generation by Exploring Knowledge Graph.” arXiv preprint arXiv:2004.14813 (2020).	2020	https://paperswithcode.com/dataset/ent-desc
GEM (Generation, Evaluation, and Metrics)	Gehrmann, Sebastian, et al. “The gem benchmark: Natural language generation, its evaluation and metrics.” arXiv preprint arXiv:2102.01672 (2021).	2021	https://paperswithcode.com/dataset/gem