WG4 / Data-to-text NLG training datasets
Dataset name and brief description, including purpose | Authors/creators | Link |
MSVD-Turkish: The first large scale video captioning dataset for Turkish languages, obtained by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. | Begum Citamak, Ozan Caglayan, Menekse Kuyu, Erkut Erdem, Aykut Erdem, Pranava Madhyastha, and Lucia Specia. | MSVD-Turkish |
WG4 / Data-to-text NLG training datasets
Data-to-text NLG systems require training data. Here we provide a list of freely available datasets that have been created with different methodologies (automatically, crowdsourcing etc.) and for different NLG sub-tasks.
Name | Paper | Year | Link |
WebNLG 2017 | Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017). Creating Training Corpora for NLG Micro-Planners. ACL. | 2017 | https://webnlg-challenge.loria.fr/challenge_2017/ |
WebNLG 2020 | Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017). Creating Training Corpora for NLG Micro-Planners. ACL. | 2020 | https://webnlg-challenge.loria.fr/challenge_2020/ |
KBGen | Banik, E., Gardent, C., & Kow, E. (2013). The KBGen Challenge. ENLG. | 2013 | http://www.kbgen.org |
E2E NLG Challenge | Dusek, O., Novikova, J., & Rieser, V. (2020). Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Comput. Speech Lang., 59, 123-156. | 2017 | http://www.macs.hw.ac.uk/InteractionLab/E2E/ |
MultiWOZ 2.2 | Zang, X., Rastogi, A., Zhang, J., & Chen, J. (2020). MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines. ArXiv, abs/2007.12720. | 2020 | https://github.com/budzianowski/multiwoz |
ToTTo | Parikh, Ankur P., et al. “Totto: A controlled table-to-text generation dataset.” arXiv preprint arXiv:2004.14373 (2020). | 2020 | https://paperswithcode.com/dataset/totto |
RotoWire | Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. “Challenges in Data-to-Document Generation.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. | 2017 | https://github.com/harvardnlp/boxscore-data/blob/master/rotowire.tar.bz2 |
WikiBio | Lebret, Rémi, David Grangier, and Michael Auli. “Neural text generation from structured data with application to the biography domain.” arXiv preprint arXiv:1603.07771 (2016). | 2016 | https://paperswithcode.com/dataset/wikibio |
WEATHER GOV | |||
ROBOCUP | |||
Logic2Text | Chen, Zhiyu, et al. “Logic2Text: High-Fidelity Natural Language Generation from Logical Forms.” arXiv preprint arXiv:2004.14579 (2020). | 2020 | https://paperswithcode.com/dataset/logic2text |
DART | Nan, Linyong, et al. “Dart: Open-domain structured data record to text generation.” arXiv preprint arXiv:2007.02871 (2020). | 2020 | https://paperswithcode.com/dataset/dart |
ENT-DESC | Cheng, Liying, et al. “ENT-DESC: Entity Description Generation by Exploring Knowledge Graph.” arXiv preprint arXiv:2004.14813 (2020). | 2020 | https://paperswithcode.com/dataset/ent-desc |
GEM (Generation, Evaluation, and Metrics) | Gehrmann, Sebastian, et al. “The gem benchmark: Natural language generation, its evaluation and metrics.” arXiv preprint arXiv:2102.01672 (2021). | 2021 | https://paperswithcode.com/dataset/gem |