About
The Noah-Wukong dataset is a large-scale multi-modality Chinese dataset.
- The dataset contains 100 Million <image, text> pairs
- Images in the datasets are filtered according to the size ( > 200px for both dimensions ) and aspect ratio ( 1/3 ~ 3 )
- Text in the datasets are filtered according to its language, length and frequency. Privacy and sensitive words are also taken into consideration.
Examples
Announcement
2022/01/30 Dataset is now available for download
2022/02/14 Paper for Noah-Wukong Dataset is released at Arxiv
2022/03/28 Benchmark models is now released
2022/03/28 Wukong-Test is available for download
2022/05/11 Chinese lables for dataset are now available at download
2022/07/05 Implementation code is available on Pytorch version.
Citation
@misc{gu2022wukong, title={Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework}, author={Jiaxi Gu and Xiaojun Meng and Guansong Lu and Lu Hou and Minzhe Niu and Hang Xu and Xiaodan Liang and Wei Zhang and Xin Jiang and Chunjing Xu}, year={2022}, eprint={2202.06767}, archivePrefix={arXiv}, primaryClass={cs.CV} }