About

The Noah-Wukong dataset is a large-scale multi-modality Chinese dataset.

The dataset contains 100 Million <image, text> pairs
Images in the datasets are filtered according to the size ( > 200px for both dimensions ) and aspect ratio ( 1/3 ~ 3 )
Text in the datasets are filtered according to its language, length and frequency. Privacy and sensitive words are also taken into consideration.

Examples

Announcement

2022/01/30 Dataset is now available for download

2022/02/14 Paper for Noah-Wukong Dataset is released at Arxiv

2022/03/28 Benchmark models is now released

2022/03/28 Wukong-Test is available for download

2022/05/11 Chinese lables for dataset are now available at download

2022/07/05 Implementation code is available on Pytorch version.

Citation

@misc{gu2022wukong,
      title={Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework}, 
      author={Jiaxi Gu and Xiaojun Meng and Guansong Lu and Lu Hou and Minzhe Niu and Hang Xu and Xiaodan Liang and Wei Zhang and Xin Jiang and Chunjing Xu},
      year={2022},
      eprint={2202.06767},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}