About

The Noah-Wukong dataset is a large-scale multi-modality Chinese dataset.
  • The dataset contains 100 Million <image, text> pairs
  • Images in the datasets are filtered according to the size ( > 200px for both dimensions ) and aspect ratio ( 1/3 ~ 3 )
  • Text in the datasets are filtered according to its language, length and frequency. Privacy and sensitive words are also taken into consideration.

Examples

Announcement

2022/01/30 Dataset is now available for download
2022/02/14 Paper for Noah-Wukong Dataset is released at Arxiv
2022/03/28 Benchmark models is now released
2022/03/28 Wukong-Test is available for download
2022/05/11 Chinese lables for dataset are now available at download
2022/07/05 Implementation code is available on Pytorch version.

Citation

@misc{gu2022wukong,
      title={Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework}, 
      author={Jiaxi Gu and Xiaojun Meng and Guansong Lu and Lu Hou and Minzhe Niu and Hang Xu and Xiaodan Liang and Wei Zhang and Xin Jiang and Chunjing Xu},
      year={2022},
      eprint={2202.06767},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}