We maintain the data using Google Drive and Baidu Yunpan. The extraction code for Baidu Yunpan is noah
Download link for Wukong100m: [Google Drive]
Download link for Wukong100m: [Baidu Yunpan]
Download link for Wukong-Test: [Google Drive]
Download link for Wukong-Test: [Baidu Yunpan]

Data orgainization

The whole dataset is split into 256 files, each contains around 80,000 <image, text> pairs. After unzip the file, files under the data root directory is like this
    ├─ wukong_100m_0.csv
    ├─ wukong_100m_1.csv
    ├─ wukong_100m_2.csv
    ├─ ....
    └─ wukong_100m_255.csv