KuaiSearch is a large-scale e-commerce search dataset and full-stack benchmark system built from real user search interactions on the Kuaishou platform. It covers the three core stages of modern industrial search pipelines: Recall, Relevance, and Ranking. Each stage provides multiple algorithmic baselines, allowing researchers to systematically evaluate and compare methods
📄 Paper: KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance Yupeng Li*, Ben Chen*, Mingyue Cheng, Zhiding Liu, Xuxin Zhang, Chenyi Lei, Wenwu Ou
- Overview
- Dataset Statistics
- Installation
- Data Preparation
- Usage
- Supported Models
- Benchmark Results
- Notes
- Citation
- References
KuaiSearch provides a large-scale e-commerce search dataset together with a complete benchmark system covering three modular, independently trainable stages:
| Stage | Description | Methods |
|---|---|---|
| 🔍 Recall | Retrieve candidate documents from a large corpus | BM25, DPR (Dense Retrieval), Generative Retrieval (GR) |
| ✅ Relevance | Score query–document semantic relevance | Cross-Encoder, Bi-Encoder Embedding, GR |
| 📊 Ranking | Learn-to-rank candidates with user features | DNN, Wide&Deep, DCNv1, DCNv2, DIN |
KuaiSearch is, to the best of our knowledge, the largest e-commerce search dataset currently available, built upon real user search interactions from the Kuaishou platform. It retains authentic user queries and natural-language product texts, covers cold-start users and long-tail products, and spans all three key stages of the search pipeline.
| Dataset | # Users | # Items | # Queries | Text Form |
|---|---|---|---|---|
| Amazon | 192,403 | 63,001 | 3,221 | text (heuristic queries) |
| JDsearch | 173,831 | 12,872,736 | 171,728 | anonymized |
| KuaiSearch-Lite | 102,086 | 6,634,118 | 555,553 | text |
| KuaiSearch | 331,930 | 18,605,582 | 2,574,949 | text |
| Table | Size | Key Fields |
|---|---|---|
| User | 331,930 | user_id, gender, age, location |
| Item | 18,605,582 | item_id, title, brand, seller, category L1/L2/L3 |
| Recall | 2,574,949 | user_id, session_id, query, impressed_item_ids, clicked_item_ids, purchased_item_ids |
| Ranking | 81,401,477 | user_id, user stats, session_id, query, search_entrance, recent clicked/purchased items, target item features, is_clicked, is_purchased |
| Relevance | 46,422 | query, title, brand_name, seller_name, attribute, score (0–3) |
KuaiSearch-Lite is a lightweight subset designed for rapid model validation and ablation studies. All experiments in the paper are conducted on KuaiSearch-Lite.
- Python 3.8+
- CUDA 11.7+
pip install -r requirements.txtThe dataset is publicly available on HuggingFace:
# Install HuggingFace Hub CLI (if not already installed)
pip install huggingface_hub
# Download KuaiSearch dataset
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='benchen4395/KuaiSearch',
repo_type='dataset',
local_dir='./data'
)
"Or manually download from: https://huggingface.co/datasets/benchen4395/KuaiSearch
⚠️ All commands must be run from theKuaiSearch/project root.
bash scripts/recall_data_process.shbash scripts/recall_bm25_eval.shbash scripts/recall_doc2query.sh
bash scripts/recall_docT5query_eval.shbash scripts/recall_dpr.shbash scripts/recall_gr.shbash scripts/relevance_data_process.shbash scripts/relevance_crossencoder.shbash scripts/relevance_embedding.shbash scripts/relevance_gr.shbash scripts/ranking_data_process.sh# Default model: DCNv1
bash scripts/ranking_train.shIf you use KuaiSearch in your research, please cite our paper:
@article{li2026kuaisearch,
title = {KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance},
author = {Yupeng Li and Ben Chen and Mingyue Cheng and Zhiding Liu and Xuxin Zhang and Chenyi Lei and Wenwu Ou},
journal = {arXiv preprint arXiv:2602.11518},
year = {2026},
url = {https://arxiv.org/abs/2602.11518}
}