KuaiSearch

KuaiSearch is a large-scale e-commerce search dataset and full-stack benchmark system built from real user search interactions on the Kuaishou platform. It covers the three core stages of modern industrial search pipelines: Recall, Relevance, and Ranking. Each stage provides multiple algorithmic baselines, allowing researchers to systematically evaluate and compare methods

! All the codes and datasets are now public !

📄 Paper: KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance Yupeng Li*, Ben Chen*, Mingyue Cheng, Zhiding Liu, Xuxin Zhang, Chenyi Lei, Wenwu Ou

🤗 Dataset: huggingface.co/datasets/benchen4395/KuaiSearch

Overview

KuaiSearch provides a large-scale e-commerce search dataset together with a complete benchmark system covering three modular, independently trainable stages:

Stage	Description	Methods
🔍 Recall	Retrieve candidate documents from a large corpus	BM25, DPR (Dense Retrieval), Generative Retrieval (GR)
✅ Relevance	Score query–document semantic relevance	Cross-Encoder, Bi-Encoder Embedding, GR
📊 Ranking	Learn-to-rank candidates with user features	DNN, Wide&Deep, DCNv1, DCNv2, DIN

KuaiSearch is, to the best of our knowledge, the largest e-commerce search dataset currently available, built upon real user search interactions from the Kuaishou platform. It retains authentic user queries and natural-language product texts, covers cold-start users and long-tail products, and spans all three key stages of the search pipeline.

Dataset Statistics

Scale Comparison with Existing Datasets

Dataset	# Users	# Items	# Queries	Text Form
Amazon	192,403	63,001	3,221	text (heuristic queries)
JDsearch	173,831	12,872,736	171,728	anonymized
KuaiSearch-Lite	102,086	6,634,118	555,553	text
KuaiSearch	331,930	18,605,582	2,574,949	text

Data Schema

Table	Size	Key Fields
User	331,930	`user_id`, `gender`, `age`, `location`
Item	18,605,582	`item_id`, `title`, `brand`, `seller`, `category L1/L2/L3`
Recall	2,574,949	`user_id`, `session_id`, `query`, `impressed_item_ids`, `clicked_item_ids`, `purchased_item_ids`
Ranking	81,401,477	`user_id`, user stats, `session_id`, `query`, `search_entrance`, recent clicked/purchased items, target item features, `is_clicked`, `is_purchased`
Relevance	46,422	`query`, `title`, `brand_name`, `seller_name`, `attribute`, `score` (0–3)

KuaiSearch-Lite is a lightweight subset designed for rapid model validation and ablation studies. All experiments in the paper are conducted on KuaiSearch-Lite.

Installation

Requirements

Python 3.8+
CUDA 11.7+

Install Dependencies

pip install -r requirements.txt

Data Preparation

Download the Dataset

The dataset is publicly available on HuggingFace:

# Install HuggingFace Hub CLI (if not already installed)
pip install huggingface_hub

# Download KuaiSearch dataset
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='benchen4395/KuaiSearch',
    repo_type='dataset',
    local_dir='./data'
)
"

Or manually download from: https://huggingface.co/datasets/benchen4395/KuaiSearch

Usage

⚠️ All commands must be run from the KuaiSearch/ project root.

Recall

Step 1: Data Preprocessing

bash scripts/recall_data_process.sh

Method 1: BM25

bash scripts/recall_bm25_eval.sh

Method 2: DocT5Query

bash scripts/recall_doc2query.sh
bash scripts/recall_docT5query_eval.sh

Method 3: Dense Retrieval (DPR)

bash scripts/recall_dpr.sh

Method 4: Generative Retrieval (GR)

bash scripts/recall_gr.sh

Relevance

Step 1: Data Preprocessing

bash scripts/relevance_data_process.sh

Method 1: Cross-Encoder

bash scripts/relevance_crossencoder.sh

Method 2: Bi-Encoder (Embedding)

bash scripts/relevance_embedding.sh

Method 3: Generative Relevance (GR)

bash scripts/relevance_gr.sh

Ranking

Step 1: Data Preprocessing

bash scripts/ranking_data_process.sh

Step 2: Train

# Default model: DCNv1
bash scripts/ranking_train.sh

Citation

If you use KuaiSearch in your research, please cite our paper:

@article{li2026kuaisearch,
  title     = {KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance},
  author    = {Yupeng Li and Ben Chen and Mingyue Cheng and Zhiding Liu and Xuxin Zhang and Chenyi Lei and Wenwu Ou},
  journal   = {arXiv preprint arXiv:2602.11518},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.11518}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
demo		demo
ranking		ranking
recall		recall
relevance		relevance
scripts		scripts
static		static
.gitignore		.gitignore
README.md		README.md
detail.png		detail.png
fig1_product_inter_freq.png		fig1_product_inter_freq.png
fig2_user_history_len_sessions.png		fig2_user_history_len_sessions.png
homepage.png		homepage.png
index.html		index.html
mall.png		mall.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KuaiSearch

! All the codes and datasets are now public !

Table of Contents

Overview

Dataset Statistics

Scale Comparison with Existing Datasets

Data Schema

Installation

Requirements

Install Dependencies

Data Preparation

Download the Dataset

Usage

Recall

Step 1: Data Preprocessing

Method 1: BM25

Method 2: DocT5Query

Method 3: Dense Retrieval (DPR)

Method 4: Generative Retrieval (GR)

Relevance

Step 1: Data Preprocessing

Method 1: Cross-Encoder

Method 2: Bi-Encoder (Embedding)

Method 3: Generative Relevance (GR)

Ranking

Step 1: Data Preprocessing

Step 2: Train

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KuaiSearch

! All the codes and datasets are now public !

Table of Contents

Overview

Dataset Statistics

Scale Comparison with Existing Datasets

Data Schema

Installation

Requirements

Install Dependencies

Data Preparation

Download the Dataset

Usage

Recall

Step 1: Data Preprocessing

Method 1: BM25

Method 2: DocT5Query

Method 3: Dense Retrieval (DPR)

Method 4: Generative Retrieval (GR)

Relevance

Step 1: Data Preprocessing

Method 1: Cross-Encoder

Method 2: Bi-Encoder (Embedding)

Method 3: Generative Relevance (GR)

Ranking

Step 1: Data Preprocessing

Step 2: Train

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages