Official implementation of "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models"
First, download the pre-processed .parquet dataset (available on Hugging Face) and place it in the data/processed/ directory.
Then, execute the following command to verify the dataset statistics:
python scripts/check_stats.py +data=sourcesYou should see a comprehensive dataset statistics report like this:
========================================
DATASET STATISTICS: fed_pii_v1.parquet
========================================
[Total Samples]: 16194
split
train 14575
test 1619
Name: count, dtype: int64
[Samples with PII]: 11633
[Breakdown by Task & Split]
count with_pii pii_ratio
split task
test exam 224 77 34.38%
jud_read_compre 350 350 100.00%
jud_sum 265 265 100.00%
leg_case_cls 403 121 30.02%
sim_case_match 377 377 100.00%
train exam 2020 623 30.84%
jud_read_compre 3148 3143 99.84%
jud_sum 2386 2385 99.96%
leg_case_cls 3625 898 24.77%
sim_case_match 3396 3394 99.94%
========================================
Note on Raw Data Processing: If you are interested in how the raw data was processed into the final
.parquetformat, you can refer toscripts/make_data.py. Please note that the original raw data is not publicly provided due to privacy and licensing constraints. This script is included strictly for reference and transparency.
To simulate a federated learning environment, we partition the dataset across multiple simulated clients. Note that our partitioning algorithms are adapted from the open-source implementations in the FedLegal/FedLab repositories.
Prerequisite: The partitioning process relies on an encoding-clustering algorithm. You must download the xlm-roberta-longformer-base-16384 model in advance to compute the semantic embeddings.
Before proceeding, create a custom local configuration file under configs/local/ (e.g., private.yaml) to specify your local model paths:
# @package paths
model_root_external: "/path/to/your/models"Once configured, run the following two-step process:
# Step 1: Unpack the datasets into an interim directory across tasks
python scripts/data_preparation.py
# Step 2: Generate Non-IID partitioned sample indices based on semantic clustering
python scripts/partition_data.py +partitioner=clustering +local=privateThe codebase for Federated Supervised Fine-Tuning, Evaluation, and PII Extraction Attacks is currently undergoing a privacy review and architectural refactoring. The remaining code will be released once this process is complete.
@inproceedings{hu-etal-2025-simple,
title = "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models",
author = "Hu, Yingqi and
Zhang, Zhuo and
Zhang, Jingyuan and
Wang, Jinghua and
Wang, Qifan and
Qu, Lizhen and
Xu, Zenglin",
booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
month = dec,
year = "2025",
address = "Mumbai, India",
publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-ijcnlp.113/",
pages = "1808--1827",
ISBN = "979-8-89176-303-6",
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
