FedPII

Official implementation of "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models"

🚀 Getting Started

1. Prepare Dataset

First, download the pre-processed .parquet dataset (available on Hugging Face) and place it in the data/processed/ directory.

Then, execute the following command to verify the dataset statistics:

python scripts/check_stats.py +data=sources

You should see a comprehensive dataset statistics report like this:

========================================
  DATASET STATISTICS: fed_pii_v1.parquet
========================================

[Total Samples]: 16194
split
train    14575
test      1619
Name: count, dtype: int64

[Samples with PII]: 11633

[Breakdown by Task & Split]
                       count  with_pii pii_ratio
split task                                      
test  exam               224        77    34.38%
      jud_read_compre    350       350   100.00%
      jud_sum            265       265   100.00%
      leg_case_cls       403       121    30.02%
      sim_case_match     377       377   100.00%
train exam              2020       623    30.84%
      jud_read_compre   3148      3143    99.84%
      jud_sum           2386      2385    99.96%
      leg_case_cls      3625       898    24.77%
      sim_case_match    3396      3394    99.94%
========================================

Note on Raw Data Processing: If you are interested in how the raw data was processed into the final .parquet format, you can refer to scripts/make_data.py. Please note that the original raw data is not publicly provided due to privacy and licensing constraints. This script is included strictly for reference and transparency.

2. Federated Learning (FL) Partition

To simulate a federated learning environment, we partition the dataset across multiple simulated clients. Note that our partitioning algorithms are adapted from the open-source implementations in the FedLegal/FedLab repositories.

Prerequisite: The partitioning process relies on an encoding-clustering algorithm. You must download the xlm-roberta-longformer-base-16384 model in advance to compute the semantic embeddings.

Before proceeding, create a custom local configuration file under configs/local/ (e.g., private.yaml) to specify your local model paths:

# @package paths
model_root_external: "/path/to/your/models"

Once configured, run the following two-step process:

# Step 1: Unpack the datasets into an interim directory across tasks
python scripts/data_preparation.py

# Step 2: Generate Non-IID partitioned sample indices based on semantic clustering
python scripts/partition_data.py +partitioner=clustering +local=private

3. 🚧 Notice: Training & Attack Pipeline

The codebase for Federated Supervised Fine-Tuning, Evaluation, and PII Extraction Attacks is currently undergoing a privacy review and architectural refactoring. The remaining code will be released once this process is complete.

Citation

@inproceedings{hu-etal-2025-simple,
    title = "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models",
    author = "Hu, Yingqi  and
      Zhang, Zhuo  and
      Zhang, Jingyuan  and
      Wang, Jinghua  and
      Wang, Qifan  and
      Qu, Lizhen  and
      Xu, Zenglin",
    booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
    month = dec,
    year = "2025",
    address = "Mumbai, India",
    publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-ijcnlp.113/",
    pages = "1808--1827",
    ISBN = "979-8-89176-303-6",
}

License

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FedPII

🚀 Getting Started

1. Prepare Dataset

2. Federated Learning (FL) Partition

3. 🚧 Notice: Training & Attack Pipeline

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

SMILELab-FL/FedPII

Folders and files

Latest commit

History

Repository files navigation

FedPII

🚀 Getting Started

1. Prepare Dataset

2. Federated Learning (FL) Partition

3. 🚧 Notice: Training & Attack Pipeline

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages