โ— PHANTOM
๐Ÿ‡ฎ๐Ÿ‡ณ IN
โœ•
Skip to content

Official implementation of "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models"

License

Notifications You must be signed in to change notification settings

SMILELab-FL/FedPII

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FedPII

Official implementation of "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models"

๐Ÿš€ Getting Started

1. Prepare Dataset

First, download the pre-processed .parquet dataset (available on Hugging Face) and place it in the data/processed/ directory.

Then, execute the following command to verify the dataset statistics:

python scripts/check_stats.py +data=sources

You should see a comprehensive dataset statistics report like this:

========================================
  DATASET STATISTICS: fed_pii_v1.parquet
========================================

[Total Samples]: 16194
split
train    14575
test      1619
Name: count, dtype: int64

[Samples with PII]: 11633

[Breakdown by Task & Split]
                       count  with_pii pii_ratio
split task                                      
test  exam               224        77    34.38%
      jud_read_compre    350       350   100.00%
      jud_sum            265       265   100.00%
      leg_case_cls       403       121    30.02%
      sim_case_match     377       377   100.00%
train exam              2020       623    30.84%
      jud_read_compre   3148      3143    99.84%
      jud_sum           2386      2385    99.96%
      leg_case_cls      3625       898    24.77%
      sim_case_match    3396      3394    99.94%
========================================

Note on Raw Data Processing: If you are interested in how the raw data was processed into the final .parquet format, you can refer to scripts/make_data.py. Please note that the original raw data is not publicly provided due to privacy and licensing constraints. This script is included strictly for reference and transparency.

2. Federated Learning (FL) Partition

To simulate a federated learning environment, we partition the dataset across multiple simulated clients. Note that our partitioning algorithms are adapted from the open-source implementations in the FedLegal/FedLab repositories.

Prerequisite: The partitioning process relies on an encoding-clustering algorithm. You must download the xlm-roberta-longformer-base-16384 model in advance to compute the semantic embeddings.

Before proceeding, create a custom local configuration file under configs/local/ (e.g., private.yaml) to specify your local model paths:

# @package paths
model_root_external: "/path/to/your/models"

Once configured, run the following two-step process:

# Step 1: Unpack the datasets into an interim directory across tasks
python scripts/data_preparation.py

# Step 2: Generate Non-IID partitioned sample indices based on semantic clustering
python scripts/partition_data.py +partitioner=clustering +local=private

3. ๐Ÿšง Notice: Training & Attack Pipeline

The codebase for Federated Supervised Fine-Tuning, Evaluation, and PII Extraction Attacks is currently undergoing a privacy review and architectural refactoring. The remaining code will be released once this process is complete.

Citation

@inproceedings{hu-etal-2025-simple,
    title = "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models",
    author = "Hu, Yingqi  and
      Zhang, Zhuo  and
      Zhang, Jingyuan  and
      Wang, Jinghua  and
      Wang, Qifan  and
      Qu, Lizhen  and
      Xu, Zenglin",
    booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
    month = dec,
    year = "2025",
    address = "Mumbai, India",
    publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-ijcnlp.113/",
    pages = "1808--1827",
    ISBN = "979-8-89176-303-6",
}

License

Shield: CC BY-NC 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

About

Official implementation of "Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages