Workshop on Responsibly Enabling Data for Foundation Models @ COLM 2026

About the Workshop

As foundation models scale, available training data sources have rapidly depleted. However, several forms of valuable data artifacts such as medical records, legal, and financial documents are restricted from use in model training due to their sensitive nature. In addition, the strong reasoning capabilities in current generative models have opened the possibility for highly personalizable AI applications but these remain bottlenecked by limited access to high quality user data. Hence, it is of immense value to responsibly unlock these data sources (for example: using data transformation or constrained training paradigms) or to generate synthetic alternatives. In this workshop, we aim to bring together domain experts in data, privacy, model training, and legal policy, to advance the frontier of responsibly leveraging such sensitive data with foundation models.

Topics of interest include (but are not limited to):

Data Transformation: De-identification, Anonymization, Pseudonymization.
Synthetic Data Generation: Controlled Regeneration, Data Diversity.
Novel Training Paradigms: DP, Federated Learning, Architectural Solutions.
Evaluation & Auditing: Privacy attack benchmarks, Utility-Privacy tradeoffs.
Policy: Compliance, New regulations on data sharing.

Call for Papers

We invite long papers with novel research contributions (up to 8 pages long) as well as short papers (up to 4 pages) reflecting preliminary studies or negative results.

Submissions are managed via OpenReview. Accepted papers are non-archival, and concurrent submissions are allowed. Please follow the COLM 2026 template.