<- Back to all work
Data Provenance Standards. The first cross-industry metadata standards to bring transparency to the origin of datasets used for both traditional data and AI applications.
Jump to section
Project launched: 11.30.23
Page updated: 07.09.24
Data Provenance Standards.
Why data provenance
How the Standards were created
Adopting the Standards
What’s in v1.0.0
FAQs
Resources

For AI to create value for business and society, the data that trains and feeds models must be trustworthy.

Trust in data starts with transparency into provenance; assessing where data comes from, how it’s created, and whether it can be used, legally. Yet the ecosystem needs a common language to provide that transparency.

This is why we developed the first cross-industry data provenance standards.

Why data provenance

Provenance matters. Understanding the sources of food, water, medicine, capital is expected and essential in our society to gauge quality and trust. The same is now needed for the fuel of our increasingly knowledge- and AI-centric world: data.

Learn more about the value of the Standards from IBM, Mastercard, Transcarent, and Vendia
These practical standards, co-created by senior practitioners across industry, are designed to help evaluate whether AI workflows align with ever-changing regulations while also helping generate increased business value.
— ROB THOMAS, Senior Vice President Software and Chief Commercial Officer, IBM
How the Standards were created

The standards were derived from use cases across 15 different industries that outlined the data provenance challenges faced within business today. These were then synthesized, refined, and validated by a team of practitioners: chief technology officers, chief data officers, and leaders in data governance, data acquisition, data quality, privacy, legal and compliance.

aarp-color.svgaarp.svg
American_Express-color.svgAmerican_Express.svg
Deloitte-color.svgDeloitte.svg
howso-color.svghowso.svg
Humana-color.svgHumana.svg
IBM-color.svgIBM.svg
Kenvue-color.svgKenvue.svg
Mastercard-color.svgMastercard.svg
Nielsen_2021 (2).svgNielsen_2021 (1).svg
Nike-color.svgNike.svg
Pfizer-color.svgPfizer.svg
Regions_logo (1).svgRegions_logo.svg
transcarent-logo (1).svgtranscarent-logo.svg
UPS_logo (1).svgUPS_logo.svg
Walmart-color.svgWalmart.svg
Warby_Parker-color.svgWarby_Parker.svg
Companies like ours feel a deep responsibility to ensure new value creation, as well as trust and transparency of data with all of our customers and stakeholders. Data provenance is critical to those efforts.
— KEN FINNERTY, President, IT & Data Analytics, UPS
Adopting the Standards

The Data Provenance Standards were designed for adoption across business, which is why they were built with three aspects in mind: Increasing business value, ease of implementation and complying with new and emerging regulation.

Christine Pierce.png
Christine PierceNielsenChief Data Officer, Audience Measurement
“As technology and AI are rapidly transforming industries, organizations need a blueprint for evaluating the underlying data that fuels these algorithms. Through the collaboration of experts across multiple industries and disciplines, the D&TA Data Provenance Standards meet this need. The standards promote trust and transparency by surfacing critical metadata elements in a consistent way, helping practitioners make informed decisions about the suitability of data sources and applications.”
Humana_Genevy Dimitrion 1.png
Genevy DimitrionHumanaVP, Data Strategy & Governance
I am excited to see snapshot 1.0.0. of the Data & Trust Alliance’s Data Provenance Standards, which mark a significant milestone in ensuring data transparency and accountability. At Humana, we are committed to upholding the highest standards of data integrity, and these standards will enhance the trust and reliability of the data we produce and consume across the enterprise to allow us to deliver value to the individuals we serve.
IBM_LeeCoxHeadshot 1.png
Lee CoxIBMVice President, Integrated Governance & Market Readiness, Office of Privacy and Responsible Technology
“The lack of data provenance consistency from one dataset to another is a pain point for organizations that build and use AI. This will be further accentuated as regulatory frameworks around the world require data origin disclosures. It is a game changer to have organizations agree on a consistent methodology to use end-to-end across the data ecosystem.”
Mallory Freeman.png
Mallory Freeman, Ph.D.UPSVP, Enterprise Data and Analytics
The new Data Provenance Standards are key to making data more reliable, not just for us at UPS, but for our customers and their supply chains. We’ve strengthened our own standards while collaborating with forward-thinking leaders across industries, and companies and consumers around the world will benefit from this work.
Howso_Mike Meehan Bio Pic 1.png
Michael MeehanHowsoGeneral Counsel and Chief Legal Officer
Data provenance standards are important for the entire data ecosystem. Beyond simplifying ingestion and use of data, use of the D&TA Data Provenance Standards, particularly by upstream data providers, will allow analysis of appropriateness, consent, and quality of aggregated datasets in a way that we have not previously had.
Mastercard, Travis Carpenter.png
Travis CarpenterMastercardSenior Vice President, Data Quality and Sources
“Trust in the data is based on our knowing that the data was sourced appropriately, is of good quality and has the consents necessary to be used. These Data Provenance Standards are an important step forward to ensure metadata about the sourcing, quality, and permissions are provided in a consistent manner, eliminating manual efforts which can introduce business risk.”
Case study: How IBM is using the Data Provenance Standards
thumbnail
Case study: How IBM is using the Data Provenance Standards

In early 2024, IBM tested the standards as part of their clearance process for datasets used to train foundational models. They saw increases in both efficiency (time for clearance) and overall data quality.

Read the IBM case study ->
What’s in v1.0.0

The Data Provenance Standards consist of 22 metadata fields, grouped into 3 standards: Source, Provenance, and Use.

Source
  • Standards version used

  • Dataset title/name

  • Unique metadata identifier

  • Metadata location (unique URL of the current dataset)

  • Dataset issuer

  • Description of the dataset

Provenance
  • Source metadata for dataset

  • Source (if different from Issuer)

  • Data origin geography

  • Dataset issue date

  • Date of previously issued version of the dataset (if applicable)

  • Range of dates for data generation

  • Method

  • Data format

Use
  • Confidentiality classification

  • Consent documentation location

  • Privacy enhancing technologies (PETs) or tools applied?

  • Data processing geography inclusion/exclusion

  • Data storage geography inclusion/exclusion

  • License to use

  • Intended data use

  • Proprietary data presence

FAQs
Why Data Provenance Standards?

Data provenance standards establish criteria for documenting the origin and lifecycle of data, which includes details on how data is collected and how it may be used. This transparency is crucial for driving accuracy, reliability and trustworthiness of data, across all industries.

Who developed these standards and with what purpose?


The Data Provenance Standards were co-developed by 19 different member organizations at the Data & Trust Alliance (D&TA), which represents some of the most significant users of data and AI across industries today. The group was made up of chief technology officers, chief data officers and leaders in data acquisition, data governance, data strategy, data quality, legal and compliance. The goal was to establish a uniform approach to increasing transparency in datasets, thus increasing the integrity of and trust in data and the AI that it feeds.

Are these standards applicable to all industries?


Yes, they are designed to apply to all industries. While we believe they are especially beneficial for sectors like healthcare, finance and technology, any industry that handles data and uses it for AI can benefit from implementing these standards.

What size companies should implement these standards?


All companies, regardless of size, are encouraged to adopt these standards. Implementing them may not only strengthen trust in your data integrity within the broader data ecosystem, but also can signal that your business is a leader in data transparency and reliability efforts. Adoption can serve as a competitive advantage, showcasing your commitment to best practices in data management and increasing the value of your data and AI in the marketplace.

How were these standards tested and validated?

Our standards underwent testing through diverse scenarios, including traditional data acquisition, synthetic data tracking in AdTech, and governance for large language models. This cross-industry approach, spanning companies of all sizes, was complemented by validation from industry experts and governance groups. This inclusive process ensured our standards quickly evolved to meet a broad range of stakeholder needs, making them robust and relevant. 

What are the benefits of implementing Data Provenance Standards?


Adopting these standards enhances data transparency, which produces a range of business value. At minimum, the transparency can lead to efficiency and cost savings by decreasing time to data acquisition, cleanup and pre-processing. It also improves quality assurance and security, which are  foundational to innovating and building new value with data. Finally, this data transparency can help organizations seeking to comply with existing and emerging data protection laws and provisions in AI regulation. In sum, these standards are designed to increase trust in data use and sharing between organizations and with consumers.

How do these standards improve data security and compliance?


By providing a clear history of data provenance and its appropriate use, these standards help in auditing and monitoring data use, thus supporting data security and regulatory compliance efforts.

What are the technical prerequisites for implementing the standards?

For data suppliers, the primary technical requirement is a website where you can publish metadata, which can be captured either through our web interface or a standalone spreadsheet. As a data consumer, there is no need for any technical infrastructure to read the metadata. However, if you intend to pass data downstream to other consumers, you will need the same infrastructure as a data producer, including capabilities for data logging and tracking to ensure compliance and maintain the integrity of the data provenance.

Are there specific technologies or platforms required?

No specific technologies are required, but for future-proofing, you might consider making systems capable of integrating with APIs and services that support metadata management and audit trails.

How do these standards align with GDPR, CCPA, and other data protection laws?

These standards complement data protection laws by enhancing the ability to test compliance through clear data lineage and permitted use, including data storage and processing requirements and documentation. They do not guarantee compliance with any particular data protection law.

What are the legal implications of not adopting these standards?


There are no legal requirements for adopting these standards. However, adoption can form a part of data privacy compliance efforts.

What is the first step in adopting these standards?


The initial step towards adopting the standards varies by organization, but generally starts with understanding your current position relative to the new standards. Here’s a streamlined approach to get started:

  1. Assessment: Evaluate your existing data management practices to identify discrepancies and areas for improvement compared to the standards.

  2. Engagement and socialization: Host a community of practice meeting to discuss the standards. If you’d like to invite insights from Data and Trust Alliance (D&TA), get in touch and we can attend your event. Or consider conducting an interview with an executive sponsor in your organization to discuss the value and adoption of the standards during an all-hands or company-wide meeting.

  3. Integration and implementation:

    1. Incorporate the standards into key business processes, such as those led by Project Management Offices (PMO) and Data Governance Boards.

    2. Develop and disseminate internal policies that align with these standards.

    3. Adjust procurement strategies to include standards compliance.

    4. Budget for the necessary resources to adopt the standards. 

    5. Utilize internal communication channels like Slack to keep discussions about standards active.

  4. Deploy as a proof of concept and generate supporting metrics:

    1. Launch a proof of concept with a data provider to build confidence within your team or broader organization.

    2. Incorporate the standards into audit and compliance routines.

    3. Apply the standards directly to your data systems for enhanced transparency and trust.

    4. Analyze how your data catalog aligns with the D&TA standards.

    5. Integrate these standards into governance tools like Collibra, Informatica, or Databricks.

    6. Define and implement metrics to monitor adherence to these standards, including third-party compliance.

How long does it typically take to implement these standards?


The timeline will vary, but we estimate that it could generally range from a few months to more than a year, depending on the size of your organization and the complexity of your data systems.

What internal resources will companies need to allocate?


Resources may include, among others, data, analytics and IT staff for system integration, legal and compliance teams for regulatory alignment, and training personnel.

What are common challenges companies face during the implementation?


Challenges include aligning existing systems with the standards, training employees, and managing the initial costs.

How can a company measure the success of implementing these standards?


Success can be measured by improved compliance audit results, reduced audit issues, increased efficiency and efficacy in assessing whether data is fit for purpose, as well as enhanced data partner trust.

Are there benchmarks or metrics that should be monitored?


Key metrics include the number of data corrections needed, audit trail completeness and compliance audit results.

How can companies stay informed about updates or changes to the standards and connect with others who have adopted them?

Companies can join the D&TA Community of Practice on LinkedIn and participate in related community events and discussions. We foster our community of adopters and encourage everyone to share experiences, best practices and support. Engagement provides networking opportunities, insights into effective implementation strategies and peer support.

Please visit our Community of Practice on LinkedIn. This will move over to the OASIS-supported community over the next few months. Please note that views expressed by adopters in the D&TA Community of Practice represent the views and opinions of the views of such adopters, and do not necessarily represent the views or opinions of the D&TA, and the D&TA makes no representations or warranties of any kind with respect to those views.

Is there a community or network of companies that have adopted these standards?


Yes, the D&TA facilitates a community where adopters can share experiences, best practices, and support. Please visit the D&TA Community of Practice on LinkedIn.

How will these standards be updated over time?

The standards will be updated through feedback from an open practitioner community, managed by the standards body OASIS. The community will be a place for use cases, feedback, troubleshooting, and the eventual evolution of the standards. At the time of publications, we expect that the current v1.0.0 will be stable for at least 12-18 months.