Arms Race for AI Training Data
As I discussed in an earlier blog, there is an arms race to gather training data for AI Models but organizations do not always recognize that their data is being used for free. This happens when applications with embedded AI use company data to train their AI models, or so-called Shadow AI.
In this example, we demonstrate the use of data valuation methodologies and AI agents to identify top vendors who are likely using a bank’s data for “free.” The analysis is based on a number of assumptions and should be the starting point for further research and vendor negotiations.Start With an Inventory of COTS Apps
We started with an inventory of commercial-off-the-shelf (COTS) apps. The dataset was a simple CSV file with application_name, application_description, and vendor_name. The vendor names in this example are illustrative and were not necessarily part of the analysis dataset.

Value Aggregate Company Data Based on the EDM Council’s Data ROI Methodology
We estimated that the bank’s data was worth 30.1 percent of market capitalization based on the EDM Council’s Data ROI methodology. I was the co-chair of the EDM Council’s Data ROI Working Group back in the day along with more than 100 practitioners across industries. In the current analysis, we used an additional deflator of 50 percent to be extremely conservative and to account for the bank’s lines of business. Based on our assumptions, the bank’s data was worth approximately $527 million in aggregate.

Allocate Data Value to Each Category of COTS Apps
We used our business judgement to allocate the overall data value to individual categories of COTS apps (“data products”). For example, Corebanking was 40 percent of overall data value or approximately $211 million.

Allocate Data Value to Each Category of COTS Apps
We tagged each COTS app to a single data product category. We then used a simple average to compute the data value per COTS app by data product category. For example, there were 43 corebanking COTS apps with an average data value of $4.9 million.

Research Shadow AI Usage by Vendors with YDC_AIGOV Agents
We researched shadow AI usage by vendors with the YDC_AIGOV agents. These agents discovered apps with embedded AI and highlighted their AI data usage policies.

Adjust COTS App Data Value for Embedded AI and Data Usage Policies
We downward adjusted the data value of each COTS app for embedded AI and data usage policies based on the following algorithm:



Summarize Data Value Captured for AI Training by Vendor
All COTS vendors were capturing a potential $199 million in data value for AI training based on our analysis. The Top 5 vendors were capturing approximately $104M of data value.

Chief Data Officers Need to Explore Avenues to Unlock Data Value
Maybe it’s time for Chief Data Officers to explore another avenue to unlock the value of their data?
- Improve negotiating posture with procurement teams
- Update Vendor Master Services Agreements (MSAs) to add clauses restricting the usage of data for AI training
- Get vendors to formally license AI training data with the appropriate safeguards
- Get something back even if it’s free tickets to the vendor’s user conference
- Align with Third-Party Risk Management & EU Digital Operational Resilience Act (DORA) Compliance