Data Validation

Data validation is the process of clarifying the accuracy, integrity and quality of a set of data before it is used.


What Is Data Validation?

Data validation is the process of clarifying the accuracy, integrity and quality of a set of data before it is used. This can apply to all forms of data, for example, specific text, addresses, dates, and more.

Data makes up the foundation of all solutions, and it goes without saying that for a solution to be effective, the data must be accurate. In Web3, developers, analysts and network participants all rely on data in order to keep blockchains going. For these players, using valid data is crucial in order to prevent any errors, inconsistencies, user hazards and compromises to a project’s integrity.

The Need for Validity in Web3

Many roadblocks in the Web3 space can be solved by having streamlined, public access to valid data. One is that as blockchains scale, the amount of data they end up producing becomes overbearing, so much so that few nodes can keep the entire chain state at hand. This leads to many nodes relying on shared snapshots, trusting that they are fully correct and up to date, leaving room for error.
Ethereum is in the same boat in this situation, not providing any incentivization for full nodes, which soon will lead to a limitation of public resources for the chain’s historical data. In order to get access to a full node, a user would need to either run their own node or pay a provider to get access to the data that should be publicly accessible.
Another major issue data validation solves is the oracle problem. When projects source off-chain data, oracles are their go-to tool as they provide an easy access point to deterministic Web2 data. However, bringing on large amounts of data on-chain leads to a recipe for a single point of failure. 

Seeing that oracles don’t typically have a built-in, truly decentralized validation feature in place, there’s no saying that the data they provide is true or hasn’t already been manipulated. What can happen, and has happened quite frequently already, is that instead of targeting a protocol directly, an attacker targets the data being sourced by the protocol from an oracle. This is an overall easier way for attackers to manipulate a situation in their favor.

With malicious events like these ceasing to die down, validation solutions are starting to pop up. However, proper data validation is much easier said than done. 

Validation Challenges and Inefficiencies

Seeing that each piece of data in the process of executing functions within and across blockchains needs to be validated and kept in sync, properly validating data is more complicated than it may seem.

The easier and most common way to implement data validation is through a centralized server so that just one entity is at the head of deciding whether a piece of data is accurate or not. This helps promote high-speed performance, eliminating the need for reaching consensus across the globe. However, centralization also leaves significant gaps for errors and malicious actors.

If a validation process is centralized, that means there is no incentivization for other actors to check and make sure that the main actor’s work is correct. Also, this means there is only one actor a hacker would need to take over in order to have complete control over the decision-making, whereas with decentralization, it decreases hacking risks seeing that hackers need to take over more than 50% of an entire network of nodes to gain control, and overall, significantly decreases any bias or validation error.

A Decentralized Solution 

The fundamental tenet of Web3 is decentralization, which distributes authority, trust and other virtues across network users and stakeholders. Since actions must travel to every corner of the globe, 100% decentralization does cause a small amount of time delay, but when it comes to validating data, decentralization reigns more importance than lightning-fast performance.

In general, to determine if a piece of data is valid, there always needs to be a generic solution, i.e., developers creating custom validation methods per data set. However, what’s lacking is managing these different runtimes and ensuring that all data sets are properly sourced and validated quickly and efficiently.

A decentralized Proof of Stake (PoS) data lake could solve this, providing data pools that execute the code responsible for relaying the data, AKA a runtime, which also includes an abstract implementation of a validation function. The function in place is simply returning true or false if the data is valid or not. The chain then computes the result of the bundled data, either valid, invalid or dropped, only keeping track of the valid data bundles so that it only provides access to correct data. 

In each pool, there is a group of nodes, with one randomly selected to be responsible for uploading the data, and the rest accountable for voting on if that data is valid or not. Each vote has a weighted value depending on how many tokens the node stakes. Once the vote is final, the responsibility of uploading the next bundle of data is switched to another randomly selected node. Doing so combats the risk of centralization, i.e., if there is only one node uploading data at all times, that would be a higher risk factor for an attack. 

Another key factor in ensuring truly decentralized validation is incentivization via PoS. Since each pool of data relies on nodes to fetch, upload and validate the data, it’s important to promote good behavior via token rewards and punish bad behavior or errors via token slashing. 

Web3’s data infrastructure and integrity highly rely on the use of truly valid data to ensure a scalable and trustless future. As time goes on and more projects recognize how important data validation is, especially in Web3, there will undoubtedly be more aspects taken into account while validating data. The best we can do is continue building and educating around the topic.

Author: Fabian Riewe, Co-Founder and CEO of KYVE, a Web3 data lake solution.

Fabian started his journey as a tech lead in a local ed-tech startup. A hackathon in 2019 started his fascination for Web3, and six months later founding his first successful project ArVerify, an on-chain KYC system that saw big adoption in the Arweave Ecosystem. Shortly after in 2021, he co-founded KYVE, a decentralized Web3 data lake, with John Letey.