Definition[edit]

Data integrity is the opposite of data corruption.^[4] The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended (such as a database correctly rejecting mutually exclusive possibilities). Moreover, upon later retrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties.

Any unintended changes to data as the result of a storage, retrieval or processing operation, including malicious intent, unexpected hardware failure, and human error, is failure of data integrity. If the changes are the result of unauthorized access, it may also be a failure of data security. Depending on the data involved this could manifest itself as benign as a single pixel in an image appearing a different color than was originally recorded, to the loss of vacation pictures or a business-critical database, to even catastrophic loss of human life in a life-critical system.

Integrity types[edit]

Physical integrity[edit]

Physical integrity deals with challenges which are associated with correctly storing and fetching the data itself. Challenges with physical integrity may include electromechanical faults, design flaws, material fatigue, corrosion, power outages, natural disasters, and other special environmental hazards such as ionizing radiation, extreme temperatures, pressures and g-forces. Ensuring physical integrity includes methods such as redundant hardware, an uninterruptible power supply, certain types of RAID arrays, radiation hardened chips, error-correcting memory, use of a clustered file system, using file systems that employ block level checksums such as ZFS, storage arrays that compute parity calculations such as exclusive or or use a cryptographic hash function and even having a watchdog timer on critical subsystems.

Physical integrity often makes extensive use of error detecting algorithms known as error-correcting codes. Human-induced data integrity errors are often detected through the use of simpler checks and algorithms, such as the Damm algorithm or Luhn algorithm. These are used to maintain data integrity after manual transcription from one computer system to another by a human intermediary (e.g. credit card or bank routing numbers). Computer-induced transcription errors can be detected through hash functions.

In production systems, these techniques are used together to ensure various degrees of data integrity. For example, a computer file system may be configured on a fault-tolerant RAID array, but might not provide block-level checksums to detect and prevent silent data corruption. As another example, a database management system might be compliant with the ACID properties, but the RAID controller or hard disk drive's internal write cache might not be.

concerns the concept of a primary key. Entity integrity is an integrity rule which states that every table must have a primary key and that the column or columns chosen to be the primary key should be unique and not null.

Entity integrity

concerns the concept of a foreign key. The referential integrity rule states that any foreign-key value can only be in one of two states. The usual state of affairs is that the foreign-key value refers to a primary key value of some table in the database. Occasionally, and this will depend on the rules of the data owner, a foreign-key value can be null. In this case, we are explicitly saying that either there is no relationship between the objects represented in the database or that this relationship is unknown.

Referential integrity

Domain integrity specifies that all columns in a relational database must be declared upon a defined domain. The primary unit of data in the relational data model is the data item. Such data items are said to be non-decomposable or atomic. A domain is a set of values of the same type. Domains are therefore pools of values from which actual values appearing in the columns of a table are drawn.

User-defined integrity refers to a set of rules specified by a user, which do not belong to the entity, domain and referential integrity categories.

File systems[edit]

Various research results show that neither widespread filesystems (including UFS, Ext, XFS, JFS and NTFS) nor hardware RAID solutions provide sufficient protection against data integrity problems.^[5]^[6]^[7]^[8]^[9]

Some filesystems (including Btrfs and ZFS) provide internal data and metadata checksumming that is used for detecting silent data corruption and improving data integrity. If a corruption is detected that way and internal RAID mechanisms provided by those filesystems are also used, such filesystems can additionally reconstruct corrupted data in a transparent way.^[10] This approach allows improved data integrity protection covering the entire data paths, which is usually known as end-to-end data protection.^[11]

The U.S. has created draft guidance on data integrity for the pharmaceutical manufacturers required to adhere to U.S. Code of Federal Regulations 21 CFR Parts 210–212.^[12] Outside the U.S., similar data integrity guidance has been issued by the United Kingdom (2015), Switzerland (2016), and Australia (2017).^[13]

Food and Drug Administration

Various standards for the manufacture of medical devices address data integrity either directly or indirectly, including , ISO 14155, and ISO 5840.^[14]

ISO 13485

In early 2017, the (FINRA), noting data integrity problems with automated trading and money movement surveillance systems, stated it would make "the development of a data integrity program to monitor the accuracy of the submitted data" a priority.^[15] In early 2018, FINRA said it would expand its approach on data integrity to firms' "technology change management policies and procedures" and Treasury securities reviews.^[16]

Financial Industry Regulatory Authority

Other sectors such as mining and product manufacturing^[18] are increasingly focusing on the importance of data integrity in associated automation and production monitoring assets.

[17]

Cloud storage providers have long faced significant challenges ensuring the integrity or provenance of customer data and tracking violations.^[20]^[21]

Definition[edit]

Integrity types[edit]

Physical integrity[edit]

Entity integrity

Referential integrity

File systems[edit]

Food and Drug Administration

ISO 13485

Financial Industry Regulatory Authority

[17]

[19]

End-to-end data integrity

Message authentication

National Information Assurance Glossary

Single version of the truth