In both the classic triad of information assurance (IA) and in the Parkerian Hexad, integrity is a fundamental attribute of information that must be protected. Data integrity refers to the correctness of information; for example, integrity can refer to consistency with data’s original and intended state.
Recently, a colleague and several of his research students ran into a problem when they tried to import data from a comma-delimited file (CSV) into their version (2007) of MS-Excel, the widely used spreadsheet program. They found unrecognized characters in the CSV file that showed up as squares with a question-mark inside. They asked me for help, and I loaded the CSV into MS-Word 2007, where it was obvious that the characters were TABs, even though they should not have been there given that all of the data were separated by commas.
After deleting the tabs using the global replace function (CTL-H) to locate every ^t character and replace it by nothing, the question arose of how to check the converted data against the original version that had contained the TAB characters. There was no point in applying the supposed correction if it caused discrepancies between the intended version of the data and the modified data.
Sure enough, we immediately located some places where additional fixes would be required to make the data conform to the intended arrangement of rows and columns. After we found the discrepancies, it became clear that none of the students had ever thought about how to locate differences between two versions of their data.
Continue reading