- Phase 1: Parse each file individually and store detected structures in a database.
- Phase 2: Parse each file, comparing its raw data to the already detected structures and use this to identify new structures.
- Phase 3: Compare all the detected structures for each file and all the files raw data against each other, identifying new structures.
I´ve also developed (borrowed really) the following terminology to describe whats happening: GRAM - a data structure or pattern of bytes detected in a file or buffer. Each GRAM will have a position, length, type, confidence level and parent. The term GRAM is taken from the classical Greek for a letter. I´m using it because it is the root of the words bigram and trigram that are used in cryptography when performing statistical analysis. FRET, after all, was inspired by coincidence counting - why not treat an unknown file format like an encrypted file for analysis purposes?
