Friday, September 2, 2005

FRET v0.0.5 Design Changes


Currently work is progressing on FRET version 0.0.5 and the major change for this version is a swap in the order of the Phase 4 and Phase 5 Scans. An improvement in ScanGrind was the driving force for this change. Previously, when a data structure, which was common to two Buffers, was identified by ScanGrind, a Group Gram was created with the offset from the first Buffer. This caused the loss of valuable information about the structures location in the second Buffer. The new version of ScanGrind now creates two Buffer Grams, one for each Buffer and no Group Gram is created. A Phase 5 Scan will then notice that two Buffer Grams are identical and will create a Group Gram with frequency 2.

Order of analysis phases modified

Detecting data structures and formats in a buffer or file is a complex task. For this reason the process has been sub-divided into multiple sub-tasks, each of which performs a small piece of analysis. It is through the amalgamation of these functions that a clear picture is derived of the structures within the target data. FRET divides these tasks into 6 Phases. Each small task that performs a specific function in the analysis is called a Scan. Each Phase contains multiple Scans and all the Scans within a Phase share the same characteristics (and interface). This modular design allows for the easy addition, deletion or re-ordering of Scans. As new Scans are added and refined, the program will grow in ability. Each phase of analysis is now briefly described.

Before commencing the description of the phases, it is key to remember the aim of FRET. It is designed as a commandline utility (and library) that takes a number of similar raw data files (or buffers), analyses them and outputs a table describing the layout of data in these files. If all the files are different then it will find little in common between the files. However, if the data follows the same format i.e. values change but the layout is the same, then a lot can be learned from the analysis and a good guess can be taken at the file format. In FRET terminology a detected pattern of bytes in a file is called a Gram, a term derived from the Greek for a letter, which is commonly used in cryptanalysis.

The first step in the analysis of a buffer are the Phase 1 Scans. These Scans examine the raw data, identify if compression or obfuscation is used and transform the data to its original order. If the later Phases were to try and analyse compressed data, for example, then it would be like trying to analyse random data for patterns - some would be found but they would be meaningless. Currently, there are no Phase 1 Scans implemented in FRET. It is probable that the first Scan to be added will be to uncompress a buffer.

Phase 2 Scans look at each file individually and attempt to identify data structures that are file format independent. These include ASCII strings, Unicode strings, fill bytes, x86 instructions and limitless other recognisable data formats. Depending on the Scan and raw data, there is a varying level of confidence that a detected Gram is valid, it may be a false positive, therefore each Gram is assigned a risk. The risk of a Gram is the probability that it is a randomly occurring pattern in the data and is not valid. This allows for the later sanity checking of Grams.

Once a Buffer of data has been analysed and Grams have been detected, it may be possible to use these Grams to detect further patterns in the data. Phase 3 Scans compare detected Grams against the raw data looking for relationships. One example of this is an offset to data in a file - often within a file format there is a table that contains identifiers and offsets to data later in the file. For example, if a string is detected in the file and earlier in the file a value that points to the start of this string is detected, it is possible that this is an offset. It may also be possible to detect fields that describe the length of other fields in a file.
At this stage, no more information can be gained from looking at a file in isolation. It is now time to compare it to its peers and this is where the real power of FRET is unleashed. Phase 4 Scans compare the raw data in multiple files, looking for commonalities and differences. Historically binary diff tools have been used for this type of analysis and they remain a valuable tool, yet there is value in focusing on the similarities between files, not the differences. scanGrind compares two files looking for all the byte patterns that are the same and they don't have to be in the same place in both files.

Now you may think it is finished but no, there are two more phases of analysis left. Firstly, the Phase 5 Scans compare the Grams that have been detected for each Buffer and create new Group Grams. This analysis needs to be fuzzy - a file format may specify a string at a certain location but the length and exact offset of the string will be different between the two files. The algorithms for the Phase 5 scans must compare Grams, looking at factors such as 'approximate' location, preceding Grams, Gram type etc. and take a guess.



Finally, once all this is completed, it is time to clean-up the results. Phase 6 Scans iterate through the detected Grams, removing or amalgamating Grams based on a range of criteria. After Phase 6, the Grams for the Buffers and Group can be passed to the user who can further process the data using custom scripts or tools.