8.13.1. Configuration of Duplicate Code Blocks Computation

The settings for how duplicates are located can be adjusted at "System""Configure...""Duplicate Code" . Usually, the default settings are acceptable. In order to understand how the configuration parameters work, it is helpful to know how the algorithm works. The main process is as follows:

  • First, candidates for start lines of duplicate code blocks are determined. For this, all lines of all source files are read.

  • If a line is too short (shorter than the number given in the configuration parameter "Minimal Line Length"), it is discarded. This allows to save memory, since all other lines might have to be stored if there occur copies of them.

  • Each non-discarded line is space-normalized (i.e., sequences of white space characters are replaced by a single space character; and words that are not separated by whitespace characters are separated by a single space character). This normalization allows to detect almost-copied blocks that only differ from each other by the whitespace in them.

  • Lines that occur too often (more often than the number given in the configuration parameter "Maximal Number of Copies") are discarded. This feature is used for excluding e.g. preambles that start every file from duplicate analysis.

  • For any pair of identical lines that result from the steps above, it is checked if they are the start of a duplicate code block. Only blocks that have a certain minimum length are reported (configuration parameter "Minimal Block Length").

  • Two other parameters allow for a certain "slack" in the comparison so that not only completely identical blocks are found, but also blocks that differ a bit.

    1. The configuration parameter "Maximal Tolerance per Edit" works like this: When two text blocks are compared, the comparison algorithm allows some differences, or "edits". Each single edit may only add, remove or change a number of lines (the one given by this parameter) in order to make the blocks identical. Note that behind the edited region, the two blocks must continue identically for at least one line.

    2. The configuration parameter "Maximal Relative Tolerance Percentage" works like this: When comparing two blocks, the number of edited lines in relation to the number of matched lines may never be larger than this percentage.

    The total number of lines in all the edits that occur in a block comparison is the "tolerance" of the comparison. The larger it is, the more lines need to be changed to consider the two blocks to be copies from one another.

  • The algorithm up to this point only identifies pairs of start lines of duplicated blocks. The last step in the identification of duplicated blocks is the aggregation: Not only are code blocks considered to be duplicates of one another when they form result pairs in the algorithm above, but also when they are indirectly copies of one another. E.g., consider two already identified pairs of duplicated blocks A,B on the one hand and C,D on the other hand, where the start of B equals the start of C; then A, B, C and D are all considered to be duplicates of the same code block. This aggregation is done until no more blocks can be aggregated. The tolerance specified for a code block in the view is the minimal tolerance that occurred during the comparison of the block with other code.