Clone spotting: Find code clones with their DNA

Find code clones with their DNA

Unidentical twins

Code clones can be elusive. First of all, we’re not talking about the dumb copy/paste textual clones, we’re interested in the harder to see algorithmic kind (we met both in a previous post).
And algorithmic clones are the sneaky ones: they disguise using different variable names, muddy the waters with varying number of statements, and hide behind diverging comments.

Déjà-vu

Let’s start with a simple example: here are two functions that are algorithmic clones.

  • They have different names
  • Out of the two attributes, the first is renamed, the second is the same
  • They share the same algorithm, a sequence of eight ‘if’ statements, returning 0 or 1
  • Finally, all ‘if’ tests have a lot in common, with a few differences (renamed variables)
Algorithmic clones on visually similar code

Have we met before?

Now look at the same first function, next to an attempt at obfuscation.

Unlike the previous example, this is not a case of “clone at first sight”.

And yet they really are the exact same algorithms.

If you are skeptical, keep on reading!

Algorithmic clones on visually different code

Algorithmic DNA matching

Using the control flow chart

As discussed in our Anatomy of a source code post, algorithms can be displayed as a chart, showing statements, conditions, control flow change, etc.

Using this representation, all functions from our previous examples have the exact same chart. Yes, even the obfuscated one.

One control flow chart to describe them all

So this method can be used to focus on the very structure of algorithms,.
To do that, we leave out all names, comments and block delimiters. In a way, we are looking at the algorithm’s DNA, and matching it changes clone spotting into a fun graphical game 😊.

Spot the clone!

Here is a first example of real code.

Look at these two functions, and quickly tell if they are algorithmically similar or not.

At first glance, they seem to be the same functions.

But to be sure, you really need to compare each individual line, check variable names and string changes.

This is no fun.    

These functions look the same, but are they?

Now look at the control flow charts for both functions. It should only take you a few seconds to hit the “Found a Clone” buzzer!

Same control flow graph means same algorithm

Recognizing algorithms from a long way away

Here is another example.

This function is not that huge (under 300 lines), but if you were asked to look for algorithm similarities, time would suddenly feel very long.

Who wants to search repeating algorithmic patterns in this code?

Now look at the corresponding control flow chart. In a few seconds, your eyes will detect similar patterns, and you will just be eager to look into the code and work on these clones!

Visual algorithmic patterns are much easier to detect

Going further

Now that we know how to look for these clones, the next steps are:

  • Search
  • Select
  • Simplify

First, based on the control flow DNA as an algorithmic signature, we can look for all clones in a project, sorting them from most to less similar (using a threshold to only keep significant clones).

Next, we can select clones which are relevant.
Discarding for example those from legacy code (which may be risky to modify), or from code needing to run on separate targets.

Finally, the selected clones can be simplified (by way of factorization).

Here is the result on our previous example.

=+
A code containing clones before and after factorization

The benefits of this factorization are immediate, and durable:

  • Smaller footprint
    The cloned part (repeated three times) now only appears one time, the resulting two functions are smaller, their complexity is reduced.
  • Simpler tests
    It is easier to test smaller functions., and to ensure test coverage.
  • Better maintainability
    In the future, debug and enhancements won’t suffer from a complex function you’ll have to understand before modifying.
  • Optimized debug
    If a bug is found in the factorized code (which used to be cloned), it will only have to be fixed in one place, and there is no risk of forgetting to fix it everywhere this code was cloned.

Looking ahead

Clone detection leading to factorization is more than an exercise of visual recognition.

We started using terms like complexity, coverage, maintainability, risk. For each one, there are fields of study, standards, methods, best practices, tools.

Software development can greatly benefit from these fields, which assist us in writing better code.
In the next posts, let’s continue exploring the building blocks of a greater edifice.
You guessed it: we’re talking about software quality.

Further readings

Share:

Legal Notice

This text is the intellectual property of the author and is copyrighted by coderskitchen.com. You are welcome to reuse the thoughts from this blog post. However, the author must always be mentioned with a link to this post!

Leave a Comment

Related Posts

Int code overflow - coderskitchen
John Paliotta

Why is everything an int?

In C/C++ applications, most scalar variables are defined as ‘int’.  Do these applications deal with lots of large numbers that need 32-bit integers?  Not likely,

Flavien Huynh

Measuring and monitoring technical debt

The technical debt concept has been around for some time now. We can find it in international conferences, in scientific articles, books, and even here

Improving code quality
Flavien Huynh

Software quality: To the rescue!

In this post we’ll show what a healthy relationship with code quality looks like. After our introductory post, software quality (‘SQ’ for friends) might seem