What’s the Difference? Creating Diffs with JGit

Home  >>  Eclipse  >>  What’s the Difference? Creating Diffs with JGit

What’s the Difference? Creating Diffs with JGit

On June 16, 2016, Posted by , In Eclipse, By ,,, , With 20 Comments

In this post, I will dig into the details of how to diff revisions and create patches with JGit. Starting from the high-level DiffCommand all the way down to the more versatile APIs to find particular changes in a file.

DiffCommand, Take I

The diff command can be used to compare two revisions and report which files were changed, added or removed. Whereby, a revision, in this context, may originate from a commit as well as the working directory or the index.

The simplest form of creating a diff in JGit looks like this:

git.diff().setOutputStream( System.out ).call();

If invoked without specifying which revisions to compare, the differences between the work directory and the index are determined.

The command prints a textual representation of the diff to the designated output stream:

diff --git a/file.txt b/file.txt
index 19def74..d5fcacb 100644
--- a/file.txt
+++ b/file.txt
@@ -1 +1,2 @@
 existing line
+added line
\ No newline at end of file

In addition, the call() method also returns a list of DiffEntries. These data structures describe the added, removed, and changed files and can also be used to determine the changes within a certain file.

But how can two arbitrary revisions be compared? By taking a closer look at the DiffCommand it becomes apparent that it actually compares two trees instead of revisions. And that explains why the working directory and index (which are trees themselves) can also be compared without extra effort.

Consequently, the diff command expects parameters of type AbstractTreeIterator to specify the old and new tree to be compared. Sometimes old and new are also referred to as source and destination or simply a and b. To learn more about what trees in Git are, you may want to read Explore Git Internals with the JGit API.

Tree Iterators

But how to get hold of a specific tree iterator? Looking at the type hierarchy of AbstractTreeIterator reveals that there are four implementations of interest.

The FileTreeIterator can be used to access the work directory of a repository. Passing the repository to its constructor like so, it is ready to use.

AbstractTreeIterator treeIterator = new FileTreeIterator( git.getRepository() );

The DirCacheIterator reveals the contents of the dir cache (aka index) and can be created in a similar way as the FileTreeIterator. Given a repository, we can tell it to read the index and pass this instance to the DirCacheIterator like so:

AbstractTreeIterator treeIterator = new DirCacheIterator( git.getRepository().readDirCache() );

Most interesting however is probably the CanonicalTreeParser. It can be configured to parse an arbitrary Git tree object. Therefore, it needs to be reset with the id of a tree object from the repository. Once set up it can be used to iterate over the contents of this tree.

This is best illustrated with the following example:

CanonicalTreeParser treeParser = new CanonicalTreeParser();
ObjectId treeId = repository.resolve( "my-branch^{tree}" );
try( ObjectReader reader = repository.newObjectReader() ) {
  treeParser.reset( reader, treeId );
}

The tree parser is configured to iterate over the tree of the commit to which my-branch points to. Passing a non-existing id or an id that does not point to a tree object will result in an exception.

Beware that it is undefined what resolve() returns if there are multiple matches. For example, the call resolve( "aabbccdde^{tree}" ) may return the wrong tree if there is a branch and an abbreviated commit id with this name. Therefore prefer fully qualified references like refs/heads/my-branch to reference the branch my-branch or refs/tags/my-tag for the tag named my-tag.

If the id of a commit is already available in the form of an ObjectId (or AnyObjectId), use the following snippet to obtain the tree id thereof:

try( RevWalk walk = new RevWalk( git.getRepository() ) ) {
  RevCommit commit = walk.parseCommit( commitId );
  ObjectId treeId = commit.getTree().getId();
  try( ObjectReader reader = git.getRepository().newObjectReader() ) {
    return new CanonicalTreeParser( null, reader, treeId );
  }
}

The code assumes that the given object id references a commit and resolves the associated RevCommit, which in turn holds the id of the corresponding tree.

And finally, there is the EmptyTreeIterator that is useful for comparing against an empty tree that has no entries at all. For example, the tree of the first commit of a repository which has no parent commit can be compared against the empty tree.

DiffCommand Revisited

Now that we know how to obtain a tree iterator the rest is simple:

git.diff()
  .setOldTree( oldTreeIterator )
  .setNewTree( newTreeIterator )
  .call();

With the setOldTree() and setNewTree() methods, the trees to be compared can be specified.

See also  Clean Sheet Service Update (0.7)

Besides these principal properties, several other aspects of the command can be controlled:

  • setPathFilter allows to restrict the scanned files to certain paths within the repository
  • setSourcePrefix and setDetinationPrefix changes the prefix of source (old) and destination (new) paths. The default values are a/ and b/.
  • setContextLines changes the number of context lines, i.e. the number of lines printed before and after a modified line. The default value is three.
  • setProgressMonitor allows to track progress while the diffs are computed. You can implement your own progress monitor or use one of the pre-defined ones that come with JGit
  • setShowNameAndStatusOnly skips generating the textual output and just returns the computed list of DiffEntries. (as the name suggests)

Apart from the properties described so far, the DiffCommand reads these configuration settings from the [diff] section.

  • noPrefix: if set to true, the source and destination prefixes are empty by default instead of a/ and b/.
  • renames: if set to true, the command attempts to detect renamed files based on similar content. More on renamed content later.
  • algorithm: the diff algorithm that should be used. JGit currently supports myers or histogram.

DiffEntry, Take I

As mentioned before, we will take a closer look at the principal output of the diff command: the DiffEntry. For each file that was added, removed or modified, a separate DiffEntry is returned. The getChangeType() indicates the type of the change which is either ADD, DELETE, or MODIFY. If a rename detector was involved while scanning for changes, the change type may also be RENAME or COPY.

In addition, a DiffEntry holds information about the old and the new state – including path, mode, and id – of a file. The methods are named accordingly getOldPath/Mode/Id and getNewPath/Mode/Id. Depending on whether the entry represents an addition or removal, the getNew or getOld methods may return ’empty’ values. The JavaDoc explains in detail which values are returned. Note that the id references the blob object in the repository database that contains the file’s content.

Under the Covers of the DiffCommand

In some cases, the DiffCommand may not be sufficient to accomplish the task. For example to detect renames and copies when comparing two revisions or to create customized patches. In this case, don’t hesitate to take a look under the covers.

The DiffCommand primarily uses the DiffFormatter, which can also be accessed directly to scan for changes and create patches.

Its scan() method expects iterators for the old and new tree and returns a list of DiffEntries. There are also overloaded versions that accept ids of tree objects to be supplied.

A simple example that scans for changes looks like this:

OutputStream outputStream = NullOutputStream.INSTANCE;
try( DiffFormatter formatter = new DiffFormatter( outputStream ) ) {
  formatter.setRepository( git.getRepository() );
  List<DiffEntry> entries = formatter.scan( oldTreeIterator, newTreeIterator );
}

The output stream to be used by format() is specified in the constructor. Since we aren’t interested in the output right now, a null output stream is supplied. With setRepository, the repository that should be scanned is specified. And finally the tree parsers are passed to the scan() method that returns the list of changes between the two of them.

Note that the DiffFormatter need to be closed explicitly or used in a try-with-resources statement like shown in the example code.

In order to create patches, one of the format() methods can be used. The patch is expressed as instructions to modify the old tree to make it the new tree.

Like the scan() methods, the format() methods accept pointers to or iterators for an old and a new tree. Either side may be null to indicate that the tree has been added or removed. In this case, the diff will be computed against nothing.

The snippet below uses format() to write a patch to the output stream that was passed to the constructor.

OutputStream outputStream = new ByteArrayOutputStream();
try( DiffFormatter formatter = new DiffFormatter( outputStream ) ) {
  formatter.setRepository( git.getRepository() );
  formatter.format( oldTreeIterator, newTreeIterator );
}

There are also an overloaded format() methods to print a single DiffEntry or a list of DiffEntries, possibly obtained by a previous call to scan().

See also  EclipseCon Europe — RAP Talk On-Line

While the outcome of the above example can also be accomplished with a plain DiffCommand, let’s have a look at what else the DiffFormater has to offer.

As mentioned earlier, renamed files can be associated while computing diffs. To enable rename detection, the DiffFormatter must be advised to do so with setDetectRenames(). Thereafter the RenameDetector can be obtained for fine tuning with getRenameDetector().

Remember that Git is a content tracker and does not track renames. Instead, renames are deduced from similar content during a diff operation.

In addition, the DiffFormatter has several further properties to fine-tune its behavior that are listed below:

  • setAbbreviationLength: the number of digits to print of an object id.
  • setDiffAlgorithm: the algorithm that should be used to construct the diff output.
  • setBinaryFileThreshold: files larger than this size will be treated as though they are binary and not text. Default is 50 MB.
  • setDiffComparator: the comparator used to determine if two lines of text are identical. The comparator can be configured to ignore various types of white space. However, I wasn’t able to let the DiffFormatter ignore all white spaces.

DiffEntry Revisited

If you are interested in the changes that took place in a certain file you may want to have another look at the DiffEntry and DiffFormatter.

With diffFormatter.toFileHeader(), a so-called FileHeader can be obtained from a given DiffEntry. And through its toEditList() method, a list of edits can be obtained.

The following code sample shows how to obtain the edit list for the first diff entry that results from a scan:

OutputStream outputStream = DisabledOutputStream.INSTANCE;
try( DiffFormatter formatter = new DiffFormatter( outputStream ) ) {
  formatter.setRepository( git.getRepository() );
  List<DiffEntry> entries = formatter.scan( oldTreeIterator, newTreeIterator );
  FileHeader fileHeader = formatter.toFileHeader( entries.get( 0 ) );
  return fileHeader.toEditList();
}

This list can be interpreted as the modifications that need to be applied to the old content in order to transform it into the new content.

Each Edit describes a region that was inserted, deleted or replaced and the lines that are affected.
The lines are counted starting with zero and can be queried with getBeginA(), getEndA() getBeginB() and getEndB().

For example, given a file with these two lines of content:

line 1
line 3

Inserting line2 between the two lines would result in an Edit of type INSERT with A(1-1) and B(1-2). In other words, replace line 1 with line 1 and 2. Deleting line 2 again results in the inverse of the inserting Edit: DELETE with A(1-2) and B(1-1). And changing the text of line 2 will yield an Edit of type REPLACE with the same A and B region 1-2.

Concluding Creating Diffs with JGit

While the DiffCommand is rather straightforward to use, the DiffFormatter has a scary API. But before using JGit in your project, you would certainly isolate yourself from the library anyway, wouldn’t you?!? … and thereby choose a more suitable API.

But apart from that, JGit provides means to accomplish most if not all tasks related to diffs and patches in Git.

The snippets shown throughout the article are excerpts from a collection of learning tests. The full source code can be found here:
https://gist.github.com/rherrmann/5341e735ce197f306949fc58e9aed141

If you like to experiment with the examples listed here by yourself, I recommend to setup JGit with access to the sources and JavaDoc so that you have meaningful context information, content assist, debug-sources, etc.

If you have difficulties or further questions, please leave a comment or ask the friendly and helpful JGit community for assistance.

Rüdiger Herrmann
Follow me
Latest posts by Rüdiger Herrmann (see all)

20 Comments so far:

  1. Pranay says:

    Hi,

    I really liked your post but I came across a problem in which I need to get the list of files committed in a particular commit.

    I referred the below link
    https://stackoverflow.com/questions/40590039/how-to-get-the-file-list-for-a-commit-with-jgit?answertab=active#tab-top

    I also tried the following code to get the list of files committed in a particular commit

    try( RevWalk walk = new RevWalk( git.getRepository() ) ) {
    RevCommit commit = walk.parseCommit( commitId );
    ObjectId treeId = commit.getTree().getId();
    try( ObjectReader reader = git.getRepository().newObjectReader() ) {
    return new CanonicalTreeParser( null, reader, tree );
    }
    }

    But unfortunately I was unsuccessful.

    Any help would really be appreciated !!!

    Thanks in advance !!

    • Rüdiger Herrmann says:

      I am gald you found this post useful. If you have a specific question, please use a Q&A forum such as stackoverflow or the like. But in any case, include in your post what you tried, what the expected outcome is, and what the actual outcome is – so that others can reproduce your problem.

  2. Aravind s says:

    How to get two branch final changes alone usung jgit

  3. KRK says:

    is there a way to compare two different branches of two different repository.
    example: a repo and b repo are two different repositories and a repo has a1,a2,a3… braches and b repo has b1,b2,b3… branches.. i want to compare a1 and b1..

    • Rüdiger Herrmann says:

      Git (and thus JGit) can only compare branches (commits to be precise) within the same repository.

      However, you can ‘integrate’ branches of repository B in to A (or vice versa) and then compare their contents.
      First, you need to add repository B as a remote repository to A. Then you need to fetch the desired branches (or just all branches) of repository B. Now that repository A contains branches of both repositories, you can use regular diff commands to compare them.

  4. Bogdan says:

    Hello there, and thank you for all your nicely exposed info.
    I got into trouble trying to get the diff lines of one file between two revisions: the first time I got it right, but when processing next files jgit shows no diff, even there IS a difference. I suppose I’m missing some call that presumably do some kind of “reset” in the API.
    That is, having three files with differences, I got the set of diffs only for the first encountered file (already proved that changing the revisions works only for the first file):
    //snippet
    List diff = git.diff().setOldTree(oldTreeIterator).setNewTree(newTreeIterator)
    .setPathFilter(PathFilter.create(fname)).call();
    for (DiffEntry entry : diff) {
    // after .call() I got diff.size() > 0 only the first time I use this code, the rest is always zero.
    }

    Could you be so kind to indicate me where I am wrong?

    Thank you very much
    Bogdan

    • Rüdiger Herrmann says:

      I can’t see a problem with the code snippet. Note, that in general, JGit commands aren’t meant to be reused. Usually, commands throw an exception when called multiple times, but even if not, their implementation isn’t prepared to be invoked multiple times.

      If you can provide a self-contained code snippet to reproduce the problem, maybe someone is able to help. I’d suggest to ask on SO as your post will be seen by a much larger audience.

      • Bogdan says:

        Your answer shed some light over my mistake. You were right about “not being prepared to be invoked multiple times”. Thus, I Git.open(), and close() each time it was necessary, and now the code gives the expected results.

        Thank you.
        Bogdan

        • Rüdiger Herrmann says:

          Note that you do not need to open() and close() the Git instance. It is just the commands that must not be reused, the DiffCommand in your case.

          Correct usage would look like this:

          Git git = Git.open(…);
          DiffCommand diff1 = git.diff();
          diff1.call();
          DiffCommand diff2 = git.diff();
          diff2.call();

          git.close();

  5. hiram says:

    Hello there
    jgit’s DiffFormatter directly compares the differences between the two branches, which is equivalent to “git diff a b”, but the effect I want is “git dif a … b”, what should I do with jgit.
    (git diff [] … [-] […]
    This form is to view the changes on the branch containing and up to the second , starting at a common ancestor of both )

  6. Sergiy Gnatyuk says:

    Hi,
    Thanks for the article!
    I have a file that is modified. If I execute ‘git diff HEAD’ I receive a diff with modified lines only:
    ########
    diff –git a/1.txt b/1.txt
    index 01e79c3..e4fbf4d 100644
    — a/1.txt
    +++ b/1.txt
    @@ -1,3 +1,3 @@
    1
    -2
    +22
    3
    ########

    When I try to duplicate this logic with JGit I receive:
    ########
    diff –git a/1.txt b/1.txt
    index 01e79c3..b6af982 100644
    — a/1.txt
    +++ b/1.txt
    @@ -1,3 +1,3 @@
    -1
    -2
    -3
    +1
    +22
    +3
    ########

    Could you look at my code and explain what I’m doing wrong?
    https://gist.github.com/SurpSG/c7b9d95b797f750cac5b37f59282961d
    The code is written in Kotlin, hope it’s not a problem

    Thanks!

    • Rüdiger Herrmann says:

      My guess would be your repository has changed in between, but without a self-contained example this is impossible to tell.

      • Sergiy Gnatyuk says:

        I’ve published an example here https://github.com/SurpSG/jgit-test

        • Rüdiger Herrmann says:

          I cannot reproduce what you see with the referenced repository. The results from git and JGit are the same.

          Are you running this on Windows? Perhaps some newline setting causes the difference?

          • Sergiy Gnatyuk says:

            I’ve tried to run my example on Ubuntu and it works as expected!
            Do you have any ideas how to solve the issue for windows?

          • Sergiy Gnatyuk says:

            I’ve tried to add this:

            repository.config.setEnum(
            ConfigConstants.CONFIG_CORE_SECTION,
            null,
            ConfigConstants.CONFIG_KEY_AUTOCRLF,
            AutoCRLF.TRUE
            )

            And it works. But I’m not sure that it is the right approach.

            Thanks a lot for your help!

  7. Dan says:

    Perhaps `tree` should be `treeId` in `return new CanonicalTreeParser( null, reader, tree )` from your snippet:

    try( RevWalk walk = new RevWalk( git.getRepository() ) ) {
    RevCommit commit = walk.parseCommit( commitId );
    ObjectId treeId = commit.getTree().getId();
    try( ObjectReader reader = git.getRepository().newObjectReader() ) {
    return new CanonicalTreeParser( null, reader, tree );
    }
    }

  8. Rüdiger Herrmann says:

    Thank you Dan, good catch! I’ve updated the post.