Using 'join' Unix command to find similar files.

Today I wanted to transfer all my Oracle RDBMS installation files to remote Linux box, for archival purposes. However, the files are huge and several of them are already there.

How to find similar files that can be skipped during transfer?

First, on my remote box, I run sha1sum for all files larger than 5M:

% find . -type f -size +5M -exec sha1sum {} \; > sha1sums


4abd1d8b951478d241586b064f531839bcdfc640  ./hz/Oracle R2.iso
79bf96f5c62715d275d37601dcf0b10a1eee1de4  ./hz/sol-10-u11/sol-10-u11-ga-sparc-dvd.iso
c27e5fe2531826192f110e2d9f1f5d73015edac6  ./hz/sol-10-u11/sol-10-u11-ga-x86-dvd.iso
f922c915dec98780ff715521acb1f47576c380c1  ./11.1/
fc51723daced2289afb3a3af22fd720ea9f76c65  ./11.1/
eab215b8bc4ac2b5a0191235ee5e46bb9128e5c8  ./10.2/
9e33ce315a244742f1bd4808e24a3ae5f5826c32  ./10.2/Oracle Database 10g Release 2 ( Solaris
88f5c109a5eae8f6796c9efd8878c238b8988235  ./10.2/
c60a7bbc15996cb6b59e84ca093beaf00a43e12c  ./10.2/Oracle Database 10g Release 2 ( Solaris (SPARC) x64.cpio.gz
2529d24dd66d3de49c2a361d6274967b4b63b27a  ./10.2/10201_database_linux_x86_64.cpio.gz
c0c7ddf5c10ac3b0af5b0d4bc7f00ace5cb8343e  ./10.2/Oracle companion 10g Release 2 ( Linux
b1f9300acedeb7ca38ceacca707c6beb1832f85f  ./10.2/
36376823b06cae66bd3320cf8a64027d1068a8d0  ./10.2/
97ede71b760438411022b57a60a897e5911ec8ba  ./10.2/


And I do this on my local box as well. I've got two files with SHA1 sums. Now how to find files with similar hashes? And this is what 'join' utility can do. By default, it takes the first column of both files and find lines that has the same value in the first column:

 % join sha1sums.remote.sorted sha1sums.local.sorted

2d8f8bca5bc144750c0bfd423a2642ff81755c0b ./11.2/ORACLE_11G_v11.2.0.2_R2/Win32/ ./oracle-install/11.2 win32/
5aff7aa0300ce33e17f4cfe28dfb64513b76bd49 ./12.1/oracle_db12102/linux_x86-64/ ./oracle-разгребать/install/12/client/
6211ce847f39f833635509e5af410a42501e090c ./11.2/ORACLE_11G_v11.2.0.2_R2/Win64/ ./oracle-install/11.2 win64/
6a8a24c8a6460d3d9e02d5bf505eba3c1ff52d30 ./11.2/ORACLE_11G_v11.2.0.2_R2/Win32/ ./oracle-install/11.2 win32/


Third column are the files to be killed on my local box, because there is no need to transfer them. (Second column are the files on remote box.)

One small note: 'join' require the first column to be sorted. So sort input files before. Consult 'man' page for 'join' to see how to specify another column...

For finding similar files and directories on a local box, see my DDFF utility.

For finding similar files and directories on a local box, see my DDFF utility.