Using 'join' Unix command to find similar files.

Today I wanted to transfer all my Oracle RDBMS installation files to remote Linux box, for archival purposes. However, the files are huge and several of them are already there.

How to find similar files that can be skipped during transfer?

First, on my remote box, I run sha1sum for all files larger than 5M:

% find . -type f -size +5M -exec sha1sum {} \; > sha1sums

...

4abd1d8b951478d241586b064f531839bcdfc640  ./hz/Oracle R2.iso
79bf96f5c62715d275d37601dcf0b10a1eee1de4  ./hz/sol-10-u11/sol-10-u11-ga-sparc-dvd.iso
c27e5fe2531826192f110e2d9f1f5d73015edac6  ./hz/sol-10-u11/sol-10-u11-ga-x86-dvd.iso
f922c915dec98780ff715521acb1f47576c380c1  ./11.1/p12429526_111070_MSWIN-x86-64.zip
fc51723daced2289afb3a3af22fd720ea9f76c65  ./11.1/p12429525_111070_Win32.zip
eab215b8bc4ac2b5a0191235ee5e46bb9128e5c8  ./10.2/p8202632_10205_MSWIN-x86-64.zip
9e33ce315a244742f1bd4808e24a3ae5f5826c32  ./10.2/Oracle Database 10g Release 2 (10.2.0.2) Solaris x86.zip
88f5c109a5eae8f6796c9efd8878c238b8988235  ./10.2/p8202632_10205_WINNT.zip
c60a7bbc15996cb6b59e84ca093beaf00a43e12c  ./10.2/Oracle Database 10g Release 2 (10.2.0.1.0) Solaris (SPARC) x64.cpio.gz
2529d24dd66d3de49c2a361d6274967b4b63b27a  ./10.2/10201_database_linux_x86_64.cpio.gz
c0c7ddf5c10ac3b0af5b0d4bc7f00ace5cb8343e  ./10.2/Oracle companion 10g Release 2 (10.2.0.1.0) Linux x86.zip
b1f9300acedeb7ca38ceacca707c6beb1832f85f  ./10.2/p8202632_10205_Linux-x86-64.zip
36376823b06cae66bd3320cf8a64027d1068a8d0  ./10.2/p12429519_10204_Win32.zip
97ede71b760438411022b57a60a897e5911ec8ba  ./10.2/p6810189_10204_Win32.zip

...

And I do this on my local box as well. I've got two files with SHA1 sums. Now how to find files with similar hashes? And this is what 'join' utility can do. By default, it takes the first column of both files and find lines that has the same value in the first column:

 % join sha1sums.remote.sorted sha1sums.local.sorted

2d8f8bca5bc144750c0bfd423a2642ff81755c0b ./11.2/ORACLE_11G_v11.2.0.2_R2/Win32/win32_11gR2_client.zip ./oracle-install/11.2 win32/win32_11gR2_client.zip
5aff7aa0300ce33e17f4cfe28dfb64513b76bd49 ./12.1/oracle_db12102/linux_x86-64/linux_12102_client32.zip ./oracle-разгребать/install/12/client/linux_12102_client32.zip
6211ce847f39f833635509e5af410a42501e090c ./11.2/ORACLE_11G_v11.2.0.2_R2/Win64/win64_11gR2_client.zip ./oracle-install/11.2 win64/win64_11gR2_client.zip
6a8a24c8a6460d3d9e02d5bf505eba3c1ff52d30 ./11.2/ORACLE_11G_v11.2.0.2_R2/Win32/ofm_webtier_win_11.1.1.2.0_32_disk1_1of1.zip ./oracle-install/11.2 win32/ofm_webtier_win_11.1.1.2.0_32_disk1_1of1.zip

...

Third column are the files to be killed on my local box, because there is no need to transfer them. (Second column are the files on remote box.)

One small note: 'join' require the first column to be sorted. So sort input files before. Consult 'man' page for 'join' to see how to specify another column...

For finding similar files and directories on a local box, see my DDFF utility.