[NBLUG/talk] Finding duplicate files

ME dugan at passwall.com
Sun Jul 6 23:45:01 PDT 2003


Lincoln Peters said:
> Is there an easy way to do a recursive search for duplicate files,
> preferably from the shell?  The best I've come up with is:
>
> #!/bin/sh
> for file1 in *
> do
> 	for file2 in *
> 	do
>                 if [ "$file1" != "$file2" ] && [ ! -d "$file1" ] && \
> 		   [ ! -d "$file2" ] && [ "$file1" != - ] && [ "$file2" != - ]
>                 then
>                         cmp -- "$file1" "$file2" &> /dev/null && \
>                                 echo "$file1 and $file2 are identical" &&
> \
>                                 # do what must be done...
>                 fi
>         done
> done
>
> Of course, this doesn't work recursively (I tried using `find`, but it
> didn't handle filenames with spaces properly), and it ends up performing
> every comparison twice.  Is there a more efficient way I could do this?

Here is one off the top of my head. Probably better ways, but this uses
the locatedb and md5sum to compare fingerprints of files to look for
duplicate data even if the filenames have changed:

for i in `locate \*` ; do  md5sum $i ; done | tee /tmp/FIND-DUPE1.txt |
cut -c1-32 | sort | uniq -c | tr -d " " | tr '\011' '\040' | grep -v "^1 "
| cut -f2 -d" " > /tmp/FIND-DUPE2.txt ; for i in `cat /tmp/FIND-DUPE2.txt`
; do grep "$i" /tmp/FIND-DUPE1.txt ; done

This was meant to be done for "speed" and makes several shortcuts...

I did not test this, but I think it will spit out the md5sums and
filenames with paths to duplicates based on md5sum for only duplicate
files.

WATCHOUT! This will likely report symlinks as the same files. An extra
conditional check from bash for "normal file" could fix that, but at the
cost of a little more disk access and delay.
(see bash man page for the various file tests within [ and ].)

"find" would actually traverse the disk while locate uses locatedb.
This does traverse the disk once to perform md5sums for each file, but
only the results are scanned (temporary files) to avoid double scan and
premit re-inspection later if desired.

disadvantages of locate is that not everyone has it and it is not "current"

disadvantage of find is that it is likely slower when also adding md5sum.
If find is used, you may find xargs of help here.

Back to homework,
-ME




More information about the talk mailing list