[NBLUG/talk] Finding duplicate files

Eric Eisenhart eric at nblug.org
Mon Jul 7 11:12:01 PDT 2003


On Mon, Jul 07, 2003 at 01:25:35AM -0700, Ross Thomas wrote:
> Also handles embedded blanks and tabs.  Misbehaves when new-lines are
> embedded in a file name (sort & uniq aren't that sophisticated).

Actually, the GNU sort has a "-z" option, equivalent to xargs' "-0" option.

One problem here.  "ln file1 file2" will create a duplicate that actually
refers to the same file.

Maybe something like this:
find "${1:0.}" -type f -print0 | xargs -0 md5sum | sort -z | uniq -D -w 32 | cut -d" " -f 3, | xargs stat --format="%D:%i %n" | sort -u -k 1,1 | cut -d" " -f 2, | xargs md5sum | sort | uniq -D -w 32

Which simply takes what you figured out, pulls out just the filename,
generates a list with a format of "device:inode filename", uniques (with
sort) on just the device:inode portion, then extracts just those filenames
and does the same md5sum list on those.

Problem I don't think I can solve in a shell-script pipe, though: for a
multiply-hardlinked file, it's relatively arbitrary which hardlink is
listed; ideal would be to list the md5sum, device:inode and filenames of all
related files only if the duplicate md5sums include non-duplicate
device:inode.

Maybe a perl script with a structure like $hash->{md5sum}->{device:inode}
holding an anonymous array of filenames.

Hmmm...

I'm attaching a perl version (based on the output of "find2perl") that seems
to handle this -- requires the Digest::MD5 module.  Shouldn't have any
problem with any kind of filename at all.  3-argument open is used to help
ensure this.
-- 
Eric Eisenhart
NBLUG Co-Founder & Vice-President Pro Tempore
The North Bay Linux Users Group
http://nblug.org/
eric at nblug.org, IRC: Freiheit at freenode, AIM: falschfreiheit, ICQ: 48217244
-------------- next part --------------
A non-text attachment was scrubbed...
Name: duplicates.pl
Type: application/x-perl
Size: 1083 bytes
Desc: not available
Url : http://nblug.org/pipermail/talk/attachments/20030707/5bab0fb4/duplicates.bin


More information about the talk mailing list