I’m in the process of starting a proper backup solution however over the years I’ve had a few copy-paste home directory from different systems as a quick and dirty solution. Now I have to pay my technical debt and remove the duplicates. I’m looking for a deduplication tool.

  • accept a destination directory
  • source locations should be deleted after the operation
  • if files content is the same then delete the redundant copy
  • if files content is different, move and change the name to avoid name collision I tried doing it in nautilus but it does not look at the files content, only the file name. Eg if two photos have the same content but different name then it will also create a redundant copy.

Edit: Some comments suggested using btrfs’ feature duperemove. This will replace the same file content with points to the same location. This is not what I intend, I intend to remove the redundant files completely.

Edit 2: Another quite cool solution is to use hardlinks. It will replace all occurances of the same data with a hardlink. Then the redundant directories can be traversed and whatever is a link can be deleted. The remaining files will be unique. I’m not going for this myself as I don’t trust my self to write a bug free implementation.

  • lemmyvore@feddit.nl
    link
    fedilink
    English
    arrow-up
    13
    ·
    6 months ago

    Use Borg Backup. It has built-in deduplication — it works with chunks not files and will recognize identical chunks and avoid storing them multiple times. It will deduplicate your files and will find duplicated chunks even in files you didn’t know had duplicates. You can continue to keep your files duplicated or clean them out, it doesn’t matter, the borg backups will be optimized either way.

    • FryAndBender@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      6 months ago

      Here are the stats from a backup of 1 server with approx 600gig


                         Original size      Compressed size    Deduplicated size
      

      This archive: 592.44 GB 553.58 GB 13.79 MB All archives: 14.81 TB 13.94 TB 599.58 GB

                         Unique chunks         Total chunks
      

      Chunk index: 2760965 19590945

      13meg… nice