Multithreaded script to mirror AWS S3 buckets

This script copies the contents of one Amazon S3 bucket to another, including between accounts (as long as your access key has the right permissions assigned). It uses forks instead of threads as the HTTP library seemed to be unstable with a large number of threads (randomly segfaulted with a large number of workers) but works fine with forks. Forks can use a lot of memory unless you have an OS that is using copy-on-write memory allocation.

You should replace the aws_access_key_id and aws_secret_access_key, frombucket and tobucket. Possibly try with a smaller numbers of workers initially until you verify how your OS reacts to a large number of running perl processes. This script doesn’t itself need a lot of bandwidth, and doesn’t need a particularly low latency connection to AWS. It will effectively divide the time to mirror a bucket using a single threaded application by the number of workers. Using 600 workers it reduced the time needed to mirror a large bucket from days to minutes. It will print a . for every file transferred. In general it won’t stop if an error occurs (eg, permission denied), so check afterwards that the full number of files copied correctly.

Props Tim for having the problem in the first place.

#!/usr/bin/perl

use warnings;
use strict;

use forks;
use Thread::Queue;

use Data::Dumper;
use Net::Amazon::S3;

$| = 1;

my $s3config = {
  aws_access_key_id     => 'xxxxxxxxxxxxxxxxxxxx',
  aws_secret_access_key => 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
  retry                 => 1,
};
my $config = {
  frombucket            => 'bucket1',
  tobucket              => 'bucket2',
  workers               => 100,
};

my $s3 = Net::Amazon::S3->new($s3config);

my @list;
my $bucketlist = $s3->bucket($config->{frombucket})->list_all or die $s3->err . ": " . $s3->errstr;
push(@list, $_->{key}) for (@{$bucketlist->{keys}});

my $q = Thread::Queue->new();
my @workers;

for (1..$config->{workers}) {
  push @workers, async {
    my $s3 = Net::Amazon::S3->new($s3config);
    my $bucket = $s3->bucket($config->{tobucket});
    while (defined(my $key = $q->dequeue())) {
      $bucket->copy_key($key, sprintf('/%s/%s', $config->{frombucket}, $key));
      print ".";
    }
  }
}

$q->enqueue(@list);
$q->enqueue(undef) for @workers;
$_->join for @workers;

Comments

  • After spending hours trying to get “aws s3 sync” to work properly to no avail, and “aws s3 cp” appears to be single-threaded, this script is pretty wonderful, thank you for posting it!

    BTW I see the error “Thread 91 terminated abnormally: Wide character in subroutine entry at /usr/local/Cellar/perl/5.32.1/lib/perl5/site_perl/5.32.1/Net/Amazon/S3/Signature/V4Implementation.pm line 181.” occasionally, I suspect because the file has some UTF8 characters.

  • Leave a comment