amiantos.net
Howdy!

My name is Brad Root and I'm a software engineer, music aficionado, video game junkie, and occasional unicyclist.

In my spare time, I build open source software, and write about my experiences as a programmer here on this blog.

You can also find me on Twitter, GitHub, and LinkedIn.


Zip Multiple Files for Download from AWS S3

Recently we decided over at Lingo that we want to allow our customers to download multiple asset files at once. We store all assets in AWS S3, which doesn't offer a method of retrieving multiple files at the same time in a single zip file. (Why not!?)

I did some Googling around and found a Python solution that purported to work in the most ideal way: no memory usage, and no disk usage. How can this be? It sounds like magic. Basically the idea is that you stream the files from S3, directly into a zip file which is streaming to S3 as you add the files to it. (Still sounds like black magic.)

Long story short, this did not work in Python. I don't know why, as I followed a couple examples from the internet. Whoever suggested that it did work must not have tested it thoroughly, or was delusional, because it definitely used memory as it was building the zip file. With sufficiently large files, or a large quantity of small files, the whole thing crashed.

Distraught that my precious Python was failing me, I found several sources on the internet that suggested this idea did work when using Node, and worked especially well as an AWS Lambda function. This sounded like a good idea to me, because a Lambda function will definitely not allow you to use too much memory, disk space, or time. Only problem was that I knew little to nothing about Node/JavaScript development. I guess it was time to learn!

So... I learned. Long story short (again), it took a little bit of research, but I came up with this Lambda function which successfully zips up hundreds of files directly into a zip file on S3, without any real memory usage and definitely no disk usage. It took some research, because none of the examples people had posted online actually worked flawlessly from beginning to end. For all I know, this script has some issues, but I haven't run into them yet, and when I do, I will update it to avoid them.

// Lambda S3 Zipper
// http://amiantos.net/zip-multiple-files-on-aws-s3/
//
// Accepts a bundle of data in the format...
// {
// "bucket": "your-bucket",
// "destination_key": "zips/test.zip",
// "files": [
// {
// "uri": "...", (options: S3 file key or URL)
// "filename": "...", (filename of file inside zip)
// "type": "..." (options: [file, url])
// }
// ]
// }
// Saves zip file at "destination_key" location
"use strict";
const AWS = require("aws-sdk");
const awsOptions = {
region: "us-east-1",
httpOptions: {
timeout: 300000 // Matching Lambda function timeout
}
};
const s3 = new AWS.S3(awsOptions);
const archiver = require("archiver");
const stream = require("stream");
const request = require("request");
const yauzl = require("yauzl");
const streamTo = (bucket, key) => {
var passthrough = new stream.PassThrough();
s3.upload(
{
Bucket: bucket,
Key: key,
Body: passthrough,
ContentType: "application/zip",
ServerSideEncryption: "AES256"
},
(err, data) => {
if (err) throw err;
}
);
return passthrough;
};
// Kudos to this person on GitHub for this getStream solution
// https://github.com/aws/aws-sdk-js/issues/2087#issuecomment-474722151
const getStream = (bucket, key) => {
let streamCreated = false;
const passThroughStream = new stream.PassThrough();
passThroughStream.on("newListener", event => {
if (!streamCreated && event == "data") {
const s3Stream = s3
.getObject({ Bucket: bucket, Key: key })
.createReadStream();
s3Stream
.on("error", err => passThroughStream.emit("error", err))
.pipe(passThroughStream);
streamCreated = true;
}
});
return passThroughStream;
};
exports.handler = async (event, context, callback) => {
var bucket = event["bucket"];
var destinationKey = event["destination_key"];
var files = event["files"];
await new Promise(async (resolve, reject) => {
var zipStream = streamTo(bucket, destinationKey);
zipStream.on("close", resolve);
zipStream.on("end", resolve);
zipStream.on("error", reject);
var archive = archiver("zip");
archive.on("error", err => {
throw new Error(err);
});
archive.pipe(zipStream);
for (const file of files) {
if (file["type"] == "file") {
archive.append(getStream(bucket, file["uri"]), {
name: file["filename"]
});
} else if (file["type"] == "url") {
archive.append(request(file["uri"]), { name: file["filename"] });
}
}
archive.finalize();
}).catch(err => {
throw new Error(err);
});
callback(null, {
statusCode: 200,
body: { final_destination: destinationKey }
});
};
view raw index.js hosted with ❤ by GitHub

Compared to Python, Javascript syntax is a horrific mess that makes my eyes bleed. But I can't argue with results, this worked out while the Python solution did not.

If you need an explanation of what this code does, just give yourself a year or two as a professional developer and you'll realize understanding what it does isn't particularly important, but by that time you'll be able to figure it out.

Good luck!