
TLDR; This is a much shorter blog post than usual, as I just summarize things I tested in a few hours over the weekend. My plan is to get back into some writing rhythm over this year and come up with more but shorter posts.
I am hosting this blog in a S3 bucket. I used various systems for logging and getting statistics over the years. Over time all the JavaScript based solutions get added to an ad blocker and so it is hard to figure out the amount of visits. My current one is beam analytics, that is ad blocked, so I don’t have any accurate numbers, as I suppose that most tech savvy folks have an ad blocker enabled.
At some point I remembered I had enabled logging in AWS. Since many years I basically have millions of small log files sitting in a bucket untouched. Time to do some cleanup.
As this is not a high traffic blog I basically end up with millions of files that contain a single line of traffic, either for cloudfront or S3. Just running a sync to a local file system did not sound like a good idea.
As I recently read about the Bun 1.2 release announcement, coming with natively integrated S3 and SQLite client, I decided this is an ideal playground.
My plan of action:
- Find all files within a certain prefix
- Copy their contents into an SQLite database
- Concatenate the contents into a single file
- Store that file back into S3
First, creating a new project with Bun is a great experience. No typescript hassle, everything just works, no extra configuration because you want to be able to use top level awaits. I do understand now, why node is starting to catch up based on recent release notes.
While starting to dig I noticed that the built-in S3 client only seems to
support basic file CRUD actions, but not listing objects. I used
s3-lite-client in
addition then. Again having a single command instead of having node
and
npm
for handling dependencies just makes a ton of sense to me.
I also added the progress dependency to show CLI progress state, as I hate not seeing if my actions on several hundred thousand items are being processed or not.
Before building any CLI tool, I need proper arguments parsing:
// parse command line options
const { values, positionals } = parseArgs({
args: Bun.argv,
options: {
help: { type: 'boolean' },
bucket: { type: 'string' },
prefix: { type: 'string' },
file: { type: 'string' },
phases: { type: 'string', },
},
strict: true,
allowPositionals: true,
});
if (values.help) {
showHelp()
process.exit(0)
}
I added a phases
parameter, so that I could run a single task like listing
S3 objects, reading S3, writing into a local file or deleting S3 separately.
I also wanted to make sure that long running tasks are incremental and can be
picked up again.
I have five phases, but only four of them are run by default - to prevent accidental deletion on S3:
const phasesAsStr = values.phases || 'createDB,listObjects,retrieveObjects,writeIntoFile'
const phases = phasesAsStr.split(',').map(x => x.trim())
This way I can run any code within the list of my tasks only if I need to
const db = new Database(`s3.sqlite`, { create: true, strict: true });
await db.exec("PRAGMA journal_mode = WAL;");
if (phases.includes('createDB')) {
await createDatabase(db);
await insertAndDelete(db);
}
if (phases.includes('listObjects')) {
await listObjects(db, s3client, prefix)
}
if (phases.includes('retrieveObjects')) {
await retrieveObjectContents(db)
}
if (phases.includes('writeIntoFile')) {
await writeIntoFile(db, file)
}
if (phases.includes('s3DeleteFiles')) {
await s3DeleteFiles(prefix)
}
Why do I use SQLite? Because I can (just checking if you remember the movie Inside Man). Also, because it is easier to use SQLite to checkpoint your work if every SQL statement is a transaction, so I can easily figure out where to pick up reading all the files from S3 if I stopped my program.
My table is simple, it only consists of path
and content
fields, with
path
having an unique index.
I added an insertAndDelete()
method to check if insertion and deletion
works - which I would probably not need at all, SQLite is a trusted system
after all (even though my code is not).
Listing objects looks like this, using the s3-lite-client
dependency:
async function listObjects(db, s3client, prefix) {
let p = progress('Listing objects [[count]]\n');
for await (const obj of s3client.listObjects({ prefix: prefix })) {
await db.query(`INSERT INTO s3_objects (path) VALUES ("${obj.key}")`).run()
p.next(1)
}
}
Every objects path gets inserted into S3, so the next step of retrieval is just going through that table and running a retrieval:
async function retrieveObjectContents(db) {
// incremental fetch
const where = `path LIKE "${prefix}%" AND (content is null or content='')`
const total = db.query(
`SELECT COUNT(*) AS count FROM s3_objects WHERE path LIKE "${prefix}%"`
).get().count
const count = db.query(`SELECT COUNT(*) AS count FROM s3_objects WHERE ${where}`).get().count
const query = db.query(`SELECT path FROM s3_objects WHERE ${where}`)
const p = progress('Retrieving content [[bar]] [[count]]/[[total]]\n', {total: total});
p.next(total-count);
// retrieve contents for each file
const promises = []
for (const result of query) {
const object = bunS3Client.file(result.path)
if (result.path.endsWith('.gz')) {
// create promise to read gzipped data from cloudfront logs
} else {
// create promise to read data from s3 logs
}
if (promises.length >= 75) {
await Promise.all(promises);
const length = promises.splice(0, promises.length).length
p.next(length)
}
await Promise.all(promises)
p.next(promises.length)
}
Let’s go through this step by step and then we will take a look at the created promise. This requires three SQL statements. The total number of documents for that prefix, the number of already downloaded and stored documents for that prefix and the query to retrieve path by path to download the data from S3.
To introduce some poor man’s concurrency I run build to 75 promises in parallel and then wait for all of them to finish. While this is not the best solution, it was the fastest I could come up with. Preferably you would like to have some sort of pool, that always runs a certain number of tasks in parallel, but whenever I search for that in the node ecosystem it looks way to complex for a small tool like this one.
The promise for plain text data looks like this
const promise = object.text()
.then(text => {
db.query(`UPDATE s3_objects SET content=? WHERE path=?`).run([text, result.path])
})
.catch(error => {
console.log(`Error fetching object ${result.path}`, error)
})
promises.push(promise)
For unpacking gzipped data, there is a built-in Bun directive, which only seems to exist in a sync variant. For my prototype this was good enough.
const promise = object.arrayBuffer()
.then(buffer => {
const dec = new TextDecoder();
const uncompressed = Bun.gunzipSync(buffer);
return dec.decode(uncompressed);
})
.then(text => {
db.query(`UPDATE s3_objects SET content=? WHERE path=?`).run([text, result.path])
})
.catch(error => {
console.log(`Error fetching object ${result.path}`, error)
})
promises.push(promise)
This is basically the workhorse function. Downloading all the data and inserting it into the local SQLite database. A nice side effect of this is that I have all my data basically in a single file. This means easy to query in the future, if I ever need it.
Now writing to a file is just running a SQL query and dumping its response:
async function writeIntoFile(db, path) {
const count = db.query(
`SELECT COUNT(*) AS count FROM s3_objects WHERE path LIKE "${prefix}%" AND content != ''`
).get()
const p = progress(
'Writing into local file [[bar]] [[count]]/[[total]]\n',
{total: count.count}
);
const query = db.query(
`SELECT content FROM s3_objects WHERE path LIKE "${prefix}%" AND content != '' ORDER BY path ASC`
)
const file = Bun.file(path);
const writer = file.writer();
for (const result of query) {
const line = result.content
writer.write(line.trim())
writer.write('\n')
p.next(1)
}
writer.end();
}
Now the final step, that required an additional invocation from the command line is the deletion of data on S3. This way I could verify that everything worked as expected:
async function s3DeleteFiles(prefix) {
const count = db.query(
`SELECT COUNT(*) AS count FROM s3_objects WHERE path LIKE "${prefix}%"`
).get()
const p = progress(
`Deleting files on S3 match '${prefix}' [[bar]] [[count]]/[[total]]\n`,
{total: count.count}
);
const query = db.query(`SELECT path FROM s3_objects WHERE path LIKE "${prefix}%"`)
const promises = []
for (const result of query) {
const object = bunS3Client.file(result.path)
const promise = object.delete()
.then(() => { p.next() })
.catch(error => { console.log(`Error deleting ${result.path}`, error) })
if (promises.length >= 25) {
await Promise.all(promises);
promises.splice(0, promises.length)
}
}
await Promise.all(promises)
p.next(promises.length)
}
Again I try to increase concurrency by triggering several promises and then awaiting for them together, reducing the total runtime.
I ran my tool like this:
bun run index.ts \
--bucket my-logs-bucket \
--prefix 'logs/cloudfront/www.spinscale.de/FHGD357FKLHF31.2021' \
--file ./cloudfront-www.spinscale.de-2021.txt
// deletion
bun run index.ts \
--bucket my-logs-bucket \
--prefix 'logs/cloudfront/www.spinscale.de/FHGD357FKLHF31.2021' \
--phases s3DeleteFiles
Summary
If you want to take a look at the code in one file, take a look at my rollup-s3-logs repository.
So, after running this, I roughly deleted 2.8 million very small files on my
S3 logs bucket. My local SQLite database is around 2.2 GB after running
vacuum
. Counting all the rows in all files I end up with roughly 4.4
million rows in my final text files.
OK, enough data, let’s talk about the Bun experience. TLDR; 👍, would do again. It was so nice an easy to get up and running compared to Node and Typescript (I’m not a node.js wizard, just an average user), the core is complete enough to not need a lot of dependencies either - at least for my small two hundred line tool.
I often use Crystal language for smaller tasks inside and outside of work, which may leave my colleagues confused, what esoteric language I brought in. Bun might be worth a try here in the future as a replacement, so everyone is up to speed quicker.
Also, the built-in ability to build binaries, and even cross compile them is really nice. As the compilation for my project takes 120ms, I suppose it’s basically picking the bun runtime, and packs my app plus dependencies into this, as I end up with 54MB binary on osx. Also I see some downloads of the architecture specific runtime when building for another platform. Compiled executables reduce memory usage and improve start up time, always nice to have.
With my data now in a file-per-year format, I can think next how to analyze it properly. But that’s for another weekend discovery.
Final remarks
If you made it down here, wooow! Thanks for sticking with me. You can follow or contact me on Bluesky, Mastodon, GitHub, LinkedIn or reach me via Email (just to tell me, you read this whole thing :-).
If there is anything to correct, drop me a note, and I am happy to fix and append a note to this post!
Same applies for questions. If you have question, go ahead and ask!
If you want me to speak about this, drop me an email!