Author Avatar Image
Alexander Reelsen

Backend developer, productivity fan, likes the JVM, full text search, distributed databases & systems

Shorts: Converting & compressing a CSV file
Jan 28, 2025
7 minutes read

TLDR; This is a much shorter blog post than usual, as I just summarize things I tested in a few hours over the weekend. My plan is to get back into some writing rhythm over this year and come up with more but shorter posts.

I started reading a DuckDB post about better compression more recent parquet versions and wondered if I could how compression works with own dataset.

My testing dataset was quite small, so it does not represent any substantial. I have collected a little more than two years of PV data in five minute intervals, around 240k entries. Each entry consists of

  • A unique identifier of my PV installation, that is always the same
  • A timestamp at second granularity
  • Six floats resembling current PV generation, grid consumption, grid push, car charging, current load and battery capacity

My csv file having roughly 240k entries has a size of 14 MB, let’s see if we can go below that. Using gzip -9 the file ends up being 2.1 MB in size. Using zstd --ultra five_mins.csv we get a 2.4 MB sized file. Given the fact, that we mainly have floating point numbers here, I am still surprised just how much smaller this is to the original sized files. As the floating point numbers all have a single comma digit at max, there is some degree in repetition as well.

Then I remembered an Apache project named tsfile that I always wanted to test. I pretty much copy/pasted the example on the website and wrote into a file. I ended up with a 1.8 MB file, the winner so far.

Next up was using csv2parquet from the Arrow CLI tools.

csv2parquet -c zstd \
  --delimiter ',' \
  --dictionary true \
  five_mins.csv five_mins.cli.parquet

I ended up with a 1.7 MB sized file. When using csv2arrow from the Arrow CLI tools I ended up with an 18 MB sized file, so I assume you need to configure compression manually to make this work as I don’t expect this to be bigger than my CSV file.

Then I thought I could try to be smart and use some time series compression algorithm like Gorilla. So I used the next best java implementation I could find, which was gorilla4j, even though it’s already five years old. This library uses Java bitsets under the hood, and I ended up at a six megabyte sized file, when serializing the bitsets onto the disk, which is highly ineffective. I probably could have used roaring bitmaps instead. I decided to test out the Fury serializer with compression enabled. However that also shaved off only 100kb. So let’s ditch java for now. Turns out trying to be smart was not so successful 😀

Let’s continue with DuckDB, loading the CSV and then exporting it to parquet.

CREATE TABLE five_mins (
  sysSn VARCHAR,
  date DATE,
  ppv DOUBLE,
  load DOUBLE,
  cbat DOUBLE,
  feedIn DOUBLE,
  gridCharge DOUBLE,
  pchargingPile DOUBLE
);

COPY five_mins FROM 'five_mins.csv' WITH (HEADER, DELIMITER ',')

COPY five_mins TO 'snappy_v1.parquet' (COMPRESSION snappy, PARQUET_VERSION V1);
COPY five_mins TO 'snappy_v2.parquet' (COMPRESSION snappy, PARQUET_VERSION V2);
COPY five_mins TO 'zstd_v1.parquet' (COMPRESSION zstd, PARQUET_VERSION V1);
COPY five_mins TO 'zstd_v2.parquet' (COMPRESSION zstd, PARQUET_VERSION V2);

This gives us four files. The v1 files are each 2.0 MB in size, where as the v2 files are 1.6 MB in size. A solid 20% reduction just by using a newer parquet version. Snappy or ZSTD compression does not matter.

So, from 14MB down to 1.6MB with two DuckDB commands. That’s fair and hard to beat. Further ideas would be to take a look at the LanceDB format but that’s maybe for another week.

OK, one last thing I want to test. As the float values are all having only a single comma digit, how about storing them as integers instead? Would that make much of a difference? We could simply multiply all values by 10 and then go with an integer as integer compression should be much better.

COPY (
  SELECT
    syssn,
    date, 
    CAST(ppv*10 as integer) AS ppv,
    CAST(load*10 as integer) AS load,
    CAST(cbat*10 as integer) AS cbat,
    CAST(feedin*10 as integer) AS feedIn,
    CAST(gridCharge*10 as integer) AS gridCharge,
    CAST(pchargingpile*10 as integer) AS pChargingPile
  FROM five_mins
)
TO 'zstd_v3.parquet' (COMPRESSION zstd, PARQUET_VERSION V2)

Now we end up with a file sized 1.3MB. Our winner so far. At the expense of having to multiply data at query time again. So we traded index size for query time. So with this we’re at roughly 9% of the original size. In production I would go with the original file very likely, so that folks do not accidentally retrieve wrong results in case they don’t know that background.

While writing these last paragraphs I had one final idea: Instead of using a timestamp with nanosecond resolution, let’s use a timestamp with second resolution in DuckDB and see if the exported data changes.

CREATE TABLE five_mins_ts (
  sysSn VARCHAR,
  date TIMESTAMP_S,
  ppv DOUBLE,
  load DOUBLE,
  cbat DOUBLE,
  feedIn DOUBLE,
  gridCharge DOUBLE,
  pchargingPile DOUBLE
);

COPY five_mins_ts FROM 'five_mins.csv' WITH (HEADER, DELIMITER ',');

COPY five_mins_ts TO 'zstd_v2_ts.parquet' (COMPRESSION zstd, PARQUET_VERSION V2);

Now I ended up with a file size of 974kb, despite using floats again. When using pqrs to dump the schema, there is something interesting:

❯ ~/.cargo/bin/pqrs schema zstd_v2.parquet
Metadata for file: zstd_v2.parquet

version: 1
num of rows: 241666
created by: DuckDB version v1.1.4-dev4923 (build d0c4cf8a28)
message duckdb_schema {
  OPTIONAL BYTE_ARRAY sysSn (UTF8);
  OPTIONAL INT32 date (DATE);
  OPTIONAL DOUBLE ppv;
  OPTIONAL DOUBLE load;
  OPTIONAL DOUBLE cbat;
  OPTIONAL DOUBLE feedIn;
  OPTIONAL DOUBLE gridCharge;
  OPTIONAL DOUBLE pchargingPile;
}

❯ ~/.cargo/bin/pqrs schema zstd_v2_ts.parquet
Metadata for file: zstd_v2_ts.parquet

version: 1
num of rows: 241666
created by: DuckDB version v1.1.4-dev4923 (build d0c4cf8a28)
message duckdb_schema {
  OPTIONAL BYTE_ARRAY sysSn (UTF8);
  OPTIONAL INT64 date (TIMESTAMP(MICROS,false));
  OPTIONAL DOUBLE ppv;
  OPTIONAL DOUBLE load;
  OPTIONAL DOUBLE cbat;
  OPTIONAL DOUBLE feedIn;
  OPTIONAL DOUBLE gridCharge;
  OPTIONAL DOUBLE pchargingPile;
}

While the data type for the date field seems larger, it seems that the integer compression worked much better for the use case of second resolution. This is just guesswork on my side and has not been verified.

Is there a CLI tool, that allows me to check the size per column in a parquet file? Turns out there is. Running parquet-tools meta zstd_v2.parquet showed me what I needed. Unfortunately it showed me also, that I accidentally exported my initial zstd files as snappy. After fixing that, the initial export from duckdb with the timestamp as a date exported to a stunning 953kb. So, just by going with zstd within parquet we can go from a 14MB file down to less than one megabyte, ending up with just 7% in size and likely the ability to even query faster than a regular CSV file. Nice! And another reminder to always check your assumptions before doing something more crazy.

I’m sure there are even more ways of optimization. But given the fact that all you need is a couple of DuckDB commands and you end up with a really good compression compared to anything you invest time in by programming yourself this sounds like the best solution - plus the ability to query those files really fast. I’m always amazed what kind of powerful tools are just at our fingertips.

Parquet will not go away soon in the world of data engineering, warehouses, lake houses and whatever fancy term comes next. Happy hacking!

Final remarks

If you made it down here, wooow! Thanks for sticking with me. You can follow or contact me on Bluesky, Mastodon, GitHub, LinkedIn or reach me via Email (just to tell me, you read this whole thing :-).

If there is anything to correct, drop me a note, and I am happy to fix and append a note to this post!

Same applies for questions. If you have question, go ahead and ask!

If you want me to speak about this, drop me an email!


Back to posts