Show HN: Hyparquet 1.0 – Apache Parquet Parser for the Browser

github.com

6 points by platypii 5 days ago

I started 6 months ago when I wanted to look inside datasets from Hugging Face. I was not satisfied with existing libraries. So, naturally, I built my own library from scratch.

Parquet is a very complicated format. It has 22 data types, 9 encodings, 8 compression codecs. Previous parquet js libraries went abandoned due to the complexity. However, I can confidently say that Hyparquet is now the most conformant parquet parser in existence. It can open more files than PyArrow and DuckDB. I dare you to find a file that Hyparquet can’t open!

In addition to supporting all the parquet files, Hyparquet supports efficient streaming of parquet data over the network, even cross-domain using CORS. This means you can stream files straight from S3 with no backend.

Hyparquet is open-source MIT licensed.

You can launch a local parquet file viewer by running "npx hyperparam"

doppenhe 5 days ago

Could HyParquet's approach be extended to other data formats beyond Parquet?

  • platypii 5 days ago

    I definitely think that UX is an underappreciated area for machine learning data. I want to make a set of libraries and tools that make it easier for people to work with ML data in the browser. The first step of good data science is to become one with your data.

    I started with parquet because most datasets for modern LLMs are in parquet format. But there are other formats like JSONL which are common too.

doppenhe 5 days ago

nice thanks for sharing