Why Polars

data processing
The key to useful and meaningful data is data that is good quality and tidy. Getting data to be tidy and higher quality is an often undervalued and underestimated activity, at the same time being very labour intensive. But it is essential to have higher quality and tidier data to have better quality results from the analysis. This post explains why we chose to use Polars for data processing tasks.
Published

December 4, 2024

Context and problem statement

When working with data analysis, data is rarely ready to be used ‘as is’. There will always be things that are either very wrong (like text in what should be a number field), slightly wrong (like a single missing or incorrect value), needing to be rearranged or reorganized, or an outlier potentially due to an error in data entry. While we don’t have domain knowledge to know about missing values or the definition or meaning of an outlier, there are some basic things we can check and fix, such as whether the data contradicts any constraints in the definition of the data resource (eg. a number in a date field). There are a lot of tools that can process data, so our question is:

Which tool should we use for processing data in our software and what tool should we recommend users to use when processing needs to happen outside of our software?

Decision drivers

  • Can be installed and used in Python.
  • Is designed and built to be fast for processing data.
  • Has minimal dependencies, so can be installed quickly.
  • Is flexible to reading different data storage formats (e.g. SQL databases, Parquet, CSV).
  • Can write to different storage formats (as with above).

Considered options

In addition to our requirements defined by our principles and our requirements listed above, these are the options we’ve considered:

DuckDB

DuckDB is an in-process online analytical process (OLAP) database tool that has API extensions in many languages including Python. It is a very young tool, version 1.0.0 was released in 2024, although the first version was released in 2019. The developers have set it up as a not-for-profit foundation in the Netherlands.

Benefits

  • Although a very young project it is already in use by a number of large companies and has a large community of users.
    • The code is open source, and the way it has been set up is promising for it to remain that way.
  • It is very fast, and can handle large data sets.
  • Extensive documentation with user guide sections embedded.

Drawbacks

  • To use it to its full potential in Python, users need to have some knowledge of SQL.

Polars

The Polars project started in 2020 and is based in the Netherlands. They have a relatively small team but seem to have a fast growing user base, and very enthusiastic developers. Polars is written in Rust and available for Python. The two main developers are described as enthusiastic about open source.

Benefits

  • Enthusiastic user community, with a lot of support available.
  • Supports several languages, including Python.
  • Handles all common data formats.
  • Written in Rust (a very fast, low level language), so it processes data very fast.
  • Polars has an extensive user guide section.

Drawbacks

  • Relatively young project, so may not have all the features of more established tools.
  • Because it is still new, it isn’t as widely used and doesn’t have too many learning resources available yet.

Pandas

Pandas was initially released in 2008, and is now a well established tool to work with data frames in Python. The tool is written in Python, and is designed to be easy to use.

Benefits

  • Established tool with a large user base.
  • Well supported, and well documented.
  • Plenty of learning resources available from a variety of sources.
  • Works with all relevant languages and data formats.
  • The Pandas website contains a very thorough section on how to get started, including customised tutorials for people that have experience with many other tools.
  • The documentation seems very thorough, and includes a lot of examples.

Drawbacks

  • Can be slow when working with large data sets.
  • Because it was developed more than a decade ago, some of its syntax and design is out-dated.

Decision outcome

We have decided to use Polars for data processing, as it is fast and well documented, and we expect that we may use it elsewhere in the project. It is also a relatively new tool, but it is already well supported with an enthusiastic following, which we expect will continue in future.

Consequences

We don’t expect that we will find any problems with this decision down the line, but as always it is something that we will review on a regular basis. If we find that Polars is not meeting our needs, we will look at other tools that are available.

Resources used for this post