Scratching an itch with Spark

One of the things I love about Spark is how easy it makes working with various data sources. Once you've got the data out of the source and into a Dataset or DataFrame then you can manipulate that data without thought to the original format. Most of the times this is in CSV, Parquet, JSON, or a few others. But occasionally (or quite a lot of the time it seems if you're me) people will have the bulk of their data in Excel.

I'm not going to go into why Excel is a bad format for data storage and distribution, you can find plenty of reasons on-line, or you've likely encountered it yourself (Excel interpreting dates anyone?).

I've been working on a custom data source for Spark recently using the newer DatasourceV2 APIs, which make it easier to write data sources without having to consider RDDs, whilst bringing along a lot of new functionality like predicate push-down.

At the moment I'm still working through a few kinks, and ensuring that I've got as much code and branch coverage as I can, at which point I'll do a deeper dive into the how as I can't see a lot about how to write your own data sources on-line or off-line. But at the moment the data source supports the following features.

Handling Excel 97-2003, 2010, and OOXML files (thanks to Apache POI)

Multi-line headers

Reading data from multiple sheets matching a given regular expression

Handle glob patterns for reading multiple files

Outputting the sheet name in the result set (set via an option)

Schema inference

Cleaning up column names (removing invalid characters)

Handle merged cells (repeats data to all cells in the merged region)

Formula evaluation (for those supported by Apache POI)

Long name, short name, and helper method for loading data

Works in Scala, PySpark, and Spark SQL

This screenshot is querying a sample workbook I put together which contains formula data, I wanted to do this to make sure that the data was extracted as expected. And thanks to the Apache POI library, most formulae people use should be covered.

Code coverage is looking pretty good at the moment, but I want to make sure all of the key parts of the data source are covered, as there's a few "quirks" of Excel which I need to make sure are working as expected.

Hopefully I can jump onto the deeper dive soon, just in case there's anyone else out there crazy enough to want to write their own data source.

Tags:

Scala Spark Excel

Building a REST API in Scala 3 using Iron and Cats

ALIGN CENTER Using Shortcuts in Microsoft Excel 365!

A Simple Shortcut to Paste Name into Formula in MS Excel 365

Excel como Base de Datos de lectura.

How to “Select Table Column” Using Shortcuts in MS Excel 365?

Shortcut to Move to the First cell in the Worksheet in Excel 365!!

How to Format Text in Microsoft Excel 2013?

How to use Excel UNICHAR Function in Office 365?

Scratching an itch with Spark

Tags:

Related Posts