16
Parsing and reading Excel binary data using Pandas
Introduction
Sometimes, it can be necessary to parse data in binary format from an external source in order to work with it.
Usually, when encoding the data from its original source into a binary format, some information that retains the original structure of the data is kept, for example, xlsb
for binary Excel data.
Ideally, when decoding the data back into its original format, we want some form of "typed conversion", where we can access the original data in the format it was originally stored.
Let's see how to do that using the Python library pandas
, motivated by a short real-world example.
An example to set the stage
Let's assume that we have real-world data containing information about the changes in acceleration on the crest of a dam over time, measured by accelerometers installed on the crest itself. This data is recorded in an Excel file containing two columns: t
, representing the time in seconds, and a (mg)
the corresponding acceleration value in micro-g.
Now, let's suppose, for monitoring purposes and displaying chart data, this file is encoded into binary and stored in a MongoDB DB as a binary field, so, when looking inside the database documents, for example, via a managed Atlas database, we get:
Note how it is stored in a binary format in the DB. Document databases are well-suited for storing this style of unstructured data, which is why we show a NoSQL database here, but, a SQL database would work just as fine.
So, now, we have a field in a database that is nothing more than a plain stream of binary data. Let's see how to get a typed decoding from it, so, we can effectively rebuild our original excel data to be used by any client app.
Using pandas+BytesIO to get a typed conversion into the original excel format
Now that we have our binary data, we need to consume it at the client side. In order to do it, we can leverage a combination of two distinct packages: BytesIO
, a standard Python package that can parse a stream of bytes using a specific encoding, and pandas
, which, quoting the official website, " is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.". Note that the read_excel
built-in function from pandas requires installing an additional dependency: openpyxl
.
The trick is to use composition of the function calls, by nesting them inside one another: first we process the binary stream with BytesIO and then, using that parsed stream, we can pass it into pandas, and, in particular, to the read_excel
function.
Let's assume that in our mongoDB database, we have the binary file stored in a list called dataFiles
and we want to parse the first file in the list. Assuming mongoClient
is a configured and ready-to-use client, the chaining code would look like:
import pandas as pd
(...)
excel_data = pd.read_excel(io.BytesIO(mongoClient['dataFiles'][0]))
Now, inside the variable excel_data
we have an array of Nx2 entries, where N is total number of rows in the excel file, which, in this example is 4000, so, when we print the variable to the console, here is the output:
t a (mg)
0 0.00 0.000358
1 0.02 -0.000466
2 0.04 -0.000181
3 0.06 0.000697
4 0.08 0.000618
... ...
3996 79.92 -0.001205
3997 79.94 0.000120
3998 79.96 -0.000061
3999 79.98 -0.000022
4000 80.00 0.000142
exactly as the original excel file format, decoded from the binary format correctly!
Now, if we want to have access to a column, we can actually do it in a typed way, so, printing excel_data['t'][0]
will give us back the value 0.00 as a 64 bit float value, typed, as we wanted.
Conclusion
Using a combination of standard library and open-source libraries, we can see how simple it is to convert binary data to an Excel format and process and operate on it.
The main trick is to use functional composition of different libraries in order to have the data flowing in between them in the required compatible formats, and, finally, being outputted in a typed format (can be either numeric or text, dependending on the data that was in the Excel). Then, the resulting 2D array operates like an hashmap where the keys are the headers from the various Excel columns.
16