Apache Parquet for HL7 FHIR

Gidon Gershinsky
9 min readJun 21, 2020

--

Gidon Gershinsky (ggershinsky@apple.com),
Eliot Salant (salant@il.ibm.com),
Lee Surprenant (lmsurpre@us.ibm.com)

The research leading to these results has received funding from the European Community’s Horizon 2020 Research and Innovation programme under Grant Agreement No 826284

FHIR (fast healthcare interoperability resources) is a new HL7 standard, that has quickly gained widespread acceptance in the medical IT field. The standard defines a data exchange protocol and an extensible message structure that is aimed at joining disparate vendor systems in the healthcare domain. The data is encapsulated in JSON or XML messages that are sent via REST calls — in either regular mode (one message per call, or a few bundled messages per call), or in a bulk export mode. The latter allows to deliver a massive subset (or all) of the stored FHIR data to another location, for replication purposes, or for execution of analytic workloads, or for other reasons.

FHIR data is typically stored in databases, though this is not prescribed by the standard which focuses solely on data exchange, and not on its storage.

The bulk export section of the FHIR standard uses ND-JSON (new line delimited JSON) files as the main format for the bulk exchange of FHIR data. ND-JSON is simply a text file where each line is a JSON document — which makes it a natural fit for packaging of numerous FHIR data resources, because each resource is a JSON document (or an XML document that can be directly mapped to a JSON).

We have been using the HL7 FHIR technology in a joint European Union sponsored research project with several hospitals, IT companies and universities. The project, called ProTego, is focused on secure and efficient implementation of healthcare scenarios; in some use cases, the platform is fully deployed inside a hospital infrastructure, in other use cases implementation is split between a private hospital cloud and a public cloud. In this project, we address a number of challenges, such as efficient storage and fast processing of large amounts of FHIR resources, privacy and integrity of healthcare data (on-premises and in cloud), and efficient export of bulk FHIR data from hospital infrastructure to the cloud (and back).

Besides ND-JSON, other file formats can be suitable for a use in future FHIR versions for bulk export. One of these formats is Apache Parquet — a popular standard for efficient storage of big, tabular data. Since we are already actively involved in the Apache Parquet community work , we decided to experiment with applying Parquet to FHIR data (see a talk at the DevDays conference).

Apache Parquet can be thought of as a “CSV on steroids”. It stores a table data, organized into columns and rows — but unlike CSV, Parquet is a highly optimized “column-oriented” binary format that enables

  • Data encoding: using dictionary encoding that keeps only a short reference for repeated values to reduce the size of stored data (several other encoding types are also supported)
  • Compression: applying gzip, Snappy and other compressors to further reduce the size of stored data
  • Columnar projection: fetching only a relevant subset of columns from the table data storage, to reduce the I/O load and improve workload throughput
  • Predicate pushdown: row-based filtering inside the column data, to further reduce the I/O load (filtering is not done per single row, but rather per a “row group” or a “page”)
  • Nested schema: supporting hierarchical columns, that allow to store nested data, such as JSON records
  • Encryption: protecting privacy and integrity of sensitive data, with per-column encryption keys, and automatic hardware acceleration of crypto operations

Using Parquet instead of traditional formats can lead to one or two orders of magnitude speedup in workload execution time. Besides the performance aspects, Parquet also allows for new functionality (such as nested structure support) and for built-in protection of sensitive data. Apache Parquet is integrated in virtually every analytic framework, such as Apache Spark, pandas, Presto, Impala, Hudi, etc.

The ProTego project explores the application of Parquet format in healthcare scenarios.

In this blog, we report the performance results in two independent technical scenarios:

1. Using Parquet (instead of ND-JSON) for the bulk export of FHIR data

2. Using Parquet (instead of databases) for the storage of FHIR data

The first scenario focuses on data exchange and demonstrates the benefits of adding Parquet to the HL7 FHIR standard (as an additional bulk export format, besides ND-JSON).

The second scenario focuses on data storage. This aspect is not covered by the HL7 FHIR standard, which specifies only the data exchange protocol. As mentioned before, FHIR data is typically stored in databases. Here, we make a case for an alternative approach to storing the healthcare data, based on using columnar format (Parquet) files instead of databases for certain types of FHIR resources, as will be explained later.

In both scenarios, the modular Parquet encryption mechanism is leveraged to protect the FHIR data, either at rest (stored) or in flight (exported).

1. FHIR DATA EXPORT

Apache Parquet support for nested columns allows for direct mapping of JSON records to Parquet rows — or for a direct conversion of ND-JSON files to Parquet files. Therefore, Apache Parquet format is a viable alternative to ND-JSON for bulk export of HL7 FHIR data. To confirm this, and to measure the performance benefits of using Parquet instead of ND-JSON, we have run a number of experiments where Apache Spark was leveraged to convert an ND-JSON file, full of FHIR resources, into a Parquet file.

The experiments were performed on data derived from samples in https://www.hl7.org/fhir/downloads.html. To extend the representability of the results, we have worked with different FHIR resource types and different internal structures: Patients; Observations with and without contained, component elements; Questionnaire with item depth ranging from 1 to 5, some with contained elements. The detailed results of the experiments are described in a report. Here we provide the summary:

- For any type of resource data, Parquet files maintain the correct schema, and the contents are fully identical to the original ND-JSON files (both schema and data are the same).

- Parquet files are much smaller than ND-JSON files with the same data.

  • Patient files are ~6 times smaller.
  • Observation files are ~40 times smaller.

The reason for stronger reduction in ‘observations’ is that they are less diverse than ‘patients’ — typically, only a few elements change from record to record (such as id, value) while the rest (coding, performer, subject) change slowly or do not change at all. This allows the Parquet ‘dictionary encoding’ mechanism to store the repeated values using only their reference ids.

- SQL queries run significantly faster on Parquet files than on ND-JSON files.

We believe these results support adding Apache Parquet as one of the HL7 FHIR standard formats for bulk export of data. The possible use cases are:

a. General Purpose Export

A user would be able to request either “ndjson” or “parquet” format within the current API. This does not require any change in the current FHIR Export specification, except for allowing a “parquet” value in the _outputFormat parameter, plus potentially an addition of a _compressionType parameter.

The produced Parquet files will contain the same information as the ND-JSON files in the current FHIR spec. The reasons to use Parquet format are:

  • (much) less bytes are processed/stored by FHIR server and sent on wire to the user, due to Parquet encoding and compression
  • received files support faster analytics / queries due to Parquet columnar projection and predicate push down support

b. General Purpose Export with Encryption

Similar to the previous use case — but the produced files are encrypted with the standard Parquet Encryption. This allows for

  • faster security — instead of TLS handshakes on each file, TLS is performed once only (to get encryption keys) and then the files (with built-in Parquet encryption) are sent on a regular connection
  • safe export to public storage / cloud locations — the receivers will get the keys, but the data itself is encrypted and can be stored “as is” in an untrusted storage

The encryption will require a certain change in the FHIR Export specification — in the “Security Considerations” section, and an addition in the request/response parameters.

c. Analytics Friendly Export

The general-purpose export mode, described above, produces Parquet files for accurate transfer of data — without changes in schema or loss of information. Although the Parquet format allows for columnar projection and predicate pushdown on these files which accelerate analytic queries -

  • the speed of analytic workloads can be further improved if the schema is changed
  • user experience can be improved if the view is simplified and some data elements are removed (see the SQL-on-FHIR project)

The analytics-friendly mode would follow the recommendations of the SQL-on-FHIR project, by dropping certain resource elements, and by allowing to make extensions first-class fields. This mode can go further, by enabling configurable schema flattening, and the additional removal of elements irrelevant to the user’s workload. The user should be able to tell a FHIR server what Parquet schema she is interested in (and how the original resources should be mapped to the target schema) or to choose from several predefined schemas / profiles. This will require significant additions to the FHIR Export specification.

2. FHIR DATA STORAGE

In the ProTego project, we go one step beyond using Apache Parquet for export of FHIR data from hospitals to public clouds and use Parquet also for the storage of massive FHIR resources in the hospitals. FHIR is an optimal format for transmission of high volumes of clinical data, leveraged by researchers and practitioners for healthcare related analytics. Instead of storing FHIR records in databases, we take the approach adopted by many companies in the big data analytics space that store the information in files with columnar format. File storage is inexpensive and scalable. Moreover, Parquet format is designed to accelerate analytic workloads by implementing advanced data filtering capabilities and compression inside the files.

We have developed an interceptor in the open source IBM FHIR server, that grabs the incoming FHIR messages and writes them to Apache Parquet files This is not done for every FHIR resource type, but only for the massive data collected in analytic use cases — for example, sensor readings, transmitted as FHIR Observation records where large amounts of data would be collected and analyzed on a per-patient level (or for many patients). Another example is drug prescriptions, transmitted as FHIR MedicationRequest records, where collectively, across a hospital or health care system, would represent a large amount of data which would be used for medical studies. This kind of information is immutable / append-only, and therefore fits the Parquet model that does not support fine-grained updates or deletion of individual records.

Today, FHIR resources are typically stored in databases, SQL or noSQL. Running massive analytics on this data is often done by exporting it from a DB to ND-JSON files, and running an analytics engine on these files. Sometimes, it requires a double export — from DB to ND-JSON, and then from ND-JSON to Parquet, in order to improve the analytics performance.

Storing big FHIR data directly in the Apache Parquet format allows for direct analytics on this data, without any export. The performance benefits are obvious.

Also — like in the “analytics-friendly” bulk export mode, described above, we consider to apply the results of the SQL-on-FHIR project to the export-free FHIR storage in Parquet, in order to further improve the analytics efficiency on this data, and to enhance the SQL user experience. This would involve dropping certain elements, flattening other elements, and converting extensions into first-class fields — all done upon storing FHIR records in Parquet files.

Moreover, if a use case requires a bulk export of FHIR data to another location (for any purpose, not just analytics), it can be implemented by simple transfer of the existing Parquet files. There is no need to extract the FHIR records from a database and package them for remote delivery.

3. FHIR DATA PROTECTION

We take a leading role in Apache Parquet community work on an encryption mechanism that protects sensitive column data. The specification of this mechanism has been recently adopted by the community as a part of the Parquet format standard.

Healthcare data is obviously personal and sensitive. Besides protecting its privacy, it is often important to protect its integrity, so a malicious party will not be able to tamper with the contents of medical records, such as sensor readings, prescriptions, allergy to medications etc.

Apache Parquet encryption mechanism protects both privacy and integrity of the data. In the ProTego project, we use it in the storage mechanism — where FHIR data is directly written to Parquet files, and in the transmission mechanism — where FHIR data is bulk-exported from hospitals to public clouds.

--

--

No responses yet