Stories by Dan Homola on Medium

FlexConnect: Cross-tenant Benchmarking

Dan Homola — Thu, 27 Feb 2025 15:55:49 GMT

It is only natural that people and organizations tend to compare themselves to others: it can drive positive change and improvement. For BI solutions that operate on data for data of many (often competing) tenants, it can be a valuable selling point to allow the tenants to compare themselves against others. These can be different businesses, departments in the same business, or even individual teams.

Since the data of each tenant is often sensitive and proprietary to each tenant, we need to take some extra steps to make the comparison useful without outright releasing the other tenant’s data. In this article, we describe the challenges unique to benchmarking and illustrate how the GoodData FlexConnect data source can be used to overcome them.

Benchmarking and its challenges

There are two aspects we need to balance when implementing a benchmarking solution:

Aggregating data across multiple peers
Picking only relevant peers

First, we need to aggregate the benchmarking data across multiple peers so that we do not divulge data about any individual peer. We must choose an appropriate granularity (or granularities) on which the aggregation happens. This is very domain-specific, but some common granularities to aggregate peers are:

Geographic: same country, continent, etc.
Industry-based: same industry
Aspect-based: same property (e.g. public vs private companies)

Second, we need to pick peers that are relevant to the given tenant: comparing to the whole world at once is very rarely useful. Instead, the chosen peers should be in the “same league” as the tenant that is doing the benchmarking. There can also be compliance concerns at play: some tenants can contractually decline to be included in the benchmarks, and so on.

All of this can make the algorithm to choose the peers very complex: often too complex to implement using traditional BI approaches like SQL. We believe that GoodData FlexConnect is a good choice to implement the benchmarking instead. Using Python to implement arbitrarily complex benchmarking algorithms while plugging seamlessly into GoodData as “just another data source”.

What is FlexConnect

FlexConnect is a new way of providing data to be used in GoodData. I like to think of it as “code as a data source” because that is essentially what it does — it allows using arbitrary code to generate data and act as a data source in GoodData.

The contract it needs to implement is quite simple. The FlexConnect gets an execution definition and its job is to return a relevant Apache Arrow Table. There is our FlexConnect Architecture article that goes into much more detail, I highly recommend reading it next.

For the purpose of this article, we will focus on the code part of the FlexConnect, glossing over the infrastructure side of things.

The project

To illustrate how FlexConnect can serve benchmarking use cases, we will use the same project available in the GoodData Trial. It consists of one “global” workspace with data for all the tenants and then several tenant-specific workspaces.

We want to extend this solution with a simple benchmarking capability using FlexConnect so that tenant workspaces can compare themselves to one another.

More specifically, we will add the capability to benchmark the average amount of returns across the different product categories. We will pick the peers by comparing their total number of orders and will pick those competitors that have a similar number of orders as the tenant running the benchmarking.

The solution

The solution uses a FlexConnect to select the appropriate peers based on the selected criteria and then runs the same execution against the global workspace with an extra filter making sure that only the peers are used.

The schema of the data returned by the function makes sure that no individual peer can be visible: there simply is not a column that would hold that information. Let’s dive into the relevant details.

The FlexConnect outline

The main steps of the FlexConnect is as follows:

Determine which tenant corresponds to the current user
Use a custom peer selection algorithm to select appropriate peers to get the comparative data
Call the global workspace in GoodData to get the aggregate data using the peers from the previous step

The FlexConnect returns data conforming to the following schema:

import pyarrow

Schema = pyarrow.schema(
    [
        pyarrow.field("wdf__product_category", pyarrow.string()),
        pyarrow.field("mean_number_of_returns", pyarrow.float64()),
    ]
)

As you can see, the schema returns a benchmarking metric sliced by individual product categories. This gives us very strict control about which granularities of the benchmarking data we want to allow: there is no way a particular competitor would leak here.

Current tenant detection

First, we need to determine which tenant is the one we are choosing the peers for. Thankfully, each FlexConnect invocation receives the information about which workspace it is being called from. We can use this to map the workspace to the tenant it corresponds to.

For simplicity’s sake, we use a simple lookup table in the FlexConnect itself, but this logic can be as complex as necessary — in real life scenarios, this is often stored in some data warehouse and you could query for this information (and possibly caching it).

import gooddata_flight_server as gf

TENANT_LOOKUP = {
    "gdc_demo_..1": "merchant__bigboxretailer",
    "gdc_demo_..2": "merchant__clothing",
    "gdc_demo_..3": "merchant__electronics",
}

# This is the API you need to implement, it has to adhere to the PyArrow schema
def call(
    self,
    parameters: dict,
    columns: Optional[tuple[str, ...]],
    headers: dict[str, list[str]],
) -> gf.ArrowData:
    execution_context = ExecutionContext.from_parameters(parameters)
    tenant = TENANT_LOOKUP.get(execution_context.workspace_id)
    peers = self._get_peers(tenant)
    return self._get_benchmark_data(
        peers, execution_context.report_execution_request
    )

Peer selection

With the current tenant known, we can then select the peers for the benchmarking. We use a custom SQL query, which we run against the source database. This query selects peers that have similar values in the number of orders (we consider competitors that have 80–200% the amount of our order quantity). Since the underlying database is Snowflake, we use the Snowflake-specific syntax to inject the current tenant into the query.

Please keep in mind that the fact we use SQL here is meant to illustrate that the peer selection can use any algorithm you want and can be as complex as needed based on business or compliance needs. E.g., it could contact some external API.

import os

import snowflake.connector


def _get_connection(self) -> snowflake.connector.SnowflakeConnection:
    ...  # omitted for brevity

def _get_peers(self, tenant: str) -> list[str]:
    """
    Get the peers that have comparable number of orders to the given tenant.
    :param tenant: the tenant for which to find peers
    :return: list of peers
    """
    with self._get_connection() as conn:
        cursor = conn.cursor()
        cursor.execute(
            """
        WITH PEER_STATS AS (
            SELECT COUNT(*) AS total_orders,
                   "wdf__client_id" AS client_id,
                   IFF("wdf__client_id" = %s, 'current', 'others') AS client_type
            FROM TIGER.ECOMMERCE_DEMO_DIRECT."order_lines"
            GROUP BY "wdf__client_id", client_type
        ),
        RELEVANT_PEERS AS (
            SELECT DISTINCT others.client_id
            FROM PEER_STATS others CROSS JOIN PEER_STATS curr
            WHERE curr.client_type = 'current'
              AND others.client_type = 'others'
              AND curr.total_orders BETWEEN others.total_orders * 0.8 AND others.total_orders * 2
        )
        SELECT * FROM RELEVANT_PEERS
            """,
            (tenant,),
        )

        record = cursor.fetchall()
        return [row[0] for row in record]

Benchmarking data computation

Once we have the peers ready, we can query the global GoodData workspace for the benchmarking data. We can take advantage of the fact that we get the information about the original execution definition passed to the FlexConnect when invoked.

This allows us to keep any filters applied to the report: without this, the benchmarking data would be filtered differently, rendering it meaningless. The relevant part of the code looks like this:

import os

import pyarrow
from gooddata_flexfun import ReportExecutionRequest
from gooddata_pandas import GoodPandas
from gooddata_sdk import (
    Attribute,
    ExecutionDefinition,
    ObjId,
    PositiveAttributeFilter,
    SimpleMetric,
    TableDimension,
)

GLOBAL_WS = "gdc_demo_..."

def _get_benchmark_data(
    self, peers: list[str], report_execution_request: ReportExecutionRequest
) -> pyarrow.Table:
    # GoodPandas = GoodData -> Pandas
    pandas = GoodPandas(os.getenv("GOODDATA_HOST"), os.getenv("GOODDATA_TOKEN"))

    (frame, metadata) = pandas.data_frames(GLOBAL_WS).for_exec_def(
        ExecutionDefinition(
            attributes=[Attribute("product_category", "product_category")],
            metrics=[
                SimpleMetric(
                    "return_unit_quantity",
                    ObjId("return_unit_quantity", "fact"),
                    "avg",
                )
            ],
            filters=[
                *report_execution_request.filters,
                # add a filter limiting the scope to the peers
                PositiveAttributeFilter(ObjId("client_id", "label"), peers),
            ],
            dimensions=[
                TableDimension(["product_category"]),
                TableDimension(["measureGroup"]),
            ],
        )
    )

    frame = frame.reset_index()
    frame.columns = ["wdf__product_category", "mean_number_of_returns"]

    return pyarrow.Table.from_pandas(frame, schema=self.Schema)

Changes to LDM

Once the FlexConnect is running somewhere reachable from GoodData (e.g., AWS Lambda), we can connect the FlexConnect as a data source.

To be able to connect the dataset from it to the rest of the logical data model, we need to make two changes to the existing model first:

Promote product category to a standalone dataset
Apply the WDF that exists on the product category to new and benchmarking datasets

Since our benchmarking function is sliceable by product category, we need to promote product category to a stand alone dataset. This will allow it to act as a bridge between the benchmarking dataset and the rest of the data.

We need to apply the WDF that exists on the product category in the model to both the new and the benchmarking datasets. This ensures the benchmark will not leak product categories available to some of the peers but not to the current tenant. This also shows how seamlessly the FlexConnects fit into the rest of GoodData: we treat them the same way we would treat any other dataset.

Let’s have a look at the before and after screenshots of the relevant part of the logical data model.

LDM before the changes

LDM after the changes

In Action

With these changes in place, we can finally use the benchmark in our analytics! Below is an example of a simple table comparing the returns of a given tenant to its peers.

Example benchmarking insight

In this particular insight, the tenant sees that their returns for Home Goods are a bit higher than those of their peers, so maybe there is something to be investigated there.

There is no data for some of the product categories, but that is to be expected: sometimes there are no relevant peers for a given category, so it is completely fine that the benchmark returns nothing for it.

Summary

Benchmarking is a deceptively complicated problem: we must balance the usefulness of the values with compliance to the confidentiality principles. This can prove to be quite hard to implement in traditional data sources. We have outlined a solution based on FlexConnect that offers much greater flexibility both in the peer selection process and the aggregated data computation.

Want to Learn More?

If you want to learn more about GoodData FlexConnect, I highly recommend you read the aforementioned architectural article.

If you’d like to see more use-cases for FlexConnect, check out our machine learning or NoSQL articles.

FlexConnect: Cross-tenant Benchmarking was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

FlexConnect: Integrate NoSQL Into BI

Dan Homola — Wed, 04 Dec 2024 10:35:13 GMT

Over the past several years, NoSQL databases (and document-oriented databases specifically) have seen significant adoption in fields like IoT and E-Commerce. They provide some interesting advantages over more traditional SQL-based databases, including flexibility, scalability, and ease of use for developers. However, while flexibility is a benefit, it can make it challenging to connect these databases to analytics platforms.

In this article, we show how FlexConnect, a new GoodData feature, can be used to connect a MongoDB collection to GoodData using a little Python code. We demonstrate this using the MongoDB Sample Mflix demo data about movies — so that it is easy to follow. We also take advantage of MongoDB client supporting the Apache Arrow format to achieve much simpler and more efficient integration.

This article is part of the FlexConnect Launch Series. To dive deeper into the overall concept, be sure to check out our architectural overview. Don’t miss our other articles on topics such as API ingestion, NoSQL integration, Kafka connectivity, Machine Learning, and Unity Catalog!

Challenges

Unlike most SQL-based databases, document databases can store data with complex structures, including nested objects, lists, and more. The schema of the documents is also not as rigid as in the SQL world. These factors make it quite challenging to fit the data into a Star- or Snowflake schema, commonly used in BI to provide self-service capabilities to end users. Any solution that aims to connect those two worlds needs to be flexible enough to allow for customization of the mapping logic. This is essential to cope with the dynamic and loosely-structured nature of the documents.

What Is FlexConnect?

FlexConnect is a new way of providing data to be used in GoodData. It enables connections to virtually any data source while integrating with the rest of GoodData, just like any natively supported data source type.

I like to think of it as “code as a data source,” because that is essentially what it does: it allows the use of arbitrary code to generate data and act as a data source in GoodData. The contract it needs to implement is quite simple: based on the execution definition, return the relevant data as an Apache Arrow Table over the Arrow Flight RPC protocol (I highly recommend our detailed primer on Arrow Flight RPC). FlexConnect is then deployed anywhere you like — as long as GoodData can reach it and it acts as “just another data source.” This article goes into much more detail, so I highly recommend reading it next. For the purpose of this article, we will focus on the code part of FlexConnect, glossing over the infrastructure side of things.

FlexConnect for Movie Collection

To showcase FlexConnect’s capabilities, we will use MongoDB, a popular choice among document-oriented databases, as a representative. We will use their Sample Mflix dataset to create a simple FlexConnect implementation exposing some of the data to GoodData.

To get started, we clone the gooddata-flexconnect-template repository. This bootstraps the necessary boilerplate we need to get started. It is a production-ready setup based on our gooddata-flight-server package that handles the necessary infrastructure for exposing data using the Apache Arrow Flight RPC protocol, the basis of FlexConnect.

Dependencies

Once that is done, we can start implementing the MongoDB connector. MongoDB provides a Python client for easy interaction with the databases. They also provide a plugin of sorts that adds some Apache Arrow-friendly APIs. We will be using both, so we need to add those to our project:

pymongo — the base client
pymongoarrow — the Apache Arrow “plugin”

Schemas

Next, let’s define the data schema we will expose to GoodData and the schema of the data we will pull from MongoDB. Thanks to pymongoarrow, we can define both at once!

Schema of the mapping of the JSON on GoodData table

import pyarrow
from pymongoarrow.api import Schema as MongoSchema

# This is the schema the data returned by the MongoDB query will have.
DbSchema = MongoSchema(
    {
        "title": pyarrow.string(),
        "rated": pyarrow.string(),
        "released": pyarrow.timestamp("ms"),
        "critic_rating": pyarrow.int64(),
        "viewer_rating": pyarrow.int64(),
    }
)

# We need to advertise the schema of the data we are going to send to GoodData.
# This is part of the FlexConnect function contract.
Schema = DbSchema.to_arrow()

As you can see in the code snippet, we define the schema of the results we want to get from MongoDB and then expose it as a pure Apache Arrow Schema object to GoodData. This ensures that we will not need to convert the results from MongoDB to GoodData-compatible data; we will just pass them along (more details to follow).

Report execution code

Now, we will focus on the part where FlexConnect handles the report execution requests. These correspond to a user opening a report and requesting data for it. First, to make things leaner, we define a simple contextmanager that exposes the “movies” collection and ensures the connection is closed appropriately.

import os
from collections.abc import Generator
from contextlib import contextmanager

from pymongo import MongoClient
from pymongo.synchronous.collection import Collection

CONNECTION_STRING = os.getenv("MONGO_CONN_STRING")

@staticmethod
@contextmanager
def _get_movie_collection() -> Generator[Collection, None, None]:
    """
    Get the MongoDB collection with movies and make sure it is closed after use.
    """
    client = MongoClient(CONNECTION_STRING)
    try:
        db = client.get_database("sample_mflix")
        yield db.get_collection("movies")
    finally:
        client.close()

Next, we move to the code for the report execution. We connect to the MongoDB database and query it for the necessary information. We use the DbSchema defined earlier to make MongoDB return the data directly in the format we want to pass to GoodData (remember, we made those two schemas identical for this very reason).

import gooddata_flight_server as gf
from gooddata_flexconnect import ExecutionContext, ExecutionType


def call(
    self,
    parameters: dict,
    columns: Optional[tuple[str, ...]],
    headers: dict[str, list[str]],
) -> gf.ArrowData:
    execution_context = ExecutionContext.from_parameters(parameters)
    if execution_context.execution_type == ExecutionType.REPORT:
        with self._get_movie_collection() as collection:
            return collection.find_arrow_all(
                # We can pass the filters directly to the find_arrow_all method
                # to optimize the query and avoid unnecessary data transfer.
                query=self._report_filters_to_mongo_query(
                    execution_context.report_execution_request.filters,
                    execution_context.timestamp,
                ),
                projection={
                    # We can project fields as they are in the MongoDB collection
                    "title": "$title",
                    "rated": "$rated",
                    "released": "$released",
                    # We can project nested fields as well
                    "critic_rating": "$tomatoes.critic.meter",
                    "viewer_rating": "$tomatoes.viewer.meter",
                },
                schema=self.DbSchema,
            )
    elif execution_context.execution_type == ExecutionType.LABEL_ELEMENTS:
        ... # will be discussed in the next snippet

As you can see, we simply define the appropriate projections to get the data from the correct locations in the documents and return the data directly. The pymongoarrow will convert the results from the database to an Apache Arrow Table of the desired shape under the hood.

You may have noticed that we also define the query to limit the number of results based on the filters from the execution request. While this is optional, it is a good practice as it can make things significantly faster instead of returning data from all the documents all the time. This will be discussed later in a separate chapter. The important thing to note here, though, is that even if you do not filter the data, GoodData will still ensure all the execution filters are applied correctly, so filtering here is mainly a means of optimizing things.

Label elements code

The other type of execution we need to handle is requests for label elements. For example, all the different ratings of all of the movies. These are primarily used to populate the UI label filter pickers. Basically, we get a label’s id, and we should return all the labels’ distinct values in our data.

import gooddata_flight_server as gf
import pyarrow
from gooddata_flexconnect import ExecutionContext, ExecutionType


def call(
    self,
    parameters: dict,
    columns: Optional[tuple[str, ...]],
    headers: dict[str, list[str]],
) -> gf.ArrowData:
    execution_context = ExecutionContext.from_parameters(parameters)
    if execution_context.execution_type == ExecutionType.REPORT:
        ... # see the previous snippet for the implementation
    elif execution_context.execution_type == ExecutionType.LABEL_ELEMENTS:
        with self._get_movie_collection() as collection:
            # Get the label we want to get elements for.
            # No need for mapping here: the label has the same name as the field in the MongoDB collection.
            label = execution_context.label_elements_execution_request.label
            # We can use the distinct method to get unique values of a field.
            # There is unfortunately no Arrow-native way to do this, so we need to convert the result to a table.
            elems = collection.distinct(
                key=label,
                filter=self._elements_request_to_mongo_query(
                    execution_context.label_elements_execution_request
                ),
            )
            # Add None to the list of elements to represent the null value:
            # this will not be returned by the distinct method because it is a part of an index.
            elems.append(None)
            # We need to return a table with a single column with the label elements.
            # This needs to be a subset of the schema we advertise to GoodData.
            return pyarrow.Table.from_pydict(
                {label: elems},
                schema=pyarrow.schema({label: pyarrow.string()}),
            )

This time, we cannot use a pymongoarrow API as there seems to be no counterpart to the distinct operation, so we need to convert the results to an Apache Arrow Table manually. Taking advantage of the fact that all the labels in our dataset are strings, we can construct the simple one-column schema easily.

Similar to the report execution, you can see we use a filter expression to limit the data transferred by reflecting the user filters on the returned elements (typically what they wrote in a search box). This will be discussed later, but as in the report execution case, even if you skip this, GoodData will apply all the filters anyway.

In action within GoodData

Once the code is written and FlexConnect is deployed, we can connect it to GoodData and use it for analytics. Using either the UI or the API, we can create a new data source of the FLEXCONNECT data source type. Then, using the Logical Data Modeler, we can add our FlexConnect function dataset to the LDM.

View of the Logical Data Model

And that is it. Now, we can use the Analytical Designer to create some interesting reports. For example, we can investigate how much the viewer and critic ratings differ based on the release month among PG-13-rated movies in the ‘90s.

Sample report using the FlexConnect data from MongoDB

Optimizing the data transfers

This section covers the additional steps you can take to make your FlexConnect function more efficient. This is optional, but beneficial. We will cover how to limit the data transferred during the MongoDB communication by leveraging the filters specified in the execution definitions. Keep in mind that the optimizations do not have to be perfect, you can take an iterative approach to them, and GoodData will still ensure all the data is displayed correctly every time.

Label elements

For label elements, we can keep things simple by respecting the search query the user typed into the search box. This can significantly reduce the data transferred, especially for labels that have many different elements.

We will show you only some of the optimizations here. For complete code and examples of these optimizations, check out our GitHub repository. This will help you dive deeper into the implementation details and adapt the techniques to your use case.

from gooddata_flexconnect import LabelElementsExecutionRequest


@staticmethod
def _elements_request_to_mongo_query(
    request: LabelElementsExecutionRequest,
) -> dict[str, dict]:
    """
    Convert GoodData label elements request to MongoDB query.
    """
    query = {}
    if request.pattern_filter:
        query[request.label] = {
            "$regex": request.pattern_filter,
            "$options": "i", # the search should be case-insensitive
        }
    return query

Report execution

For report executions, the filter variability is a bit greater as there are quite a few filter types. However, we can only focus on the ones we expect to be used the most or those that can have the biggest impact.

The snippet in the GitHub repo shows how to reflect attribute and date filters in the MongoDB queries for report executions. We use the standard MongoDB query operators and fill them with the appropriate values.

Do More With FlexConnect

Bridging the gap between the semi-structured world of document databases and the structured world of BI tools is not a trivial task. In this article we have shown that using GoodData FlexConnect, doing so is straightforward with a bit of Python code. It offers great flexibility and is easy to get started with.

Learn More

FlexConnect is built for developers and companies looking to streamline the integration of diverse data sources into their BI workflows. It gives you the flexibility and control you need to get the job done with ease.

Explore detailed use cases like connecting APIs, running local machine learning models, handling semi-structured NoSQL data, streaming real-time data from Kafka, or integrating with Unity Catalog — each with its own step-by-step guide.

Want the bigger picture? Check out our architecture article on FlexConnect, or connect with us through our Slack community for support and discussion.

FlexConnect: Integrate NoSQL Into BI was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

CSV Files in Analytics: Taming the Variability

Dan Homola — Tue, 16 Apr 2024 20:03:00 GMT

There are few formats as ubiquitous as CSV: most applications for which it makes even a smidge of sense to do so, support storing their output as CSV. Apart from its popularity, the format itself has quite a few additional advantages:

It is human-readable, making it easy to create, read, and edit in any text editor, over the terminal, etc.
It is approachable even for non-technical users, they can make sense of the raw contents easily.
It can be versioned using Git and other version control systems.
It is relatively condensed compared to other text-based formats like XML or JSON.

As with all things, CSV has its downsides, too:

It is less efficient to store. For example, numbers take up space for each digit instead of being stored as a number.
Being row-based, it is quite hard to get data only for certain columns: the whole file needs to be read even if we care about the first two columns, for example.
There is no one universal CSV standard, it has several variants or dialects.

When adding support for CSV files into Longbow — our framework for creating modular data services — it was the last point that was especially challenging. In this article, we describe the approach we took with it.

What information do we need to extract

Let’s discuss what aspects of the CSV files we need to concern ourselves with when ingesting them into Longbow for further use. For each file, we need to derive the following:

The encoding used by the file (ASCII, UTF-8, etc.).
Dialect used by the file (delimiters, quotes, etc.).
Names of the columns.
Types of the data in the columns (integer, string, date, etc.).
Preview of the first several rows so that the user can verify the CSV was parsed correctly.

We will explore the steps we took for each of these items in more detail in the rest of the article.

The ingest process

Before diving into the individual steps, let’s take a look at what the process of adding a new file looks like. First, the user uploads the CSV file they want to use to what we call a staging area. This is so we can run some analysis on the file using Longbow and show the results to the user. The user can review that the file is parsed correctly, and they can tweak some of the settings. Then, if they are satisfied with the results, they can proceed with confirming the file import. Once they do that, the file is moved from the staging area to the production area and it is then ready for use.

Storing the metadata

CSV has no dedicated way of storing any kind of metadata in the file itself (apart from somehow including it before the actual data), and we also want to support read-only input files. We had to devise a mechanism to store the metadata detected in the steps described below somewhere. We ended up with dedicated manifest files. The manifests are located right next to the relevant CSV files and have the same name with the .manifest suffix. They contain JSON-serialized versions of all the configurations we have collected both from the analysis and the user. Every time a particular CSV file is requested, we first check the manifest and use the configuration stored there to read the actual CSV file.

The configuration itself consists of options accepted by the Arrow CSV module (ReadOptions, ParseOptions, and ConvertOptions) that are used as-is when reading the CSV file. We also store information about date formats for any columns that should be interpreted as dates (more on that later).

Detecting the encoding

The very first step when reading an unknown CSV file (or any text file for that matter) for any analysis is to determine the encoding used by the file. This is to avoid any surprises with non-UTF-8 files being interpreted the wrong way. We use the charset_normalizer package for this purpose. The detected encoding is then used in subsequent reads of the CSV file.

Detecting the dialect and column names

The next step is to detect the so-called dialect of the CSV file. The dialect describes some of the structural properties of the CSV:

What is the separating character for the individual columns?
Are there any quotation marks used to escape the separators, and if so, how can they be escaped?

We also need to detect the column names. Some CSV files store the column names in the first row, some do not store them at all, and we need to generate some ourselves.

We use DuckDB’s sniff_csv function to gather all of this information. It gives us all the structural information about the file, like the delimiters, quotes, etc. It also detects the column headers if there are any, falling back on autogenerated column names. You can read more about the DuckDB CSV capabilities in their introductory blog post. We also need to make sure that the file we feed into DuckDB is in UTF-8. Otherwise, it fails. We make use of the detected encoding and prepare a special copy of the input file just for DuckDB in case the original is not in UTF-8 (or ASCII).

def _detect_dialect_and_header_and_column_names(
    sample_filename: str,
    encoding: str,
) -> tuple[CsvDialect, int, list[str]]:
    needs_conversion = encoding not in ["utf_8", "ascii"]
    if needs_conversion:
        duckdb_input_file = sample_filename + ".utf_8.csv"
        # ... convert the sample file to utf-8
    else:
        duckdb_input_file = sample_filename

    try:
        return _run_duckdb_detection(duckdb_input_file)
    finally:
        if needs_conversion:
            os.unlink(duckdb_input_file)

def _run_duckdb_detection(
    duckdb_input_file: str,
) -> tuple[CsvDialect, int, list[str]]:
    # use only one thread, we will always run only one query at a time from this
    conn = duckdb.connect(":memory:", config={"threads": 1})
    query = conn.execute(
        "SELECT Delimiter, Quote, Escape, HasHeader, Columns FROM sniff_csv(?)",
        [duckdb_input_file],
    )
    query_result = query.fetchone()
    if not query_result:
        raise ValueError("Unable to detect file dialect.")

    (delimiter, quote, escape, has_header, columns) = query_result
    # the columns are returned as a list of dictionaries, we need to extract the names
    column_names = [col["name"] for col in columns]

    # the detection may return \x00 as a delimiter, need to normalize to None
    dialect = CsvDialect(
        delimiter=delimiter if delimiter != "\x00" else None,
        quotechar=quote if quote != "\x00" else None,
        escapechar=escape if escape != "\x00" else None,
    )

    if has_header:
        # the header takes as many rows as there are newlines in the column names
        # plus one for the first line itself
        header_row_count = sum(col_name.count("\n") for col_name in column_names) + 1
    else:
        header_row_count = 0

    return dialect, header_row_count, column_names

Changed on 23rd of May 2024: as of 0.10.3, duckdb now has better support for the Columns output of the sniff_csv function that allows us to simplify the column name parsing.

Before the sniff_csv was available, we used the CleverCSV library for this step. Still, the DuckDB variant performs better (we observed a ten-fold improvement in the overall time) and allowed us to simplify the code since it can detect the dialect and column names in one step.

Detecting the data types

Having a way to read the file with the schema in hand, we can proceed with determining the actual data type of each column. You might ask, “Why not use the types detected by DuckDB?” or “Why not use the automatic detection that Arrow CSV has?”. There are a few reasons, but the most significant one has to do with the various date formats we want to support. The DuckDB CSV sniffer only supports one date format per file, so if you use one date format in one column and another format in another column, it will not work. Arrow CSV does support different date formats per column, but the set of date formats it supports is limited. While it would work great with ISO 8601 compliant dates, for example, it would not recognize strings like:

Jan 22, 2023 01:02:03
01 22 23 01:02:03
20230122

as potentially being dates as well. This is not to say the Arrow detection is wrong (after all, the last example may very well be just an integer). We just need to support a wider set of formats.

You can specify which date formats you want Arrow to try, but in case of ambiguity, it will always assume that the first matching format is correct. We want our users to disambiguate the date format manually: only they know which format is the correct one.

Another limitation of the Arrow CSV approach is that you either get the most precise data type detection (but you need to read the whole file into memory -which obviously does not scale that well), or you can use the batch-based approach. Still, only the first batch of the file is used for the data type detection making it less precise.

We want the most precise detection while conserving the memory. To that end, our pipeline is constructed a bit differently. First, we tell Arrow to read the file batch by batch and to treat each column as a string so that we avoid any automatic detection performed by Arrow. This is where the column names come in handy: you need their names to reference them in the Arrow CSV options. Next, we pipe this source into a custom Acero pipeline that allows us to run the analysis extremely quickly on the entire file in a streaming fashion, keeping the memory footprint small.

Acero streaming engine

What is Acero, you might wonder. Acero is an experimental streaming engine for running queries on large data. In Acero, you specify the processing pipeline declaratively, using several building blocks like projections, filters, and aggregations. You can choose from a wide range of predefined compute functions and crucially, you can also define your own custom functions (User Defined Functions, UDFs for short). The UDFs are fairly easy to write: you worry only about the transformations you want to perform. Acero figures out the rest. What’s more, you can use several languages to do so, we use Python for the data type detection pipeline and Cython for the pipeline we use to read the CSV data using the detected types. If SQL is more up your alley, you can use Substrait to generate the Acero query plan from an SQL query.

The type detection pipeline

From a high-level perspective, our type detection pipeline is very simple: it has one source node reading the CSV file and one projection node running the UDF detection algorithm. Ideally, there would also be an aggregation node at the end that would aggregate the results of each projection batch. Unfortunately, Acero does not seem to support UDFs in the aggregation nodes yet, so we run the aggregation in pure Python.

The detection UDF is run in parallel for every column in isolation and works like this. For each batch of values in a column:

We detect which values are null or empty.

We use regular expressions to detect booleans, integers, and doubles.

import pyarrow.compute as pc

is_boolean_vec = pc.match_substring_regex(
    array,
    # values taken from the defaults in pyarrow.csv.convert_options
    pattern=r"^$|^(true|false|0|1)$",
    ignore_case=True,
    memory_pool=ctx.memory_pool,
)

We use regular expressions and the strptime function

To detect possible date formats (based on a set of supported date formats).

We return the following values

All the types the values in the batch conform to order by the specificity (e.g. integer is more specific than a double).
All the date formats that can be used to parse all non-empty values in the batch as a valid date.
Whether any of the values in the batch is null or empty.
Whether all of the values in the batch are null or empty.

We then aggregate the results for all the batches for each column so that we get the final result:

The most specific type usable for the column.
All the date formats that can be used to parse all the non-empty values in all the batches.
A flag indicating whether the column is nullable: i.e., it contains at least one value that is null or empty.

Reading a preview

To allow the user to make an informed decision whether we “understood” the file properly and to allow them to pick the correct date format from those that we detected as suitable, we read a small sample of the data using the options we intend to use once the file is confirmed by the user. We return this preview as a part of the response, along with all the options and configurations we detected.

You might wonder, “Why does the user need to pick a date format?”. This is to handle situations where the date values are ambiguous. Imagine a file that only has these two values in a column: 01/01/2024 and 01/02/2024. Do these correspond to January 1st and 2nd? Or are they January 1st and February 1st? Only the user knows which is the case, so in these (admittedly rare) cases, they need to pick the correct date format for us to use.

Using the CSV file as a source of data

Once the user confirms the CSV file is correctly parsed, the file is moved to the production area of the file storage, and a manifest file with all the metadata is created. When there is a computation run that needs to access the CSV data, it uses the metadata in the manifest to prepare a RecordBatchReader that uses another Acero pipeline with another UDF for reading the date columns using the correct date format. The UDF is a thin wrapper around the strftime function written in Cython that does not fail on empty values but fails on invalid non-empty values. The default strftime either fails on empty values or returns null for anything it cannot parse, neither of which is what we want.

The resulting RecordBatchReader can then be consumed by the rest of Longbow, business as usual. There is a dedicated article coming about that particular part of Longbow, so stay tuned!

Summary

CSV files are one of the most used formats for storing structured data. Their relative looseness and simplicity make them easy to produce, but they are also quite challenging to read and parse automatically. We have outlined the way we do it for Longbow, leveraging the DuckDB CSV sniffing functionality and the Apache Arrow capabilities: its CSV module and the Acero streaming engine.

Want to learn more?

As always, I’m eager to hear what you think about the direction we are taking with our Platform. Feel free to reach out to us on the GoodData community Slack.

Want to try it out for yourself? Consider using our free trial. Want to learn what else we are cooking at GoodData? Join GoodData Labs!

CSV Files in Analytics: Taming the Variability was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Flexible caching and storage with Flight RPC

Dan Homola — Mon, 25 Mar 2024 11:05:02 GMT

This is part of a series about FlexQuery and the Longbow engine powering it.

Storing any kind of large data is tricky. So when we designed our Longbow engine, for our Analytics Stack, we did so with flexible caching and storage in mind.

Longbow is our framework based on Apache Arrow and Arrow Flight RPC, for creating modular, scalable, and extensible data services. It is essentially the heart of our Analytics Stack ( FlexQuery). For a more bird’s eye view of Longbow and to understand its place within FlexQuery, see the architectural and introductory articles in this series.

Now let’s have a look at the tiered storage service for the Flight RPC in Longbow, its trade-offs and our strategies to tackle them.

If you want to learn more about the Flight RPC I highly recommend the Arrow Flight RPC 101 article written by Lubomir Slivka.

The storage trade-offs

To set the stage for the Longbow-specific part of the article, let us briefly describe the problems we tackled and some basic motivations. When storing data, we need to be mindful of several key aspects:

Capacity
Performance
Durability
Cost

Setting them up and balancing them in cases when they go against one another is a matter of carefully defining the requirements of the particular application and can be quite situation-specific.

Cost v. Performance

The first balancing decision is how much we are willing to spend on increased performance. For example, when storing a piece of data being accessed by many users in some highly visible place in our product, we might favor better performance for a higher cost because the end-user experience is worth it. Conversely, when storing a piece of data that has not been accessed for a while, we might be ok with it being on a slower (and cheaper) storage even at the potential cost of it taking longer to be ready when accessed again. We can also be a bit smarter here and make it so that only the first access will be slower (more on that later).

Capacity v. Performance

Another balancing decision is about where we draw the line between data sizes that are too big. For example, we might want to put as many pieces of data into memory as possible: memory will likely be the best-performing storage available. At the same time though, memory tends to be limited and expensive, especially if you work with large data sets or very many of them: you cannot fit everything into memory and “clog” the memory with one huge flight instead of using it for thousands of smaller ones would be unwise.

Durability

We also need to decide on how durable the storage needs to be. Some data changes rarely but is very expensive to compute, it might even be impossible to compute again (e.g. a user-provided CSV file): such data might be a good candidate for durable storage. Other types of data can be changing very often and/or be relatively cheap to compute: non-durable storage might be better there. Also, there can be other non-technical circumstances: compliance or legal requirements might force you to avoid any kind of durable storage altogether.

Isolation and multi-tenancy

Yet another aspect to consider is whether we want to isolate some data from the rest or handle it differently, especially in a multi-tenant environment. You might want to give a particular tenant a higher capacity of storage space because they are on some advanced tier of your product. Or you might want to make sure some of the data is automatically removed after a longer period of time for some tenants. The storage solution should give you mechanisms to address these requirements. It is also a monetization leverage — yes, even we have to eat something ;-)

Storage in Longbow

When designing and developing Longbow, we aimed to make it as universally usable as possible. As we have shown in the previous section, flexibility in the storage configuration is paramount to making a Longbow deployment efficient and cost-effective.

Since Longbow builds on the Flight RPC protocol (we go into much more detail in the Project Longbow article), it stores the individual pieces of data (a.k.a flights) under flight paths. Flight paths are unique identifiers of the related flights. They can be structured to convey some semantic data by having several segments separated by a separator (in our case it is a slash).

In this context, you can think of a flight as a file on a filesystem and a flight path as a file path to it: like it, it is a slash-separated string like cache/raw/postgres/report123.

The flight paths can be used by the Flight RPC commands to reference the particular flights. Flight RPC, however, does not impose any constraints on how exactly the actual flight data should be stored: as long as the data is made available when requested by a Flight RPC command, it does not care where it comes from. We take advantage of this fact and we make use of several types of storage to provide the data to Flight RPC.

The storage types in Longbow are divided into these categories:

Shard — local, ephemeral storage
— Memory
— Memory-mapped disk
— Disk
External, persistent storage

Visualization of the relationship between Longbow shards and different tiers of storage. The shard lingo is taken from database architecture.

As you can see, the lower in the storage hierarchy we go, the slower, larger, and cheaper the storage gets, and vice versa. Also, only the external durable storage survives Longbow shard restarts. To expose these layers and to make them configurable, we built an abstraction on top of them that we call Storage classes.

Storage classes

A storage class encapsulates all the storage-related configurations related to a subset of flights. It has several important properties:

It can be applied only to some flights: you can have different storage classes for different types of flights.
The settings in the storage class can be tiered: you can mix and match different types of storage and flights can be moved between them.

Longbow uses the storage class definitions to decide where to physically store an incoming flight and how to manage it.

Storage class settings

Storage classes have different settings that define the storage class as a whole, another set of settings that define different so-called cache tiers, and another set that can govern the limits of several storage classes. Let’s focus on the storage-class-wide ones first.

Flight path prefix

Each storage class is applied to a subset of the flight paths. More specifically, each storage class defines a flight path prefix: only flights that share that flight path prefix are affected by that particular storage class. This allows us to handle different types of data differently by constructing their flight paths systematically thus handling the tradeoffs described earlier differently for data with different business meanings.

The storage classes can have flight path prefixes that are substrings of each other: in case of multiple prefixes matching the given flight path, the storage class with the longest matching prefix is used.

Going back to the filesystem path analogy, you can think of the path prefix as a path to the folder you want to address with the particular settings. Folders deeper in the tree can have their own configuration, overriding that of the parent folders.

Durability

Storage classes must specify the level of storage durability they guarantee. Currently, we support three levels of durability:

none — the storage class does not store the flights in any durable storage. Data will be lost if the Longbow node handling that particular flight runs out of resources and needs to evict some flights to make room for new ones, crashes, or is restarted
weak — the storage class will acknowledge the store operation immediately (before storing the data in durable storage). Data can be lost if the Longbow node crashes during the upload to the durable storage. However, the happy-path scenarios can have better performance.
strong — the storage class will only acknowledge the store operation after the data has been fully written to the durable storage. Data will never be lost for acknowledged stores, but the performance may suffer waiting for the durable write to complete.

These levels allow us to tune the performance and behavior of different data types. For data we cannot afford to lose (for example direct user uploads), we would choose strong durability, for data that can be easily recalculated in the rare case of a shard crash, weak might provide a better performance overall.

Storage classes with any durability specified will store the incoming data in their ephemeral storage first (if it can fit) and then create a copy in the durable storage. Whenever some flight is evicted from the ephemeral storage, it can still be restored there from the durable storage if it is requested again.

Durable Storage ID and Prefix

Durable storage ID and prefix tell the storage class which durable storage to use and whether or not to put the data there under some root path (so that you can use the same durable storage for multiple storage classes and keep the data organized). This also means you can use different durable storage and in the future, this would also enable some “bring your own storage” types of use cases for your users.

Flight Time-To-Live (TTL)

With flight Time-To-Live (TTL), storage classes can specify how long “their” flights will remain in the system. After that time passes, flights will be made unavailable.

In non-durable storage classes, the flights might become unavailable sooner: if resources are running out, the storage class can evict the flights to make space for newer ones.

In durable storage classes, the TTL is really the time they will be available for. In case the flight is evicted from the ephemeral storage tier (more on that later), it will be restored from the durable storage on next access. After the TTL passes, the flight will also be deleted from the durable storage.

Flight replicas

Optionally, the flight replicas setting directs the storage class to create multiple copies of the flights across different Longbow nodes. This is not meant to be a resilience mechanism, rather, it can be used to improve performance by making the flights available on multiple nodes.

Cache tier settings

As described earlier, each Longbow shard has several layers of ephemeral storage resources and if configured, also external durable storage (e.g. AWS S3 or network-attached storage (NAS)). To utilize each of these layers as efficiently as possible, storage class settings can configure the usage of each of these tiers separately. There are also policies that automatically move data to a slower tier after that piece of data was not accessed for some time and vice versa, moving the data to a faster tier after it has been accessed repeatedly over a short period of time.

The cache tiers each have several configuration options to allow for all of these behaviors.

Storage type

Tiers must specify the type of storage they manage. There are several options:

memory — the data is stored in the memory of the Longbow shard.
disk_mapped — the data is stored on the disk available to the Longbow shard and memory mapping is used when accessing the data.
disk — the data is stored on the disk available to the Longbow shard. This disk is wiped whenever the shard restarts.

The storage types are shared by all the tiers of all the storage classes in effect (there is one memory after all), so once the storage is running out of space, flights from across the storage classes will be evicted.

For storage classes with no durability, eviction means deleting the flight forever.

For storage classes with durability, eviction deletes the flight copy from the ephemeral storage but keeps the copy in durable storage. This means that if the evicted flight is requested, it can be restored from the durable copy back into the more performant ephemeral storage.

Max flight size and Upload Spill

Each tier can specify the maximum size of flights it can accept. This is to prevent situations when one large flight being uploaded would lead to hundreds or thousands of smaller flights being evicted: it is usually favorable to be able to serve a lot of users efficiently with smaller flights than only a handful with a few large ones. There is also a setting allowing the flights larger than the limit to either be rejected straight away or to spill over to another storage tier (e.g. from memory to the disk instead).

Since the flight data is streamed during the upload, the final size is not always known ahead of time. So the upload starts writing the data to the first tier and only when it exceeds the limit, the already uploaded data is moved to the next tier, and the rest of the upload stream is also redirected there. For cases when the final size is known ahead of time, the client can provide it at the upload start using a specific Flight RPC header, to avoid the potentially wasteful spill process.

Priority and Spill

When a storage type is running out of resources, it might need to start evicting some of the flights to make space for newer ones. By default, the eviction is driven using the least-recently-used (LRU) policy, but for situations where more granular control is needed, the storage tiers can specify a priority: flights from a tier with a higher priority will be evicted only after all the flights with lower priority are gone.

Related to this, there is also a setting that can cause the data to be moved to another storage type instead of being evicted from the ephemeral storage altogether (similarly to the Upload Spill).

Move after and Promote after

Tiers can also specify a time period after which the flights are moved to another (lower) tier. This is to prevent the “last-minute” evictions that would happen when the given storage tier is running out of resources.

Somewhat inverse to the Move after mechanism, tiers can proactively promote a flight to a higher tier after it was accessed a defined number of times over a defined period of time. This handles the situation where a lot of users start accessing the same “stale” flight at the same time: we expect that even more are coming so to improve the access time for them, we move the flight to a faster tier.

Tuning these two settings allows us to strike a good balance between having the flights most likely to be accessed the most in the fastest cache tier while being able to ingest new flights as well.

Limit policies

To have even more control, the configuration also offers something we call Limit policies. These can affect how many resources in total a particular storage class takes up in the system. By default, without any limit policies, all the storage classes share all the resources available. Limit policies allow you to restrain some storage classes to only store data up to a limit. You can limit both the amount of data in the non-durable storage only and the total amount including the durable storage.

For example, you can configure the policies in a way that limits cache for report computations to 1GB of data in the non-durable storage but leaves it unlimited capacity in the durable storage. Or you can also impose a limit on the durable storage (likely to limit costs).

There are several types of limit policies catering to different types of use cases. One limit policy can be applied to several storage classes and more than one limit policy can govern a particular storage class.

Standard limit policies

These are the most basic policies, they set a limit on a storage class as a whole i.e. on all the flight paths sharing the storage class’ flight path prefix. They are useful for setting a “hard” limit on storage classes that you fine-tune using other limit policy types.

Segmented limit policies

Segmented limit policies allow you to set limits on individual flight path subtrees governed by a storage class. For example, if you have flight paths like cache/postgres and cache/snowflake and a storage class that covers the cache path prefix, you can set up such a policy that limits data from each database to take at most 1GB per database type and it would automatically cover even database types added in the future.

Hierarchical limit policies

Offering even more control than with the segmented limit policies, hierarchical limit policies allow you to model more complex use cases limiting on more than one level of the flight paths.

For example, you have a storage class that manages flights with the cache prefix. On the application level, you make sure that the flight paths for this storage class always include some kind of tenant id as the second part of the flight path, the third part is the identifier of a data source the tenant uses and finally, the fourth segment is the cache itself: cache/tenant1/dataSource1/cacheId1. Hierarchical limit policies allow you to do things like limiting tenant1 to 10GB in total and also allow them to put 6GB towards dataSource1 because they know it produces bigger data while keeping all the other tenants at 5GB by default.

Summary

As we have shown, storing any kind of large data certainly isn’t straightforward. It has many facets that need to be carefully considered. Any kind of data storage system needs to be flexible enough to allow the users to tune it according to their needs.

Longbow comes equipped with a meticulously designed tiered storage system exposed via Flight RPC that enables the users to set it up to cater to whatever use case they might have. It takes advantage of several different storage types and plays to their strengths whether it is size, speed, durability, or cost.

In effect, Longbow’s cache system provides virtually unlimited storage size thanks to the very cheap durable storage solutions while keeping the performance much better for the subset of data that is being actively used.

Want to learn more?

As we mentioned in the introduction, this is part of a series of articles, where we take you on a journey of how we built our new analytics stack on top of Apache Arrow and what we learned about it in the process.

Other parts of the series are about the Building of the Modern Data Service Layer, Project Longbow, details about the Flight RPC, and last but not least, how good the DuckDB quacks with Apache Arrow!

As you can see in the article, we are opening our platform to an external audience. We not only use (and contribute to) state-of-the-art open-source projects, but we also want to allow external developers to deploy their services into our platform. Ultimately we are thinking about open-sourcing the Longbow. Would you be interested in it? Let us know, your opinion matters!

If you’d like to discuss our analytics stack (or anything else), feel free to join our Slack community!

Want to see how well it all works in practice, you can try the GoodData free trial! Or if you’d like to try our new experimental features enabled by this new approach (AI, Machine Learning and much more), feel free to sign up for our Labs Environment.

Flexible caching and storage with Flight RPC was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Polyglot Apache Arrow: Java and Python Perspective

Dan Homola — Mon, 31 Jul 2023 10:37:45 GMT

Apache Arrow, a versatile, analytics-focused in-memory data format, offers the flexibility to work with data seamlessly across multiple programming languages, such as Java and Python. While utilizing Apache Arrow in more than one language may initially seem straightforward — write a feature in one language and ‘translate’ it to another — there are certain nuances and challenges that may arise. In this article, we aim to shed light on some of these challenges we encountered at GoodData. We do this not to criticize the Arrow project, but rather to provide insights and solutions for fellow developers who may encounter similar obstacles.

Issues encountered

To keep things organized, let’s split the tricky parts into three categories:

Missing APIs
Different API philosophies
API inconsistencies

Missing cancel in FlightStreamWriter in Python

The first difference we encountered has to do with how the client can cancel a put operation.

There are several reasons the client might want to stop the writing half-way. For example it reads from some underlying datasource batch-by-batch and sends each batch to the Flight server. It can then happen that the underlying datasource raises an exception. The flight client then wants to signal to the server: “I am canceling this put operation, disregard whatever I have sent you so far”.

In Java, the process is simple: clients call the error method of the ClientStreamListener returned by the startPut method, which will close the stream and signal the cancellation to the server (where server_call_context.is_cancelled() returns True). Consequently, the server can react accordingly.

In Python, though, the client cannot cancel the flight put operation in a way the server would be able to detect. It can raise an exception, but the server has no way of knowing which of these cases occurred:

The client is done writing and just did not call done_writing and the data sent by the client should be persisted by the server.
The client wants to cancel the operation, and any data sent by the client should not be persisted by the server.

Currently, there is no way of doing this in Python (we’ve opened an issue for it). Fortunately for us, we currently do not use the Python client in this context. However, should we start, or decide to make the client public, we would have to handle this somehow.

Different API philosophies

Start_put: tuple in Python, listener as an argument in Java

Since we are running the Arrow Flight RPC servers in a High-Availability (HA) environment, the logic behind putting a flight there is more involved. First, we need to contact the cluster with the flightPath we would like to put and it responds with a list of locations of the nodes that can accept that particular path. Once we have the list, we try each node in the list and use the first node that successfully accepts the put to perform the operation.

To avoid duplicating this logic across the applications, we encapsulated it in our clients. First, we implemented this logic in Python.

def _start_put(
    self,
    flight_path: str,
    schema: pyarrow.Schema,
    options: pyarrow.flight.FlightCallOptions,
) -> tuple[pyarrow.flight.FlightStreamWriter, pyarrow.flight.FlightMetadataReader]:
    # this method handles the initiation of the do_put operation while also handling situations when the cluster
    # changes between the operations (the client that was ok before is now broken): in that case we need to try
    # again with another client. the tricky part is that we only know whether the client is ok or not when we
    # initiate the do_put operation: only then can we decide to query the shards again
    descriptor = pyarrow.flight.FlightDescriptor.for_path(flight_path)

    writer: Optional[pyarrow.flight.FlightStreamWriter] = None
    metadata_reader: Optional[pyarrow.flight.FlightMetadataReader] = None
    error: Optional[pyarrow.flight.FlightError] = None
    put_started = time.time()

    while (
        writer is None
        and time.time() - put_started < self._put_establishing_timeout
    ):
        try:
            error_count = 0
            shard_count = 0
            for shard in self.shards_for_put(flight_path=flight_path):
                try:
                    shard_count += 1
                    writer, metadata_reader = shard.do_put(
                        descriptor=descriptor, schema=schema, options=options
                    )
                    break
                except pyarrow.flight.FlightUnavailableError as e:
                    error_count += 1
                    error = e
            # if all shards failed raise an error so that the placement is tried again after backoff, not immediately
            if shard_count > 0 and shard_count == error_count:
                raise cast(pyarrow.flight.FlightError, error)
        except pyarrow.flight.FlightUnavailableError as e:
            error = e
            time.sleep(0.25)

    if writer is None or metadata_reader is None:
        # if the writer is not assigned, we must have hit an error
        assert error is not None
        raise error

    return writer, metadata_reader

This is pretty straightforward, so when we wanted to replicate a similar logic in Java, we approached it with a “just rewrite this to Java” attitude. As it turns out, this it not so simple, as there are two substantial differences between Java and Python in this area:

The semantics differ for the writer.
The API shape is quite different in Java.

The first difference is between what state the writer is returned in. In Python, when the writer is returned, you can start writing to it right away. In Java, however, you should first check the isReady flag (or even better subscribe to setOnReadyHandler) before writing to the stream. This complicates our node probing logic a bit.

public StartPutResult startPut(FlightDescriptor descriptor,
                               VectorSchemaRoot root,
                               CallOption... options) throws FlightRuntimeException, InterruptedException {
    ClientStreamListener writer = null;
    CustomPutListener metadataListener = null;
    var startTime = System.currentTimeMillis();
    FlightRuntimeException error = null;
    var resolveLocationsCalled = false;
    while (writer == null && System.currentTimeMillis() - startTime < putEstablishingTimeout) {
        // first, try getting all the locations that can accept the flight
        List locations;
        try {
            var distributionNodes = getDistribution(descriptor, options);
            locations = nodesToLocations(distributionNodes);
        } catch (FlightRuntimeException e) {
            if (isRecoverable(e)) {
                if (isConnectionFailure(e) && !resolveLocationsCalled) {
                    locationProvider.resolveLocations();
                    resolveLocationsCalled = true;
                }
                error = e;
                Thread.sleep(putEstablishingGracePeriod); // back off for a bit before retrying
                continue;
            } else {
                throw e;
            }
        }

        // then try each location in turn to find the first one that manages to establish the put operation
        for (var location : locations) {
            try {
                metadataListener = new CustomPutListener();
                writer = flightClientsAdapter.startPut(location, descriptor, root, metadataListener, options);
                Thread.sleep(putEstablishingGracePeriod); // give it some time to get ready, without it, the isReady would never be true
                if (writer.isReady()) {
   // we have the writer that is ready to accept flights: break out of the cycle
                    break;
                } else {
                    // the writer was not ready in time: clean up and try another
                    var toCleanUp = writer;
                    writer = null;
                    metadataListener = null;
                    error = null;
                    toCleanUp.error(new Exception("client not ready in time, cleaning up"));
                    toCleanUp.getResult();
                }
            } catch (FlightRuntimeException e) {
                // we assume that FlightStatusCode.CANCELLED encountered here always means we hit
                // the "client not ready in time" and should therefore try again
                var isClientNotReadyInTimeError = e.status().code() == FlightStatusCode.CANCELLED;
                var shouldRetry = isClientNotReadyInTimeError || isRecoverable(e);
                if (shouldRetry) {
                    if (isConnectionFailure(e) && !resolveLocationsCalled) {
                        locationProvider.resolveLocations();
                        resolveLocationsCalled = true;
                        break;
                    }
                    error = e;
                } else {
                    throw e;
                }
            }
        }
    }

    if (writer == null) {
        if (error == null) {
            throw new ArrowClientNoNodeForPutException();
        } else {
            throw error;
        }
    }

    return new StartPutResult(writer, metadataListener);
}

As you can see, the logic is more elaborate than its Python counterpart (even though the ideas behind them are the same).

The second difference is that in Python you get both the writer and the metadata_reader for both directions of the bidi channel. In Java, however, you only get the writer and you need to provide your own listener implementation (defined by the PutListener interface). Due to the retry logic we have in place, we need to actually return this listener (and control its creation: we might need more than one and return the one associated with the node that is used in the end).

To make providing your own PutListener easier, there are two implementations provided by the Arrow Java client: SyncPutListener and AsyncPutListener you can use.

Since we want to read from the server only after the whole writing process is complete, SyncPutListener seemed like the more fitting choice. We wanted to extend it with some domain-specific methods to make it easier to use in our context, but alas, the SyncPutListener is final and cannot be extended. We could make a new class and compose the SyncPutListener into it, but that seemed like quite a lot of boilerplate. Instead, we chose to extend the extensible AsyncPutListener. This class allows you to override only one method and build your logic around it.

package com.gooddata.demo;

import org.apache.arrow.flight.AsyncPutListener;
import org.apache.arrow.flight.PutResult;

import java.nio.charset.StandardCharsets;

public final class CustomPutListener extends AsyncPutListener {
    private boolean hasWaitedForResult = false;
    private String syncToken = null;

    /**
     * Returns the syncToken that can be used to enforce read-after-write consistency.
     * Must be called after {@link CustomPutListener#waitForComplete}.
     */
    public String getSyncToken() {
        if (!hasWaitedForResult) {
            throw new IllegalStateException("You must call waitForComplete before trying to access the sync token.");
        }
        if (syncToken == null || syncToken.isBlank()) {
            throw new IllegalStateException("Server did not send any sync token back.");
        }
        return syncToken;
    }

    @Override
    public void onNext(PutResult val) {
        var metadata = val.getApplicationMetadata();
        var charBuffer = StandardCharsets.UTF_8.decode(metadata.nioBuffer());
        this.syncToken = charBuffer.toString();
    }

    /**
     * Call this method to ensure the writing of the server-sent metadata is finished.
     * This must be done before accessing any of the properties provided by this class.
     */
    public void waitForComplete() {
        getResult();
        hasWaitedForResult = true;
    }
}

Mandatory done_writing in Python

An especially tricky aspect of the Python client code is the requirement to call the done_writing method once writing to the stream is complete. Otherwise, the server can get stuck waiting for the next batch that might never come. Below is a simple server code illustrating this:

import pyarrow as pa
import pyarrow.flight as flight


class Server(flight.FlightServerBase):
    def __init__(self, location):
        super(Server, self).__init__(location=location)

    def do_put(self, context, descriptor, reader, writer):
        for batch in reader:
            print(f"got batch {str(batch.data)}")
        # or even just reader.read_all()

        print("sending metadata back")
        writer.write(pa.py_buffer((42).to_bytes(8, byteorder="big")))


def main():
    server = Server("grpc://localhost:16661")
    server.serve()


if __name__ == "__main__":
    main()

This server works correctly with well-behaved clients. When faced with a not-so-well-behaved one like the one in the following code listing, it will get stuck:

import pandas as pd
import pyarrow as pa
import pyarrow.flight as flight


def main():
    client = flight.connect("grpc://localhost:16661")
    df = pd.DataFrame(
        {
            "n_legs": [None, 4, 5, None],
            "animals": ["Flamingo", "Horse", None, "Centipede"],
        }
    )
    table = pa.Table.from_pandas(df)

    writer, reader = client.do_put(
        flight.FlightDescriptor.for_path(b"fun/times"), table.schema
    )

    writer.write_table(table)

    print("done putting")
    # if this line is uncommented, everything starts working fine
    # writer.done_writing()

    meta = reader.read()
    print(int.from_bytes(meta))


if __name__ == "__main__":
    main()

Unless the writer.done_writing() is called, both the client and the server get stuck: the server waiting for batches from the client, the client waiting for a message from the server.

The documentation does not mention the significance of the done_writing call. It is apparently a known issue, one that will hopefully get resolved with the introduction of async APIs.

In Java, this is not an issue: once the client is closed, the stream is closed as well and everything works as expected (although you should also call the completed method, as mentioned in the Arrow Java Cookbook).

API inconsistencies

The mismatch between metadata types: bytes in Python, String in Java

The most recently discovered tricky part had to do with custom flight metadata. Quite some time ago, we added a custom metadata field to our flights. This field contains a serialized protobuf command that was used to generate this flight (an opt-in feature useful for debugging).

In Python, we serialized this protobuf object to bytes using the standard protobuf serialization (using the SerializeToString method of the protobuf-generated object). We stored this serialized object in the flight metadata. This worked in Python no problem: writing the metadata was ok, reading it as well (we even had tests for both).

Some time later we tried to access this metadata field from Java and started getting exceptions:

java.lang.IllegalArgumentException: Invalid UTF-8: Illegal leading byte in 2 bytes utf.

After a bit of not-so-easy debugging (especially due to our error handling unintentionally swallowing the original exception, which we have since improved), we discovered that the problem is that in Java, each of the metadata values must be a String, not a bytes array (see the particular line that raises the exception).

To work around this limitation, we now store the command base64-encoded in the schema: this admittedly makes the schema larger, but is currently the only way around this issue.

Conclusion

In conclusion, although working with Apache Arrow in multiple languages such as Java and Python offers great flexibility, there are certain challenges and differences that developers need to be aware of. This article has highlighted some of these challenges encountered during the implementation process.

Despite these challenges, it is important to emphasize that the Apache Arrow project itself is highly appreciated and enjoyable to work with. By highlighting the encountered challenges, this article aims to provide awareness and guidance for developers who may face similar difficulties.

As the Apache Arrow community continues to evolve and improve the project, we expect that some of these challenges will be addressed, leading to smoother cross-language integration and enhanced usability.

Polyglot Apache Arrow: Java and Python Perspective was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using QR codes to facilitate testing config in React Native

Dan Homola — Fri, 24 Aug 2018 11:59:53 GMT

Photo by Rima Kruciene on Unsplash

While creating a proof of concept for a mobile app for one of our products in React Native, we came across an interesting problem to solve — how to allow our testers to switch the URLs of the API endpoints for the application to get data from. This is necessary because we need to test the app with different versions of our APIs, for several reasons:

When our API changes, we need to test the mobile app against it before releasing it.
We can manipulate the data on the testing servers to reproduce bugs and edge cases.
For legal reasons, we cannot let the store reviewers access the production environment, only the testing ones.

In this article, we describe the evolution of our solution to this problem.

The requirements

There are several requirements for the solution:

The app should point to production API by default.
The testing environments’ URLs should not be hardcoded in the app.
The configuration change should be easy to do even for non-developers.

The first requirement stems from the notion of “correct by default”. In other words, if the user does nothing, the app works with the API we want it to. This rules out making different builds for different environments (besides being a build process nightmare).

We also do not want the testing API URLs hardcoded in the app mainly because it would be impossible to add a new testing environment without making a new build of the app.

The ease of use is also necessary for two reasons. First, our testers switch environments often and we want to make the switch process as fast as possible. Second, if we wanted to make a test release among our employees (the majority of whom are non-technical), they need to be able to switch the environment without any advanced technical knowledge.

The baseline solution

The first approach to this problem we took was creating a special config file. This file would be placed on the device in a place where the app could detect and read it. This means the app’s files directory on Android and Documents directory on iOS. When the app detects the file with the proper values, it makes a "secret" developer tools section available where the user can make the switch using a picker.

This was relatively easy to implement using the wonderful react-native-fs package:

https://medium.com/media/1f90f4728dae9d42f34c8cfbd8cbd9d7/href

First, we check whether a config file is present and then try to parse it as a JSON. A sample config file looks like this:

https://medium.com/media/9bf464b5690c92c309b1cceef11e79b1/href

Notice that the selectedEnv value corresponds to one of the keys of the baseUrls object. This allows us to craft config files with a particular environment already selected, so that the user doesn't have to make the switch manually. Moreover, it enables us to switch the app's API endpoints while not showing all the other developer tools at all (making the switch impossible) by not setting the developerTools to true!

The configFile methods are called in our configuration handling logic like this:

https://medium.com/media/3c53c26d338965a9a13d249efb0c9831/href

We try to access the file and if it exists, we use its values to override the default settings. We also persist the last environment the user selected. This means they don’t have to set the environment every time they start the app.

This solution ticks the first two boxes. If there is no config file detected by the app, it defaults to the production. Also, the testing environments’ URLs are fully specified in the config file. This means adding a testing environment is as easy as updating the config file.

The ease of use aspect however leaves much to be desired, as putting the file in the correct place on the device is not so simple. On Android, this means getting the file into the device (be it by email or using a USB cable) and using the built in Files app to copy it into a directory with seemingly cryptic names several layers deep. On iOS, the process is opening iTunes, connecting the device using a cable and copying the config file using the File Sharing functionality.

On both platforms, this is a lot of hustle and takes up to minutes to complete for an inexperienced user. After some brainstorming, an idea emerged — we’ll use QR codes!

The user-friendly solution

At first, the idea of adding the support of reading QR codes to a React Native app seemed extremely hard. Fortunately, there is a great package already available — react-native-qrcode-scanner which is built on top of the incredible react-native-camera.

Installing the packages

To make things work you first need to follow the react-native-camera installation guide. For us, the iOS CocoaPods installation went well, however we needed to take the manual path for Android to make it work.

After making sure your app compiles and runs with the react-native-camera package included, you can install the react-native-qrcode-scanner using npm or yarn. As it is implemented purely in JavaScript, there are no additional installation steps.

Hiding the functionality

For obvious reasons, we don’t want the regular users to know about this functionality, it is meant only for instructed developers and testers. Still, we need to have this feature accessible in production builds. This means we need to put it some “secret” place. There are several ways to achieve this, for example:

Enabling the feature after a sequence of short and long taps on some element in the app (detecting this could be a nice exercise in finite automata programming)
Enabling the feature after a specific string was input in some unrelated field in the app
Enabling the feature after a shake gesture on a specific screen

I’m not going to tell you the exact way we chose, this section was meant merely to point out you need to think about this when implementing similar feature.

Using the scanner

Having installed both packages and prepared the “secret” place, actually using the scanner is really easy.

https://medium.com/media/2d48ad21180649bc2c2306ea058f9237/href

As we can see, the InconspicuousComponent displays its normal contents and only when it receives the isDevModeEnabled prop set to true, it renders the QRCodeScanner as well, passing the handler received in props to it. In the handler, the QR contents are parsed from JSON and a basic validation of the contents is made. If the validation succeeds, the config is persisted into the config file on the device using the persistConfigFile function introduced earlier and the user is notified of the results by an Alert box.

Creating the QR codes

The final thing to do is to create the QR code itself. There are many online generators available, we used the-qrcode-generator.com. Simply paste your config file contents in and download the code. The important thing here is to use the minified version of the JSON because we want to encode as few characters as possible. Again, there are many tools online to minify a JSON file or maybe your editor can do it as well.

Below is a QR code corresponding to the sample config file mentioned earlier:

Sample QR code

You can then distribute the code to your testers along with the description of how to access the “secret” area. And that's it!

Conclusion

In this article, we discussed how to allow testers and other instructed user to change the environment of your React Native app. First, we described the basic solution and then illustrated how the process can be made much more user-friendly using QR codes.

Using QR codes to facilitate testing config in React Native was originally published in VZP Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Setting up Storybook in TypeScript/Sass environment with Webpack 4

Dan Homola — Fri, 10 Aug 2018 13:01:01 GMT

Photo by Patrick Tomasso on Unsplash

Storybook is a wonderful tool for developing UI components. It allows you to see your components in all their supported states by providing proper data to them. It supports quite a few libraries (Angular, Vue, React, and others). We have been using it for quite some time in our shared UI elements packages without issues. However, when we tried to add it to our Moje VZP Single Page App it was not exactly easy to make it work. In this article, we will describe the process of setting up the Storybook integration and the issues we encountered.

Project stack

The previous post on this blog goes into great detail about the Moje VZP project technology stack, so let’s just point out the relevant information here.

The app is written in React using TypeScript. The code is bundled using Webpack 4 with TypeScript compiler and Babel. The styles are written in Sass and are bundled by Webpack as well.

Initial setup

The recommended way to set Storybook up is by using their CLI utility — @storybook/cli. While this utility works well for “standard” projects, it has two issues that prevented us from using it:

it uses npm instead of yarn that we use
it does not take neither TypeScript nor Sass into account

Because of those two reasons we have to set Storybook up manually by loosely following the slow start guide.

The first step is to install Storybook itself by running yarn add -D @storybook/react @types/storybook__react. This adds the React version of Storybook and TypeScript type definitions for it (this allows us to write the stories in TypeScript, more on that later on).

Next, we add a script for starting the Storybook to our package.json:

https://medium.com/media/3ea96c3bbd85089a372899178b3c90b1/href

Finally, we create the config file .storybook/config.js in our app's root:

https://medium.com/media/33c0a60212d66995ffed2bc375232e7b/href

The file tells Storybook that the story files are located inside the app folder and their names end in .stories.tsx. The reason for not using the default file is that we think it is more clear when the stories are right next to "their" components rather than in some arbitrary folder. So for example for the Currency component specified in Currency.tsx there is now Currency.stories.tsx file with its stories right next to it.

Notice that the stories will be written in TypeScript. In a project with components written in TypeScript as well this makes writing the stories much easier — the prop types are checked therefore you are unlikely to pass nonsensical data to the component. This is especially useful with components that display complex data.

Making TypeScript work

Unfortunately, Storybook does not support TypeScript out of the box. There is a guide for setting it up, however we ran into some issues using it.

The guide states that we need to extend the Storybook Webpack config in order to use TypeScript files. However, there are two issues with this:

the current stable version of Storybook (3) uses Webpack 3, whereas we use Webpack 4
the guide suggests using awesome-typescript-loader, whereas we use ts-loader

These two facts together mean that if we wanted to use ts-loader we already use in bundling our app (which is only compatible with Webpack 4), we would need to have two versions of ts-loader – one for Webpack 4 for bundling our app and one for Webpack 3 to use with Storybook. This is of course not desirable. Fortunately, there is an alpha version of Storybook 4 available that uses Webpack 4.

Therefore the first step to make the TypeScript work is to install the alpha version — yarn add -D @storybook/react@alpha. Next, we can create the .stories/webpack.config.js:

https://medium.com/media/b15fa43b9a416591a333b5260406d069/href

As for the tsconfig.test.json, it extends our "normal" tsconfig.json and overrides some settings needed by Storybook (and Jest as well, hence the name test):

https://medium.com/media/c7a115428297be75b77c6c53d4ab1509/href

As we can see, it uses the extend functionality to override our standard configuration. It changes is the module system and tells TypeScript to compile JSX (our base config uses the esnext module system and does not compile JSX). This makes TypeScript output files Storybook can use.

This is all it takes to make TypeScript work (tested with @storybook/react@4.0.0-alpha.16).

Making styles work

As stated earlier, the styles for our app are written in Sass. This again requires additional configuration in Storybook.

First, we need to make sure Storybook knows how to process .scss files by updating .stories/webpack.config.js:

https://medium.com/media/65e5523a888306938c9bbabdd4d4e0fc/href

As we can see, the setup is similar to the TypeScript one — we add appropriate loaders and add the extension. Note that we did not add loader configuration for CSS as Storybook supports it by default and adding it here would cause errors. The getSassLoaderConfig function is used by our standard Webpack config as well and looks like this:

https://medium.com/media/fbf508edfddf008b75629a322acd3b93/href

The last thing to do is to include the styles in the stories. This can be done for all the stories at once by updating the .storybook/config.js with an import of the styles' main file:

https://medium.com/media/e9849b926ace76bbd058ee976e2499d5/href

Writing components properly

Having made Storybook work with our stack, the last thing is to write the stories themselves. There is a guide on how to do it so we will not go into detail here.

The important thing is that you must write your components in a way that makes using them in Storybook possible. For us this meant splitting components using Redux into two as mocking Redux would be impractical.

Let’s use HotNews as an example:

https://medium.com/media/97852be90d7aff6541aa08917b03148f/href

This component downloads some news data and displays it. This works well, but we cannot write stories for it because it is connected to Redux which is not available in Storybook. The solution is to split HotNewsList into two components:

https://medium.com/media/ba1e2f27bf6d9093a25028fb7144ee93/href

The HotNews component no longer handles the data obtaining, just the data display. Thanks to this change it can now be used in Storybook, we just provide it the relevant data. There is no need for the complication of mocking Redux.

The logic behind getting the data from server and connecting to Redux was moved to HotNewsContainer which uses the HotNews internally and passes it the data from the server. Also, wherever we previously used HotNews in our app, we now must use HotNewsContainer instead.

Summary

In this article we’ve shown a way to make Storybook work in an environment with TypeScript, Webpack 4 and Sass. We also discussed how to write components in a way suitable for Storybook. Hopefully, this will inspire you, to try Storybook yourself, it really is worth it!

Setting up Storybook in TypeScript/Sass environment with Webpack 4 was originally published in VZP Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing Bundle Sizes of React Single Page Application in TypeScript

Dan Homola — Fri, 27 Jul 2018 13:21:22 GMT

Optimizing Bundle Sizes of a React Single Page Application in TypeScript

Photo by Alexander Sinn on Unsplash

If you are reading this, you have probably written some Single Page App. You probably also know that their size can get out of hand quickly. In this article we will discuss how we managed to improve the size of our main business to client app using several techniques including

image optimization
dependency optimization
tree shaking
code splitting

The application

The application is called Moje VZP (My VZP) and its main purpose is to allow our clients to get all the information about their health insurance they need. It is a React Single Page App written in TypeScript. We use Webpack to create the application bundles and transpile the code using TypeScript and Babel (because of the babel-preset-env that provides polyfills for older browsers). The styles use SCSS and Bootstrap. As for the images, there are a few SVG icons and one small PNG banner.

The old state

Before we started the size optimization efforts, the app consisted of three bundles:

vendors.js – all the third-party packages (React, lodash, etc.)
app.js – all the application code
commons.js – all the CSS and images

Below is the screenshot of webpack-bundle-analyzer output (interactive version):

Webpack bundles before optimization

As we can see the biggest parts of the vendors.js bundle are our @vzp/* packages with their dependencies – cleave.js (formatted input) and react-intl-tel-input (specialized phone number input). These total at 70.84 kB gzipped. Another big part is lodash (32.2 kB gzipped). We found ways to take these sizes down substantially (see Tree Shaking and Lodash Optimization).

The app.js bundle is relatively uninteresting, the only thing to note is that it contains the code of the entire app in a single bundle. This means every time the user accesses our app, they download the whole app even with parts they will not (or in some cases even cannot) use. We solved this as well (see Code Splitting).

Finally, the commons.js bundle contains mainly CSS and images. We haven't managed to make the CSS smaller yet (it uses Bootstrap for historical reasons, so we would need to rewrite it substantially), however we managed to trim a few kilobytes of the images (see Image Optimization).

Image Optimization

Let’s start with the most straight forward means to make the app payload smaller — optimizing image sizes. This can be achieved really easily in Webpack thanks to the wonderful image-webpack-loader. This loader uses the imagemin library to optimize various image formats during the Webpack build. Configuring this is as easy as running yarn add -D image-webpack-loader and updating webpack.config.js:

https://medium.com/media/f228da4638d20759bd99991d47b64261/href

We disable the image optimization in development mode to make the build as fast as possible when working on the app.

With this change, the image part of the commons.js bundle went from 7.9 kB gzipped to 5.85 kB gzipped. Not a huge win in absolute numbers, but considering it is more than 25% decrease for only a few lines of build script, this is a nice save.

Tree Shaking

The term tree shaking refers to the process of eliminating unused code by leaving out modules that are not imported anywhere. Without tree shaking, when you import a part of a package, the whole package is included as well. This can be seen on our @vzp/validatedform:

validatedform before tree shaking

Even though the whole application does not contain a single radio button or checkbox, both of those are included in the final bundle. This problem is manifested in several other packages.

To take advantage of the tree shaking functionality of Webpack, the app must conform to a few restrictions as stated in the docs:

[1] Use ES2015 module syntax (i.e. import and export).

[2] Add a “sideEffects” property to your project’s package.json file.

[3] Include a minifier that supports dead code removal (e.g. the UglifyJSPlugin).

So that is what we did. First, we had to revise the way our project is built. Previously we used TypeScript to transpile our code to ES5 and Babel to handle the polyfills. We changed that so that TypeScript does only the type checking and Babel handles the transpilation and polyfills. In order to do that we updated our tsconfig.json:

https://medium.com/media/f0edf1467f6b3f495f13248db027f3cb/href

As we can see, TypeScript now transpiles the code to ES6 using the ESNext module system. The other three settings are necessary to make things work. Without "moduleResolution": "node" the modules are resolved in a way incompatible with ESNext and both esModuleInterop and allowSyntheticDefaultImports improve interoperabilty at runtime. For example, they allow you to write import _ from "lodash" where previously you had to write import * as _ from "lodash" which is not permitted in ESNext.

The next step was to update .babelrc by setting "modules": false in babel-preset-env options:

https://medium.com/media/312951a7477741d350e48ae15dcff4ff/href

Doing so tells Babel to leave the modules alone in order to allow Webpack to access the imports and exports in the form it needs them to tree shake. As a side effect we observed a decrease in the overall bundle size just by setting this flag (even in projects without tree shaking). There is in fact an open issue in the babel-loader repository to make this value the default.

These two changes make our app pass the first requirement — using the proper import/export syntax. To pass the second one, we needed to update our @vzp/* packages.

The update was twofold: first we updated their build process similarly to what we described above so that they are built as ESNext modules, then we added the "sideEffects": false to their package.json files. Before doing so, all the packages were checked it they do not contain any side effect in order to avoid bugs. I am glad to say none of our packages contain side effects!

The third requirement was passed automatically as Webpack 4 uses Uglify in its production mode by default.

Now the @vzp/validatedform looks much better:

validatedform after tree shaking

The Radio, Checkbox and other unused input types are not included showing the tree shaking works!

Lodash optimization

With tree shaking enabled some of the third-party dependencies got smaller and more optimized. However, lodash remained the same. When researching the reason for this, we found out that the problem is in the way we import lodash functions:

https://medium.com/media/81a76ab000ed57036c7dcb59472b7087/href

In other words, you need to import specific functions from lodash for tree shaking to work. You could write the imports manually which however gets tedious fast and more importantly it takes missing one bad import and you are back to the whole lodash being included in your bundles.

Luckily, there is a better way — babel-plugin-lodash. This plugin transforms the wrong imports into the correct ones automatically. All it takes is to yarn add -D babel-plugin-lodash and one update in .babelrc:

https://medium.com/media/e73a29c9bb5c763f240baad946b75d7e/href

Another step suggested on the plugin’s website is to use lodash-webpack-plugin as well to make lodash even smaller. This is achieved by removing some more advanced (and therefore rarely used) lodash features. It is straight-forward to set up: yarn add -D lodash-webpack-plugin and update webpack.config.js:

https://medium.com/media/86e5c871437b54302f677dbc0f20bf26/href

While the setup was easy, this plugin can get a little tricky to use. The problems arise when you actually use some of the “advanced” features in your code. This fact can go undetected — the Webpack build succeeds, the bundles get smaller, and then your app behaves weirdly or crashes. The plugin allows to opt-in to groups of features by setting the appropriate options in webpack.config.js, however it can be difficult to determine which feature sets you need. In our case the only one was flattening, because we use _.flow in several places. Just know that using this plugin can break your app if you are not careful!

Code Splitting

Code Splitting refers to the process of dividing your app bundle into several smaller ones each including a part of your app and only loading them when the user needs them. This leads to faster startup times and saves bandwidth. Also, thanks to HTTP/2 multiple parallel requests are not an issue like they used to be.

The main challenge of effective code splitting is the question where to make the split. The most basic form of code splitting is the so-called route-centric one. It comes from the same logic “normal” webpages work — each route path has its own app code.

While route-centric splitting is better than nothing, there is an even more effective approach — component-centric splitting. The idea is that sometimes you want to be even more granular than just routes.

For example, our app has a whole section in the Reimbursement History part of the app where clients can issue claims for their health care reimbursements (for example when they feel their health care provider invoiced something they had not provided). We call this section ClaimForm. ClaimForm consists of a multi-step wizard with forms on every step. The problem is that most of our users do not use this feature (or at least not every time they use our app). Without code splitting they download the code for ClaimForm even though they are not going to use it.

To facilitate component-centric code splitting in React apps, the wonderful react-loadable package was created. It allows developers to dynamically import parts of their apps in a nice declarative way. There is a really detailed guide on the project's home page so we will not go in to all the details, let's just see it in action!

https://medium.com/media/789fbb3c5ee789488c22585212331d2a/href

As shown in the listing above, the only thing we need is to import the react-loadable package and use it to create a new component. This component is created by calling the Loadable function. Loadable takes a configuration object with several options. The most important ones are:

loader - which component to make loadable
loading – which component to display while the loaded one is being loaded
render – how to render the loaded component (optional in JavaScript, mandatory in TypeScript)

The first two options are pretty self-explanatory, the render is tricky, though. While according to the docs it is optional, you must specify it if your component is not the default export of its module (e.g. using export default). Because we do not use default exports for components (some of the reasons are mentioned in the great TypeScript Deep Dive book, we must specify the render method.

The resulting component then takes the same props as the original one.

An attentive reader may have noticed the /* webpackChunkName: "claim-form" */ comment in the dynamic import statement. This is used by Webpack to name the split chunk with a human readable name. By default, these chunks are just numbered sequentially. To make this comment work, webpack.config.js must be updated:

https://medium.com/media/6be200c17a92c5b1ba9660e479339050/href

As for the build part, the code splitting uses dynamic import syntax (that import() statement) which Babel does not support by default. Hence, we need to install a plugin for that (yarn add -D babel-plugin-syntax-dynamic-import) and enable it in .babelrc:

https://medium.com/media/50ce4c2f5eadd3d8281ad7c1e357c7f0/href

And that is all the setup needed! From now on Webpack will generate separate chunks for our loadable components. What is even better it will also detect code shared by the chunks and emit chunks with the shared code as well.

The result is best seen on a picture. After splitting on routes (Dashboard, InsuranceHistory and ReimbursementHistory) as well as two forms (ActivationForm and ClaimForm) and one UX element (PeopleSelector) our bundles look like this (interactive version):

Webpack bundles after optimization

As you can see, there are several new chunks for the components themselves as well as the parts shared by two or more of them. Note that for example claim-form.js also includes the react-intl-tel-input as it is the only component using it. This means that users that do not use the ClaimForm do not have to download this relatively big package anymore!

Savings

The main question is was it worth it? The answer is (unsurprisingly) yes! Below is a comparison of before and after:

https://medium.com/media/bc99ab0c023bbcbc4fad2d5d10f71e26/href https://medium.com/media/31e7816002561a8eecbeab4a1edd9aa8/href

The total savings are 47.35 kB gzipped which is a roughly 15 % decrease. More importantly though, when users initially open our app (on the Dashboard route), they only download 159.47 kB gzipped which is less than a half of the old payload!

You can explore the bundle visualizations yourself — before optimizations and after optimizations — to see what really changed.

Further work

The steps discussed here are not exhaustive, there is still room for improvement. The biggest opportunity to further slim down the bundles is to collocate the styles with the components (be it using CSS-in-JS or “normal” imports) instead of using one big CSS file. This would enable the code splitting part of Webpack to handle the styles just as effectively as the scripts.

Summary

In this article we’ve shown some of the techniques you can use to make your Single Page Apps smaller and more effective. On a real project we illustrated the impact these can have in practice.

Optimizing Bundle Sizes of React Single Page Application in TypeScript was originally published in VZP Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Encouraging proper API response handling on code level

Dan Homola — Fri, 13 Jul 2018 14:15:05 GMT

Photo by Victor Zambrano on Unspash

If you ever developed any application that communicates with a REST API, you probably know that handling the responses can be a bit tricky. It is often easy to forget to handle for example error states. In this article, we’ll show an approach that we devised to make forgetting this much less probable. The solution described uses TypeScript, but the core concept should be usable in other languages as well.

Previous solution

Not long after we started developing React Single Page apps, we found out that there are several things to keep track of for every REST API resource:

whether there is a request pending (in other words if we are currently waiting for a server response)
whether there was an error last time we tried to call the API endpoint
when the last change of this object happened (usually the time of the last response)
the contents of the last successful response

For this, we created a small library called loadable (this was before the popular react-loadable pacakge was published, so the name clashes now). Its main purpose was to facilitate and unify representations of REST resources. The core object called Loadable can be seen bellow:

https://medium.com/media/9cd81ae7b9b5f759ce4303c51d78c1ef/href

As we can see, it has four properties — one for each thing we want to track mentioned above. It uses a generic parameter for the data property to allow representation of any REST resource in a type-safe way.

There was also a function for creating the instances that handled the timestamp assignment:

https://medium.com/media/db0ae206da7ec612f7c480f0657eee79/href

There is nothing wrong with this representation itself, in fact it is used as is in the new approach (see below). The problem is, there is no standard way to consume this object that would enforce that the consumer handles all the possible states the object can be in. Most of the time, there was code like this when handling Loadable instances in React code:

https://medium.com/media/10dc286e830d1d573ef036c4319f59b7/href

This was mostly fine, however, there are two main problems with this:

forgetting to handle one of the states
handling of state combinations gets really messy

It is easy to forget to handle one of the states (for us it was the error most of the time). This leads to suboptimal UX, because when a REST request fails, the app seems to not work at all.

Also, what if we wanted to display the previously received data as well as the loading indicator (for example when the user explicitly reloads the resource)? The logic gets really complex in those cases, and is super easy to make an error in.

The new approach aims to alleviate both of these issues.

Current solution

Recently I finally read the brilliant Professor Frisby’s Mostly Adequate Guide to Functional Programming. Despite its wacky title, it is a really useful explanation of various functional programming concepts. You should definitely check it out if you haven’t already. Chapter 8 discusses various containers and how they can help you with for example error handling. This inspired me to revise our loadable package and attempt to use the approaches described to fix our problems.

It turns out, both of the problems mentioned above can be solved by a simple form of “pattern matching”. I put it in quotes, because it is no pattern matching per se, it shares the basic idea with it, though.

Instead of the createLoadable function, there is now a loadable utility object that looks like this:

https://medium.com/media/3a6c91bd6648c00b774973a59e4f96b5/href

The name of the instance creating method (of) is inspired by the Guide, but it does the same thing as the createLoadable function used to. The interesting bit is the merge method:

https://medium.com/media/130a800c7277c01e7201e7f46b6dc2d0/href

As we can see, the match method uses a configuration object of the LoadablePattern type. This type consists of handlers of all the states that the Loadable instance can be in.

The important thing to notice is that some of the handlers are marked as required (more precisely not marked as optional with the ?). This means that every caller of the merge method has to provide handlers for the basic states: loading in progress, data retrieved, error occurred, and the default empty state (when the resource has not been requested yet for the first time). Thus, we solve the first problem, if using the merge method, all the basic states are handled.

The second problem is solved by encapsulating the decision tree into one place. The caller can also specify more nuanced handlers. For example loadingWithData which represents a situation when we already had some data but are loading it again – hence we show the current data as well as the loading indicator.

The matching is opinionated regarding the priority in which the cases are matched. This is necessary as TypeScript does not have proper pattern matching (yet) which would allow the consumer to define the priority themselves using the order of the clauses. For our use cases this is fine though, as it means we handle the diferent states consistently across the app.

Using these methods is then as easy as:

https://medium.com/media/04ebe41e2f8234288864557c9929dc76/href

As we can see, it allows us to extend the initial example by providing loadingWithData handler while retaining its previous functionality.

Design decisions

Observant reader might have two questions about this implementation:

Why is the merge method curried?
Why is the merge method implemented as an instance method of the Loadable type?

The reason for the currying is illustrated in the usage example above. We can declare the matcher once, instead of doing it on every render. This keeps the render method as lightweight as possible.

As to why the Loadable is not a class with merge as its instance method (and of as a static method) similarly to for example Either, the main reason is serializability.

Most of our TypeScript applications are React apps that use Redux as the store management solution (though I really want to look into MobX soon). All of the Loadable instances we use are therefore stored in a Redux store. As you may know it is really beneficial if all the data stored in the store is serializable. This unfortunately means we cannot use the "proper" class approach easily.

If we were to use loadable in some other environment, I would definitely go for the class approach, as it avoids the Loadable/loadable dichotomy.

Summary

In this article we’ve shown how to enforce a certain practice on code level at compile time using functional approaches. Hopefully, this will inspire you to solve some of your code problems in a similar way.

Encouraging proper API response handling on code level was originally published in VZP Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Creating accessible progress indicator in React

Dan Homola — Fri, 29 Jun 2018 12:01:02 GMT

Photo by Matt Artz on Unspash

When creating our registration application our UX designer came up with the following design (screenshot taken from our storybook):

It was relatively easy to implement using React, so we wrote it and forgot about it.

Several weeks later, we submitted the alpha version of the app to our accessibility testing consultant (as we do with all our apps). The results were that among other minor accessibility problems, the most serious was that the progress indicator (we call it thermometer) is completely inaccessible to screen readers. Not invisible, even worse — annoying, because when read using a screen reader it produced seemingly irrelevant words.

The accessibility problems can be split into these points:

There was no indication this component represents a progress display
The circles were not numbered and labelled in any way
The lines connecting the circles did not provide information about how much of the way they were filled
There was no explanation what A***** D*** means (this was explained in the rest of the app, so a seeing users could easily read it, visually impaired users however could not)

We will discuss how we fixed all these problems in the rest of this article.

Declaring the semantic meaning

First, let’s focus on how to declare that our component is a multi step progress indicator. We followed our consultant’s recommendation: The whole component is an ordered list where the list items consist of the circle and optionally the connecting line. Bellow there is the source code of the main component:

https://medium.com/media/984326c1781f00b9c7cab6c197cbd636/href

As we can see, it renders an ordered list with a descriptive aria-label attribute that tells the user what it represents.

The list takes two props: items that specify name, icon and other properties of the individual steps, and position – two numbers that indicate which step is currently active and what is the percentual progress to the next one.

Rendering the circles

Next, let’s see how the Segments (i.e. list items) are rendered. Bellow is the code for the Segment component:

https://medium.com/media/0c01d52210ee4e85357da14c754364a8/href

The Segment component renders one circle and if it is no the last one, it renders the connecting line as well. The important thing here is that the li element is marked as aria-current when appropriate. This ensures that screen reader reads the current step like

Account, current step

ensuring the user knows what the current step is.

Other than that, the component is relativelly straight-forward, it computes some derived props for the Point and Line components (see bellow) based on the current position on the thermoeter.

The Point component represents the circle on the thermometer. It has a name, an icon and status, that indicates whether it is the current step, step already completed or one that is still to be visited. The whole code is listed bellow:

https://medium.com/media/35a783101c53be4261af18d6b54ca04a/href

Point renders a div with some styling that ensures it is a circle of the right color and size depending on the status. Active step is larger than inactive for example.

Inside of the div, there are two spans – one for screen readers only and one for displays (hidden from screen readers using the aria-hidden attribute). The reason for this duplicity is that we needed to provide additional description for the screen reader users. In our case it was a note that the name of the person being registered is redacted and therefore is spelled A***** D*** instead of Arthur Dent.

You may wonder why we did not simplify this to something like

https://medium.com/media/f001a9b6e632a41221acd39d80d17513/href

While this would work relatively well, the screen reader would read the two parts separately. In other words the user would have to press a button to hear props.description as well. The way we wrote it, both strings get read at once.

Rendering the connecting lines

The last part we haven’t shown are the lines connecting the circles. This is the responsibility of the Line component:

https://medium.com/media/f1242a27353ed597f559e4533b523215/href

The visual part of its markup is simple: just a div contaning two overlapping divs that emulate the partially filled progress bar. It's the ARIA part that's interesting!

As we can see, the main div has several ARIA attributes, so let's explain those one by one.

The role attribute declares what the component represents semantically – in this case a progressbar. There are quite a few roles available in the WAI-ARIA standard.

We used aria-label here to give name to the progress bar. In this case the title of the related step is used. This ensures the user knows which step the progress bar corresponds to.

Finally, the aria-valuenow specifies the curernt progress value while aria-valuemin and aria-valuemax specify the lowest and highest possible values. These are used by the screen readers to compute the percentual progress of the progress bar. This means that somthing like

https://medium.com/media/19d1036ee9f4d5d13b7d1c0a2a6567bf/href

would be read as

50 percent

Conclusion

On a simple real life example we’ve shown how to make React components accessible to visually impaired users and hopefully inspired you to revise your own components with this perspective in mind. After all, accessible components are better components.

Creating accessible progress indicator in React was originally published in VZP Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.