Marina

✨ Overview

Marina is an enhanced version of PySeekDB that supports scalable feature engineering using Python UDFs, in a style similar to LanceDB Geneva. It helps you easily harness feature engineering using Python UDFs ——— to add them as generated columns, and backfill them asynchronously. You can use either built-in Python functions, or open-source models (e.g. BLIP, ViT) to enrich dataset features.

Marina is designed to work with the OceanBase distributed workflow orchestration system. It integrates with Ray, leveraging its powerful capabilities for UDF offloading computation, elastic resource scaling, low-latency task scheduling, and automatic fault recovery.

🚀 Architecture

📖 Demonstration

Step 1: Connect to the database

Open a session with db.connect using an OceanBase-compatible MySQL URI.

from pyseekdb.client import db

conn = db.connect("mysql://root:@127.0.0.1:2881/test?tenant=mysql")

Step 2: Create a table

Declare a PyArrow schema and load rows from a list of dicts. Sample images can be downloaded from the Oxford-IIIT Pet Dataset.

import pyarrow as pa

schema = pa.schema([
    pa.field("image_oss_url", pa.string()),
    pa.field("filename", pa.string()),])

data = [
    {"image_oss_url": "url1", "filename": "Birman.jpg"},
    {"image_oss_url": "url2", "filename": "Havanese.jpg"},
    {"image_oss_url": "url3", "filename": "Persian.jpg"},]

tbl = conn.create_table(name="pets", data=data, schema=schema)

Step 3: Define a Python UDF

The UDF takes an array of OSS URLs, fetches images, applies the pretrained BLIP-Caption model to the batch, and returns an array of descriptive captions.

import io
import numpy as np
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration



def generate_caption_udf_batch(url: np.ndarray) -> np.ndarray:
    images = load_images_from_oss_urls(url)
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()
    n = len(images)
    out = np.empty(n, dtype=np.object_)
    pil_images = [Image.open(io.BytesIO(b)).convert("RGB") for b in images]

    with torch.no_grad():
        inputs = processor(pil_images, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}
        gen = model.generate(**inputs, max_length=50)
        captions = processor.batch_decode(gen, skip_special_tokens=True)

    for i, cap in enumerate(captions):
        out[i] = cap
    return out

Step 4: Add the UDF as a generated column

add_column stores the Python UDF as routine, and sets routine_id in column meta.

tbl.add_column(
    col_name="caption",
    data_type=pa.string(),
    udf=generate_caption_udf_batch,
    udf_name="generate_caption_udf",
    input_columns=["image_oss_url"],
)

Step 5: Backfill the generated column

backfill materializes the column by evaluating UDF on the Ray cluster.

tbl.backfill(
    col_name="caption",
    num_gpus=2,
    num_batches=4,
)

For a complete runnable example, refer to test_udf/test_geneva_compatibility.py.

💡 Requirements

Ray. The OceanBase-compatible fork, using OceanBase as Ray's Global Control Service (GCS) for better fault tolerance.

OceanBase. The storage engine, enterprise edition.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
demo		demo
dist		dist
examples		examples
images		images
src/pyseekdb		src/pyseekdb
test_base		test_base
test_udf		test_udf
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marina

✨ Overview

🚀 Architecture

📖 Demonstration

Step 1: Connect to the database

Step 2: Create a table

Step 3: Define a Python UDF

Step 4: Add the UDF as a generated column

Step 5: Backfill the generated column

💡 Requirements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Marina

✨ Overview

🚀 Architecture

📖 Demonstration

Step 1: Connect to the database

Step 2: Create a table

Step 3: Define a Python UDF

Step 4: Add the UDF as a generated column

Step 5: Backfill the generated column

💡 Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages