Skip to content

Reading data

Douglas Blank edited this page Nov 15, 2022 · 3 revisions

For the following code examples, it is assumed that Kangas has been imported as follows:

import kangas as kg

DataFrame

Kangas can read a Pandas DataFrame object directly.

import pandas as pd

df = pd.DataFrame(...)
dg = kg.read_dataframe(df)

Likewise, you can easily create a DataFrame from a DataGrid:

df = dg.to_dataframe()

HuggingFace's datasets

HuggingFace's datasets can be loaded into DataGrid directly because they use rows of dictionaries, and images are represented by PIL images. DataGrid will automatically convert PIL images into a Kangas Image.

from datasets import load_dataset

dataset = load_dataset("beans", split="train")
dg = kg.DataGrid(dataset)

Grouping on labels in the Kangas UI:

HuggingFace Dataset

In addition, Kangas can also read in annotation data (such as bounding boxes) from HuggingFace datasets. For more information on HuggingFace's datasets, see: https://huggingface.co/datasets

CSV Files

Kangas can read directly from CSV files. This is a more nuanced process than Pandas CSV reading as it preserves floats, integers, and dates automatically. Kangas also supports a dictionary of converters.

dg = kg.read_csv("samples.csv")
dg = kg.read_csv("https://company.com/samples.csv")
dg = kg.read_csv("https://company.com/samples.csv.zip")

You can also read from a URL, and if the file is in an archived format ("zip", "tgz", etc.) then it will download, unarchive, and load it.

For more options on reading CSV files, see DataGrid.read_csv()

See also:

Kaggle datasets

This example uses the dataset from: https://www.kaggle.com/c/dog-breed-identification

Here, we use one DataGrid to read the CSV file, and then construct another that contains the breed and image.

dg = kg.read_csv("labels.csv")
dogs = kg.DataGrid(
    name="Dog Breeds",
    columns=["Breed", "Image"],
)
for row in dg.to_dicts():
    dogs.append([row["breed"], kg.Image("train/" + row["id"] + ".jpg")])

Grouping on "breed" in the Kangas UI gives:

Kaggle Dataset

You can also read a CSV from a URL, and if the file is in an archived format ("zip", "tgz", etc.) then it will download, unarchive, and load it.

JSON line files

Kangas can read JSON line files as described here: https://jsonlines.org/

A "JSON line file" is basically JSON objects, one per line. These are useful as you can process one line at a time, rather than needing to ready the entire file into memory before deserializing it.

dg = kg.read_json("json_line_file.json")

You can also read a JSON line file from a URL, and if the file is in an archived format ("zip", "tgz", etc.) then it will download, unarchive, and load it.

For more options on reading JSON line files, see DataGrid.read_json()

Converters

Each of the read_ methods, and the DataGrid constructor itself, also takes a parameter named converters. This is a dictionary where the key is a column name, and the value is a function of one argument. The function should take the column's raw value, and returns the converted value.

In addition, you can also use a dictionary with the key "row". This special form takes the entire row as a dictionary. You can alter one column based on the values of another. For example:

def huggingface_annotations(row):
    cppe_labels = ["Coverall", "FaceShield", "Gloves", "Goggles", "Mask"]
    if "image" in row and "objects" in row:
        # cppe
        if isinstance(row["image"], Image) and isinstance(row["objects"], dict):
            if ("bbox" in row["objects"]) and ("category" in row["objects"]):
                boxes = row["objects"]["bbox"]
                labels = row["objects"]["category"]
                for box, label in zip(boxes, labels):
                    x, y, w, h = box
                    row["image"].add_bounding_boxes(
                        cppe_labels[label], [[x, y], [x + w, y + h]]
                    )

dg = DataGrid(data, converters={"row": huggingface_annotations})

This example will read the contents of the "objects" JSON column, and add the data as bounding boxes to an "image" column.

Table of Contents

Clone this wiki locally