Skip to content

A sample docker setup for Trino (Presto) with Hive connector

Notifications You must be signed in to change notification settings

hevoio/trino-hive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trino (Presto) with Hive connector

The instructions here relate to run Presto with data present in GCS. This setup has been adapted from this repository: Hive connector over MinIO file storage.

Stack

  • Trino (Presto)
  • Hive GCS connector
  • Hive Metastore with MariaDB persistence

Steps

  • Download the gcs connector and place the jar file in lib directory.
  • Obtain GCP service-account credentials json with the service-account having admin permissions on GCS. Place this file in creds directory
  • Set the GCP project-id against fs.gs.project.id in metastore-site.xml.
  • Run docker-compose up -d.
  • It should start instances of the database, metastore and Trino
Creating network "trino-hive_trino-network" with driver "bridge"
Creating trino-hive_mariadb_1           ... done
Creating trino-hive_trino-coordinator_1 ... done
Creating trino-hive_hive-metastore_1    ... done

Access trino

docker container exec -it trino-hive_trino-coordinator_1 trino

# The following prompt must show up
trino>

Create Entities

View Catalogs

trino> show catalogs;
 Catalog
---------
 hive
 system
 tpcds
 tpch
(4 rows)

Create a Schema in the Hive Catalog

trino> CREATE SCHEMA hive.hive_gcs WITH (location = 'gs://bucket-test-tj-1/');
CREATE SCHEMA

Create some data in GCS. Ensure that partitions like partion_name=foo are present in the structure and each file has 5 columns with headers col_1, col_2, col_3, col_4, col_5. Example: Example

Create a partitioned Hive Table

USE hive.hive_gcs;

CREATE TABLE sample_table1 (
  col_1 varchar, col_2 varchar, col_3 varchar, col_4 varchar, col_5 varchar,
  xing varchar, fing varchar
)
WITH (
  format = 'CSV',
  partitioned_by = ARRAY['xing','fing'],
  external_location = 'gs://bucket-test-tj-1/ping',
  skip_header_line_count = 1
);

Update partitions

CALL system.sync_partition_metadata(schema_name=>'hive_gcs', table_name=>'sample_table1', mode=>'FULL');

Read Data


trino:hive_gcs> select * from sample_table1 where xing = 'bar' and fing = 'io';

 col_1 | col_2 | col_3 | col_4 | col_5 | xing | fing
-------+-------+-------+-------+-------+------+------
 91    | 92    | 93    | 94    | 95    | bar  | io
 81    | 82    | 83    | 84    | 85    | bar  | io

Sample code

Maven dependency

<dependency>
    <groupId>io.trino</groupId>
    <artifactId>trino-jdbc</artifactId>
    <version>373</version>
</dependency>

Java Code

public static void main(String[] args) throws Exception {
    // Format: <Trino_Coordinator>/Catalog/Schema
    try (Connection conn = DriverManager.getConnection("jdbc:trino://localhost:9080/hive/hive_gcs?user=anything")) {
        try(Statement stmt = conn.createStatement()) {
            try(ResultSet rs = stmt.executeQuery("SELECT * FROM sample_table1")) {
                while (rs.next()) {
                    String col1 = rs.getString("col_1");
                    String col2 = rs.getString("col_2");
                    System.out.println(String.format("col_1=%s and col_2=%s", col1, col2));
                }
            }
        }
    }
}

References

MinIO

MinIO is a multi-cloud object storage abstraction for any cloud. It has s3 like semantics and is compatible with most clouds. Hence, it can be a good alternate to object storage if rich abstraction is a necessity. A sample experiment.

About

A sample docker setup for Trino (Presto) with Hive connector

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published