Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execute shell expression from within repl? #77

Open
dezren39 opened this issue Dec 17, 2023 · 7 comments
Open

execute shell expression from within repl? #77

dezren39 opened this issue Dec 17, 2023 · 7 comments

Comments

@dezren39
Copy link

from the repl, i'd like to get regular data to use later maybe?

randomNumbers = $(cat /dev/random)
myTaxes = csv $(cat 2012.csv)
# json, xml, kdl, parquet, ?sqlite?

is this like what FileSystem pre-compute examples are?

let topics = $(find /dev/kafka/local/topics/ops.logs.*)

i could do a lot of this upfront outside type stream and then pipe into pipestream, but being able to access within the env may unlock some things.

@lucapette
Copy link
Contributor

when you say regular data like 2012.csv you're referring to a file available on your machine?

is this like what FileSystem pre-compute examples are?

The general idea behind those notes in the experiments doc is that we'd like to extend TypeStream so you can store in a variable things like a "data stream" (like your example in #76) or a "list of paths" (lists are not supported yet, neither is * expansion. The latter has already come up in #51).

So if I understand your suggestion correctly, you'd like to integrate "local" (as in your filesystem) data with TypeStream? Can you please provide an example of how you'd use this feature? (I just want to be sure I have enough context to think about this)

@dezren39
Copy link
Author

dezren39 commented Dec 18, 2023

yeah that's right. I don't have a ton of use-cases upfront, but my thought would be that allowing local structured data to be injected could be useful for filtering/aggregating/enriching purposes. I could join a local file's objects to filter to only each object that matches type and subtype and then also pair each match and return that result.

cat /dev/kafka/tx | inner-join key,subkey $(csv ./directory.csv)

key, subkey, name, type, price (comes from stream)
primary CS0120 Drew credit $21.34
primary CS0123 Drew debit $12.34

joined(key, subkey) to

key, subkey, name (separate system stores this info, easiest to pull from s3 1/hour)
primary CS0122 "Small Corp Ltd."
primary CS0123 "Business Corp Ltd."

returns

key, subkey, name, type, price, name_1
primary CS0123 Drew debit $12.34 "Business Corp Ltd."

I feel like this could be a useful pattern, though I want to call out 'how to handle column name collision' could need better handling or possible configurability.

Initially I was thinking like, csv/xml/json/kdl data formats, you would load the files up once and then use that data like a small stream, but I also think if you could run a sql query against a sqlite database, either upfront (cached results forever or re-queries/refreshes every N) or per-object, that could produce some interesting mechanics. (You could insert into the sql database too? Idk.)

It would be nice if it could hot-reload on change, or operate on a wildcard pulling in all files that match the filter into the stream.

@lucapette
Copy link
Contributor

this is very intriguing. The challenge here is how to provide a schema for external data sources that don't come with one (csv files are a good example but also sources with a "loose schema" like redis would fall here). I've been thinking about it a lot and, coincidentally, I have a meeting tomorrow that should help me push this forward.

But there's more than one thing going in this issue, so let me try providing some context.

my thought would be that allowing local structured data to be injected could be useful for filtering/aggregating/enriching purposes

this makes a lot of sense to me. As I said, I'd want to make it as easy as possible to provide a schema for this. What you said next may be a good way of making this happen:

if you could run a sql query against a sqlite database, either upfront (cached results forever or re-queries/refreshes every N) or per-object, that could produce some interesting mechanics. (You could insert into the sql database too? Idk.)

Interfacing with relational databases is very high on my personal list of priorities for TypeStream since it comes up a lot and, just now, you provided me with one more reason to focus on this. The basic idea here is to rely on the "filesystem metaphor". It would look like this:

cat /media/sqlite3/my.db/tables/users | join /dev/kafka/cluster/topics/page_views

I wrote about the implications on this approach here.

There are some obvious challenges with the semantics of this (how does a table become a stream? And a stream a table?) but there are very solid solutions out there (like debezium) which, I'm sure, will provide me with the right guidance.

In the context of your examples, what I'm thinking is that, once there's a "db mounting" feature in TypeStream, the workflow could be something like this:

  • move your csv,json,etc file into a sqlite table (lots of cool ways of doing this but nothing really beats sqlite-utils imo)
  • mount that db into typestream
  • "just" use it as a db

If you're still with me (sorry I wrote a lot, I know 🤦‍♂️), I'd love to hear what you think about this workflow. Also very curious how you'd imagine "mount that db in typestream" look like (I have ideas but don't want to bias anyone into the way I'm thinking about it)

@jevy
Copy link

jevy commented Dec 19, 2023

It almost sounds like having a local "looking table" for aggregating with a stream. That certainly seems useful.
For the schema: I don't think it's unreasonable to have a schema file (like protobuf or whatever) sitting next to the csv file to define the types.

@lucapette
Copy link
Contributor

I don't think it's unreasonable to have a schema file (like protobuf or whatever) sitting next to the csv file to define the types.

that's true! I proposed the "sqlite workflow" because I've found that approach quite fast as a way to import/structure data from csv (sqlite-utils is just that good). The question that remains is how should that schema file look like of course

@dezren39
Copy link
Author

I think sqlite sounds like a pretty good solution so you don't need a million adapters. I like it. I would prefer the libsql driver be used, so that the 'file' might be a url. 😈

You will want to document how to interact with a live sqlite database from 'another app' on the same file system, reading out of the file safely.

I like the /tables concept. It might be convenient to allow a raw sql select too.

Piping into a sqlite table, autocreate the table if it doesnt exist using the schema?

Sqlite covers all my use cases. I could find use out of regular structured data but I think most of those could easily be imported into sqlite. It's not my niche, but you may find certain data types really want native columnar parquet ?or duck db? files.

@lucapette
Copy link
Contributor

lucapette commented Dec 20, 2023

I think sqlite sounds like a pretty good solution so you don't need a million adapters. I like it. I would prefer the libsql driver be used, so that the 'file' might be a url.

ah that's a nice one. I think this helps me shape the "mounting" concept!

Piping into a sqlite table, autocreate the table if it doesnt exist using the schema?

Yes, I think this is the "mirror" feature of "cat /media/sqlite3/db/tables/foo" and both make sense. I haven't tried to spike this yet so I can't tell if we'll be able to ship both at the same time (not a big fat of giant pull requests :D) but I fully convinced we need both.

Sqlite covers all my use cases. I could find use out of regular structured data but I think most of those could easily be imported into sqlite. It's not my niche, but you may find certain data types really want native columnar parquet ?or duck db? files.

I say this a lot... this is what I like about TypeStream abstraction the most: once we lay down the work for "mounting sqlite", TypeStream will have enough infrastructure code that we'll be able to build new integrations (not totally convinced this is the right word but def best we have) quite quickly.

Exciting times. I'm going to leave this open while I still have a "private roadmap" in my hands since there's no other place where we're talking publicly about "media mounting"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants