Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser trips up on common snowflake query history #78

Open
peteclark3 opened this issue Oct 14, 2021 · 3 comments
Open

Parser trips up on common snowflake query history #78

peteclark3 opened this issue Oct 14, 2021 · 3 comments

Comments

@peteclark3
Copy link

Currently, the parser trips up on many common snowflake query history entries like select query_text from table(information_schema.query_history()); - also queries with the rm @SNOWFLAKE_... syntax... also queries with the keyword recluster ... in the latter case, the error being syntax error at or near "recluster", at index 35 ... I am systematically removing these from analysis prior to sending to the analyzer but just FYI that without doing this, the analyzer throws an exception

@vrajat
Copy link
Member

vrajat commented Oct 14, 2021

Yes. This is a known issue and happy to discuss how to approach parsing Snowflake queries better. Right now, this package uses pglast for parsing. So it parses Postgres syntax. It also does pretty well with Redshift since the dialect is very similar.
A set of parsers are required to support different databases. There are a few options that I have considered and funded through freelance gigs:

  • Use Apache Calcite and add support for custom syntax. This worked pretty well for a gig but it is in Java. Since this package is in Python, Python/Java bridges are a pain to use.
  • Use lark and build a parser. Note that one doesnt have to start from scratch.

Both of these require substantial effort and I am looking for opportunities to fund it.

@peteclark3
Copy link
Author

peteclark3 commented Oct 14, 2021

Ok. Is it fair to say that this package doesn't really support snowflake yet? I see it in the marketing materials - https://tokern.io/data-lineage/ .. but in practice when I try the examples, I never get any result in the graph because it trips up on so many queries. If there is an implementation that does use snowflake effectively, or at least filter out the syntax that pglast doesn't support, I'd be happy to use it... but right now I think I am going to have to try a different package because it's getting to a point where I have so many query syntax exclusions that it's becoming not feasible... (even then, queries are failing and not telling me why.. I'm just getting a bunch of these now):

tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/server.py", line 186, in post
tokern-data-lineage |     chosen_visitor = analyze_dml_query(self._catalog, parsed, source)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/__init__.py", line 54, in analyze_dml_query
tokern-data-lineage |     chosen_visitor = visit_dml_query(parsed, source)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/__init__.py", line 89, in visit_dml_query
tokern-data-lineage |     v(parsed.node)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 197, in __call__
tokern-data-lineage |     result = method(ancestors, node)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/data_lineage/parser/dml_visitor.py", line 254, in visit_SelectStmt
tokern-data-lineage |     super().__call__(node.intoClause)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 190, in __call__
tokern-data-lineage |     ancestors, node = generator.send(None)
tokern-data-lineage |   File "/opt/pysetup/.venv/lib/python3.8/site-packages/pglast/visitors.py", line 221, in iterate
tokern-data-lineage |     raise ValueError('Bad argument, expected a ast.Node instance or a tuple')
tokern-data-lineage | ValueError: Bad argument, expected a ast.Node instance or a tuple

@vrajat
Copy link
Member

vrajat commented Oct 16, 2021

Yes. The open source package as it currently stands has poor Snowflake coverage. This is because there is no good open source Snowflake SQL parser. All other OSS packages have a similar problem.
I have private scripts to parse and I support snowflake as part of my freelance gigs. These scripts have too many hacks to open source.
I am looking for funding to build an open source Snowflake parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants