Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Improvement #144

Open
isuhail-sray opened this issue Oct 18, 2023 · 11 comments
Open

Performance Improvement #144

isuhail-sray opened this issue Oct 18, 2023 · 11 comments

Comments

@isuhail-sray
Copy link

Is there any way to improve performance/cache responses to make it faster to parse and query large json files?

@michaelmior
Copy link
Collaborator

Can you give an example of a specific performance problem you're having? jsonpath_ng doesn't parse JSON for you, but there are many faster parsers than Python's JSON module if that's your bottleneck. If you have a specific performance problem with jsonpath_ng, it would help to have more details.

@jpetersen23
Copy link

jpetersen23 commented Dec 13, 2023

I used your library to write a csv to json converter with the row headers being jpath, it worked well except for the performance. I'm probably using it wrong, but it looks like the parse is particularly expensive (I also have a lot of queries). (I used cprofiler and snakeviz to display this)
tmp.prof.zip
Screenshot 2023-12-12 at 9 29 56 PM

@jpetersen23
Copy link

jpetersen23 commented Dec 13, 2023

It looks to be coming from this:

def parse_token_stream(self, token_iterator, start_symbol='jsonpath'):

    # Since PLY has some crufty aspects and dumps files, we try to keep them local
    # However, we need to derive the name of the output Python file :-/
    output_directory = os.path.dirname(__file__)
    try:
        module_name = os.path.splitext(os.path.split(__file__)[1])[0]
    except:
        module_name = __name__

    parsing_table_module = '_'.join([module_name, start_symbol, 'parsetab'])

    # And we regenerate the parse table every time;
    # it doesn't actually take that long!
    new_parser = ply.yacc.yacc(module=self,
                               debug=self.debug,
                               tabmodule = parsing_table_module,
                               outputdir = output_directory,
                               write_tables=0,
                               start = start_symbol,
                               errorlog = logger)

    return new_parser.parse(lexer = IteratorToTokenStream(token_iterator))
Screenshot 2023-12-12 at 9 41 07 PM

@jpetersen23
Copy link

I did more profiling to see if I had specific expensive queries, but in fact, I'm doing 80 path queries, and each of them is taking about: ~34 ms

But in total that ends up being ~2765.30ms

@michaelmior
Copy link
Collaborator

@jpetersen23 Can you post the code that's giving you performance problems?

@jpetersen23
Copy link

jpetersen23 commented Dec 13, 2023

I cant share my actual code or data, but I made a toy example from my data/code.

from jsonpath_ng.ext import parse
import time

pairs = [
    ("$.metadata.content_release_version", "taco"),
    ("$.id", "taco"),
    ("$.config.priority", "taco"),
    ("$.created_at", "taco"),
    ("$.update_at", "taco"),
    ("$.event_type", "taco"),
    ("$.event_state", "taco"),
    ("$.config.requires_one_of.token[0].thingy_id", "taco"),
    ("$.config.requires_one_of.token[0].amount", "taco"),
    ("$.config.asset_map.event_icon", "taco"),
    ("$.config.asset_map.key_art", "taco"),
    ("$.config.loc_map.desc.namespace", "taco"),
    ("$.config.loc_map.desc.key", "taco"),
    ("$.config.loc_map.title.namespace", "taco"),
    ("$.config.loc_map.title.key", "taco"),
    ("$.config.loc_map.something_desc.namespace", "taco"),
    ("$.config.loc_map.something_desc.key", "taco"),
    ("$.config.challenges.BANANAS_01.event_progress", "taco"),
    ("$.config.challenges.BANANAS_02.event_progress", "taco"),
    ("$.config.challenges.BANANAS_03.event_progress", "taco"),
    ("$.config.challenges.BANANAS_04.event_progress", "taco"),
    ("$.config.challenges.BANANAS_05.event_progress", "taco"),
    ("$.config.challenges.BANANAS_06.event_progress", "taco"),
    ("$.config.challenges.BANANAS_07.event_progress", "taco"),
    ("$.config.challenges.BANANAS_08.event_progress", "taco"),
    ("$.config.challenges.BANANAS_09.event_progress", "taco"),
    ("$.config.challenges.BANANAS_10.event_progress", "taco"),
    ("$.config.challenges.BANANAS_11.event_progress", "taco"),
    ("$.config.challenges.BANANAS_12.event_progress", "taco"),
    ("$.config.challenges.BANANAS_13.event_progress", "taco"),
    ("$.config.challenges.BANANAS_14.event_progress", "taco"),
    ("$.config.challenges.BANANAS_15.event_progress", "taco"),
    ("$.config.challenges.BANANAS_16.event_progress", "taco"),
    ("$.config.challenges.BANANAS_17.event_progress", "taco"),
    ("$.config.challenges.BANANAS_18.event_progress", "taco"),
    ("$.config.challenges.BANANAS_19.event_progress", "taco"),
    ("$.config.challenges.BANANAS_20.event_progress", "taco"),
    ("$.config.challenges.BANANAS_01.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_02.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_03.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_04.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_05.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_06.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_07.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_08.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_09.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_10.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_11.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_12.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_13.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_14.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_15.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_16.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_17.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_18.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_19.auto_assign", "taco"),
    ("$.config.challenges.BANANAS_20.auto_assign", "taco"),
    ("$.config.tiers.\"00\".threshold", "taco"),
    ("$.config.tiers.\"00\".array_type[0].thingy_id", "taco"),
    ("$.config.tiers.\"00\".array_type[0].amount", "taco"),
    ("$.config.tiers.\"01\"", "taco"),
    ("$.config.tiers.\"02\"", "taco"),
    ("$.config.tiers.\"03\"", "taco"),
    ("$.config.tiers.\"04\"", "taco"),
    ("$.config.tiers.\"05\"", "taco"),
    ("$.config.tiers.\"06\"", "taco"),
    ("$.config.tiers.\"07\"", "taco"),
    ("$.config.tiers.\"08\"", "taco"),
    ("$.config.tiers.\"09\"", "taco"),
    ("$.config.tiers.\"10\"", "taco"),
    ("$.config.tiers.\"11\"", "taco"),
    ("$.config.tiers.\"12\"", "taco"),
    ("$.config.tiers.\"13\"", "taco"),
    ("$.config.tiers.\"14\"", "taco"),
    ("$.config.tiers.\"15\"", "taco"),
    ("$.config.tiers.\"16\"", "taco"),
    ("$.config.tiers.\"17\"", "taco"),
    ("$.config.tiers.\"18\"", "taco"),
    ("$.config.tiers.\"19\"", "taco")
]

json_output = {}
parse_total_time = 0
start_time = time.process_time()
for pair in pairs:
    parse_start_time = time.process_time()
    jsonpath_expr = parse(pair[0])
    duration = 1000 * (time.process_time() - parse_start_time)
    parse_total_time += duration

    jsonpath_expr.update_or_create(json_output, pair[1])

total_time = 1000 * (time.process_time() - start_time)

print(f"Parse Time: {parse_total_time}ms. Total Time: {total_time}ms")

Here is cprof output for a run of it:
tmp.prof.zip

Parse Time: 2665.3660000000023ms. Total Time: 2672.836ms

I also made a follow up test comparing it to a python jq setup.
jpath_toy_example_with_jq.zip

The python jq version produced equivalent json, with the following times:
Parse Time: 133.58500000000004ms. Total Time: 142.75500000000002ms

@lukasjesche
Copy link

Not sure how viable but I changed the parser class to setup the parser table only once and reused the parser and I got the time from:
Parse Time: 524.3853999999981ms. Total Time: 528.6647000000003ms
down to:
Parse Time: 33.61489999999634ms. Total Time: 41.74550000000021ms
(using the posted example code)

@michaelmior
Copy link
Collaborator

@lukasjesche That should work. Although it does also require slight code changes to the example. Instead of calling the parse function each time, you should import ExtentedJsonPathParser (not a typo, the class is unfortunately misspelled). Try this on the reuse-parse-table branch. For me it gives over 20x speedup on the example.

@martkopecky
Copy link

Wow, great findings here. Having read this thread, I decided to start caching my parsers where applicable and went down from 14 minutes processing time to 7 seconds.

@evert061
Copy link

Not sure how viable but I changed the parser class to setup the parser table only once and reused the parser and I got the time from: Parse Time: 524.3853999999981ms. Total Time: 528.6647000000003ms down to: Parse Time: 33.61489999999634ms. Total Time: 41.74550000000021ms (using the posted example code)

Hi @lukasjesche, I’m facing a similar issue related to performance, could you please post the refactored you did in this example?

@lukasjesche
Copy link

@evert061 I just extended the JsonPathParser Class like in this commit: 0e20f3d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants