Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Refactoring for Enhanced Flexibility and Performance #90

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

maxime4000
Copy link

@maxime4000 maxime4000 commented Nov 28, 2023

🚨 Warning

This is fully functionnal and shouldn't break anythings that weren't breaking, but it is not fully "finished/stable". I would like to have someone finishing the PR. Open to revert some possibly "controversial" changes, but not interested to fix the multi progressbar as I don't have deep knowledge into playwright and the logger.

Overview

This PR introduces a major refactoring of the codebase, adding support for multiple configurations and enhancing the overall performance and flexibility of the crawler.

Was talked briefly in #82

Key Changes

  • Multiple Configurations: Now supports handling multiple configurations, allowing more versatile use cases.
  • New Flexible Configuration Options:
    • Exclude Selectors: Allows exclusion of content from selectors for targeted data extraction.
    • Max Concurrency: Enables concurrent processing to optimize performance.
    • Config Naming: Useful for organizing multiple configurations.
  • Improved Configuration Parsing: Configs now have default values set within the parser.
  • Output Directory Structure: Changed to output/, improving data management.
  • Multi Progress Bar Interface: Replaced individual page logging with a multi progress bar for clearer progress monitoring.
  • Cleaner Output: Streamlined the output for clarity.
  • Codebase Refactoring: Comprehensive overhaul to improve maintainability and efficiency.
  • Sub Routing Namings: Enhanced routing naming conventions.
  • File Path Changes: Updated output file path to output/data.json.
  • .gitignore Fix: Improved source control management.
  • Prettier Configuration: This update introduces a unique blend of traditional and modern coding styles, focusing on enhancing readability, clarity, and maintainability. A significant aspect of this configuration is the use of tab characters for indentation, prioritizing accessibility and offering better flexibility for visually impaired developers.
    • Can also be reverted by deleting the config file and run npm run prettier or npm run fmt
      • We should delete one of the 2.
      • IMO: fmt

Partially Completed Features and Known Issues

  • Full CLI Support: Haven't use the CLI, should be tested.
  • Terminal Log Issues: INFO and ERROR logs not displayed correctly due to the multi-progress bar feature. They are showing up, but overwritten by the progressbar.
  • Progress Bar Wrapping: Issues with wrapping in narrow terminals.

Current State and Request for Collaboration

The changes made significantly improve the tool's capabilities. However, they are not entirely stable, particularly regarding terminal output. I seek collaboration from the community to resolve these issues, especially the log display discrepancies and progress bar wrapping.

Note :

I am open to make those changes as they might be controversial:

  • Revert Prettier config
  • Disable/Remove Multi Progressbar.
    • Tho, I would prefer if someone would be willing to fix it. Maybe someone that has better understanding of terminal output that I do.
    • My goal was to make understanding the current progress easier, but it introduce bugs with logging coming from playwright... Wasn't expecting this when I start the feature...

# Changes
- Now support multiple configs
- New options for a more flexible config
  - ExcludeSelectors : Allow to remove content from selector.
  - MaxConcurrency : Allow crawler to be concurrent
  - Name : Define the name of the config. Useful when multi config.
- Config now has default value set in the parser.
- Output now is in it's own folder `output/`
- Now display a multi progressbar instead of logging every page.

# Known issues
- INFO, ERROR, are not displayed correctly now that the multibar feature is implemented
- Progressbar is append to next line if terminal width is too small.
toArray now make empty Array if string empty
Copy link
Collaborator

@marcelovicentegc marcelovicentegc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @maxime4000! I see lots of value to these changes! I'm willing to partake to resolve the logging 🤗

My first take would be on the prettier config. Would you mind keeping the tab width 2 or 1 (I'm not sure what's the default)? It occurs to me that it would make the PR review and evolution easier as formatting changes wouldn't be present 🤗

Comment on lines +8 to +20
export async function sequentialAsyncMap<T, R>(
array: T[],
asyncFunction: (value: T, index: number, array: T[]) => Promise<R>,
): Promise<R[]> {
return array.reduce(
async (previousPromise: Promise<R[]>, currentItem: T, index, array) => {
const accumulator = await previousPromise;
const result = await asyncFunction(currentItem, index, array);
return [...accumulator, result];
},
Promise.resolve([] as R[]),
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not being used anywhere, right?

Suggested change
export async function sequentialAsyncMap<T, R>(
array: T[],
asyncFunction: (value: T, index: number, array: T[]) => Promise<R>,
): Promise<R[]> {
return array.reduce(
async (previousPromise: Promise<R[]>, currentItem: T, index, array) => {
const accumulator = await previousPromise;
const result = await asyncFunction(currentItem, index, array);
return [...accumulator, result];
},
Promise.resolve([] as R[]),
);
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact! I had few issues with data being saved in the wrong config. I use that one to make it sequential and see what is happening. I found out that the problem was the Dataset/RequestQueue that didn't exist, so I change it back to a parallel afterward. Will remove it as it is not used.

Copy link
Collaborator

@marcelovicentegc marcelovicentegc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after reviewing the code I got to a conclusion:

I'm in for multiple configurations and enhancing the overall performance and flexibility of the crawler, but my suggestion is to address the multi progress bar feature separately, on a different pull request, specially because it introduces possibly undesired behaviours on logging and needs further attention.

This way we can release the multi-config feature and work on enhancing the logging apart.

What do you think @maxime4000?

@marcelovicentegc marcelovicentegc added the enhancement New feature or request label Nov 28, 2023
@maxime4000
Copy link
Author

@marcelovicentegc Thanks I appreciate :)

  1. Prettier Tab Width: The 'tab width' setting in Prettier is not mandatory when using tabs. In such cases, the IDE's setting (if specified) determines the tab size, otherwise it defaults to 4. This means this settings is not important and could be removed. However, for languages like YAML or Python, where tabs are not allowed, Prettier would uses this value to insert 4 spaces. That being said, we could omit this setting and it would be using "the Prettier opiniated" default value. (Said that prettier support that file format in the first place)

  2. Multi Progress Bar Changes: I can cherry pick these changes into a new Draft PR and assign it to you in the following days. However, I won't assist more than giving you hint when releasing the draft PR. (Might answering initial question if you have some)

Caveat of Reverting the Logger

When running multiple crawlers simultaneously, log outputs will mix, making real-time tracking challenging. This issue mainly arises with multiple configurations. But I'm also guessing that if you use this tools with multiple config, you kinda know what you are doing. 😉

Timeline for Changes

I plan to complete thoses changes by tomorrow or the next day. However, I'd prefer to have others review these changes before proceeding further.

The proposed changes are substantial, and I've made some important decisions independently. It's prudent to wait for feedback, especially from someone at BuilderIO, to ensure there's no disagreement or need for discussion about these changes before doing changes.

@subframe7536
Copy link

Maybe the executor should also be changed. tsc is a little slow.
How about using esbuild, like https://github.com/esbuild-kit/esno

@maxime4000
Copy link
Author

maxime4000 commented Nov 30, 2023

@subframe7536 While you do have a point about the speed of tsc and esbuild/swc would be a better transpiller, we could also says that when we run npm run start, there is no need to build and could be using tsx or ts-node instead.

While your point is an improvement, this PR is already doing too much and thus, I won't make that change here. Though, I suggest you to make a separate PR with this change as it would be faster if you are interested to bring that to the project.

@maxime4000
Copy link
Author

I'm sorry @marcelovicentegc, I couldn't make it...

I was optimistic about my own time and deadlines to do the things I promise to do. And I blinked, and it's been a month and couldn't put the time to finish it...

I think about it, I'm anxious about the promise of doing it but can't find the time to do it. Also, the facts are that my customs GPTs using the output produced by this PR wasn't any better than without it. In facts, chatgpt was slower, more prone to "network error" and didn't give better results than without "the knowledge".

Open AI did update their model to use more recent data and it made it feels like this PR was procrastination with my borrowed times as I should be using this time to build the things I want to build instead of improving a "random" tools that will scrape data to build knowledge files that will slightly help an AI to be more "accurate" when the AI is already able to help me without this data and also the output generated from the tools and the improvement I made didn't end up used because the experience was worse when those files were added to the custom GPT...

Motivation has been low as the beneficial gains of using the tools weren't substantial. I came here with a "take it or leave it" mentality, I promise to do better and support my work, but the urgency of finishing the work fall off short as I didn't used that work in the end...

I'm still planning to finish this PR, but I wanted to be transparent about this. I'm on borrowed times with my current life situation, I want to use that time to get the most of it and hopefully secure an independent prosper future. This was procrastination to my current goals. I wish it would have been more useful to me. I will take time to finish it someday, but it's not my priority at the moment.

Reviewing changes is not the same mental load as doing the changes and I'm still open about reviewing contributions to merge those improvements the sooner, but can't give an estimate of when will I do the promised changes.

On that note, I wish everyone who read this a good holiday season, an happy new years and to take time to take care of yourself 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants