Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Closed
wants to merge 21 commits into from

Conversation

Daethyra
Copy link
Contributor

This pull request

introduces significant enhancements to the conv_html_to_markdown.py module, integrating JinaAI's embeddings v2 (Small) model for advanced text processing. The primary goal is to refine the conversion of HTML content into Markdown format, with an added focus on removing redundant data using semantic analysis. Below is a detailed overview of the workflow, reasoning, steps taken, and the key features of this updated module.

Workflow and Reasoning:

  • Initial Goal: Improve the conversion of HTML to Markdown by eliminating redundant or semantically similar content.
  • Choice of Technology: Utilized JinaAI's embeddings v2 (Small) model because it is open-source, highly efficient in handling large text data and semantic text analysis capabilities, and it has an 8k maximum token context length.
  • Security Consideration: Implemented trust_remote_code=True cautiously to allow the execution of the model's custom code, acknowledging the potential risks involved. Passing this value is required for the model to run.
  • Thought Process: By integrating the conv_html_to_markdown.py module, we streamline the conversion process and intelligently curate the HTML content, paving the way for more refined and contextually relevant Markdown outputs.
    • Semantic Similarity Threshold: I found most success with a float of 0.8699 after extensive methodical testing against the Hugging Face Pipelines documentation.
    • Docker Integration: conv_html_to_markdown.py is seamlessly integrated into the root dir's Dockerfile, which has also been updated to install all necessary Python packages for the environment.

Features Covered:

  • HTML to Markdown conversion with tag stripping and link processing.
  • Semantic redundancy removal using JinaAI embeddings.
  • Batch processing of text for efficient embedding generation.
  • Resilient error handling with detailed logging.

Task List:

  • Convert HTML content to Markdown format.
  • Integrate JinaAI's embeddings for semantic analysis.
  • Implement mean pooling and batch processing for embedding generation.
  • Develop functionality to identify and remove redundant lines based on embeddings.
  • Ensure robust error handling and logging throughout the module.

Steps Taken:

  1. Module Setup: Initialized the module with BeautifulSoup and markdownify for HTML parsing and Markdown conversion.
  2. Embedding Integration: Integrated AutoTokenizer and AutoModel from the transformers library for embedding processing.
  3. Batch Processing Logic: Developed process_embeddings method to handle embeddings in batches, optimizing for large datasets.
  4. Semantic Analysis: Created remove_redundant_data method to filter out semantically similar content using cosine similarity on embeddings.
  5. Error Handling: Added comprehensive try-except blocks for resilience, particularly in critical methods like convert and curate_content.
  6. Testing and Validation: Conducted thorough testing to ensure the accuracy and efficiency of the embedding-based redundancy removal.

Future Improvements and Customization:

The module serves as a foundational framework for further customization and enhancement. Future iterations can explore:

  • Tuning of the semantic similarity threshold for redundancy removal.
  • House scraped code blocks in markdown blocks with backticks
    • Create visual divider of chunks or sections
      • Could implement pagination of sorts instead

This pull request is brought to you by: WELL IT WORKS ON MY MACHINE!

While installing Python, switch to ROOT to avoid installing/using `sudo`
- Switch back before installing `pip` packages to avoid pip warnings
- Added:
  - docstrings
  - granular exception handling
- Ran Black, Flake8, and PyLint against `conv_html_to_markdown`
- Need to change the input file name to BuildIO's default for consistency
	modified:   .gitignore
	new file:   .pylintrc
	modified:   Dockerfile
	renamed:    conv_html_to_markdown.py -> src/conv_html_to_markdown.py
	new file:   tests/test_conv_html_to_markdown.py
Merge release 1.0.0 changes
- out of scope of initial conversion processor
  - html to markdown
- Satisfied with useful:unuseful data ratio
  - Float 0.8699 as threshold for semantic similarity was chosen *intentionally* after extensive methodical testing
- Room for improvement:
  - Wrap code blocks, remove all 'copied'
  - Remove language references
  - Remove `<nsource>`
  - Separate sections with '---' or something similar
    - need visual representation of each chunk while causing minimal noise
  - Introduce structure to house each class
    - Headers and subheaders maybe???
    - Could also just use some basic type of markdown formatting

	modified:   Dockerfile
	modified:   src/conv_html_to_markdown.py
- Refactored load_json function to load_json_files, allowing it to handle multiple JSON files matching a pattern using glob. This change enables the aggregation of data from all matched files. Also, updated main function to reflect the new file loading process and added explanatory comments for clarity.
@marcelovicentegc
Copy link
Collaborator

Hey @Daethyra 👋! It occurs to me that these changes could be introduced as a separate package on another repo)as something possibly complementary to the gpt-crawler package.

I miss some instructions on this PR:

  • What are the requirements to run these changes? Do people need a HF API key to run this?
  • How does it integrate with gpt-crawler?

@Daethyra
Copy link
Contributor Author

Daethyra commented Dec 8, 2023

Hey @marcelovicentegc,

Sorry for the late reply. I appreciate you having a look at the PR.

In regard to introducing the Python script as a separate package, I leave that to you -- I'm a noob and don't know what's best for production environments and what's sustainable for you and your team.

To answer your questions,

  • Currently, the conv_html_to_markdown.py module runs as soon as the HTML scraping is finished.
  • Requirements:
  1. Python: ">=3.9, <=3.10.12" | Successfully tested on 3.10 & 3.11
  2. Packages: pip install -U beautifulsoup4 markdownify transformers torch
  3. API Key?: No external API keys required as jina-embeddings-v2-small-en is open access on HF. It's not like Llama 2 where you need to request access. It's ready to go!
  4. Integration: As normal, from inside the root directory of gpt-crawler, build and run the image. The Python script has already been pointed to in the Dockerfile's build process.
  • PowerShell: run docker build -t gpt-crawler . ; docker run -it gpt-crawler
  • Bash: run docker build -t gpt-crawler . && docker run -it gpt-crawler

Acknowledgements:

  • More enhancements to conv_html_to_markdown.py may be found on my fork's version.
  • I haven't thought of a way to quickly test results from conv_html_to_markdown.py because I rebuild the image every time I update the module's logic.
  • Adding this Python module does not produce ideal results; however, they are still an improvement upon the JSONic data for Custom GPTs, in my opinion.
  • I intend to add the following features at a later date, once Assistant Architect receives it's file-base enhancements from the Utilikit:
    2023-12-08 14_10_09-Issues · Daethyra_gpt-crawler and 2 more pages - Personal - Microsoft​ Edge

Copy link
Contributor Author

@Daethyra Daethyra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Because semantic-similarity is the base of this PR, I merged enhancements from the 'main' branch into 'semantic-similarity'

    • Fixes logic errors
  • Throw out changes to config.ts

@Daethyra Daethyra closed this Dec 28, 2023
@Daethyra Daethyra deleted the semantic-similarity branch December 28, 2023 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants