Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Online Search: Parallelize Search, Use Jina Reader API by default #832

Merged

Conversation

debanjum
Copy link
Member

Overview

Khoj wil be able to do online search out of the box, even for self-hosted users

  • Default to Jina search, reader API when no Serper.dev, Olostep API keys
  • Run online searches in parallel to process multiple queries faster

Details

  • Jina provides a reader API for online search and web page reading.
    It requires no API key. This provides a good default to enable
    online search for self-hosted readers requiring no additional setup.

  • Jina search API also returns webpage contents with the results, so
    just use those directly when Jina Search API used instead of
    trying to read webpages separately. The extract relvant content from
    webpage step using a chat model is still used from the
    read_webpage_and_extract_content func in this case.

  • Parse search results from Jina search API into same format as
    Serper.dev for accurate rendering of online references by clients

  • Run online searches in parallel with AsyncIO to process multiple
    queries faster

@debanjum debanjum force-pushed the improve-online-search-with-parallel-search-jina-ai-fallback branch from 85ad6f6 to 373eff2 Compare June 22, 2024 09:55
Copy link
Collaborator

@sabaimran sabaimran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome being able to provide a better out of the box search experience to self-hosted users.

documentation/docs/features/online_search.md Outdated Show resolved Hide resolved
documentation/docs/features/online_search.md Outdated Show resolved Hide resolved
src/khoj/processor/tools/online_search.py Show resolved Hide resolved
tests/test_offline_chat_director.py Outdated Show resolved Hide resolved
@debanjum debanjum force-pushed the improve-online-search-with-parallel-search-jina-ai-fallback branch 3 times, most recently from 9ebf68a to 2e1b112 Compare June 25, 2024 19:35
@debanjum debanjum requested a review from sabaimran June 26, 2024 02:45
@debanjum debanjum force-pushed the improve-online-search-with-parallel-search-jina-ai-fallback branch from 2cba000 to 2e1b112 Compare June 26, 2024 09:41
Jina AI provides a search and webpage reader API that doesn't require
an API key. This provides a good default to enable online search for
self-hosted readers requiring no additional setup.

Jina search API also returns webpage contents with the results, so
just use those directly when Jina Search API used instead of
trying to read webpages separately. The extract relvant content from
webpage step using a chat model is still used from the
`read_webpage_and_extract_content' func in this case.

Parse search results from Jina search API into same format as
Serper.dev for accurate rendering of online references by clients
It is unnecessary as the OpenAI client automatically tries to use API
key from OPENAI_API_KEY env var when the api_key field is unset
Update offline, openai chat actor, director tests to not require
Serper to run the online command tests

Update documentation for self-hosted online search to mention no setup
is required by default. But improvements can be made by using
Serper.dev or Olostep
@debanjum debanjum force-pushed the improve-online-search-with-parallel-search-jina-ai-fallback branch from 2e1b112 to d5ceff2 Compare July 2, 2024 11:49
@debanjum debanjum merged commit c015eeb into master Jul 2, 2024
7 checks passed
@debanjum debanjum deleted the improve-online-search-with-parallel-search-jina-ai-fallback branch July 2, 2024 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants