Skip to content

Commit

Permalink
Merge pull request #131 from ibm-client-engineering/ng5
Browse files Browse the repository at this point in the history
Updated Rewriting Documents & Chunking Experiments
  • Loading branch information
ng4567 committed Jun 20, 2024
2 parents 4073c1b + 6922b19 commit 45abe79
Show file tree
Hide file tree
Showing 4 changed files with 47 additions and 0 deletions.
7 changes: 7 additions & 0 deletions docs/3-Use-Cases/NeuralSeek.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,13 @@ You can and run the different experiments just by changing the Discovery collect
It uses the NeuralSeek API.
Please refer to [Testing Notebook](testing.ipynb) for detailed steps.

## UI Testing

Within the NeuralSeek UI, there is also a feature to send a batch of questions to test at once. This can be helpful for users who prefer UI tools, and returns answers in a spreadhseet format. To use this feature, navigate to the "upload test questions" section on the home page:
![UI Testing](./assets/test-questions-ui.png)

When submitting your batch of questions, format them using the [template](./assets/q.csv) provided before uploading them.

## Download Logs
- Proceed to API on Integrate tab
![NS Console Log](./assets/NS_Console_Log.png)
Expand Down
12 changes: 12 additions & 0 deletions docs/3-Use-Cases/assets/q.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
ID,Question,Filter
1,,
2,,
3,,
4,,
5,,
6,,
7,,
8,,
9,,
10,,
11,,
Binary file added docs/3-Use-Cases/assets/test-questions-ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 28 additions & 0 deletions docs/4-Transition/3-Lessons.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,31 @@ There are two ways to resolve this problem:
or
2) Replace ```SubnetId: !Ref PublicSubnet1ID``` with a private subnetID ```SubnetId: subnet-example5b646```

# Rewriting Documents & Chunking Experiments

These were some experiments conducted during the course of this POC to evaluate methods for optimizing LLM output quality, table parsing, and alternate chunking methods to optimize NeuralSeek's document retrieval.

## Rewriting Documents
From our observation of many documents, we have noticed that it is common to have PDF documents that are dense and complex. The content could sometimes be challenging for humans to process and find the exact information. Moreover, sometimes the answers to the questions are not explicitly stated in the documents. This typically can be solved by the exercise of reformatting and rewriting portions of the documents to improve the documents and chatbot responses.


## Chunking

Certain documents contain tables of contents and numbered section headings that make it easily to split the documents into chunks. We experimented with this using certain documents in our knowledge base, splitting documents based on section headings. This can be done either with code, or manually.

Employing this technique, we were able to achieve a significant increase in the amount of times the correct passage from a document was retrieved during a NeuralSeek search. Ensuring that documents are titled correctly is also important for this to work. We only performed manual chunking by hand, but it would be simple to implement a programmatic solution with something like Regex to automatically split documents into parts.

Once documents were chunked, we then removed the full documents from the Watson Discovery collection after replacing them with their split components, to avoid confusing the RAG retrieval.

## Recommendation

After our extensive testing, we recommend the following:

1. Use a combination of LLM and curation
LLM is not a perfect solution to everything in the enterprise world. Sometimes it could end up in subpar answers and unnecessary costs. Our recommendation is to combine LLM and curation. Curate existing Q&A or the ones that we want the same answer every time first. Then, use LLM to answer any new queries. Every query to LLM is associated with a cost, so if there is an existing Q&A list, then there is no need to spend extra cost to have LLM answer them. LLM can search through the knowledge repository and provide an answer, then we can evaluate the answer and curate if needed.

2. Improve upon existing complex documents
The existing documentations contain complex tables and text structures. Avoid having tables and images when possible, and have it in natural language text for LLM to understand. Moreover, rewriting documents in general will help LLM to understand the content more. It is recommended to be more explicit when writing, and avoid section titles that do not provide additional context. In addition, it is better to have acronyms spelled out since LLM may only pull sections of text that only have acronyms without the context what it stands for. These techniques will significantly improve chatbot results.

3. Be specific when querying
It is also recommended to provide enough contextual information in questions, so the LLM can understand and find relevant information. Some examples include spell out acronyms, include specific type of data in question, etc. This helps LLM to understand and answer properly.

0 comments on commit 45abe79

Please sign in to comment.