Releases: aryn-ai/sycamore
v0.1.17
This Sycamore release contains new writers to the Weaviate and Pinecone vector databases, enhancements to the demo UI, and numerous small features and bug fixes.
What's Changed
- Add Sycamore Partitioner example notebook by @jonfritz in #379
- Various link updates and typo fixes in docs by @hsm207 in #381
- Fix notebook link in docs. by @bsowell in #383
- SummarizeImage example in the SycamorePartitionerExample notebook. by @bsowell in #382
- Jonfritz patch 1 updated description by @jonfritz in #384
- Rename ...Request -> ...Call and ...Response -> ...Reply by @alexaryn in #385
- Responsive Demo UI by @sohamkasar19 in #386
- lineage 1/n: add support for metadata. by @eric-anderson in #387
- Set table object to None when no table is found. by @bsowell in #389
- Fix integration tests. by @eric-anderson in #390
- Add GPT-4o support. by @bsowell in #392
- Updates to Demo UI by @sohamkasar19 in #391
- Updates in demo ui for filtering by @sohamkasar19 in #394
- ensure unique uuids post explode via sequence numbers by @HenryL27 in #398
- Fix model deserialization error by @bohou-aryn in #395
- Convert map.py transforms over to base map. by @eric-anderson in #399
- Convert bbox_merge and mark_misc to new Map style by @eric-anderson in #402
- Add TimeTrace and instrument major pieces of code. by @alexaryn in #388
- Use proxy to provide default settings to the UI by @baitsguy in #404
- Update CONTRIBUTING.md to note that integration tests are currently broken. by @eric-anderson in #403
- FIX: Pdf viewer error by @sohamkasar19 in #406
- Convert Merge from Ray Actor to Ray Task. by @alexaryn in #401
- Tools to look at TimeTrace output. by @alexaryn in #396
- Switch Filter and Enbed over to BaseMapTransform by @eric-anderson in #405
- Convert classes over to new *Map classes 3/n by @eric-anderson in #408
- Explicitly enumerate notebooks to automatically test. by @bsowell in #335
- Add timing for OpenSearch writer. by @alexaryn in #407
- Add weaviate writer by @HenryL27 in #400
- Switch from uuid1() to uuid4() in explode. by @bsowell in #410
- Add remote model server support by @MarkLindblad in #397
- Refactor drawing code to support additional formats. by @bsowell in #409
- Adjust assertion for batch_size resource_arg. by @bsowell in #411
- Convert over to *Map 4/n by @eric-anderson in #412
- Fix OOM on CPU by reducing default batch size by @MarkLindblad in #415
- Update poetry lock files by @eric-anderson in #414
- Fix bug in runtests, it always detected changes. Add --force to force tests by @eric-anderson in #413
- Conversion to base map 5/n: Partition by @eric-anderson in #416
- Remove generate_map_class_from_callable -- *Map 6/n by @eric-anderson in #417
- Fix bug again. Make JSON encoding work. by @alexaryn in #418
- TimeTrace: add RSS and improve usability by @alexaryn in #419
- Fix split_and_convert_to_image when some pages have no elements. by @bsowell in #422
- remove empty lists from documents in weaviate writer by @HenryL27 in #421
- Switch spread props & ndd to BaseMap -- *Map 7/n by @eric-anderson in #425
- Add show_pages pdf utility for visualizing pdf partitioning. by @bsowell in #424
- TimeTrace: Fallback to RUSAGE_SELF when RUSAGE_THREAD isn't present. by @bsowell in #426
- Switch extract_schema to BaseMap -- *Map 8/8 by @eric-anderson in #427
- fixed typo by @tranade in #420
- Remove base64 from model server response by @MarkLindblad in #423
- Pin setuptools so that we can run DETR model on GPU on Linux. by @alexaryn in #431
- Prettier output for timetrace files when using ttviz & ttanal by @alexaryn in #430
- Add Detr json serializability test by @MarkLindblad in #429
- add term_frequency transform by @HenryL27 in #432
- Pinecone connector by @HenryL27 in #435
- Soham demo UI updates by @sohamkasar19 in #428
- Enable table extraction to use GPU. by @alexaryn in #434
- Initial checkin of speed performance benchmarking script. by @alexaryn in #433
- Add pinned dependencies to the sycamore and rps repos. by @bsowell in #436
- Fix cross-test error by @eric-anderson in #439
- Force dependency consistency by @eric-anderson in #440
- Add utility code for working with tables and showing them as HTML. by @bsowell in #438
- Demo UI eslint prettier config by @sohamkasar19 in #437
- bump sycamore version to 0.1.17 by @HenryL27 in #443
New Contributors
- @MarkLindblad made their first contribution in #397
- @tranade made their first contribution in #420
Full Changelog: v0.1.16...v0.1.17
v0.1.16
This release contains support in the SycamorePartitioner for extracting table structure and images, as well as a new transform for summarizing images. It also includes a number of bug fixes and enhancements.
What's Changed
- fix ui error when no title is extracted and we're not in ntsb setting by @HenryL27 in #352
- Fix almost all the pyproject.toml and poetry.lock files to have consistent requirements on python dependencies. by @eric-anderson in #345
- Bind mount to convey SSL cert/key to Jupyter & UI by @alexaryn in #349
- Use real SSL certificate for OpenSearch HTTP. by @alexaryn in #353
- copy lib/poetry-lock into containers to make poetry happy by @HenryL27 in #354
- copy lib/poetry-lock into remote-processor-service too. by @HenryL27 in #355
- copy in all of poetry-lock, not just the pyproject files by @HenryL27 in #356
- Update data model for table structure recognition. by @bsowell in #357
- Put token-protected HTTPS proxy in front of UI proxy. by @alexaryn in #359
- Arxiv switched to HTTP for these PDFs; make it work. by @alexaryn in #360
- Add apt update to UI Dockerfiles. by @alexaryn in #361
- Use chown in our copy commands to make sure all files are owned by app by @eric-anderson in #362
- Add TableStructureExtractor interface and TableTransformer impl. by @bsowell in #358
- fix zsh path by @eric-anderson in #367
- Jupyter container improvements by @eric-anderson in #369
- Don't say localhost if it's not going to work. by @alexaryn in #366
- bump deploy timeout for reranking model from 60 to 120 by @HenryL27 in #363
- ingest all ntsb docs, automatically detect docker v not, spread path … by @HenryL27 in #368
- Fix typos in README by @hsm207 in #370
- Fix default prep script when given an empty directory to import by @HenryL27 in #371
- fix typo by @HenryL27 in #372
- Add the ability to summarize images to partitioned docsets. by @bsowell in #365
- Store element bbox as a tuple rather than BoundingBox. by @bsowell in #374
- Jonfritz patch 1 partition update by @jonfritz in #376
- FIX: Error on initiate conversation without a conversation id by @sohamkasar19 in #375
- Add API docs for the SycamorePartitioner and table extraction. by @bsowell in #373
- Fix malformed text from beautiful soup. by @bohou-aryn in #351
- Handle deserializing JSON documents when elements is None. by @bsowell in #377
- Bump sycamore version to 0.1.16 by @bsowell in #378
New Contributors
- @hsm207 made their first contribution in #370
- @sohamkasar19 made their first contribution in #375
Full Changelog: v0.1.15...v0.1.16
v0.1.15
This release add support for writing DocSets to jsonl files as well as other incremental features and bug fixes.
What's Changed
- Cache entire Amazon Textract response by @baitsguy in #333
- New query chosen in consultation with Mehul. by @alexaryn in #336
- Fix unit test mocking. by @alexaryn in #338
- Added ability to write JSONL block files. by @alexaryn in #337
- Fix bug in updating a single property and most workarounds for the bug. by @eric-anderson in #341
- Set RPS default version to follow VERSION again by @HenryL27 in #342
- Initial Container ITs by @HenryL27 in #339
- Force to opensearch V2.12.0.0 to make build work by @eric-anderson in #343
- minor fixups to NDD doc by @alexaryn in #346
- Better container integration testing automation by @HenryL27 in #344
- Update NDD notebook with JSON/PDF ingestion options. by @alexaryn in #347
- Bump sycamore version to v0.1.15 by @bsowell in #348
Full Changelog: v0.1.14...v0.1.15
v0.1.14
This release includes CPU support and OCR in the Sycamore Partitioner, caching for better performance and lower cost when using Textract for table extraction, an upgraded version of Ray (2.10), and more.
What's Changed
- mark rps version as latest rc by @HenryL27 in #291
- Cleanup rewriting - cloning doesn't work by @eric-anderson in #292
- Fix integ test import error. by @bsowell in #293
- Change notebook working directory when running outside container. by @bsowell in #294
- Fix bug in undocumented/untested prefix limiting feature. by @eric-anderson in #295
- Implement CachedTextractTableExtractor by @bohou-aryn in #288
- Upgrade the openai Python library to 1.x and guidance to 0.1.x. by @bsowell in #242
- Reorder partitioner output and fix model loading inefficiency by @bohou-aryn in #277
- Refactor sycamore to apps, lib by @HenryL27 in #296
- add averaged_perceptron_tagger to nltk downloads by @HenryL27 in #301
- fix jupyter bind mount path by @HenryL27 in #302
- Make sure filetype property is already set. by @eric-anderson in #298
- initialize messages index on startup by @HenryL27 in #303
- Add demo UI by @HenryL27 in #300
- Address HTML viewer bug when doing sycamore_crawler_http_sort_all by @alexaryn in #304
- Make SycamorePartitioner runnable on CPUs. by @bsowell in #299
- Get all the containers building and working again. by @eric-anderson in #305
- Switch from Exception to RuntimeError by @eric-anderson in #306
- remove submodule steps from plugin checkout in dockerfile because sub… by @HenryL27 in #309
- Fix dockerfile to work post merge by @eric-anderson in #310
- Add some documentation for NDD: Sketcher at ingestion time. by @alexaryn in #307
- Add sketch() after explode() in all our default pipelines. by @alexaryn in #312
- Add remote processor service by @HenryL27 in #311
- use ADD instead of RUN git clone to checkout git repos by @HenryL27 in #313
- Change from nmslib to faiss everywhere. by @alexaryn in #314
- Add tesseract-ocr to container dependencies. by @bsowell in #316
- compile docs with poetry by @HenryL27 in #317
- Add support for OCR in the Sycamore partitioner. by @bsowell in #315
- Setup query-time NDD: pre-create RPS processors, add to pipelines by @alexaryn in #318
- Changes needed for vanilla build of importer and RPS containers. by @alexaryn in #320
- Add shingles to _source to enable query-time near duplicate detection by @alexaryn in #321
- Fix importer to check for user, apply similar fix to crawlers by @eric-anderson in #322
- Remove obsolete files from the quickstart -> sycamore repo merge. by @eric-anderson in #283
- Upgrade to Ray 2.10.0. by @bsowell in #319
- Upgrade guidance to 0.1.13. by @bsowell in #323
- Remove mypy --explicit-package-bases flag and fix issues. by @bsowell in #324
- Update poetry.lock files based on recent sycamore dependency changes. by @bsowell in #325
- Deal with renamed file. by @alexaryn in #329
- Added -anon switch to S3 crawler for public buckets. by @alexaryn in #327
- add docs for RPS by @HenryL27 in #328
- Add Jupyter notebook to demonstrate query-time NDD. by @alexaryn in #326
- Expand NDD doc into separate file. by @alexaryn in #330
- Bump version to 0.1.14. by @bsowell in #332
- Add .profile to container so that we get poetry python not container python by @eric-anderson in #331
- Update dedup.md by @jonfritz in #334
Full Changelog: v0.1.13...v0.1.14
v0.1.13
This release upgrades the Sycamore docker containers to use OpenSearch 2.12 and adds support for SSL. It also includes significant additions to the Sycamore documentation (https://sycamore.readthedocs.io/), and a number of other features and bug fixes.
What's Changed
- Upgrade test workflow to os 2.10 by @baitsguy in #240
- Evaluation code by @baitsguy in #239
- Quickstart: use SSL for all network communication: OpenSearch and Jupyter by @alexaryn in #231
- Update get_started.md by @jonfritz in #241
- Jonfritz patch 1 by @jonfritz in #245
- Upgrade dependencies and address dependabot alerts. by @bsowell in #248
- Update documentation for DocSetWriter and DocSet.write by @bsowell in #246
- Added eval metrics and fixed bugs by @baitsguy in #247
- Fix examples to use https for links to UI by @eric-anderson in #244
- Upgrade opensearch to 2.12 by @HenryL27 in #249
- Straggler comment. by @alexaryn in #250
- Add debug facility to sycamore-opensearch.sh entrypoint script. by @alexaryn in #251
- Address two timing-related problems with SSL/security setup. by @alexaryn in #253
- Henry's fix to detect failure properly for model deployment. by @alexaryn in #254
- Enable DEBUG and NOEXIT environment variables. by @eric-anderson in #255
- Add NOEXIT functionality to die function. by @alexaryn in #257
- By default, disable SSL for Jupyter, to avoid browser cert complaints. by @alexaryn in #256
- Added some debug messages that were missing. by @alexaryn in #259
- Improve recall metrics by @baitsguy in #258
- Suppress parse error that we expect. by @alexaryn in #262
- Remove need for passwordless sudo to run the default import notebook. by @eric-anderson in #264
- Improve http integration test debugging. by @eric-anderson in #263
- Increase default model deployment stability by @HenryL27 in #260
- Modify supplement_text for integrating text from pdfminer by @bohou-aryn in #265
- register model if not found in setup transient by @HenryL27 in #266
- add documentation for reranking by @HenryL27 in #261
- move model and pipeline configurations to python by @HenryL27 in #268
- Bump opensearch ssl startup wait time to 30 tries. by @eric-anderson in #269
- Document map_batch by @eric-anderson in #267
- Cleanup metrics classes + bug fixes by @baitsguy in #271
- Fixes to documentation. gen script to auto-add transforms. by @alexaryn in #272
- Add .sketch() to DocSet to access Sketcher transform directly. by @alexaryn in #273
- Update hybrid_search.md by @jonfritz in #274
- Fix notebooks -- proper protocol, truncate output. by @eric-anderson in #275
- Update docs for schema and property extractors. by @bsowell in #270
- More SSL/container fixes. by @eric-anderson in #276
- Minor doc cleanup: removed not-checked-in files. by @alexaryn in #278
- default opensearch to x86 by @HenryL27 in #279
- build ml-commons locally with correct dependencies by @HenryL27 in #280
- Upgrade the Sycamore version to 0.1.13. by @bsowell in #281
- unset default opensearch platform by @HenryL27 in #282
- Fix bug in dev example. by @eric-anderson in #285
- Added documentation for SSL=1 and general SSL background. by @alexaryn in #286
- Update hardware.md by @jonfritz in #287
- build and install remote processor plugin in opensearch dockerfile by @HenryL27 in #284
- add rps to compose.yaml by @HenryL27 in #289
- lowercase d in docker compose command by @HenryL27 in #290
Full Changelog: v0.1.12...v0.1.13
v0.1.12
This release adds components to Sycamore to enable search and analytics use cases, beyond data preparation. Sycamore can now be deployed using Docker containers, and you can also download the Python libraries for data preparation. The documentation has also been updated to reflect this change in scope.
This release also has other features and bug fixes.
What's Changed
- Correctly handle OpenAI model fallback. by @bsowell in #205
- Upgrade Ray to 2.9.0. by @bsowell in #207
- Convert distance function from average to min. Tune parameters. by @alexaryn in #209
- add nltk download punkt action by @HenryL27 in #211
- Address boundary conditions of sliding window, small docs. Re-tuned. by @alexaryn in #210
- Augment text by @HenryL27 in #208
- Update JSON scan to use Ray JSON reader. by @bsowell in #215
- Add docker image prefix as sycamore-importer. by @bohou-aryn in #216
- Make Textract disabled by default by @bohou-aryn in #218
- Add Aryn trained DETR model for entity detection by @bohou-aryn in #212
- Add some useful utility methods for DocSets. by @bsowell in #217
- Element splitter to prevent text elements with too many tokens. by @alexaryn in #184
- Update to Sycamore documentation for consolidation by @jonfritz in #222
- Jonfritz patch 1 docs by @jonfritz in #224
- Metadata extraction updates by @baitsguy in #220
- Prepare for merge of quickstart into sycamore. by @eric-anderson in #225
- Merge quickstart into sycamore by @eric-anderson in #226
- Endless piles of reformatting to get checks to pass. by @eric-anderson in #227
- Use classic shingles; simplified implementation; added debug; re-tuned by @alexaryn in #214
- Update README.md by @jonfritz in #229
- Jonfritz patch 2 by @jonfritz in #228
- Fix sycamore importer service name by @bohou-aryn in #232
- Jonfritz patch 2 by @jonfritz in #235
- Fix bugs on Deformable-DETR by @bohou-aryn in #236
- Jonfritz patch 1 by @jonfritz in #233
- Create notebook file with default ingest script by @bohou-aryn in #219
- Fix typo in docs, and fix formatting by @eric-anderson in #237
- Bump version to v0.1.12. by @bsowell in #238
Full Changelog: v0.1.11...v0.1.12
v0.1.11
This release removes support for OpenAI's text-davinci-003
model, which will be deprecated on 1/4/23, and replaces it with gpt-3.5-turbo-instruct
. All users of sycamore should upgrade.
What's Changed
- Migrate from text-davinci-003 to gpt-3.5-turbo-instruct. by @bsowell in #202
- Bump version to v0.1.11. by @bsowell in #203
Full Changelog: v0.1.10...v0.1.11
v0.1.10
This Sycamore release adds support for near duplicate detection via shingling. It also includes documentation improvements and incremental bug fixes.
What's Changed
- Render schema extraction documentation by @mkyl in #194
- Additional documentation for Schema extraction by @mkyl in #195
- Add async-timeout dependency by @eric-anderson in #198
- Add docstrings to all public document methods so they show up on sycamore.readthedocs.io by @eric-anderson in #197
- Near-Duplicate Detection in Sycamore: Document Tagging and Document Dropping by @alexaryn in #199
- Bump version to v0.1.10. by @bsowell in #200
Full Changelog: v0.1.9...v0.1.10
v0.1.9
This Sycamore release adds improved heuristics for partitioning documents. It also includes a new method of automatically inferring entities to extract from unstructured documents, as well as incremental features and bug fixes.
What's Changed
- Change the default merge size to 256. by @eric-anderson in #178
- Simplify running the http crawler. by @eric-anderson in #180
- Fix text chunking for html importing to improve result quality. by @eric-anderson in #185
- Remove docker_compose and opensearch files. They were moved to quickstart. by @eric-anderson in #183
- Change simple_ingest and s3_ingest to use GTE-small embedding model. by @alexaryn in #169
- Remove unneeded mapping in OpenSearch index settings. by @alexaryn in #186
- Added HTML ingest example. Fixed order in S3 ingester. by @alexaryn in #188
- Simple transform to perform regex replacement on Elements. by @alexaryn in #187
- Update README.md by @jonfritz in #179
- Entity Extraction by @mkyl in #161
- Merging/breaking elements based on heuristics including bbox by @alexaryn in #171
- Update aiohttp and cryptography to address dependabot alerts. by @bsowell in #192
- Bump version to v0.1.9. by @bsowell in #191
New Contributors
Full Changelog: v0.1.8...v0.1.9
v0.1.8
This Sycamore release contains code to build Docker containers as well as small improvements and bug fixes.
What's Changed
- Add take_all operator on docsets. by @bsowell in #140
- Merge in crawler by @eric-anderson in #143
- Speed up 'poetry lock'. by @alexaryn in #147
- Merge after extract_entity so that elements don't exceed size limit. by @alexaryn in #148
- Add docker compose yaml files that run sycamore + crawler + arynai/opensearch + demoui by @eric-anderson in #145
- Add the scripts to dockerize opensearch in a way that works with the other sycamore components by @eric-anderson in #146
- Dockerize sycamore importing by @eric-anderson in #154
- Fix bug in dockerization from merge by @eric-anderson in #164
- Upgrade pyarrow version. by @bsowell in #165
- Bump version for 0.1.8 release. by @bsowell in #166
- Update README. by @austintlee in #167
- Mount data volume to demo-ui container by @pparmar30 in #170
- Lots of improvments to get sort benchmark working better by @eric-anderson in #172
- Move library dependencies back under [tool.poetry.dependencies] by @bsowell in #174
- Fixup dockerization -- skip sycamore library & add build-stamps by @eric-anderson in #175
- Update poetry.lock. by @bsowell in #176
- Update S3 crawler Dockerfile to skip library dependencies and add build stamps by @eric-anderson in #177
New Contributors
- @austintlee made their first contribution in #167
Full Changelog: v0.1.7...v0.1.8