Our great sponsors
-
nlm-ingestor
This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Sorry if i'm completely missing it, I noticed in the code, there is something around chat:
https://github.com/tembo-io/pg_vectorize/blob/main/src/chat....
This would lead me to believe there is some way to actually invoke not just embeddings, but querying an LLM... which would be crazy powerful. Are there any examples on how to do this?
>tree-based approach to organize and summarize text data, capturing both high-level and low-level details.
https://twitter.com/parthsarthi03/status/1753199233241674040
processes documents, organizing content and improving readability by handling sections, paragraphs, links, tables, lists, page continuations, and removing redundancies, watermarks, and applying OCR, with additional support for HTML and other formats through Apache Tika:
https://github.com/nlmatics/nlm-ingestor
There's an issue in the pgvector repo about someone having several ~10-20million row tables and getting acceptable performance with the right hardware and some performance tuning: https://github.com/pgvector/pgvector/issues/455
I'm in the early stages of evaluating pgvector myself. but having used pinecone I currently am liking pgvector better because of it being open source. The indexing algorithm is clear, one can understand and modify the parameters. Furthermore the database is postgresql, not a proprietary document store. When the other data in the problem is stored relationally, it is very convenient to have the vectors stored like this as well. And postgresql has good observability and metrics. I think when it comes to flexibility for specialized applications, pgvector seems like the clear winner. But I can definitely see pinecone's appeal if vector search is not a core component of the problem/business, as it is very easy to use and scales very easily
I wrote a C# library to do this, which is similar to other chunking approaches that are common, like the way langchain does it: https://github.com/drittich/SemanticSlicer
Given a list of separators (regexes), it goes through them in order and keeps splitting the text by them until the chunk fits within the desired size. By putting the higher level separators first (e.g., for HTML split by
before
), it's a pretty good proxy for maintaining context.