Show HN: Open-source ETL framework to sync data from SaaS tools to vector stores

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

sidekick

23 1,101 9.8 TypeScript

Discontinued Universal APIs for unstructured data. Sync documents from SaaS tools to a SQL or vector database, where they can be easily queried by AI applications [Moved to: https://github.com/psychic-api/psychic] (by ai-sidekick)

Hey hacker news, we launched a few weeks ago as a GPT-powered chatbot for developer docs, and quickly realized that the value of what we’re doing isn’t the chatbot itself. Rather, it’s the time we save developers by automating the extraction of data from their SaaS tools (Github, Zendesk, Salesforce, etc) and helping transform it to contextually relevant chunks that fit into GPT’s context window.
A lot of companies are building prototypes with GPT right now and they’re all using some combination of Langchain/Llama Index + Weaviate/Pinecone + GPT3.5/GPT4 as their stack for retrieval augmented generation (RAG). This works great for prototypes, but what we learned was that as you scale your RAG app to more users and ingest more sources of content, it becomes a real pain to manage your data pipelines.
For example, if you want to ingest your developer docs, process it into chunks of <500 tokens, and add those chunks to a vector store, you can build a prototype with Langchain fairly quickly. However, if you want to deploy it to customers like we did for BentoML ([https://www.bentoml.com/](https://www.bentoml.com/)) you’ll quickly realize that a naive chunking method that splits by character/token leads to poor results, and that “delete and re-vectorize everything” when the source docs change doesn’t scale as a data synchronization strategy.
We took the code we used to build chatbots for our early customers and turned it into an open source framework to rapidly build new data Connectors and Chunkers. This way developers can use community built Connectors and Chunkers to start running vector searches on data from any source in a matter of minutes, or write their own in a matter of hours.
Here’s a video demo: [https://youtu.be/I2V3Cu8L6wk](https://youtu.be/I2V3Cu8L6wk)
Here’s a repo with instructions on how to get started: [https://github.com/ai-sidekick/sidekick](https://github.com/ai-sidekick/sidekick)
You can set up API endpoints to load, chunk, and vectorize data all in one step. Right now it only works with websites and Github repos, but we’ll be adding Zendesk and Google Drive integrations soon too.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Plaid-like SDK to connect to your users' knowledge bases (Notion, GDrive, etc.)

1 project | news.ycombinator.com | 5 May 2023
Why it finally makes sense to build, not buy, a workplace search product

3 projects | news.ycombinator.com | 26 Apr 2023
Sync data from SaaS tools to a vector database automatically

1 project | /r/freesoftware | 13 Apr 2023
We made a free tool to sync data from Zendesk, Google Docs, or Confluence to a vector database

1 project | /r/GPT3 | 12 Apr 2023
31-Mar-2023

3 projects | /r/dailyainews | 11 Apr 2023

Show HN: Open-source ETL framework to sync data from SaaS tools to vector stores

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
discord-bot Chatbot conversational-ai langchain openai
Post date: 30 Mar 2023

sidekick

InfluxDB

Related posts

Plaid-like SDK to connect to your users' knowledge bases (Notion, GDrive, etc.)

Why it finally makes sense to build, not buy, a workplace search product

Sync data from SaaS tools to a vector database automatically

We made a free tool to sync data from Zendesk, Google Docs, or Confluence to a vector database

31-Mar-2023

Show HN: Open-source ETL framework to sync data from SaaS tools to vector stores

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com discord-bot Chatbot conversational-ai langchain openai Post date: 30 Mar 2023

sidekick

InfluxDB

Related posts

Plaid-like SDK to connect to your users' knowledge bases (Notion, GDrive, etc.)

Why it finally makes sense to build, not buy, a workplace search product

Sync data from SaaS tools to a vector database automatically

We made a free tool to sync data from Zendesk, Google Docs, or Confluence to a vector database

31-Mar-2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
discord-bot Chatbot conversational-ai langchain openai
Post date: 30 Mar 2023