paperetl
dagster
Our great sponsors
paperetl | dagster | |
---|---|---|
12 | 46 | |
302 | 9,939 | |
5.0% | 4.7% | |
6.3 | 10.0 | |
4 months ago | 5 days ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
paperetl
- Show HN: Open-source Rule-based PDF parser for RAG
-
Oracle of Zotero: LLM QA of Your Research Library
Nice project!
I've spent quite a lot of time in the medical/scientific literature space. With regards to LLMs, specifically RAG, how the data is chunked is quite important. With that, I have a couple projects that might be beneficial additions.
paperetl (https://github.com/neuml/paperetl) - supports parsing arXiv, PubMed and integrates with GROBID to handle parsing metadata and text from arbitrary papers.
paperai (https://github.com/neuml/paperai) - builds embeddings databases of medical/scientific papers. Supports LLM prompting, semantic workflows and vector search. Built with txtai (https://github.com/neuml/txtai).
While arbitrary chunking/splitting can work, I've found that integrating parsing that has knowledge of medical/scientific paper structure increases the overall accuracy and experience of downstream applications.
-
[P] Parse research papers into structured data
paperai | paperetl
-
Seeking Advice: How to extract Abstract from scientific journals (.pdfs) 10k+.
paperai and paperetl are a set of projects to consider for this task.
- paperetl: ETL processes for medical and scientific papers
dagster
-
The Dagster Master Plan
I found this example that helped me - https://github.com/dagster-io/dagster/tree/master/examples/project_fully_featured/project_fully_featured
In the meantime, we're collecting solutions and use cases in our GitHub Discussions, and you're welcome to ask any specific questions in there!
-
What are some open-source ML pipeline managers that are easy to use?
I would recommend the following: - https://www.mage.ai/ - https://dagster.io/ - https://www.prefect.io/ - https://metaflow.org/ - https://zenml.io/home
-
Best Orchestration Tool to run dbt projects?
Dagster seemed really cool when I looked into it as an alternative to airflow. I especially like the software defined assets and built-in lineage which I haven't seen in any other tool. However it seems it does not support RBAC which is a pretty big issue if you want a self-service type of architecture, see https://github.com/dagster-io/dagster/issues/2219. It does seem like it's available in their hosted version, but I wanted to run it myself on k8s.
-
dbt Cloud Alternatives?
Dagster? https://dagster.io
-
What's the best thing/library you learned this year ?
One that I haven't seen on here yet: dagster
- Can we take a moment to appreciate how much of dataengineering is open source?
-
Dagger Python SDK: Develop Your CI/CD Pipelines as Code
I wondered how it related to https://dagster.io/
-
Data Engineer Github Profile?
You can find all current, closed, and resolved issues on the “Issues” section and explore them using filters: eg issues for dagster. Look into some of the issues and feel free to ask a question or post your idea: it’s much less toxic here (compared to SO, for example).
-
[D] Should I go with Prefect, Argo or Flyte for Model Training and ML workflow orchestration?
You could also consider Dagster, which aims to improve Apache Airflow's shortcomings. Also, take a look at MyMLOps, where you can get a quick overview of open-source orchestration tools.
What are some alternatives?
Prefect - The easiest way to build, run, and monitor data pipelines at scale.
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Mage - 🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
MLflow - Open source platform for the machine learning lifecycle
meltano
OpenLineage - An Open Standard for lineage metadata collection
streamlit - Streamlit — A faster way to build and share data apps.
ploomber - The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
superset - Apache Superset is a Data Visualization and Data Exploration Platform
fastapi - FastAPI framework, high performance, easy to learn, fast to code, ready for production
hashi-ui - A modern user interface for @hashicorp Consul & Nomad