-
For PDFs it's entirely a wrapper around https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... - https://github.com/microsoft/markitdown/blob/main/src/markit...
So if that's your use case, PDFMiner might be better to integrate with directly!
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
Quite curious how this compares to docling - https://github.com/DS4SD/docling
docling uses an LLM IIRC, so that's already a difference in approach
-
Pandoc (https://pandoc.org) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.
-
Awesome-Tabular-LLMs
We collect papers about "large language models (LLM) for table-related tasks", e.g., using LLM for Table QA task. “表格+LLM”相关论文整理
This is an active area of research: https://github.com/SpursGoZmy/Awesome-Tabular-LLMs is a good starting point!
-
vim-office
read common binary files, such as PDFs and those of Microsoft Office or LibreOffice, in Vim
Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.
[here] https://github.com/Konfekt/vim-office
-
And the core code mostly calls other libraries for heavy lifting -- eg `mammoth`: https://github.com/mwilliamson/python-mammoth