Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
This is missing the "pdfsizeopt" suite, that bundles statically compiled utilities to reduce size.
Static compilation means that it will run on most Linux platforms without extra required software.
I believe one aspect of it will remove characters from included fonts that are not used.
It really is quite impressive.
> Would love to find a cheaper (local) option vs AWS
How about tesseract (https://github.com/tesseract-ocr/tesseract)
There’s even a library for php (https://github.com/thiagoalessio/tesseract-ocr-for-php). Haven’t used it. I did used python Pytesseract & works fairly well.
PDFBox can do this. It’s not part of the CLI but it wouldn’t be too hard to add:
https://github.com/apache/pdfbox/blob/5b00807463279f1002e245...
This tool might be helpful for comparing pdfs: https://github.com/serhack/pdf-diff
I'd like to add this tool to the list: https://pdfsam.org/