Smile
grobid
Our great sponsors
Smile | grobid | |
---|---|---|
8 | 11 | |
5,904 | 2,958 | |
- | - | |
9.0 | 9.3 | |
6 days ago | 16 days ago | |
Java | Java | |
GNU General Public License v3.0 or later | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Smile
-
Just want to vent a bit
Although it may be a bit more work, you can do both machine learning and AI in Java. If you are doing deep learning, you can use DeepJavaLibrary (I do work on this one at Amazon). If you are looking for other ML algorithms, I have seen Smile, Tribuo, or some around Spark.
-
Anybody here using Java for machine learning?
For deploying a trained model there are a bunch of options that use Java on top of some native runtime like TF-Java (which I co-lead), ONNX Runtime, pytorch has inference for TorchScript models. Training deep learning models is harder, though you can do it for some of them in DJL. Training more standard ML models is much simpler, either via Tribuo, or using things like LibSVM & XGBoost directly, or other libraries like SMILE or WEKA.
-
What libraries do you use for machine learning and data visualizing in scala?
I use smile https://github.com/haifengl/smile with ammonite and it feels pretty easy/good to work with. Of course for pure looking at data, and exploration, you're not going to beat python.
-
Python VS Scala
Actually, it does. Scala has Spark for data science and some ML libs like Smile.
-
Machine learning on JVM
I was using Smile for some period - https://haifengl.github.io/ - it's quite small and lightweight Java lib with some very basic algorithms - I was using in particularly cauterization. Along with this it provides Scala API.
grobid
- Show HN: Open-source Rule-based PDF parser for RAG
- How to ingest image based PDFs into private GPT model?
- 🥪 Best Sites For ebooks, articles, research papers etc..🥪
- Free/open-source alternatives to Connected Papers...?
-
Seeking Advice: How to extract Abstract from scientific journals (.pdfs) 10k+.
Just use science-parse or GROBID. They have been designed for that exact reason.
-
Project to rebuild papers with plaintext markup languages
- I ended up using Grobid, which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.
-
[D] What pdf parser do you use for paragraph parsing for huggingface models
A few years ago I evaluated a few open source tools. In the end focused on GROBID. As usual, it depends on the type of document whether it works well for your use-case. There is some focus on it being "fast" (if that is a concern).
- Grobid: Machine learning for extracting information from scholarly documents
What are some alternatives?
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
Parsr - Transforms PDF, Documents and Images into Enriched Structured Data
Deeplearning4j - Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Weka
Breeze - Breeze is a numerical processing library for Scala.
CERMINE - Content ExtRactor and MINEr
Apache Flink - Apache Flink
ND4S - ND4S: N-Dimensional Arrays for Scala. Scientific Computing a la Numpy. Based on ND4J.
tensorflow-keras-scala - Scala-based Keras API for the Java bindings to TensorFlow. Mirror of https://codeberg.org/sciss/tensorflow-keras-scala
Apache Mahout - Mirror of Apache Mahout
JSAT - Java Statistical Analysis Tool, a Java library for Machine Learning
H2O - Sparkling Water provides H2O functionality inside Spark cluster