dblink
SynapseML
dblink | SynapseML | |
---|---|---|
1 | 18 | |
54 | 4,970 | |
- | 0.2% | |
0.0 | 9.0 | |
almost 3 years ago | 5 days ago | |
Scala | Scala | |
GNU General Public License v3.0 or later | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
dblink
-
[D] Machine Learning and "Record Linkage"
Felligi-Sunter is the baseline model in record linkage research. It is implemented in R in fastLink and RecordLinkage, but you will need training data. There are some other options, e.g. dblink, that use Bayesian methods and a latent variable set up so you don’t need training data.
SynapseML
- FLaNK Stack Weekly for 12 September 2023
-
Microsoft announces new tool for applying ChatGPT and GPT-4 at massive scales
Release Notes: https://github.com/microsoft/SynapseML/releases/tag/v0.11.0
-
Data science in Scala
b) There are libraries around e.g. Microsoft SynapseML, LinkedIn Photon ML
- [N] Microsoft Announces New Integrations with OpenAI and MLFlow
- [N] Microsoft Releases new Integrations with OpenAI and MLflow as part of SynapseML
-
[P] Microsoft releases SynapseML v0.9.5 with support for speech synthesis, anomaly detection, and geospatial analytics on large-scale data
Link to Release Notes: https://github.com/microsoft/SynapseML/releases/tag/v0.9.5
- Microsoft releases SynapseML v0.9.5 for distributed geospatial analytics, speech synthesis, and anomaly detection in PySpark.
- [P] SynapseML v0.9.5 announces support for geospatial analytics, speech synthesis, and anomaly detection on large-scale datasets
- Microsoft releases SynapseML v0.9.5 with support for speech synthesis, anomaly detection, and geospatial analytics on Apache Spark
What are some alternatives?
entity-embed - PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
mmlspark - Simple and Distributed Machine Learning [Moved to: https://github.com/microsoft/SynapseML]
splink - Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
isolation-forest - A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.
Tensorflow_scala - TensorFlow API for the Scala Programming Language
sparkMeasure - This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
delight - A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
Breeze - Breeze is a numerical processing library for Scala.
azure-kusto-spark - Apache Spark Connector for Azure Kusto
cobrix - A COBOL parser and Mainframe/EBCDIC data source for Apache Spark