Our great sponsors
-
wikipedia-changes
Repository for the post and talk "Hazelcast + Kibana: best buddies for exploring and visualizing data"
-
okhttp-eventsource
Server-sent events (SSE) client implementation for Java, based on OkHttp: http://javadoc.io/doc/com.launchdarkly/okhttp-eventsource
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
lingua
The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Now is the time to create a data pipeline to get this data in Hazelcast. Note that if you want to follow along, the project is readily available on GitHub.
Wikipedia provides changes through Server-Sent Events. In short, with SSE, you register a client to the endpoint, and every time new data comes in, you are notified and can act accordingly. On the JVM, a couple of SSE-compatible clients are available, including Spring WebClient. Instead, I chose to use OkHttp EventSource because it's lightweight - it only depends on OkHttp, and its usage is relatively straightforward.
A linguist can infer the language of the field. It's also possible to use an automated process in the pipeline. A couple of NLP libraries are available in the JVM ecosystem, but I set my eyes on Lingua, one focused on language recognition.
Related posts
- Announcing Lingua 1.2.0 - The most accurate natural language detection library for the JVM, suitable for long and short text alike
- r/argentina es el subreddit de habla hispana mas popular del sitio
- The most popular languages on Reddit, after analyzing 1M comments: English, German, Spanish, Portuguese, French, Italian, Romanian, Dutch... [OC]
- Usando a Biblioteca Lingua para Kotlin
- Language Detection - Pre Trained Models