Our great sponsors
|7 days ago||2 days ago|
|MIT License||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Critical New 0-day Vulnerability in Popular Log4j Library - List of applications
12 projects | dev.to | 13 Dec 2021
Kafka Connect CosmosDB : https://github.com/microsoft/kafka-connect-cosmosdb/blob/0f5d0c9dbf2812400bb480d1ff0672dfa6bb56f0/CHANGELOG.md
Getting started with Kafka Connector for Azure Cosmos DB using Docker
6 projects | dev.to | 6 Jul 2021
The Azure Cosmos DB connector allows you to move data between Azure Cosmos DB and Kafka. It’s available as a source as well as a sink. The Azure Cosmos DB Sink connector writes data from a Kafka topic to an Azure Cosmos DB container and the Source connector writes changes from an Azure Cosmos DB container to a Kafka topic. At the time of writing, the connector is in pre-production mode. You can read more about it on the GitHub repo or install/download it from the Confluent Hub.
gRPC on the client side
9 projects | dev.to | 16 Mar 2023
Other serialization alternatives have a schema validation option: e.g., Avro, Kryo and Protocol Buffers. Interestingly enough, gRPC uses Protobuf to offer RPC across distributed components:
Understanding Azure Event Hubs Capture
2 projects | dev.to | 22 Dec 2022
Apache Avro is a data serialization system, for more information visit Apache Avro
In One Minute : Hadoop
10 projects | dev.to | 21 Nov 2022
Avro, a data serialization system based on JSON schemas.
Protocol Buffer x JSON para serialização de dados
6 projects | dev.to | 13 Aug 2022
Marshaling objects in modern Java
4 projects | reddit.com/r/java | 23 Jun 2022
If binary format is OK, use Protocol Buffer or Avro . Note that in the case of binary formats, you need a schema to serialize/de-serialize your data. Therefore, you'd probably want a schema registry to store all past and present schemas for later usage.
How-to-Guide: Contributing to Open Source
19 projects | reddit.com/r/dataengineering | 11 Jun 2022
How should I handle storing and reading from large amounts of data in my project?
2 projects | reddit.com/r/rust | 10 Apr 2022
Maybe it will be simpler to serialise all the data in a more compact data format, such as avro (its readme is in here), a row based format that seems to be able to use zstd/bzip/xz.
2 projects | dev.to | 18 Jan 2022
When serializing a value, we convert it to a different sequence of bytes. This sequence is often a human-readable string (all the bytes can be read and interpreted by humans as text), but not necessarily. The serialized format can be binary. Binary data (example: an image) is still bytes, but makes use of non-text characters, so it looks like gibberish in a text editor. Binary formats won't make sense unless deserialized by an appropriate program. An example of a human-readable serialization format is JSON. Examples of binary formats are Apache Avro, Protobuf.
Dreaming and Breaking Molds – Establishing Best Practices with Scott Haines
3 projects | dev.to | 8 Dec 2021
Scott: It's like a very large row of Avro data that had everything you could possibly ever need. It was like 115 columns. Most things were null, and it became every data type you'd ever want. It's like, is it mobile? Look for mobile_. It's like, this is really crappy. I didn't know about, I guess, the hardships of data engineering at that point. Because this was the first time where I was like, okay, you're on the ground basically pulling data now, and now we're going to do stuff with it. We're going to power our whole entire application with it. And I remember that just being exciting. The gears were turning. I was waking up super early. I wanted to go in to just to work on it more. It was the first thing where it's like, man, that's just like the coolest thing in the whole entire world.
Apache Hudi - The Streaming Data Lake Platform
8 projects | dev.to | 27 Jul 2021
Hudi is designed around the notion of base file and delta log files that store updates/deltas to a given base file (called a file slice). Their formats are pluggable, with Parquet (columnar access) and HFile (indexed access) being the supported base file formats today. The delta logs encode data in Avro (row oriented) format for speedier logging (just like Kafka topics for e.g). Going forward, we plan to inline any base file format into log blocks in the coming releases, providing columnar access to delta logs depending on block sizes. Future plans also include Orc base/log file formats, unstructured data formats (free form json, images), and even tiered storage layers in event-streaming systems/OLAP engines/warehouses, work with their native file formats.
What are some alternatives?
Protobuf - Protocol Buffers - Google's data interchange format
SBE - Simple Binary Encoding (SBE) - High Performance Message Codec
Apache Thrift - Apache Thrift
iceberg - Apache Iceberg
Apache Parquet - Apache Parquet
gRPC - The C based gRPC (C++, Python, Ruby, Objective-C, PHP, C#)
Apache Orc - Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
hudi - Upserts, Deletes And Incremental Processing on Big Data.
Wire - gRPC and protocol buffers for Android, Kotlin, and Java.
tape - A lightning fast, transactional, file-based FIFO for Android and Java.
Big Queue - A big, fast and persistent queue based on memory mapped file.
RocksDB - A library that provides an embeddable, persistent key-value store for fast storage.