-
sparkMeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
As an alternative to other proposed solutions, you could try and leverage the Spark metrics system to extract this information from accumulators. Metrics include total records and bytes written at each stage, among others. Take a look at SparkMeasure as well as an implementation example if you need to roll your own.