jitsu
monosi
Our great sponsors
jitsu | monosi | |
---|---|---|
13 | 20 | |
3,795 | 320 | |
1.9% | 1.3% | |
9.8 | 0.0 | |
6 days ago | over 1 year ago | |
TypeScript | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
jitsu
- Any examples of working activist, socialist, or community-organizing software?
-
Lesser Known Features of ClickHouse
you may check: https://github.com/jitsucom/jitsu. "Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days"
You can create an API endpoint, and send those JSON to it. In the "destination" part, it can sync to clickhouse (one of many choices, like redshift, snowflake,besides clickhouse) very quickly, and flatten the JSON into columns. If there is new key found in JSON, it will create a new column in clickhouse.
-
Reference Data Stack for Data-Driven Startups
We also have telemetry set up on our Monosi product which is collected through Snowplow,. As with Airbyte, we chose Snowplow because of its open source offering and because of their scalable event ingestion framework. There are other open source options to consider including Jitsu and RudderStack or closed source options like Segment. Since we started building our product with just a CLI offering, we didn’t need a full CDP solution so we chose Snowplow.
-
Data pipeline suggestions
Ingestion / Extraction: Airbyte, Singer, Jitsu
-
Where can I find free data engineering ( big data) projects online?
Ingestion / ETL: Airbyte, Singer, Jitsu Transformation: dbt Orchestration: Airflow, Dagster Testing: GreatExpectations Observability: Monosi Reverse ETL: Grouparoo, Castled Visualization: Lightdash, Superset
- Ask HN: Good open source alternatives to Google Analytics?
-
Launch HN: Jitsu (YC S20) – Open-Source Segment Alternative
Hey HN! Vlad here with Sergey, Ildar, and Kirill. We are building Jitsu, an open-source Segment alternative ((https://github.com/jitsucom/jitsu, https://jitsu.com/). We help companies collect events from their apps, websites, and APIs and send them to databases.
I've been doing data engineering for more than ten years (half of that time, I didn't know that it's called "data engineering”). Before Jitsu, I was a co-founder and CTO of GetIntent, an ad-tech startup. Although it was ad-tech (I'm sorry for that!), we also built a quite fascinating technology platform. We processed up to 1 million events per second at peak, and all those events needed to be stored somewhere.
We churned through a few data warehouse platforms along the way. In 2013, we started with Hadoop's HDFS and a bunch of map-reduce jobs on top of it. Then, when we decided to allow our customers to run ad-hoc reports, we switched to BigQuery. BigQuery was great, but expensive—especially with some customers obsessively clicking the refresh button. Finally, in 2017 we migrated to self-hosted ClickHouse which in my opinion is still the best analytics database in the world.
All that time, we spent a fair amount of effort to get data to the database. When you're dealing with millions of events per minute, running an INSERT statement per event won't work. What if the DB is down for maintenance? How can you be sure that all 50+ edge nodes are aware of recent DB schema changes? Also, did you know streaming data to BigQuery is costly while batching data is free?
We tried different approaches: first, we would write local log files, sync them to HDFS, and load data to BQ (or ClickHouse) with map-reduce jobs. To improve data freshness, we ditched HDFS and started to send data in batches to the DB directly from edge servers. We experimented with Kafka, but it felt too complex for that task at the time.
I always dreamed about a straightforward service, to which I'd throw JSON objects, and it would take care of the rest: queueing, retrying, updating database schema, etc.
Then I discovered Segment. I liked it at first. It seemed very developer-friendly with a nice API and excellent documentation. But the pricing model and data delays (the event gets to DB in 12 hours after it has been sent to Segment) killed the whole idea. And it was not open-sourced. In my opinion, being open-source and self-hostable is a must for such a fundamental part of the architecture as data collection.
I left GetIntent and got accepted to YC with a different idea for the Summer 2020 batch. The idea was to build a churn prevention and BI tool for online retailers. It didn't take off, but in the process we made a component to collect customer's app events and put it to DB. We tried to hack a solution on top of the ELK stack, but I was frustrated with ElasticSearch’s lack of SQL support. Here I was back to square one: there's no good open-source event collection service yet, and we needed to build one, once again.
So we decided to focus solely on that problem. We ditched all the previous code, which was in Java, rewrote the data collection server in Go and hacked together what we called EventNative [1]. It was received very well, and we started to get users.
Over the last 11 months, we've been busy building the UI, adding Connectors (to pull data from external APIs), polishing data warehouse support, adding javascript support to transform incoming data, and implementing dozens of other features.
Now we're launching Jitsu, an open-source Segment alternative. With Jitsu, we make it easy to collect data and send it to databases (we support all major players: ClickHouse, Redshift, Snowflake, BigQuery and Postgres). We’re deployed in production, including into a large gaming publisher, eSignature service, and many other great companies. We're going for an open-core model. So far we don't have paid features, but soon we'll have some, presumably around things like authorization and data masking. Also we run Jitsu.Cloud[2] which you can buy if you don’t want to self-host
Give it a spin: https://github.com/jitsucom/jitsu.
Thank you for reading this story - I hope it was interesting. I would love to read your feedback on Jitsu and answer questions!
I’m just saying this is better:
We are building Jitsu, (https://github.com/jitsucom/jitsu, https://jitsu.com/) We help companies collect events from their apps, websites, and APIs and send them to databases.
Think of us as an open-source Segment alternative.
Thanks! I take it this file is where I can get started to learn more:
https://github.com/jitsucom/jitsu/blob/0aaa74b59eb9d8c885c80...
I see that it instantiates an "AsyncLogger" - does the service wait until data is written to the log prior to returning success to the client?
That's a good question. We're aiming to replace Kafka in some cases. There're many ways how people use Kafka. But it could be roughly divided into two buckets
- Kafka as a company wide message bus: dozen's of (micro)services sending data there, and consumers listens to data. Each service doesn't know which other service will consume the data. For that case, we're not looking to replace Kafka — we're going to work along with it. We have a PR about supporting Kafka as destination [1] (Jitsu sends data to Kafka), and we will support Kafka as a source at some point (PRs are always welcome :))
- Kafka is used just as a transport between web-app and DB. In that case Jitsu is a perfect replacement
[1] https://github.com/jitsucom/jitsu/pull/537
P. S. The same applies to Kinesis too
monosi
-
Open source data observability tools with UI?
I also found https://github.com/monosidev/monosi but it seems there are no activities in the repository from last year.
-
Metadata extraction and management
It’s open source, check out the repository here - https://github.com/monosidev/monosi
-
How to Monitor Supabase with Monosi
🎉 Congratulations, you've just set up and scheduled a data monitor on your Supabase instance. You can now add more monitors to other tables in your database. Find more information on how to use Monosi here.
-
Setting up data monitoring for PostgreSQL
Monosi is an open source data observability and monitoring platform for data teams. It is used to quickly set up monitors on a data store. The monitors run checks for data quality issues and alert on detected anomalies.
Now that you’ve worked through an example using a public PostgreSQL instance, you can further extend this to your own data store. For more information, get started here.
-
Sunday Daily Thread: What's everyone working on this week?
Continuing to build out & stabilize Monosi (open source data observability) - https://github.com/monosidev/monosi
-
Data pipeline suggestions
Observability: Monosi
-
Whats something hot rn or whats going to be next thing we should focus on in data engineering?
Ah ok cool, well I guess you can say a lot of these tools that are becoming big with the modern data stack provide some form of automation. E.g Fivetran / Airbyte extract data on an automated schedule, then you have dbt with the transformations, and then the reverse ETLs like Hightouch / Census that run on an automated a schedule as well. I think it's pretty much becoming somewhat of a standard now to have, e.g. with what I'm building we included a scheduler for automation from the start.
-
Where can I find free data engineering ( big data) projects online?
Ingestion / ETL: Airbyte, Singer, Jitsu Transformation: dbt Orchestration: Airflow, Dagster Testing: GreatExpectations Observability: Monosi Reverse ETL: Grouparoo, Castled Visualization: Lightdash, Superset
-
Is airflow a good pick for monitoring without scheduling?
Got it, yea in terms of monitoring data in snowflake the monosi package can help.
What are some alternatives?
airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Snowplow - The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP
datahub - The Metadata Platform for your Data Stack
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
posthog-ios - PostHog iOS SDK
sqlpad - Web-based SQL editor. Legacy project in maintenance mode.
superset - Apache Superset is a Data Visualization and Data Exploration Platform
growthbook - Open Source Feature Flagging and A/B Testing Platform
Matomo - Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
uBlock-issues - This is the community-maintained issue tracker for uBlock Origin
projects - Sample projects using Ploomber.
castled - Castled is an open source reverse ETL solution that helps you to periodically sync the data in your db/warehouse into sales, marketing, support or custom apps without any help from engineering teams