common-crawl

Open-source projects categorized as common-crawl

Top 6 common-crawl Open-Source Projects

  • StringZilla

    Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖

  • Project mention: Measuring energy usage: regular code vs. SIMD code | news.ycombinator.com | 2024-02-19

    The 3.5x energy-efficiency gap between serial and SIMD code becomes even larger when

    A. you do byte-level processing instead of float words;

    B. you use embedded, IoT, and other low-energy devices.

    A few years ago I've compared Nvidia Jetson Xavier (long before the Orin release), Intel-based MacBook Pro with Core i9, and AVX-512 capable CPUs on substring search benchmarks.

    On Xavier one can quite easily disable/enable cores and reconfigure power usage. At peak I got to 4.2 GB/J which was an 8.3x improvement in inefficiency over LibC in substring search operations. The comparison table is still available in the older README: https://github.com/ashvardanian/StringZilla/tree/v2.0.2?tab=...

  • comcrawl

    A python utility for downloading Common Crawl data

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • troll-a

    Drill into WARC web archives

  • Project mention: Show HN: Command line tool for extracting secrets from WARC (Web ARChive) files | news.ycombinator.com | 2023-12-20
  • cc-notebooks

    Various Jupyter notebooks about Common Crawl data

  • url-collector

    An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

  • abracabra

    Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Index

What are some of the best open-source common-crawl projects? This list will help you:

Project Stars
1 StringZilla 1,819
2 comcrawl 214
3 troll-a 130
4 cc-notebooks 37
5 url-collector 0
6 abracabra 0

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com