[Discussion] good html tokenization libraries?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • dragnet

    Just the facts -- web page content extraction

  • You can take a look into how dragnet preprocess html block ( the Blockifier function ), it's certainly not the best since it collapse the tree structure into one sequential flow but I think is easier to work with once you know how to modify the code for your own need. While the hard way is use lxml and write the parsing yourself, which is also what's happening under dragnet.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Open-source SDK for adding custom code interpreters to AI apps

    2 projects | news.ycombinator.com | 2 May 2024
  • Show HN: SpRAG – Open-source RAG implementation for challenging real-world tasks

    1 project | news.ycombinator.com | 2 May 2024
  • Show HN: Local GLaDOS

    1 project | news.ycombinator.com | 2 May 2024
  • Let's Build An AI Agent: trendrBOT answers questions about Google Search trends

    1 project | news.ycombinator.com | 2 May 2024
  • NPi – An Open Source project for enhancing AI Agents in taking action

    1 project | news.ycombinator.com | 2 May 2024