-
web2text
Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The only paper and code I’m aware of is in Scala and called https://github.com/dalab/web2text. They originally used a CNN. I think their training data was way to small.
It is related to document AI. Recently google has released a model pix2struct. Some of the tasks they considered and datasets they used include:
I have also seen several tools that try to use LLM to do web scraping. I didn't look into the details. https://www.reddit.com/r/MachineLearning/comments/12v0vda/p_i_built_a_tool_that_autogenerates_scrapers_for/ https://github.com/Smyja/blackmaria
Related posts
-
GitHub - Smyja/blackmaria: Python package for webscraping in Natural language
-
Black Maria is a Python package that does web scraping with GPT and natural language
-
This Week in Python
-
replit discord.py why does line 6 print false?? logging in seems ok but bot doesn't respond
-
How can I code the desired discord bot?