gumbo-parser
html-parser.ts
gumbo-parser | html-parser.ts | |
---|---|---|
7 | 1 | |
5,116 | 1 | |
- | - | |
0.0 | 8.1 | |
about 1 year ago | 5 months ago | |
HTML | TypeScript | |
Apache License 2.0 | BSD 2-clause "Simplified" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
gumbo-parser
- Gumbo HTML5 parsing library has been discontinued
-
Web Scraping with C++
It uses libcurl and gumbo (https://github.com/google/gumbo-parser). Gumbo is apparently written in pure C99 (interestingly Curl is written in the even older C89 standard). Will've been more amusing if article was written considering that and used C99.
- how to make a C++ web scraper?
-
The computers are fast, but you don't know it
> A standards compliant HTML5 parser is at the bare minimum millions of lines of code.
But https://github.com/google/gumbo-parser is only 34K lines?
-
Markup Language Operations in Nim to extract and remove el
oops... I saw a markup parser and automatically thought XML, but you are right! HTML is actually a whole different beast!
As it turns out, seems like nim also has an html parser [1], but I'm guessing something like Google's gumbo [2] could be more reliable, but you would have to write bindings for nim.
1: https://nim-lang.org/docs/htmlparser.html
2: https://github.com/google/gumbo-parser
-
What second language to learn after Python?
Well, regarding HTML5, what I've found was libxml (does not support tag-soup HTML5), https://github.com/lexbor/lexbor, for which I was unable to find good documentation ( see https://lexbor.com/docs/lexbor/#dom), Apache Xerces (appears to not support tag-soup HTML5 as well), and Gumbo, which does not appear to be active and to support selectors and XPath (although there are libraries that add that).
-
Does anyone know of an HTML parser written in C++ that has Node.js interface?
I haven't used any of them, but there's a few wrappers available for Gumbo.
html-parser.ts
-
Gumbo HTML5 parsing library has been discontinued
Feel free to have a look at it: https://github.com/beenotung/html-parser
What are some alternatives?
Xerces-C++ - Apache Xerces-C validating XML parser
necktie - Necktie – a simple DOM binding tool
lexbor - Lexbor is development of an open source HTML Renderer library. https://lexbor.com
dom-proxy - Develop lightweight and declarative UI with automatic dependecy tracking without boilerplate code, VDOM, nor compiler
HTML-XML-Operations-Nim - Mark Up Language extraction, removal and copy
cheerio - The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
benchmarks - Some benchmarks of different languages
cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server. [Moved to: https://github.com/cheeriojs/cheerio]
cpr - C++ Requests: Curl for People, a spiritual port of Python Requests.
happy-dom - A JavaScript implementation of a web browser without its graphical user interface
q.nim - Query HTML/XML elements using a CSS3 or jQuery-like selector syntax
htmlparser2 - The fast & forgiving HTML and XML parser