lexbor
oil
Our great sponsors
lexbor | oil | |
---|---|---|
10 | 234 | |
881 | 2,717 | |
1.7% | 1.5% | |
8.5 | 9.9 | |
6 days ago | 5 days ago | |
C | Python | |
Apache License 2.0 | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
lexbor
-
Modest: A fast HTML renderer implemented as a pure C99 library
Project is deprecated in favour of the same developer's lexbor project[0].
-
Created a performance-focused HTML5 parser for Ruby, trying to be API-compatible with Nokogiri
It supports both CSS selectors and XPath like Nokogiri, but with separate engines - parsing and CSS engine by Lexbor, XPath engine by libxml2. (Nokogiri internally converts CSS selectors to XPath syntax, and uses XPath engine for all searches).
- Lexbor: Fast HTML Renderer library in C
-
Andreas Kling (of SerenityOS fame) is building a new Linux browser using SerenityOS libraries
An HTML parser, probably the simplest relatively modern example I could find is 1MB https://github.com/lexbor/lexbor (haven't used it, but might look more into it now that I know it exists.)
- Lexbor: Open-source HTML Renderer library in C
-
The State of Web Scraping in 2021
Lazyweb link: https://github.com/rushter/selectolax
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor
---
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Heh, oh, new to the Internet, are you? :-D
-
Libraries for retrivieng html data from website
Lexbor is here: https://github.com/lexbor/lexbor
-
What second language to learn after Python?
Well, regarding HTML5, what I've found was libxml (does not support tag-soup HTML5), https://github.com/lexbor/lexbor, for which I was unable to find good documentation ( see https://lexbor.com/docs/lexbor/#dom), Apache Xerces (appears to not support tag-soup HTML5 as well), and Gumbo, which does not appear to be active and to support selectors and XPath (although there are libraries that add that).
-
You can't parse [X]HTML with regex
I think we've all (mostly?) tried it. It really is the Wild West of the web when you're trying to parse other people's HTML, though.
I've played around with this parser which is extremely quick. https://github.com/lexbor/lexbor
-
How SerpApi sped up data extraction from HTML from 3s to 800ms (or How to profile and optimize Ruby code and C extension)
I’m glad to have the opportunity to contribute to an open-source project that is used by thousands of people. Hopefully, we will speed up Nokogiri (or XML parser it uses) to match the performance of html5ever or lexbor at some point in the future. 800 ms to extract data from HTML is still too much.
oil
-
Autoconf makes me think we stopped evolving too soon
will prevent almost all of the "silent footguns".
YSH has strict:all and then a bunch of NEW features.
There's been good feedback recently, which has led to many concrete changes. So your experience can definitely influence the language! https://github.com/oilshell/oil/wiki/Where-To-Send-Feedback
-
Basic Things
Regarding writing tools/tests/benchmarks in bash+Python, vs. writing tools in your main language:
I think we might eventually concede that something Debian-like is the “standard development environment” (at least for server side stuff, i.e. not iOS apps)
In this case, bash+Python is a non-issue. It works extremely reliably. That’s actually why I use it! Everything else seems to break, or it’s really slow (node.js is a very common alternative).
- Microsoft conceded this back in ~2017, by building Linux into their kernel with WSL, and providing Ubuntu on top
Yes bash + Python is a disaster on Windows (I have scars from it), but Microsoft agrees that the right place to solve that is in Windows :-)
- Every CI system runs Debian/Ubuntu
- Every hosting provider runs Debian/Ubuntu
- Every online dev env like gitpod.io provides Debian/Ubuntu
This is somewhat related to remote dev envs: https://lobste.rs/s/ucirlx/lapdev_self_hosted_remote_dev
One vision for https://www.oilshell.org/ is that the CI environment is the dev environment is the hosting environment.
Everything is just an equal node in a distributed system. BUT it’s more git like, in that you explicitly sync and work “locally”, wherever that is. You don’t have the network chatter and flakiness of “the cloud”.
Oils has a very large set of monotonically increasing properties too - https://www.oilshell.org/release/0.21.0/quality.html
All that is bash+Python that is run on every commit, and it’s extremely good at catching bugs and perf regressions.
I’m skeptical that any project has that level of quality automation written in pure Rust or Zig. More likely it’s a bunch of cloud services with YAML.
Also a bunch of “hard-coded” toolchains that you can’t script with bespoke code. Like some shell commands in your package.json, which is just a worse way of writing a shell script.
Our quality process is all self-hosted, in the repo, and runs on both Github Actions and sourcehut - https://www.oilshell.org/release/0.21.0/pub/metrics.wwz/line...
bash and Python runs perfectly on Github Actions and sourcehut, with zero change. Containers also do.
(Although we need to unify the CI and release, because the release runs on 2 different real hardware machines, while CI is cloud only.)
Also, a main point Oils is that bash now has another highly compatible, spec-driven implementation – OSH. Having 2 independent implementations is something newer languages don’t have.
(copy of lobste.rs comment)
-
The secret weapon of Bash power users
in your bashrc to enable it. I've used it for probably ~18 years now.
It also works with https://www.oilshell.org/ since we use GNU readline. Just 'set -o vi' in ~/.config/oils/oshrc
-
Pipexec – Handling pipe of commands like a single command
No other shell does that.
But I didn't know it was called MULTIOS until now. (I guess that's read "mult I/O's"? I have a hard time not reading it was multi-OS :) )
It seems a bit niche to be honest, but it's possible to support in Oils.
---
Oils also uses Unix domain sockets already for the headless shell protocol
https://github.com/oilshell/oil/wiki/Headless-Mode
We could do something like dgsh, but so far I haven't seen a lot of uptake / demand. Every time it's mentioned, somebody kinda wants it, and then it kinda peters out again ... still possible though.
I think flat files work fine for a lot of use cases, and once you add streaming, you also want monitoring, more control over backpressure/queue sizes, etc.
-
Show HN: Hancho – A simple and pleasant build system in ~500 lines of Python
which works well. You don't have to clean when rebuilding variants. IMO this is 100% essential for writing C++ these days. You need a bunch of test binaries, and all tests should be run with ASAN and UBSAN.
---
I wrote a mini-bazel on top of Ninja with these features:
https://www.oilshell.org/blog/2022/10/garbage-collector.html...
So it's ~1700 lines, but for that you get the build macros like asdl_library() generating C++ and Python (the same as proto_library(), a schema language that generates code)
And it also correctly finds dependencies of code generators. So if you change a .py file that is imported by another .py file that is used to generated a C++ header, everything will work. That was one of the trickier bits, with Ninja implicit dependencies.
I also use the Bazel-target syntax like //core/process
This build file example mixes low level Ninja n.rule() and n.build() with high level r.cc_library() and so forth. I find this layering really does make it scale better for bigger projects
https://github.com/oilshell/oil/blob/master/asdl/NINJA_subgr...
Some more description - https://lobste.rs/s/qnb7xt/ninja_is_enough_build_system#c_tu...
-
Re2c
This is sort of a category error...
re2c is a lexer generator, and YAML and Python are recursive/nested formats.
You can definitely use re2c to lex them, but it's not the whole solution.
I use it for everything possible in https://www.oilshell.org, and it's amazing. It really reduces the amount of fiddly C code you need to parse languages, and it drops in anywhere.
-
Ask HN: Looking for a project to volunteer on? (February 2024)
SEEKING VOLUNTEERS - https://www.oilshell.org/ - https://github.com/oilshell/oil/
I'm looking for people to help fill out the "standard library" for Oils/YSH. We're implementing a shell for Python and JavaScript programmers who avoid shell!
On the surface, this is writing some very simple functions in typed Python. But I've realized that the hardest parts are specifying, TESTING, and documenting what the functions do.
---
The most recent release announcement also asks for help - https://www.oilshell.org/blog/2024/01/release-0.19.0.html (long)
If you find all those details interesting (if maybe overwhelming), you might have a mind for language design, and could be a good person to help.
Surveying what Python and JavaScript do is very helpful, e.g. for the recent Str.replace() function, which is nontrivial (takes a regex or string, replacement template or string)
But there are also very simple methods to get started, like Dict.values() and List.indexOf(). Other people have already contributed code. Examples:
https://github.com/oilshell/oil/commit/58d847008427dba2e60fe...
https://github.com/oilshell/oil/commit/8f38ee36d01162593e935...
This can also be useful to tell if you'll have fun working on the project - https://github.com/oilshell/oil/wiki/Where-Contributors-Have...
More on #help-wanted on Zulip (requires login) - https://oilshell.zulipchat.com/#narrow/stream/417617-help-wa...
Please send a message on Github or Zulip! Or e-mail me andy at oilshell dot org.
-
The rust project has a burnout problem
This is true, but then the corrolary is that new PRs need to come with this higher and rigorous level of test coverage.
And then that becomes a bit of a barrier to contribution -- that's a harness
I often write entirely new test harnesses for features, e.g. for https://www.oilshell.org, many of them linked here . All of these run in the CI - https://www.oilshell.org/release/latest/quality.html
The good thing is that it definitely helps me accept PRs faster. Current contributors are good at this kind of exhaustive testing, but many PRs aren't
- Unix as IDE: Introduction (2012)
- Oils
What are some alternatives?
myhtml - Fast C/C++ HTML 5 Parser. Using threads.
nushell - A new type of shell
selectolax - Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
fish-shell - The user-friendly command line shell.
gumbo-parser - An HTML5 parsing library in pure C99
elvish - Powerful scripting language & Versatile interactive shell
Xerces-C++ - Apache Xerces-C validating XML parser
xonsh - :shell: Python-powered, cross-platform, Unix-gazing shell.
nokogiri-rust - Ruby FFI wrapper around scraper crate to be used instead of Nokogiri. Status: proof of concept.
PowerShell - PowerShell for every system!
pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
ShellCheck - ShellCheck, a static analysis tool for shell scripts