trafilatura
filemanager
trafilatura | filemanager | |
---|---|---|
13 | 305 | |
2,853 | 23,791 | |
- | 2.2% | |
8.7 | 8.8 | |
2 days ago | 1 day ago | |
Python | Go | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trafilatura
-
Trafilatura: Python tool to gather text on the Web
The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
-
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
The WebScraper tool uses Trafilatura [1] to scrape and parse HTML—nothing too fancy. "Scraping" a React site would require a totally different approach, probably something more akin to Adept's ACT-1 [2].
I run a local chat app built with Griptape and I use it to give me summaries of web pages or answer specific questions all the time :)
1. https://github.com/adbar/trafilatura/
-
Powerful and free scraper with a headless browser under the hood and Readability for parsing
I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.
-
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
Cool! If you care to explain me further... :) ... I tried parsing a page using: https://github.com/adbar/trafilatura, json stringify it and passing it to https://platform.openai.com/docs/api-reference/embeddings/create. How do I use the response as an input later? <3
-
Testing fast installation in tear-down environment
I want to test how easy it is to install a package plus special extra dependencies to run a certain script in that package: https://github.com/adbar/trafilatura
- Advice on standard design pattern for comparison test script
- Automate dependency installation
- Issue with sklearn
- Questions about some code
- How does Firefox's Reader View work?
filemanager
-
Ask HN: Online File Repository System?
Checkout https://awesome-selfhosted.net/tags/file-transfer---web-base...
I've used https://filebrowser.org/ and it's okay. I've also Seafile, but my current setup is sftp clients (Transmit nowadays) and Syncthing if I need the files on multiple computers.
-
Homelab Adventures: Crafting a Personal Tech Playground
File Browser
-
h5ai – modern HTTP web server index
Thanks for sharing. I wasn't aware of dufs and it looks very solid. Fileserver[0] is another popular choice, though it's more GUI-oriented for file operations.
[0]: https://filebrowser.org/
-
Ask HN: Spreadsheets like Google Sheets but not from Google?
The OnlyOfffice desktop app is a pretty good and free alternative to Microsoft Office Suite. You can simply install it on your local machine for offline access.
OnlyOfffice is also self-hostable as a web app for a cloud alternative to Google Sheets.
Filebrowser is a self-hostable alternative to Google Drive.
There's a pull request open to integrate OnlyOffice with Filebrowser for self-hosted google-drive + google docs.
https://github.com/filebrowser/filebrowser/pull/1420
-
Ask HN: What is the best FOSS file sharing protocol/app?
For strictly local use, Google's Nearby share is technically FOSS but the documentation is basically non-existent and a proper Linux implementation is not here yet. Alternatives aren't hard to find though, with Mint's Warpinator or KDE Connect having worked well for me.
For non-local use (everything out of Bluetooth range), you almost have to trust a third party and it really depends on your use case. Want to send your friend a file or host pictures of your birthday for multiple people to download? For the former magic wormhole works great, for the later you could almost spin up a nextcloud or similar (personally I like https://github.com/filebrowser/filebrowser ). Want to regularly send files from device 1 to device 2? Now classic sync solutions like syncthing become really viable.
If everything else fails, FTP always has your back
-
Finally a decent file browser in Game mode
I have been looking for a file browser which can run in game mode and is reasonably user friendly for simple file operations (copy/delete/rename, etc). Most people recommend Dolphin. it does work but there are issues: the color scheme looks really weird in game mode. context menu does not like game mode, either. Got file browser working (https://github.com/filebrowser/filebrowser) in game mode, which essentially an Edge app accessing a web server on localhost (running as user service). It took some time to set up but the end result is exactly what I would like to have.
-
List of your reverse proxied services
File Browser - For access to the files on my NAS
-
Self Hosted File upload service
filebrowser has user management plus sharing capabilities
-
Folder/File sharing with multiple links
Filebrowser suppports multiple shares with different expiration dates. It also offers file previews and generates QR Codes for the shares.
-
I need help creating a diy nas for under $1000
NextCloud is great for this, but if we're talking sharing files from your sync'd project collection, I'd probably instead recommend Filebrowser. You can point it to the same data store that syncthing is using and it'll make it easy to share the projects. Note that in order to do this you'll need to open up and expose filebrowser publicly. The simplest way to do this would probably be a cloudflare tunnel and for sharing files like this ad-hoc I don't see any issues with their TOS. For things like SyncThing though you'll still wanna do conventional port forwarding. the DIY approach instead of CloudFlare tunnel would be to port forward, set up a dynamic dns record, and set up letsencrypt certs
What are some alternatives?
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Nextcloud - ☁️ Nextcloud server, a safe home for all your data
python-goose - Html Content / Article Extractor, web scrapping lib in Python
Filestash - 🦄 A modern web client for SFTP, S3, FTP, WebDAV, Git, Minio, LDAP, CalDAV, CardDAV, Mysql, Backblaze, ...
TWINT - An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
filegator - Powerful Multi-User File Manager
html2text - Convert HTML to Markdown-formatted text.
OpenMediaVault - openmediavault is the next generation network attached storage (NAS) solution based on Debian Linux. Thanks to the modular design of the framework it can be enhanced via plugins. openmediavault is primarily designed to be used in home environments or small home offices.
Goose3 - A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
h5ai - HTTP web server index for Apache httpd, lighttpd and nginx.
textract - extract text from any document. no muss. no fuss.
tinyfilemanager - Single-file PHP file manager, browser and manage your files efficiently and easily with tinyfilemanager