URLExtract
yarl
URLExtract | yarl | |
---|---|---|
1 | 2 | |
236 | 1,237 | |
- | 2.3% | |
5.7 | 9.4 | |
3 months ago | 16 days ago | |
Python | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
URLExtract
-
Famous HNers and Their Sites
That'd explain some of the holes mentioned in these comments. I think you just want to match any "word" containing ".[valid TLD]" and then exclude invalid URLs ("@" in first part indicating email, etc).
I've been using this[0] Python library which seemed good enough for my needs in some scraping project.
0: https://github.com/lipoja/URLExtract
yarl
- Parsing URLs in Python
-
What Is a URL: Dangers of inconsistent parsing of URLs
I think it's also worth using special objects instead of strings when handling URLs. Don't try to build URLs with strings, don't try to parse URLs as strings, rely on code that does that well and represents the URL as a special, non-string object. For Python, I really like yarl.
What are some alternatives?
MPKExtractor - Simple extractor script for Diablo Immortal's .MPK files
furl - 🌐 URL parsing and manipulation made easy.
proxy_web_crawler - Automates the process of repeatedly searching for a website via scraped proxy IP and search keywords
webargs - A friendly library for parsing HTTP request arguments, with built-in support for popular web frameworks, including Flask, Django, Bottle, Tornado, Pyramid, webapp2, Falcon, and aiohttp.
office365-audit-log-collector - Collect / retrieve Office365, AzureAD and DLP audit logs and output to PRTG, Azure Log Analytics Workspace, SQL, Graylog, Fluentd, and/or file output.
pyshorteners - :electric_plug: Generating short urls with python has never been easier
url_cleaner - A package for removing tracing parameters from URLs. This package supports automatically updating filtering rules from Adguard.
purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
courlan - Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters
short_url - Python implementation for generating Tiny URL- and bit.ly-like URLs.
bowlrl - A small roguelike
python-tcod - A high-performance Python port of libtcod. Includes the libtcodpy module for backwards compatibility with older projects.