Top 23 beautifulsoup4 Open-Source Projects

JobFunnel

1 1,740 0.0 Python

Scrape job websites into a single spreadsheet with no duplicates.
PornHub-downloader-python

2 750 0.0 Python

Download stuff from PH the easy way.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
dude

28 413 9.0 Python

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

Project mention: Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next | /r/webscraping | 2023-05-08

Please check https://github.com/roniemartinez/dude

facebook-post-scraper

3 295 0.0 Python

Facebook Post Scraper 🕵️🖱️
scrape-google-scholar-py

2 75 6.4 Python

Extract data from all Google Scholar pages from a single Python module. NOTE: I'm no longer maintaining this repo. Chrome driver/selectors might need and update.

Project mention: Scrape Google Scholar in R | dev.to | 2023-05-06

scrape-google-scholar-py is a open-source project of mine that aims to extract all the possible data from Google Scholar. In the future I'll port it to R.

Quest

1 75 2.3 Python

This is a web app that integrates GPT-3 with google searches (by farrael004)
AmazonMe

110 42 8.5 Python

Introducing the AmazonMe webscraper - a powerful tool for extracting data from Amazon.com using the Requests and Beautifulsoup library in Python. This scraper allows users to easily navigate and extract information from Amazon's website.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
amazon_wishlist_pricewatch

2 29 0.0 Python

Periodically check your public Amazon wishlist for price reductions.
rango

2 23 0.0 Python

Telegram bot to download torrent. (by kaushalpurohit)
telexkcdbot

9 21 0.0 Python

A functional asynchronous telegram-bot for handy reading xkcd comics. https://t.me/telexkcdbot
permaculture

1 16 3.7 Python

Permaculture design app built on scraped plant databases. Drag-n-drop GUI with detailed design plan generator.
Letterboxd-friend-ranker

1 12 10.0 Python

Program that computes, ranks a given user and their friends based on Letterboxd ratings
larentals

13 12 9.5 Python

An interactive map of for-sale & rental property listings in Los Angeles County, updated weekly.

Project mention: New L.A. County rental listings, week of 6-12-2023 | /r/LARentals | 2023-06-12

Also be sure to check out the interactive version of this spreadsheet at WhereToLive.LA courtesy of /u/cheeze_whiz_dot_com, my thanks to him.

Simple-Web-Crawler

1 11 0.0 CSS

A Simple Web scraper. Using a URL and an HTML tag provided by a user scraps the page and returns the total number of elements fetched and then display the results from the scrap.
Data-extraction-and-text-analysis

1 11 1.7 Jupyter Notebook

The objective of this assignment is to extract textual data articles from the URL and perform text analysis to compute variables.

Project mention: Need help wrt NLP | /r/learnmachinelearning | 2023-07-03

Something like this solutionGitHub Ipynb file

hu-announcement-bot

1 9 7.9 Python

Get the latest from Hacettepe with this amazing Telegram Bot!
simple-web-scraper

1 9 8.6 Python

Simple web scraper to get player data using beatiful-soup4 and PostgreSQL as a database. SQLAlchemy as an ORM
reactjs-docs-ebook

1 6 10.0 Python

Exports React docs as EPUB ebook.
BANaNAS

2 6 1.9 Svelte

The web app that takes two random pieces of data from around the web and tries to find any correlation, no matter how wild or far-fetched
HackerNEWS-Simplified

5 6 0.0 Python

A more simplified, straightforward, and plain version of Hacker News.
web-scraping-with-python

2 4 7.6 Python

Demonstration of Web Scraping using Selenium Python (Pytest & Pyunit) and Beautiful Soup

Project mention: Pyppeteer Tutorial: The Ultimate Guide to Using Puppeteer with Python | dev.to | 2024-02-05

import asyncio import pytest from pyppeteer.errors import PageError from urllib.parse import quote import json import os import sys from os import environ from pyppeteer import connect, launch exec_platform = os.getenv('EXEC_PLATFORM') # Get username and access key of the LambdaTest Platform username = environ.get('LT_USERNAME', None) access_key = environ.get('LT_ACCESS_KEY', None) test1_url = 'https://ecommerce-playground.lambdatest.io/' test2_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/' # Usecase - 1 # loc_ecomm_1 = ".order-1.col-lg-6 div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) div:nth-of-type(1) > img:nth-of-type(1)" loc_ecomm_1 = "[aria-label='1 / 2'] div:nth-of-type(1) > [alt='Nikon D300']" target_url_1 = "https://ecommerce-playground.lambdatest.io/index.php?route=product/product&product_id=63" # Usecase - 2 (Click on e-commerce sliding banner) loc_ecomm_2 = "[alt='Canon DSLR camera']" target_url_2 = "https://ecommerce-playground.lambdatest.io/index.php?route=product/product&product_id=30" # Usecase - 3 Automating interactions on https://scrapingclub.com/exercise/list_infinite_scroll/ loc_infinite_src_prod1 = ".grid .p-4 [href='/exercise/list_basic_detail/93926-C/']" target_url_3 = "https://scrapingclub.com/exercise/list_basic_detail/93926-C/" # Usecase - 4 Automating interactions on https://scrapingclub.com/exercise/list_infinite_scroll/ # when the images are lazy loaded loc_infinite_src_prod2 = "div:nth-of-type(31) > .p-4 [href='/exercise/list_basic_detail/94967-A/']" target_url_4 = "https://scrapingclub.com/exercise/list_basic_detail/94967-A/" # Set timeout in ms timeOut = 60000 async def scroll_to_element(page, selector): # Scroll until the element is detected await page.evaluateHandle( '''async (selector) => { const element = document.querySelector(selector); if (element) { element.scrollIntoView(); } }''', selector ) return selector async def scroll_carousel(page, scr_count): for scr in range(1, scr_count): elem_next_button = "#mz-carousel-213240 > ul li:nth-child(" + str(scr) + ")" await asyncio.sleep(1) elem_next_button = await page.querySelector(elem_next_button) await elem_next_button.click() # Replica of https://github.com/hjsblogger/web-scraping-with-python/blob/ # main/tests/beautiful-soup/test_infinite_scraping.py#L67C5-L80C18 async def scroll_end_of_page(page): start_height = await page.evaluate('document.documentElement.scrollHeight') while True: # Scroll to the bottom of the page await page.evaluate(f'window.scrollTo(0, {start_height})') # Wait for the content to load await asyncio.sleep(1) # Get the new scroll height scroll_height = await page.evaluate('document.documentElement.scrollHeight') if scroll_height == start_height: # If heights are the same, we reached the end of the page break # Add an additional wait await asyncio.sleep(2) start_height = scroll_height # Additional wait after scrolling await asyncio.sleep(2) @pytest.mark.asyncio @pytest.mark.order(1) async def test_lazy_load_ecomm_1(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test1_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(2) if exec_platform == 'local': # Scroll until the element is detected elem_button = await scroll_to_element(page, loc_ecomm_1) # await page.click(elem_button) # Wait until the page is loaded # https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.waitForNavigation navigationPromise = asyncio.ensure_future(page.waitForNavigation()) await page.click(elem_button) await navigationPromise elif exec_platform == 'cloud': elem_button = await page.waitForSelector(loc_ecomm_1, {'visible': True}) await asyncio.gather( elem_button.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 30000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_1 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(2) async def test_lazy_load_ecomm_2(page): carousel_len = 4 # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test1_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(2) # Approach 1: Directly click on the third button on the carousel # elem_carousel_banner = await page.querySelector("#mz-carousel-213240 > ul li:nth-child(3)") # await asyncio.sleep(1) # await elem_carousel_banner.click() # Approach 2 (Only for demo): Serially click on every button on carousel await scroll_carousel(page, carousel_len) await asyncio.sleep(1) # elem_prod_1 = await page.querySelector(loc_ecomm_2) elem_prod_1 = await page.waitForSelector(loc_ecomm_2, {'visible': True}) await asyncio.gather( elem_prod_1.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_2 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(3) async def test_lazy_load_infinite_scroll_1(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test2_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(1) elem_prod1 = await page.querySelector(loc_infinite_src_prod1) await asyncio.gather( elem_prod1.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # await asyncio.sleep(1) # await elem_carousel_banner.click() # elem_button = scroll_to_element(page, loc_infinite_src_prod1) # print(elem_button) # await asyncio.sleep(2) # await elem_button.click() # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_3 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(4) async def test_lazy_load_infinite_scroll_2(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) # Tested navigation using LambdaTest YouTube channel # await page.goto("https://www.youtube.com/@LambdaTest/videos", await page.goto(test2_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(1) await scroll_end_of_page(page) await page.evaluate('window.scrollTo(0, 0)') await asyncio.sleep(1) # elem_prod = await page.querySelector(loc_infinite_src_prod2) # asyncio.sleep(1) # await asyncio.gather( # elem_prod.click(), # page.waitForNavigation({'waitUntil': 'load', 'timeout': 60000}), # ) elem_button = await scroll_to_element(page, loc_infinite_src_prod2) await asyncio.sleep(1) # await page.click(elem_button) await asyncio.gather( page.click(elem_button), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_4 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e))

acgn-bot

1 4 8.5 Python

Telegram bot: Check anime/comic/game/novel websites update
beautifulday

1 3 0.0 Python

Learning project for scraping weather from weather.gc.ca. Print out simple or extended weather reports for any Canadian city to a console.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

beautifulsoup4 related posts

I create my first webscraping for yellowpages.com
2 projects | /r/webscraping | 16 Jun 2023
New L.A. County rental listings, week of 6-12-2023
1 project | /r/LARentals | 12 Jun 2023
New L.A. County rental listings, week of 6-12-2023
1 project | /r/LAlist | 12 Jun 2023
We will NOT be participating in the blackout.
1 project | /r/LARentals | 11 Jun 2023
Looking for 2 bedroom apartment/condo in West-East Hollywood.
1 project | /r/LARentals | 8 Jun 2023
Looking for a rental apartment
1 project | /r/LARentals | 11 May 2023
Scrape Google Scholar in R
2 projects | dev.to | 6 May 2023
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source beautifulsoup4 projects? This list will help you:

	Project	Stars
1	JobFunnel	1,740
2	PornHub-downloader-python	750
3	dude	413
4	facebook-post-scraper	295
5	scrape-google-scholar-py	75
6	Quest	75
7	AmazonMe	42
8	amazon_wishlist_pricewatch	29
9	rango	23
10	telexkcdbot	21
11	permaculture	16
12	Letterboxd-friend-ranker	12
13	larentals	12
14	Simple-Web-Crawler	11
15	Data-extraction-and-text-analysis	11
16	hu-announcement-bot	9
17	simple-web-scraper	9
18	reactjs-docs-ebook	6
19	BANaNAS	6
20	HackerNEWS-Simplified	6
21	web-scraping-with-python	4
22	acgn-bot	4
23	beautifulday	3