Top 23 Python Beautifulsoup Projects

requests-html

14 13,574 0.0 Python

Pythonic HTML Parsing for Humans™
MechanicalSoup

4 4,545 5.9 Python

A Python library for automating interaction with websites.

Project mention: How to scrape a website with Python (Beginner tutorial) | dev.to | 2024-02-22

MechanicalSoup is a Python library for web scraping that combines the simplicity of Requests with the convenience of BeautifulSoup. It's particularly useful for interacting with web forms, like login pages. Here's a basic example to illustrate how you can use MechanicalSoup for web scraping:

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
JobFunnel

1 1,740 0.0 Python

Scrape job websites into a single spreadsheet with no duplicates.
tiktok-downloader

2 263 4.4 Python

Tiktok Downloader/Scraper using requests & bs4

Project mention: Anyone know how to bulk download a tiktok profile with no watermark and in HD? | /r/DataHoarder | 2023-05-30

soupsieve

2 187 5.5 Python

A modern CSS selector implementation for BeautifulSoup
languagepod101-scraper

1 145 0.0 Python

Python scraper for Language Pods such as Japanesepod101.com :japanese_ogre: :japan: :sushi: Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨
WhatSoup

4 108 2.6 Python

A web scraper that exports your entire WhatsApp chat history.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
web_to_obsidian

1 45 10.0 Python

A Python 3 script that scrapes an html/xml page to extract text, then creates markdown files for Obsidian & the dataview plugin
reddit-bots

3 23 0.0 Python

A collection of Reddit bots that I use to enhance the subreddits I manage.
tweet-transcriber

2 19 0.0 Python

A Reddit bot that transcribes tweets from comments and submissions links, mirrors their images and replies back with a formatted Markdown message.
PythonAutomateCybersecurity

0 15 0.0 Python

Course covering Task Automation and Cyber Security in Python
Letterboxd-friend-ranker

1 12 10.0 Python

Program that computes, ranks a given user and their friends based on Letterboxd ratings
DDD

2 11 5.1 Python

🎧 CLI Python tool for bulk downloading Darknet Diaries podcast. Hate being online? This is the way. (by Psyhackological)
Amazon-Product-Information-Scraper

2 10 3.1 Python

This Python web-scraping project retrieves product names, prices, review stars, and review counts for a specific product category.
tabroom-API

9 8 0.0 Python

tournaments.tech's API for scraping tabroom.com

Project mention: Debate Land Beta 0.2 is out! | /r/Debate | 2023-06-03

Now does a data site for high school debate need all of that? Probably no. But we wanted to create something stunning. Our goal for Debate Land is to create something MaxPreps would be jealous of. And we've never really had a problem with the UI/UX from our feedback—our KPIs have jumped significantly since the redesign from tournaments.tech (which had a more simple layout and design).

weheartpy

2 4 3.0 Python

A fast, reliable API wrapper for weheartit.com
statum

2 4 0.0 Python

🗺️ statum, a Twitch streamer-related website. Written in Python + Flask, with MongoDB. Current features include Twitch OAuth integration, personalized dashboard, unique streamer insights & much more.
web-scraping-with-python

2 4 7.6 Python

Demonstration of Web Scraping using Selenium Python (Pytest & Pyunit) and Beautiful Soup

Project mention: Pyppeteer Tutorial: The Ultimate Guide to Using Puppeteer with Python | dev.to | 2024-02-05

import asyncio import pytest from pyppeteer.errors import PageError from urllib.parse import quote import json import os import sys from os import environ from pyppeteer import connect, launch exec_platform = os.getenv('EXEC_PLATFORM') # Get username and access key of the LambdaTest Platform username = environ.get('LT_USERNAME', None) access_key = environ.get('LT_ACCESS_KEY', None) test1_url = 'https://ecommerce-playground.lambdatest.io/' test2_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/' # Usecase - 1 # loc_ecomm_1 = ".order-1.col-lg-6 div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) div:nth-of-type(1) > img:nth-of-type(1)" loc_ecomm_1 = "[aria-label='1 / 2'] div:nth-of-type(1) > [alt='Nikon D300']" target_url_1 = "https://ecommerce-playground.lambdatest.io/index.php?route=product/product&product_id=63" # Usecase - 2 (Click on e-commerce sliding banner) loc_ecomm_2 = "[alt='Canon DSLR camera']" target_url_2 = "https://ecommerce-playground.lambdatest.io/index.php?route=product/product&product_id=30" # Usecase - 3 Automating interactions on https://scrapingclub.com/exercise/list_infinite_scroll/ loc_infinite_src_prod1 = ".grid .p-4 [href='/exercise/list_basic_detail/93926-C/']" target_url_3 = "https://scrapingclub.com/exercise/list_basic_detail/93926-C/" # Usecase - 4 Automating interactions on https://scrapingclub.com/exercise/list_infinite_scroll/ # when the images are lazy loaded loc_infinite_src_prod2 = "div:nth-of-type(31) > .p-4 [href='/exercise/list_basic_detail/94967-A/']" target_url_4 = "https://scrapingclub.com/exercise/list_basic_detail/94967-A/" # Set timeout in ms timeOut = 60000 async def scroll_to_element(page, selector): # Scroll until the element is detected await page.evaluateHandle( '''async (selector) => { const element = document.querySelector(selector); if (element) { element.scrollIntoView(); } }''', selector ) return selector async def scroll_carousel(page, scr_count): for scr in range(1, scr_count): elem_next_button = "#mz-carousel-213240 > ul li:nth-child(" + str(scr) + ")" await asyncio.sleep(1) elem_next_button = await page.querySelector(elem_next_button) await elem_next_button.click() # Replica of https://github.com/hjsblogger/web-scraping-with-python/blob/ # main/tests/beautiful-soup/test_infinite_scraping.py#L67C5-L80C18 async def scroll_end_of_page(page): start_height = await page.evaluate('document.documentElement.scrollHeight') while True: # Scroll to the bottom of the page await page.evaluate(f'window.scrollTo(0, {start_height})') # Wait for the content to load await asyncio.sleep(1) # Get the new scroll height scroll_height = await page.evaluate('document.documentElement.scrollHeight') if scroll_height == start_height: # If heights are the same, we reached the end of the page break # Add an additional wait await asyncio.sleep(2) start_height = scroll_height # Additional wait after scrolling await asyncio.sleep(2) @pytest.mark.asyncio @pytest.mark.order(1) async def test_lazy_load_ecomm_1(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test1_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(2) if exec_platform == 'local': # Scroll until the element is detected elem_button = await scroll_to_element(page, loc_ecomm_1) # await page.click(elem_button) # Wait until the page is loaded # https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.waitForNavigation navigationPromise = asyncio.ensure_future(page.waitForNavigation()) await page.click(elem_button) await navigationPromise elif exec_platform == 'cloud': elem_button = await page.waitForSelector(loc_ecomm_1, {'visible': True}) await asyncio.gather( elem_button.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 30000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_1 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(2) async def test_lazy_load_ecomm_2(page): carousel_len = 4 # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test1_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(2) # Approach 1: Directly click on the third button on the carousel # elem_carousel_banner = await page.querySelector("#mz-carousel-213240 > ul li:nth-child(3)") # await asyncio.sleep(1) # await elem_carousel_banner.click() # Approach 2 (Only for demo): Serially click on every button on carousel await scroll_carousel(page, carousel_len) await asyncio.sleep(1) # elem_prod_1 = await page.querySelector(loc_ecomm_2) elem_prod_1 = await page.waitForSelector(loc_ecomm_2, {'visible': True}) await asyncio.gather( elem_prod_1.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_2 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(3) async def test_lazy_load_infinite_scroll_1(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test2_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(1) elem_prod1 = await page.querySelector(loc_infinite_src_prod1) await asyncio.gather( elem_prod1.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # await asyncio.sleep(1) # await elem_carousel_banner.click() # elem_button = scroll_to_element(page, loc_infinite_src_prod1) # print(elem_button) # await asyncio.sleep(2) # await elem_button.click() # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_3 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(4) async def test_lazy_load_infinite_scroll_2(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) # Tested navigation using LambdaTest YouTube channel # await page.goto("https://www.youtube.com/@LambdaTest/videos", await page.goto(test2_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(1) await scroll_end_of_page(page) await page.evaluate('window.scrollTo(0, 0)') await asyncio.sleep(1) # elem_prod = await page.querySelector(loc_infinite_src_prod2) # asyncio.sleep(1) # await asyncio.gather( # elem_prod.click(), # page.waitForNavigation({'waitUntil': 'load', 'timeout': 60000}), # ) elem_button = await scroll_to_element(page, loc_infinite_src_prod2) await asyncio.sleep(1) # await page.click(elem_button) await asyncio.gather( page.click(elem_button), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_4 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e))

python-web-scraping-primjeri

3 3 7.0 Python

web scraping stranica posta.hr, konzum.hr, index.hr, njuskalo.hr, neostar.com, DasWeltAuto.hr, ...
python_portfolio_web_scraper-spotrac

1 3 10.0 Python

Python solution to webscrape contract data from https://www.spotrac.com
israbrew

1 1 4.3 Python

Beers from various suppliers across the state scraped onto one website
GSOC_org_analysis

1 0 0.0 Python

Welcome to my first full project!
flannelfynet

2 0 4.8 Python

find your fantano score.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-22.

Python Beautifulsoup related posts

How to scrape a website with Python (Beginner tutorial)
1 project | dev.to | 22 Feb 2024
Nastavak analize tržišta rabljenih auta - novi auti su preskupi, a porasle su cijene i rabljenima ?
1 project | /r/croatia | 14 Apr 2023
Kratka analiza strujića na Njuškalu - detalji u komentaru
1 project | /r/croatia | 5 Apr 2023
flannelfy.net update - LastFM, All Scores
1 project | /r/fantanoforever | 28 Mar 2023
Letterboxd Profile Analyzer, Friends Ranker, and Simple Movie Recommender App
3 projects | /r/Letterboxd | 28 Mar 2023
Alternatives to Selenium?
1 project | /r/pythontips | 21 Jul 2022
Researching the effects of Blackness in competitive Public Forum
1 project | /r/Debate | 2 Jul 2022
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Beautifulsoup projects in Python? This list will help you:

	Project	Stars
1	requests-html	13,574
2	MechanicalSoup	4,545
3	JobFunnel	1,740
4	tiktok-downloader	263
5	soupsieve	187
6	languagepod101-scraper	145
7	WhatSoup	108
8	web_to_obsidian	45
9	reddit-bots	23
10	tweet-transcriber	19
11	PythonAutomateCybersecurity	15
12	Letterboxd-friend-ranker	12
13	DDD	11
14	Amazon-Product-Information-Scraper	10
15	tabroom-API	8
16	weheartpy	4
17	statum	4
18	web-scraping-with-python	4
19	python-web-scraping-primjeri	3
20	python_portfolio_web_scraper-spotrac	3
21	israbrew	1
22	GSOC_org_analysis	0
23	flannelfynet	0