Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python beautifulsoup4 Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
dude
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
-
scrape-google-scholar-py
Extract data from all Google Scholar pages from a single Python module. NOTE: I'm no longer maintaining this repo. Chrome driver/selectors might need and update.
-
AmazonMe
Introducing the AmazonMe webscraper - a powerful tool for extracting data from Amazon.com using the Requests and Beautifulsoup library in Python. This scraper allows users to easily navigate and extract information from Amazon's website.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
telexkcdbot
A functional asynchronous telegram-bot for handy reading xkcd comics. https://t.me/telexkcdbot
-
permaculture
Permaculture design app built on scraped plant databases. Drag-n-drop GUI with detailed design plan generator.
-
larentals
An interactive map of for-sale & rental property listings in Los Angeles County, updated weekly.
-
Letterboxd-friend-ranker
Program that computes, ranks a given user and their friends based on Letterboxd ratings
-
simple-web-scraper
Simple web scraper to get player data using beatiful-soup4 and PostgreSQL as a database. SQLAlchemy as an ORM
-
web-scraping-with-python
Demonstration of Web Scraping using Selenium Python (Pytest & Pyunit) and Beautiful Soup
-
beautifulday
Learning project for scraping weather from weather.gc.ca. Print out simple or extended weather reports for any Canadian city to a console.
-
discord-bot
A Discord Bot that can be used to help manage and add functionality to Discord servers (by sachinlim)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Also be sure to check out the interactive version of this spreadsheet at WhereToLive.LA courtesy of /u/cheeze_whiz_dot_com, my thanks to him.
Project mention: Pyppeteer Tutorial: The Ultimate Guide to Using Puppeteer with Python | dev.to | 2024-02-05import asyncio import pytest from pyppeteer.errors import PageError from urllib.parse import quote import json import os import sys from os import environ from pyppeteer import connect, launch exec_platform = os.getenv('EXEC_PLATFORM') # Get username and access key of the LambdaTest Platform username = environ.get('LT_USERNAME', None) access_key = environ.get('LT_ACCESS_KEY', None) test1_url = 'https://ecommerce-playground.lambdatest.io/' test2_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/' # Usecase - 1 # loc_ecomm_1 = ".order-1.col-lg-6 div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) > div:nth-of-type(1) div:nth-of-type(1) > img:nth-of-type(1)" loc_ecomm_1 = "[aria-label='1 / 2'] div:nth-of-type(1) > [alt='Nikon D300']" target_url_1 = "https://ecommerce-playground.lambdatest.io/index.php?route=product/product&product_id=63" # Usecase - 2 (Click on e-commerce sliding banner) loc_ecomm_2 = "[alt='Canon DSLR camera']" target_url_2 = "https://ecommerce-playground.lambdatest.io/index.php?route=product/product&product_id=30" # Usecase - 3 Automating interactions on https://scrapingclub.com/exercise/list_infinite_scroll/ loc_infinite_src_prod1 = ".grid .p-4 [href='/exercise/list_basic_detail/93926-C/']" target_url_3 = "https://scrapingclub.com/exercise/list_basic_detail/93926-C/" # Usecase - 4 Automating interactions on https://scrapingclub.com/exercise/list_infinite_scroll/ # when the images are lazy loaded loc_infinite_src_prod2 = "div:nth-of-type(31) > .p-4 [href='/exercise/list_basic_detail/94967-A/']" target_url_4 = "https://scrapingclub.com/exercise/list_basic_detail/94967-A/" # Set timeout in ms timeOut = 60000 async def scroll_to_element(page, selector): # Scroll until the element is detected await page.evaluateHandle( '''async (selector) => { const element = document.querySelector(selector); if (element) { element.scrollIntoView(); } }''', selector ) return selector async def scroll_carousel(page, scr_count): for scr in range(1, scr_count): elem_next_button = "#mz-carousel-213240 > ul li:nth-child(" + str(scr) + ")" await asyncio.sleep(1) elem_next_button = await page.querySelector(elem_next_button) await elem_next_button.click() # Replica of https://github.com/hjsblogger/web-scraping-with-python/blob/ # main/tests/beautiful-soup/test_infinite_scraping.py#L67C5-L80C18 async def scroll_end_of_page(page): start_height = await page.evaluate('document.documentElement.scrollHeight') while True: # Scroll to the bottom of the page await page.evaluate(f'window.scrollTo(0, {start_height})') # Wait for the content to load await asyncio.sleep(1) # Get the new scroll height scroll_height = await page.evaluate('document.documentElement.scrollHeight') if scroll_height == start_height: # If heights are the same, we reached the end of the page break # Add an additional wait await asyncio.sleep(2) start_height = scroll_height # Additional wait after scrolling await asyncio.sleep(2) @pytest.mark.asyncio @pytest.mark.order(1) async def test_lazy_load_ecomm_1(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test1_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(2) if exec_platform == 'local': # Scroll until the element is detected elem_button = await scroll_to_element(page, loc_ecomm_1) # await page.click(elem_button) # Wait until the page is loaded # https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.waitForNavigation navigationPromise = asyncio.ensure_future(page.waitForNavigation()) await page.click(elem_button) await navigationPromise elif exec_platform == 'cloud': elem_button = await page.waitForSelector(loc_ecomm_1, {'visible': True}) await asyncio.gather( elem_button.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 30000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_1 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(2) async def test_lazy_load_ecomm_2(page): carousel_len = 4 # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test1_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(2) # Approach 1: Directly click on the third button on the carousel # elem_carousel_banner = await page.querySelector("#mz-carousel-213240 > ul li:nth-child(3)") # await asyncio.sleep(1) # await elem_carousel_banner.click() # Approach 2 (Only for demo): Serially click on every button on carousel await scroll_carousel(page, carousel_len) await asyncio.sleep(1) # elem_prod_1 = await page.querySelector(loc_ecomm_2) elem_prod_1 = await page.waitForSelector(loc_ecomm_2, {'visible': True}) await asyncio.gather( elem_prod_1.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_2 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(3) async def test_lazy_load_infinite_scroll_1(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) await page.goto(test2_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(1) elem_prod1 = await page.querySelector(loc_infinite_src_prod1) await asyncio.gather( elem_prod1.click(), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # await asyncio.sleep(1) # await elem_carousel_banner.click() # elem_button = scroll_to_element(page, loc_infinite_src_prod1) # print(elem_button) # await asyncio.sleep(2) # await elem_button.click() # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_3 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e)) @pytest.mark.asyncio @pytest.mark.order(4) async def test_lazy_load_infinite_scroll_2(page): # The time out can be set using the setDefaultNavigationTimeout # It is primarily used for overriding the default page timeout of 30 seconds page.setDefaultNavigationTimeout(timeOut) # Tested navigation using LambdaTest YouTube channel # await page.goto("https://www.youtube.com/@LambdaTest/videos", await page.goto(test2_url, {'waitUntil': 'load', 'timeout': timeOut}) # Set the viewport - Apple MacBook Air 13-inch # Reference - https://codekbyte.com/devices-viewport-sizes/ # await page.setViewport({'width': 1440, 'height': 770}) await asyncio.sleep(1) await scroll_end_of_page(page) await page.evaluate('window.scrollTo(0, 0)') await asyncio.sleep(1) # elem_prod = await page.querySelector(loc_infinite_src_prod2) # asyncio.sleep(1) # await asyncio.gather( # elem_prod.click(), # page.waitForNavigation({'waitUntil': 'load', 'timeout': 60000}), # ) elem_button = await scroll_to_element(page, loc_infinite_src_prod2) await asyncio.sleep(1) # await page.click(elem_button) await asyncio.gather( page.click(elem_button), page.waitForNavigation({'waitUntil': 'networkidle2', 'timeout': 60000}), ) # Assert if required, since the test is a simple one; we leave as is :D current_url = page.url print('Current URL is: ' + current_url) try: assert current_url == target_url_4 print("Test Success: Product checkout successful") except PageError as e: print("Test Failure: Could not checkout Product") print("Error Code" + str(e))
Python beautifulsoup4 related posts
-
I create my first webscraping for yellowpages.com
-
New L.A. County rental listings, week of 6-12-2023
-
New L.A. County rental listings, week of 6-12-2023
-
We will NOT be participating in the blackout.
-
Looking for 2 bedroom apartment/condo in West-East Hollywood.
-
Looking for a rental apartment
-
Scrape Google Scholar in R
-
A note from our sponsor - InfluxDB
www.influxdata.com | 10 May 2024
Index
What are some of the best open-source beautifulsoup4 projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | JobFunnel | 1,740 |
2 | PornHub-downloader-python | 750 |
3 | dude | 414 |
4 | facebook-post-scraper | 295 |
5 | scrape-google-scholar-py | 75 |
6 | Quest | 75 |
7 | AmazonMe | 46 |
8 | amazon_wishlist_pricewatch | 29 |
9 | rango | 23 |
10 | telexkcdbot | 21 |
11 | permaculture | 16 |
12 | larentals | 13 |
13 | Letterboxd-friend-ranker | 12 |
14 | hu-announcement-bot | 9 |
15 | simple-web-scraper | 9 |
16 | dicer | 6 |
17 | HackerNEWS-Simplified | 6 |
18 | reactjs-docs-ebook | 6 |
19 | web-scraping-with-python | 5 |
20 | acgn-bot | 4 |
21 | beautifulday | 3 |
22 | goodreads2libby | 2 |
23 | discord-bot | 1 |
Sponsored