Top 18 Python Crawling Projects

Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16

newspaper

13 13,720 0.0 Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Grab

0 2,354 3.0 Python

Web Scraping Framework
mlscraper

10 1,225 0.6 Python

🤖 Scrape data from HTML websites automatically by just providing examples
botasaurus

5 899 9.2 Python

The All in One Framework to build Awesome Scrapers.

Project mention: This Week In Python | dev.to | 2024-04-05

botasaurus – The All in One Framework to build Awesome Scrapers

scrapyrt

3 816 6.8 Python

HTTP API for Scrapy spiders
isp-data-pollution

2 566 0.0 Python

ISP Data Pollution to Protect Private Browsing History with Obfuscation
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
spidermon

2 509 6.9 Python

Scrapy Extension for monitoring spiders execution.
WarcDB

7 383 7.2 Python

WarcDB: Web crawl data as SQLite databases.

Project mention: An Introduction to the WARC File | news.ycombinator.com | 2024-01-29

LinkedInDumper

1 362 6.8 Python

Python 3 script to dump/scrape/extract company employees from LinkedIn API
spidy Web Crawler

0 323 0.0 Python

The simple, easy to use command line web crawler.
telegram-crawler

1 236 9.7 Python

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
estela

10 153 8.1 Python

estela, an elastic web scraping cluster 🕸
scrapfly-scrapers

4 126 9.4 Python

Web scrapers for popular targets powered Scrapfly.io

Project mention: How To Scrape TikTok in 2024 | dev.to | 2024-03-08

TikTok is one of the leading social media platforms, with an enormous traffic load. Imagine the amount of insights web scraping TikTok will allow for! In this article, we'll explain how to scrape TikTok. We'll extract data from various TikTok sources, such as posts, comments, profiles and search pages. Moreover, we'll scrape these data through hidden TikTok APIs or hidden JSON datasets. Let's get started! Latest TikTok Scraper Code Why Scrape TikTok? The amount of social interaction on TikTok is vast, allowing for gathering various insights for different use cases: Analyzing trends Trends on TikTok are fast-changing, making it challenging to stay up to date with recent users' preferences. Scraping TikTok can capture these trend changes effectively along with their impact, which improves the marketing strategies to align with the users' interests. Lead generation Scraping TikTok allows businesses to identify marketing opportunities and new customers. This can be achieved by recognizing influencers with a relevant fan base that matches the business domain. Sentiment Analysis Web scraping TikTok is a good source for gathering text data found in comments, which can be researched by sentiment analysis models for gathering opinions on a given subject. Setup To web scrape TikTok, we'll use a few Python libraries: httpx: For sending HTTP requests to TikTok and getting the data in either HTML or JSON. parsel: For parsing the HTML and extracting elements using selectors, such as XPath and CSS. JMESPath: For parsing and refining the JSON datasets to exclude unnecessary details. loguru: For monitoring and logging our TikTok scraper in beautiful terminal outputs. scrapfly-sdk: For scraping TikTok pages that require JavaScript rendering and using advanced scraping features using ScrapFly. asyncio: For increasing our increasing our web scraping by running our code asynchronously. Note that asyncio comes pre-installed in Python. Use the following command to install the other packages: pip install httpx parsel jmespath loguru Enter fullscreen mode Exit fullscreen mode How to Scrape TikTok Profiles? Let's start building the first parts of our TikTok scraper by scraping profile pages. Profile pages can include two types of data: Main profile data, such as the name, ID followers and likes count. Channel data, if the profile has any posts. It includes data about each video, such as the name, description and views statistics. Let's begin by scraping the profile data, which can be found under JavaScript script tags. To view this tag, follow the below steps: Go to any profile page on TikTok. Open the browser developer tools by pressing the F12 key. Look for the script tag that starts with the `UNIVERSAL_DATA`__ ID. The identified tag contains a comprehensive JSON dataset about the web app, browser and localization details. However, the profile data can be found under the webapp.user-detail key: Hidden JSON data on profile pages The above data is commonly referred to as hidden web data. It's the same data on the page but before getting rendered into the HTML. Python Tab import asyncio import json from typing import List, Dict from httpx import AsyncClient, Response from parsel import Selector from loguru import logger as log # initialize an async httpx client client = AsyncClient( # enable http2 http2=True, # add basic browser like headers to prevent being blocked headers={ "Accept-Language": "en-US,en;q=0.9", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", }, ) def parse_profile(response: Response): """parse profile data from hidden scripts on the HTML""" assert response.status_code == 200, "request is blocked, use the ScrapFly codetabs" selector = Selector(response.text) data = selector.xpath("//script[@id='__UNIVERSAL_DATA_FOR_REHYDRATION__']/text()").get() profile_data = json.loads(data)["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"] return profile_data async def scrape_profiles(urls: List[str]) -> List[Dict]: """scrape tiktok profiles data from their URLs""" to_scrape = [client.get(url) for url in urls] data = [] # scrape the URLs concurrently for response in asyncio.as_completed(to_scrape): response = await response profile_data = parse_profile(response) data.append(profile_data) log.success(f"scraped {len(data)} profiles from profile pages") return data Enter fullscreen mode Exit fullscreen mode ScrapFly Tab import asyncio import json from typing import Dict, List from loguru import logger as log from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key") def parse_profile(response: ScrapeApiResponse): """parse profile data from hidden scripts on the HTML""" selector = response.selector data = selector.xpath("//script[@id='__UNIVERSAL_DATA_FOR_REHYDRATION__']/text()").get() profile_data = json.loads(data)["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"] return profile_data async def scrape_profiles(urls: List[str]) -> List[Dict]: """scrape tiktok profiles data from their URLs""" to_scrape = [ScrapeConfig(url, asp=True, country="US") for url in urls] data = [] # scrape the URLs concurrently async for response in SCRAPFLY.concurrent_scrape(to_scrape): profile_data = parse_profile(response) data.append(profile_data) log.success(f"scraped {len(data)} profiles from profile pages") return data Enter fullscreen mode Exit fullscreen mode Run the code: async def run(): profile_data = await scrape_profiles( urls=[ "https://www.tiktok.com/@oddanimalspecimens" ] ) # save the result to a JSON file with open("profile_data.json", "w", encoding="utf-8") as file: json.dump(profile_data, file, indent=2, ensure_ascii=False) if __name__ == "__main__": asyncio.run(run()) Enter fullscreen mode Exit fullscreen mode Let's go through the above code: Create an async httpx with basic browser headers to avoid blocking. Define a parse_profiles function to select the script tag and parse the profile data. Define a scrape_profiles function to request the profile URLs concurrently while parsing the data from each page. Running the above TikTok scraper will create a JSON file named profile_data. Here is what it looks like: [ { "user": { "id": "6976999329680589829", "shortId": "", "uniqueId": "oddanimalspecimens", "nickname": "Odd Animal Specimens", "avatarLarger": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7327535918275887147~c5_1080x1080.jpeg?lk3s=a5d48078&x-expires=1709280000&x-signature=1rRtT4jX0Tk5hK6cpSsDcqeU7cM%3D", "avatarMedium": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7327535918275887147~c5_720x720.jpeg?lk3s=a5d48078&x-expires=1709280000&x-signature=WXYAMT%2BIs9YV52R6jrg%2F1ccwdcE%3D", "avatarThumb": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7327535918275887147~c5_100x100.jpeg?lk3s=a5d48078&x-expires=1709280000&x-signature=rURTqWGfKNEiwl42nGtc8ufRIOw%3D", "signature": "YOUTUBE: Odd Animal Specimens\nCONTACT: [email protected]", "createTime": 0, "verified": false, "secUid": "MS4wLjABAAAAmiTQjtyN2Q_JQji6RgtgX2fKqOA-gcAAUU4SF9c7ktL3uPoWu0nLpBfqixgacB8u", "ftc": false, "relation": 0, "openFavorite": false, "bioLink": { "link": "linktr.ee/oddanimalspecimens", "risk": 0 }, "commentSetting": 0, "commerceUserInfo": { "commerceUser": false }, "duetSetting": 0, "stitchSetting": 0, "privateAccount": false, "secret": false, "isADVirtual": false, "roomId": "", "uniqueIdModifyTime": 0, "ttSeller": false, "region": "US", "profileTab": { "showMusicTab": false, "showQuestionTab": true, "showPlayListTab": true }, "followingVisibility": 1, "recommendReason": "", "nowInvitationCardUrl": "", "nickNameModifyTime": 0, "isEmbedBanned": false, "canExpPlaylist": true, "profileEmbedPermission": 1, "language": "en", "eventList": [], "suggestAccountBind": false }, "stats": { "followerCount": 2600000, "followingCount": 6, "heart": 44600000, "heartCount": 44600000, "videoCount": 124, "diggCount": 0, "friendCount": 3 }, "itemList": [] } ] Enter fullscreen mode Exit fullscreen mode We can successfully scrape TikTok for profile data. However, we are missing the profile's video data. Let's extract it! How To Scrape TikTok Channels? In this section, we'll scrape channel posts. The data we'll scrape are video data, which are only found in profiles with posts. Hence, this profile type is referred to as a channel. The channel video data are loaded dynamically through JavaScript, where scrolling loads more data. The above background XHR calls are loaded while scrolling down the page. These calls were sent to the endpoint /api/post/item_list/, which returns the channel video data through batches. To scrape channel data, we can request the /post/item_list/ API endpoint directly. However, this endpoint requires many different parameters, which can be challenging to maintain. Therefore, we'll extract the data from the XHR calls. TikTok allows non-logged-in users to view the profile pages. However, it restricts any actions unless you are logged in, meaning that we can't scroll down with the mouse actions. Therefore, we'll scroll down using JavaScript code that gets executed upon sending a request: function scrollToEnd(i) { // check if already at the bottom and stop if there aren't more scrolls if (window.innerHeight + window.scrollY >= document.body.scrollHeight) { console.log("Reached the bottom."); return; } // scroll down window.scrollTo(0, document.body.scrollHeight); // set a maximum of 15 iterations if (i < 15) { setTimeout(() => scrollToEnd(i + 1), 3000); } else { console.log("Reached the end of iterations."); } } scrollToEnd(0); Enter fullscreen mode Exit fullscreen mode Here, we create a JavaScript function to scroll down and wait between each scroll iteration for the XHR requests to finish loading. It has a maximum of 15 scrolls, which is sufficient for most profiles. Let's use the above JavaScript code to scrape TikTok channel data from XHR calls: import jmespath import asyncio import json from typing import Dict, List from loguru import logger as log from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse SCRAPFLY = ScrapflyClient(key="Your ScrapFly APi key") js_scroll_function = """ function scrollToEnd(i) { // check if already at the bottom and stop if there aren't more scrolls if (window.innerHeight + window.scrollY >= document.body.scrollHeight) { console.log("Reached the bottom."); return; } // scroll down window.scrollTo(0, document.body.scrollHeight); // set a maximum of 15 iterations if (i < 15) { setTimeout(() => scrollToEnd(i + 1), 3000); } else { console.log("Reached the end of iterations."); } } scrollToEnd(0); """ def parse_channel(response: ScrapeApiResponse): """parse channel video data from XHR calls""" # extract the xhr calls and extract the ones for videos _xhr_calls = response.scrape_result["browser_data"]["xhr_call"] post_calls = [c for c in _xhr_calls if "/api/post/item_list/" in c["url"]] post_data = [] for post_call in post_calls: try: data = json.loads(post_call["response"]["body"])["itemList"] except Exception: raise Exception("Post data couldn't load") post_data.extend(data) # parse all the data using jmespath parsed_data = [] for post in post_data: result = jmespath.search( """{ createTime: createTime, desc: desc, id: id, stats: stats, contents: contents[].{desc: desc, textExtra: textExtra[].{hashtagName: hashtagName}}, video: video }""", post ) parsed_data.append(result) return parsed_data async def scrape_channel(url: str) -> List[Dict]: """scrape video data from a channel (profile with videos)""" log.info(f"scraping channel page with the URL {url} for post data") response = await SCRAPFLY.async_scrape(ScrapeConfig( url, asp=True, country="GB", render_js=True, rendering_wait=2000, js=js_scroll_function )) data = parse_channel(response) log.success(f"scraped {len(data)} posts data") return data Enter fullscreen mode Exit fullscreen mode Run the code: async def run(): channel_data = await scrape_channel( url="https://www.tiktok.com/@oddanimalspecimens" ) # save the result to a JSON file with open("channel_data.json", "w", encoding="utf-8") as file: json.dump(channel_data, file, indent=2, ensure_ascii=False) if __name__ == "__main__": asyncio.run(run()) Enter fullscreen mode Exit fullscreen mode Let's break down the execution flow of the above TikTok web scraping code: A request with a headless browser is sent to the profile page. The JavaScript scroll function gets executed. More channel video data are loaded through background XHR calls. The parse_channel function iterates over the responses of all the XHR calls and saves the video data into the post_data array. The channel data are refined using JMESPath to exclude the unnecessary details. We have extracted a small portion of each video data from the responses we got. However, the full response includes further details that might be useful. Here is a sample output for the results we got: Sample output: [ { "createTime": 1675963028, "desc": "Mouse to Whale Vertebrae - What bone should I do next? How big is a mouse vertebra? How big is a whale vertebrae? A lot bigger, but all vertebrae share the same shape. Specimen use made possible by the University of Michigan Museum of Zoology. #animals #science #learnontiktok ", "id": "7198206283571285294", "stats": { "collectCount": 92400, "commentCount": 5464, "diggCount": 1500000, "playCount": 14000000, "shareCount": 11800 }, "contents": [ { "desc": "Mouse to Whale Vertebrae - What bone should I do next? How big is a mouse vertebra? How big is a whale vertebrae? A lot bigger, but all vertebrae share the same shape. Specimen use made possible by the University of Michigan Museum of Zoology. #animals #science #learnontiktok ", "textExtra": [ { "hashtagName": "animals" }, { "hashtagName": "science" }, { "hashtagName": "learnontiktok" } ] } ], "video": { "bitrate": 441356, "bitrateInfo": [ .... ], "codecType": "h264", "cover": "https://p16-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/3a2c21cd21ad4410b8ad7ab606aa0f45_1675963028?x-expires=1709287200&x-signature=Iv3PLyTi3PIWT4QUewp6MPnRU9c%3D", "definition": "540p", "downloadAddr": "https://v16-webapp-prime.tiktok.com/video/tos/maliva/tos-maliva-ve-0068c799-us/ed00b2ad6b9b4248ab0a4dd8494b9cfc/?a=1988&ch=0&cr=3&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C&cv=1&br=932&bt=466&bti=ODszNWYuMDE6&cs=0&ds=3&ft=4fUEKMvt8Zmo0K4Mi94jVhstrpWrKsd.&mime_type=video_mp4&qs=0&rc=ZTs1ZTw8aTZmZzU8ZGdpNkBpanFrZWk6ZmlsaTMzZzczNEBgLmJgYTQ0NjQxYDQuXi81YSNtMjZocjRvZ2ZgLS1kMS9zcw%3D%3D&btag=e00088000&expire=1709138858&l=20240228104720CEC3E63CBB78C407D3AE&ply_type=2&policy=2&signature=b86d518a02194c8bd389986d95b546a8&tk=tt_chain_token", "duration": 16, "dynamicCover": "https://p19-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/348b414f005f4e49877e6c5ebe620832_1675963029?x-expires=1709287200&x-signature=xJyE12Y5TPj2IYQJF6zJ6%2FALwVw%3D", "encodeUserTag": "", "encodedType": "normal", "format": "mp4", "height": 1024, "id": "7198206283571285294", "originCover": "https://p16-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/3f677464b38a4457959a7b329002defe_1675963028?x-expires=1709287200&x-signature=KX5gLesyY80rGeHg6ywZnKVOUnY%3D", "playAddr": "https://v16-webapp-prime.tiktok.com/video/tos/maliva/tos-maliva-ve-0068c799-us/e9748ee135d04a7da145838ad43daa8e/?a=1988&ch=0&cr=3&dr=0&lr=unwatermarked&cd=0%7C0%7C0%7C&cv=1&br=862&bt=431&bti=ODszNWYuMDE6&cs=0&ds=6&ft=4fUEKMvt8Zmo0K4Mi94jVhstrpWrKsd.&mime_type=video_mp4&qs=0&rc=OzRlNzNnPDtlOTxpZjMzNkBpanFrZWk6ZmlsaTMzZzczNEAzYzI0MC1gNl8xMzUxXmE2YSNtMjZocjRvZ2ZgLS1kMS9zcw%3D%3D&btag=e00088000&expire=1709138858&l=20240228104720CEC3E63CBB78C407D3AE&ply_type=2&policy=2&signature=21ea870dc90edb60928080a6bdbfd23a&tk=tt_chain_token", "ratio": "540p", "subtitleInfos": [ .... ], "videoQuality": "normal", "volumeInfo": { "Loudness": -15.3, "Peak": 0.79433 }, "width": 576, "zoomCover": { "240": "https://p16-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/3a2c21cd21ad4410b8ad7ab606aa0f45_1675963028~tplv-photomode-zoomcover:240:240.avif?x-expires=1709287200&x-signature=UV1mNc2EHUy6rf9eRQvkS%2FX%2BuL8%3D", "480": "https://p16-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/3a2c21cd21ad4410b8ad7ab606aa0f45_1675963028~tplv-photomode-zoomcover:480:480.avif?x-expires=1709287200&x-signature=PT%2BCf4%2F4MC70e2VWHJC40TNv%2Fbc%3D", "720": "https://p19-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/3a2c21cd21ad4410b8ad7ab606aa0f45_1675963028~tplv-photomode-zoomcover:720:720.avif?x-expires=1709287200&x-signature=3t7Dxca4pBoNYtzoYzui8ZWdALM%3D", "960": "https://p16-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/3a2c21cd21ad4410b8ad7ab606aa0f45_1675963028~tplv-photomode-zoomcover:960:960.avif?x-expires=1709287200&x-signature=aKcJ0jxPTQx3YMV5lPLRlLMrkso%3D" } } }, .... ] Enter fullscreen mode Exit fullscreen mode The above code extracted over a hundred video data with a few lines of code in less than a minute. That's pretty powerful! How To Scrape TikTok Posts? Let's continue with our TikTok scraping project. In this section, we'll scrape video data, which represents TikTok posts. Similar to profile pages, post data can be found as hidden data under script tags. Go to any video on TikTok, inspect the page and search for the following selector, which we have used earlier: //script[@id='__UNIVERSAL_DATA_FOR_REHYDRATION__']/text() Enter fullscreen mode Exit fullscreen mode The post data in the above script tag look like this: Hidden data on TikTok posts Let's scrape TikTok posts by extracting and parsing the above data. Python Tab: import jmespath import asyncio import json from typing import List, Dict from httpx import AsyncClient, Response from parsel import Selector from loguru import logger as log # initialize an async httpx client client = AsyncClient( # enable http2 http2=True, # add basic browser like headers to prevent being blocked headers={ "Accept-Language": "en-US,en;q=0.9", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", }, ) def parse_post(response: Response) -> Dict: """parse hidden post data from HTML""" assert response.status_code == 200, "request is blocked, use the ScrapFly codetabs" selector = Selector(response.text) data = selector.xpath("//script[@id='__UNIVERSAL_DATA_FOR_REHYDRATION__']/text()").get() post_data = json.loads(data)["__DEFAULT_SCOPE__"]["webapp.video-detail"]["itemInfo"]["itemStruct"] parsed_post_data = jmespath.search( """{ id: id, desc: desc, createTime: createTime, video: video.{duration: duration, ratio: ratio, cover: cover, playAddr: playAddr, downloadAddr: downloadAddr, bitrate: bitrate}, author: author.{id: id, uniqueId: uniqueId, nickname: nickname, avatarLarger: avatarLarger, signature: signature, verified: verified}, stats: stats, locationCreated: locationCreated, diversificationLabels: diversificationLabels, suggestedWords: suggestedWords, contents: contents[].{textExtra: textExtra[].{hashtagName: hashtagName}} }""", post_data ) return parsed_post_data async def scrape_posts(urls: List[str]) -> List[Dict]: """scrape tiktok posts data from their URLs""" to_scrape = [client.get(url) for url in urls] data = [] for response in asyncio.as_completed(to_scrape): response = await response post_data = parse_post(response) data.append(post_data) log.success(f"scraped {len(data)} posts from post pages") return data Enter fullscreen mode Exit fullscreen mode ScrapFly Tab: import jmespath import asyncio import json from typing import Dict, List from loguru import logger as log from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key") def parse_post(response: ScrapeApiResponse) -> Dict: """parse hidden post data from HTML""" selector = response.selector data = selector.xpath("//script[@id='__UNIVERSAL_DATA_FOR_REHYDRATION__']/text()").get() post_data = json.loads(data)["__DEFAULT_SCOPE__"]["webapp.video-detail"]["itemInfo"]["itemStruct"] parsed_post_data = jmespath.search( """{ id: id, desc: desc, createTime: createTime, video: video.{duration: duration, ratio: ratio, cover: cover, playAddr: playAddr, downloadAddr: downloadAddr, bitrate: bitrate}, author: author.{id: id, uniqueId: uniqueId, nickname: nickname, avatarLarger: avatarLarger, signature: signature, verified: verified}, stats: stats, locationCreated: locationCreated, diversificationLabels: diversificationLabels, suggestedWords: suggestedWords, contents: contents[].{textExtra: textExtra[].{hashtagName: hashtagName}} }""", post_data ) return parsed_post_data async def scrape_posts(urls: List[str]) -> List[Dict]: """scrape tiktok posts data from their URLs""" to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls] data = [] async for response in SCRAPFLY.concurrent_scrape(to_scrape): post_data = parse_post(response) data.append(post_data) log.success(f"scraped {len(data)} posts from post pages") return data Enter fullscreen mode Exit fullscreen mode Run the code: async def run(): post_data = await scrape_posts( urls=[ "https://www.tiktok.com/@oddanimalspecimens/video/7198206283571285294" ] ) # save the result to a JSON file with open("post_data.json", "w", encoding="utf-8") as file: json.dump(post_data, file, indent=2, ensure_ascii=False) if __name__ == "__main__": asyncio.run(run()) Enter fullscreen mode Exit fullscreen mode In the above code, we define two functions. Let's break them down: parse_post: For parsing the post data from the script tag and refining it with JMESPath to only extract the useful details. scrape_posts: For scraping multiple post pages concurrently by adding the URLs to a scraping list and requesting them concurrently. Here is what the created post_data file should look like: [ { "id": "7198206283571285294", "desc": "Mouse to Whale Vertebrae - What bone should I do next? How big is a mouse vertebra? How big is a whale vertebrae? A lot bigger, but all vertebrae share the same shape. Specimen use made possible by the University of Michigan Museum of Zoology. #animals #science #learnontiktok ", "createTime": "1675963028", "video": { "duration": 16, "ratio": "540p", "cover": "https://p16-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/3a2c21cd21ad4410b8ad7ab606aa0f45_1675963028?x-expires=1709290800&x-signature=YP7J1o2kv1dLnyjv3hqwBBk487g%3D", "playAddr": "https://v16-webapp-prime.tiktok.com/video/tos/maliva/tos-maliva-ve-0068c799-us/e9748ee135d04a7da145838ad43daa8e/?a=1988&ch=0&cr=3&dr=0&lr=unwatermarked&cd=0%7C0%7C0%7C&cv=1&br=862&bt=431&bti=ODszNWYuMDE6&cs=0&ds=6&ft=4fUEKMUj8Zmo0Qnqi94jVZgzZpWrKsd.&mime_type=video_mp4&qs=0&rc=OzRlNzNnPDtlOTxpZjMzNkBpanFrZWk6ZmlsaTMzZzczNEAzYzI0MC1gNl8xMzUxXmE2YSNtMjZocjRvZ2ZgLS1kMS9zcw%3D%3D&btag=e00088000&expire=1709142489&l=202402281147513D9DCF4EE8518C173598&ply_type=2&policy=2&signature=c0c4220f863ca89053ec2a71b180f226&tk=tt_chain_token", "downloadAddr": "https://v16-webapp-prime.tiktok.com/video/tos/maliva/tos-maliva-ve-0068c799-us/ed00b2ad6b9b4248ab0a4dd8494b9cfc/?a=1988&ch=0&cr=3&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C&cv=1&br=932&bt=466&bti=ODszNWYuMDE6&cs=0&ds=3&ft=4fUEKMUj8Zmo0Qnqi94jVZgzZpWrKsd.&mime_type=video_mp4&qs=0&rc=ZTs1ZTw8aTZmZzU8ZGdpNkBpanFrZWk6ZmlsaTMzZzczNEBgLmJgYTQ0NjQxYDQuXi81YSNtMjZocjRvZ2ZgLS1kMS9zcw%3D%3D&btag=e00088000&expire=1709142489&l=202402281147513D9DCF4EE8518C173598&ply_type=2&policy=2&signature=779a4044a0768f870abed13e1401608f&tk=tt_chain_token", "bitrate": 441356 }, "author": { "id": "6976999329680589829", "uniqueId": "oddanimalspecimens", "nickname": "Odd Animal Specimens", "avatarLarger": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7327535918275887147~c5_1080x1080.jpeg?lk3s=a5d48078&x-expires=1709290800&x-signature=F8hu8G4VOFyd%2F0TN7QEZcGLNmW0%3D", "signature": "YOUTUBE: Odd Animal Specimens\nCONTACT: [email protected]", "verified": false }, "stats": { "diggCount": 1500000, "shareCount": 11800, "commentCount": 5471, "playCount": 14000000, "collectCount": "92420" }, "locationCreated": "US", "diversificationLabels": [ "Science", "Education", "Culture & Education & Technology" ], "suggestedWords": [], "contents": [ { "textExtra": [ { "hashtagName": "animals" }, { "hashtagName": "science" }, { "hashtagName": "learnontiktok" } ] } ] } ] Enter fullscreen mode Exit fullscreen mode The above TikTok scraping code has successfully extracted the video data from its page. However, the comments are missing! Let's scrape it in the following section! How To Scrape TikTok Comments? The comment data on a post aren't found on hidden parts of the HTML. Instead, it's loaded dynamically through hidden APIs, which get activated through scroll actions. Since the comments on a post can exceed thousands, scraping them through scrolling isn't a practical approach. Therefore, we'll scrape them using the hidden comments API itself. To locate the comments hidden API, follow the below steps: Open the browser developer tools and select the network tab. Go to any video page on TikTok. Load more comments by scrolling down. After following the above steps, you will find the API calls used for loading more comments logged: Hidden comments API The API request we sent to the endpoint https://www.tiktok.com/api/comment/list/ with many different API parameters. However, a few paramaters are required: { "aweme_id": 7198206283571285294, # the post ID "count": 20, # number of comments to retrieve in each API call "cursor": 0 # the index to start retrieving from } Enter fullscreen mode Exit fullscreen mode We'll request the comments API endpoint to the comment data directly in JSON and use the cursor parameter to crawl over comment pages. Python Tab: import jmespath import asyncio import json from urllib.parse import urlencode from typing import List, Dict from httpx import AsyncClient, Response from loguru import logger as log # initialize an async httpx client client = AsyncClient( # enable http2 http2=True, # add basic browser like headers to prevent being blocked headers={ "Accept-Language": "en-US,en;q=0.9", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, br", "content-type": "application/json" }, ) def parse_comments(response: Response) -> List[Dict]: """parse comments data from the API response""" data = json.loads(response.text) comments_data = data["comments"] total_comments = data["total"] parsed_comments = [] # refine the comments with JMESPath for comment in comments_data: result = jmespath.search( """{ text: text, comment_language: comment_language, digg_count: digg_count, reply_comment_total: reply_comment_total, author_pin: author_pin, create_time: create_time, cid: cid, nickname: user.nickname, unique_id: user.unique_id, aweme_id: aweme_id }""", comment ) parsed_comments.append(result) return {"comments": parsed_comments, "total_comments": total_comments} async def scrape_comments(post_id: int, comments_count: int = 20, max_comments: int = None) -> List[Dict]: """scrape comments from tiktok posts using hidden APIs""" def form_api_url(cursor: int): """form the reviews API URL and its pagination values""" base_url = "https://www.tiktok.com/api/comment/list/?" params = { "aweme_id": post_id, 'count': comments_count, 'cursor': cursor # the index to start from } return base_url + urlencode(params) log.info("scraping the first comments batch") first_page = await client.get(form_api_url(0)) data = parse_comments(first_page) comments_data = data["comments"] total_comments = data["total_comments"] # get the maximum number of comments to scrape if max_comments and max_comments < total_comments: total_comments = max_comments # scrape the remaining comments concurrently log.info(f"scraping comments pagination, remaining {total_comments // comments_count - 1} more pages") _other_pages = [ client.get(form_api_url(cursor=cursor)) for cursor in range(comments_count, total_comments + comments_count, comments_count) ] for response in asyncio.as_completed(_other_pages): response = await response data = parse_comments(response)["comments"] comments_data.extend(data) log.success(f"scraped {len(comments_data)} from the comments API from the post with the ID {post_id}") return comments_data Enter fullscreen mode Exit fullscreen mode ScrapFly Tab: import jmespath import asyncio import json from urllib.parse import urlencode from typing import Dict, List from loguru import logger as log from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key") def parse_comments(response: ScrapeApiResponse) -> List[Dict]: """parse comments data from the API response""" data = json.loads(response.scrape_result["content"]) comments_data = data["comments"] total_comments = data["total"] parsed_comments = [] # refine the comments with JMESPath for comment in comments_data: result = jmespath.search( """{ text: text, comment_language: comment_language, digg_count: digg_count, reply_comment_total: reply_comment_total, author_pin: author_pin, create_time: create_time, cid: cid, nickname: user.nickname, unique_id: user.unique_id, aweme_id: aweme_id }""", comment ) parsed_comments.append(result) return {"comments": parsed_comments, "total_comments": total_comments} async def scrape_comments(post_id: int, comments_count: int = 20, max_comments: int = None) -> List[Dict]: """scrape comments from tiktok posts using hidden APIs""" def form_api_url(cursor: int): """form the reviews API URL and its pagination values""" base_url = "https://www.tiktok.com/api/comment/list/?" params = { "aweme_id": post_id, 'count': comments_count, 'cursor': cursor # the index to start from } return base_url + urlencode(params) log.info("scraping the first comments batch") first_page = await SCRAPFLY.async_scrape(ScrapeConfig( form_api_url(cursor=0), country="US", asp=True, headers={ "content-type": "application/json" } )) data = parse_comments(first_page) comments_data = data["comments"] total_comments = data["total_comments"] # get the maximum number of comments to scrape if max_comments and max_comments < total_comments: total_comments = max_comments # scrape the remaining comments concurrently log.info(f"scraping comments pagination, remaining {total_comments // comments_count - 1} more pages") _other_pages = [ ScrapeConfig(form_api_url(cursor=cursor), country="US", asp=True, headers={"content-type": "application/json"}) for cursor in range(comments_count, total_comments + comments_count, comments_count) ] async for response in SCRAPFLY.concurrent_scrape(_other_pages): data = parse_comments(response)["comments"] comments_data.extend(data) log.success(f"scraped {len(comments_data)} from the comments API from the post with the ID {post_id}") return comments_data Enter fullscreen mode Exit fullscreen mode Run the code: async def run(): comment_data = await scrape_comments( # the post/video id, such as: https://www.tiktok.com/@oddanimalspecimens/video/7198206283571285294 post_id=7198206283571285294, # total comments to scrape, omitting it will scrape all the avilable comments max_comments=100, # default is 20, it can be overriden to scrape more comments in each call but it can't be > the total comments on the post comments_count=20 ) # save the result to a JSON file with open("comment_data.json", "w", encoding="utf-8") as file: json.dump(comment_data, file, indent=2, ensure_ascii=False) if __name__ == "__main__": asyncio.run(run()) Enter fullscreen mode Exit fullscreen mode The above code scrapes TikTok comments data using two main functions: scrape_comments: For creating the comments API URL with the desired offset and requesting it to get the comment data in JSON. parse_comments: For parsing the comments API responses and extracting the useful data using JMESPath. Here is a sample output of the comment data we got: [ { "text": "Dude give ‘em back", "comment_language": "en", "digg_count": 72009, "reply_comment_total": 131, "author_pin": false, "create_time": 1675963633, "cid": "7198208855277060910", "nickname": "GrandMoffJames", "unique_id": "grandmoffjames", "aweme_id": "7198206283571285294" }, { "text": "Dudes got everyone’s back", "comment_language": "en", "digg_count": 36982, "reply_comment_total": 100, "author_pin": false, "create_time": 1675966520, "cid": "7198221275168719662", "nickname": "Scott", "unique_id": "troutfishmanjr", "aweme_id": "7198206283571285294" }, { "text": "do human backbone", "comment_language": "en", "digg_count": 18286, "reply_comment_total": 99, "author_pin": false, "create_time": 1676553505, "cid": "7200742421726216987", "nickname": "www", "unique_id": "ksjekwjkdbw", "aweme_id": "7198206283571285294" }, { "text": "casually has a backbone in his inventory", "comment_language": "en", "digg_count": 20627, "reply_comment_total": 9, "author_pin": false, "create_time": 1676106562, "cid": "7198822535374734126", "nickname": "*", "unique_id": "angelonextdoor", "aweme_id": "7198206283571285294" }, { "text": "😧", "comment_language": "", "digg_count": 7274, "reply_comment_total": 20, "author_pin": false, "create_time": 1675963217, "cid": "7198207091995132698", "nickname": "Son Bi’", "unique_id": "son_bisss", "aweme_id": "7198206283571285294" }, .... ] Enter fullscreen mode Exit fullscreen mode The above TikTok scraper code can scrape tens of comments in mere seconds. That's because utilizing the TikTok hidden APIs for web scraping is much faster than parsing data from HTML. How To Scrape TikTok Search? In this section, we'll proceed with the last piece of our TikTok web scraping code: search pages. The search data are loaded through a hidden API, which we'll utilize for web scraping. Alternatively, data on search pages can be scraped from background XHR calls, similar to how we scraped the channel data. To capture the search API, follow the below steps: Open the network tab of the browser developer tools. Use the search box to search for any keyword. Scroll down to load more search results. After following the above steps, you will find the search API requests logged: Hidden search API The above API request was sent to the following endpoint with these parameters: search_api_url = "https://www.tiktok.com/api/search/general/full/?" parameters = { "keyword": "whales", # the keyword of the search query "offset": cursor, # the index to start from "search_id": "2024022710453229C796B3BF936930E248" # timestamp with random ID } Enter fullscreen mode Exit fullscreen mode The above parameters are essential for the search query. However, this endpoint requires certain cookie values to authorize the requests, which can be challenging to maintain. Therefore, we'll utilize the ScrapFly sessions feature to obtain a cookie and reuse it with the search API requests: import datetime import secrets import asyncio import json import jmespath from typing import Dict, List from urllib.parse import urlencode, quote from loguru import logger as log from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key") def parse_search(response: ScrapeApiResponse) -> List[Dict]: """parse search data from the API response""" data = json.loads(response.scrape_result["content"]) search_data = data["data"] parsed_search = [] for item in search_data: if item["type"] == 1: # get the item if it was item only result = jmespath.search( """{ id: id, desc: desc, createTime: createTime, video: video, author: author, stats: stats, authorStats: authorStats }""", item["item"] ) result["type"] = item["type"] parsed_search.append(result) # wheter there is more search results: 0 or 1. There is no max searches available has_more = data["has_more"] return parsed_search async def obtain_session(url: str) -> str: """create a session to save the cookies and authorize the search API""" session_id="tiktok_search_session" await SCRAPFLY.async_scrape(ScrapeConfig( url, asp=True, country="US", render_js=True, session=session_id )) return session_id async def scrape_search(keyword: str, max_search: int, search_count: int = 12) -> List[Dict]: """scrape tiktok search data from the search API""" def generate_search_id(): # get the current datetime and format it as YYYYMMDDHHMMSS timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S') # calculate the length of the random hex required for the total length (32) random_hex_length = (32 - len(timestamp)) // 2 # calculate bytes needed random_hex = secrets.token_hex(random_hex_length).upper() random_id = timestamp + random_hex return random_id def form_api_url(cursor: int): """form the reviews API URL and its pagination values""" base_url = "https://www.tiktok.com/api/search/general/full/?" params = { "keyword": quote(keyword), "offset": cursor, # the index to start from "search_id": generate_search_id() } return base_url + urlencode(params) log.info("obtaining a session for the search API") session_id = await obtain_session(url="https://www.tiktok.com/search?q=" + quote(keyword)) log.info("scraping the first search batch") first_page = await SCRAPFLY.async_scrape(ScrapeConfig( form_api_url(cursor=0), asp=True, country="US", headers={ "content-type": "application/json", }, session=session_id )) search_data = parse_search(first_page) # scrape the remaining comments concurrently log.info(f"scraping search pagination, remaining {max_search // search_count} more pages") _other_pages = [ ScrapeConfig(form_api_url(cursor=cursor), asp=True, country="US", headers={ "content-type": "application/json" }, session=session_id ) for cursor in range(search_count, max_search + search_count, search_count) ] async for response in SCRAPFLY.concurrent_scrape(_other_pages): data = parse_search(response) search_data.extend(data) log.success(f"scraped {len(search_data)} from the search API from the keyword {keyword}") return search_data Enter fullscreen mode Exit fullscreen mode Run the code: async def run(): search_data = await scrape_search( keyword="whales", max_search=18 ) # save the result to a JSON file with open("search_data.json", "w", encoding="utf-8") as file: json.dump(search_data, file, indent=2, ensure_ascii=False) if __name__ == "__main__": asyncio.run(run()) Enter fullscreen mode Exit fullscreen mode Let's break down the execution flow of the above TikTok scraping code: A request is sent to the regular search page to obtain a cookie values through the obtain_session function. A random search ID is created using the generate_search_id to use it with the requests sent to the search API. The first search API URL is created with the form_api_url function. A request is sent to the search API with the session key containing the cookies. The JSON response of the search API is parsed using the parse_search. It also filters the response data to only include the video data. 🙋‍ The above code requests the /search/general/full/ endpoint, which retrieves search results for both profile and video data. To select a specific data type, you can inspect the network tab on the browser to get the corresponding API endpoint by filtering the search results type. Here is a sample output of the results we got: [ { "id": "7192262480066825515", "desc": "Replying to @julsss1324 their songs are described as hauntingly beautiful. Do you find them scary or beautiful? For me it’s peaceful. They remind me of elephants. 🐋🎶💙 @kaimanaoceansafari #whalesounds #whalesong #hawaii #ocean #deepwater #deepsea #thalassophobia #whales #humpbackwhales ", "createTime": 1674579130, "video": { "id": "7192262480066825515", "height": 1024, "width": 576, "duration": 25, "ratio": "540p", "cover": "https://p16-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/e438558728954c74a761132383865d97_1674579131~tplv-dmt-logom:tos-useast5-i-0068-tx/0bb4cf51c9f445c9a46dc8d5aab20545.image?x-expires=1709215200&x-signature=Xl1W9ELtZ5%2FP4oTEpjqOYsGQcx8%3D", "originCover": "https://p19-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/2061429a4535477686769d5f2faeb4f0_1674579131?x-expires=1709215200&x-signature=OJW%2BJnqnYt4L2G2pCryrfh52URI%3D", "dynamicCover": "https://p19-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/88b455ffcbc6421999f47ebeb31b962b_1674579131?x-expires=1709215200&x-signature=hDBbwIe0Z8HRVFxLe%2F2JZoeHopU%3D", "playAddr": "https://v16-webapp-prime.us.tiktok.com/video/tos/useast5/tos-useast5-pve-0068-tx/809fca40201048c78299afef3b627627/?a=1988&ch=0&cr=3&dr=0&lr=unwatermarked&cd=0%7C0%7C0%7C&cv=1&br=3412&bt=1706&bti=NDU3ZjAwOg%3D%3D&cs=0&ds=6&ft=4KJMyMzm8Zmo0apOi94jV94rdpWrKsd.&mime_type=video_mp4&qs=0&rc=NDU3PDc0PDw7ZGg7ODg0O0BpM2xycGk6ZnYzaTMzZzczNEBgNl4tLjFiNjMxNTVgYjReYSNucGwzcjQwajVgLS1kMS9zcw%3D%3D&btag=e00088000&expire=1709216449&l=202402271420230081AD419FAC9913AB63&ply_type=2&policy=2&signature=1d44696fa49eb5fa609f6b6871445f77&tk=tt_chain_token", "downloadAddr": "https://v16-webapp-prime.us.tiktok.com/video/tos/useast5/tos-useast5-pve-0068-tx/c7196f98798e4520834a64666d253cb6/?a=1988&ch=0&cr=3&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C&cv=1&br=3514&bt=1757&bti=NDU3ZjAwOg%3D%3D&cs=0&ds=3&ft=4KJMyMzm8Zmo0apOi94jV94rdpWrKsd.&mime_type=video_mp4&qs=0&rc=ZTw5Njg0NDo3Njo7PGllOkBpM2xycGk6ZnYzaTMzZzczNEBhYjFiLjA1NmAxMS8uMDIuYSNucGwzcjQwajVgLS1kMS9zcw%3D%3D&btag=e00088000&expire=1709216449&l=202402271420230081AD419FAC9913AB63&ply_type=2&policy=2&signature=1443d976720e418204704f43af4ff0f5&tk=tt_chain_token", "shareCover": [ "", "https://p16-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/2061429a4535477686769d5f2faeb4f0_1674579131~tplv-photomode-tiktok-play.jpeg?x-expires=1709647200&x-signature=%2B4dufwEEFxPJU0NX4K4Mm%2FPET6E%3D", "https://p16-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/2061429a4535477686769d5f2faeb4f0_1674579131~tplv-photomode-share-play.jpeg?x-expires=1709647200&x-signature=XCorhFJUTCahS8crANfC%2BDSrTbU%3D" ], "reflowCover": "https://p19-sign.tiktokcdn-us.com/tos-useast5-p-0068-tx/e438558728954c74a761132383865d97_1674579131~tplv-photomode-video-cover:480:480.jpeg?x-expires=1709215200&x-signature=%2BFN9Vq7TxNLLCtJCsMxZIrgjMis%3D", "bitrate": 1747435, "encodedType": "normal", "format": "mp4", "videoQuality": "normal", "encodeUserTag": "" }, "author": { "id": "6763395919847523333", "uniqueId": "mermaid.kayleigh", "nickname": "mermaid.kayleigh", "avatarThumb": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7310953622576037894~c5_100x100.jpeg?lk3s=a5d48078&x-expires=1709215200&x-signature=0tw66iTdRDhPA4pTHM8e4gjIsNo%3D", "avatarMedium": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7310953622576037894~c5_720x720.jpeg?lk3s=a5d48078&x-expires=1709215200&x-signature=IkaoB24EJoHdsHCinXmaazAWDYo%3D", "avatarLarger": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/7310953622576037894~c5_1080x1080.jpeg?lk3s=a5d48078&x-expires=1709215200&x-signature=38KCawETqF%2FdyMX%2FAZg32edHnc4%3D", "signature": "Love the ocean with me 💙\nOwner @KaimanaOceanSafari 🤿\nCome dive with me👇🏼", "verified": true, "secUid": "MS4wLjABAAAAhIICwHiwEKwUg07akDeU_cnM0uE1LAGO-kEQdw3AZ_Rd-zcb-qOR0-1SeZ5D2Che", "secret": false, "ftc": false, "relation": 0, "openFavorite": false, "commentSetting": 0, "duetSetting": 0, "stitchSetting": 0, "privateAccount": false, "downloadSetting": 0 }, "stats": { "diggCount": 10000000, "shareCount": 390800, "commentCount": 72100, "playCount": 89100000, "collectCount": 663400 }, "authorStats": { "followingCount": 313, "followerCount": 2000000, "heartCount": 105400000, "videoCount": 1283, "diggCount": 40800, "heart": 105400000 }, "type": 1 }, .... ] Enter fullscreen mode Exit fullscreen mode With this last feature our TikTok scraper is complete. It can scrape profiles, channels, posts, comments and search data! Bypass TikTok Scraping Blocking With ScrapFly We can successfully scrape TikTok data from various pages. However, scaling our scraping rate will lead TikTok to block the IP address used. Moreover, it can challenge the requests with CAPTCHAs if the traffic is suspected: TikTok sraping blocking ScrapFly is a web scraping API that allows for scraping TikTok without getting blocked using an anti-scraping protection feature. It also allows for scraping at scale by providing: Residential proxies in over 50 countries - For scraping from almost any geographical location while also preventing IP address throttling and blocking. JavaScript rendering - For scraping dynamic web pages through cloud headless browsers wihtout running them yourself. Easy to use Python and Typescript SDKs, as well as Scrapy integration. And much more! ScrapFly service does the heavy lifting for you! Here is how we can avoid TikTok web scraping blocking using ScrapFly. All we have to do is replace the HTTP client with the ScrapFly client and enable the asp parameter: # standard web scraping code import httpx from parsel import Selector response = httpx.get("some tiktok.com URL") selector = Selector(response.text) # in ScrapFly becomes this 👇 from scrapfly import ScrapeConfig, ScrapflyClient # replaces your HTTP client (httpx in this case) scrapfly = ScrapflyClient(key="Your ScrapFly API key") response = scrapfly.scrape(ScrapeConfig( url="website URL", asp=True, # enable the anti scraping protection to bypass blocking country="US", # set the proxy location to a specfic country render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed )) # use the built in Parsel selector selector = response.selector # access the HTML content html = response.scrape_result['content'] Enter fullscreen mode Exit fullscreen mode Sign up to get your API key! FAQ To wrap up this guide, let's have a look at some frequently asked questions about web scraping TikTok. Is there a public API for TikTok? Yes, TikTok offers public APIs APIs for developers, researchers and communities. These APIs allow access to the public TikTok data found in profiles, videos, and comments, as well as data insights on the commercial ads. Can I scrape TikTok for sentiment analysis? Yes, TikTok includes opinionated text data found in comments. These comment data can be classified into positive, negative or neutral sentences. TikTok sentiment analysis allows for capturing relations and opinions on a given subject. We have covered using web scraping for sentiment analysis in a previous article. Web Scraping TikTok Summary In this guide, we have explained how to scrape TikTok through a step-by-step guide. We have created a TikTok scraper that scrapes different data types: Profile and channel data. Video and comment data in post pages. Video search results from search pages. We have used some web scraping tricks to scrape TikTok without writing selectors by extracting the data from hidden JavaScript tags and hidden APIs. We have also used ScrapFly to avoid TikTok scraper blocking.

scrapper

1 103 8.4 Python

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
sneakpeek

3 35 7.5 Python

Sneakpeek is a framework that helps to quickly and conviniently develop scrapers. It’s the best choice for scrapers that have some specific complex scraping logic that needs to be run on a constant basis (by flulemon)

Project mention: Sneakpeek is a framework that helps to quickly and conveniently develop scrapers. It’s the best choice for scrapers that have some specific complex scraping logic that needs to be run on a constant basis | /r/CKsTechNews | 2023-04-30

XingDumper

1 34 5.1 Python

Python 3 script to dump/scrape/extract company employees from XING API
estela-cli

1 4 5.8 Python

estela Command Line Client 🕸
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Crawling related posts

How To Scrape TikTok in 2024
1 project | dev.to | 8 Mar 2024
Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
1 project | news.ycombinator.com | 16 Feb 2024
An Introduction to the WARC File
7 projects | news.ycombinator.com | 29 Jan 2024
Real challenge for web scrapers- Need Help!
1 project | /r/webscraping | 13 Jul 2023
Implementing case sensitive headers in Scrapy (not through `_caseMappings`)
4 projects | /r/scrapy | 3 Jul 2023
Dicas para projetos usando web scraping
1 project | /r/brdev | 27 Jun 2023
Best tools to use for web scraping ??
1 project | /r/learnpython | 25 Jun 2023
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Crawling projects in Python? This list will help you:

	Project	Stars
1	Scrapy	50,896
2	newspaper	13,720
3	Grab	2,354
4	mlscraper	1,225
5	botasaurus	899
6	scrapyrt	816
7	isp-data-pollution	566
8	spidermon	509
9	WarcDB	383
10	LinkedInDumper	362
11	spidy Web Crawler	323
12	telegram-crawler	236
13	estela	153
14	scrapfly-scrapers	126
15	scrapper	103
16	sneakpeek	35
17	XingDumper	34
18	estela-cli	4