Pandas
Scrapy
Our great sponsors
Pandas | Scrapy | |
---|---|---|
341 | 158 | |
37,387 | 46,621 | |
1.9% | 2.0% | |
10.0 | 9.7 | |
5 days ago | 2 days ago | |
Python | Python | |
BSD 3-clause "New" or "Revised" License | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Pandas
-
I keep hitting a wall when trying to install modules in Pycharm...
pip install https://github.com/pandas-dev/pandas/archive/master.zip
-
Reducing size of dependencies
Going through the venv site-packages I found 2 things: * Once your py files are converted into pyc or pyo you can delete the originals at the cost of debugability. It won't be a huge change but it's something. * Pandas specifically carries a huge tests folder it's discussed here. You can delete it. This is probably true for other libraries as well to some extent (carrying non production code into the package.
-
We are the developers behind pandas, currently preparing for the 2.0 release :) AMA
I'm Patrick Hoefler aka phofl and I'm one of the core team members developing and maintaining pandas (repo, docs), a popular data analysis library.
introducing optional behaviour comes with a huge maintenance cost (I started making such a proposal here, but then withdrew it)
I think this is an interesting question! I've opened https://github.com/pandas-dev/pandas/issues/51751
Personally polars' strictness is making me think about situations when in pandas we end up with object dtype, which we should probably avoid. Here's an example: https://github.com/pandas-dev/pandas/issues/50887 (polars would just error in such a case, which I think is the correct thing to do)
-
New to python
If you want to do data analysis or engineering, spend more time on Pandas or PySpark.
-
Join us for an AMA with the developers of pandas, the powerful data analysis toolkit, this Thursday, March 2nd at 5:30 pm UTC to celebrate the upcoming 2.0 release
This Thursday we'll be hosting an AMA with some of the developers of pandas. The AMA will 'officially' start at 5:30pm UTC.
We released the release candidate for 2.0 last week, so the actual release is expected shortly, possibly next week. Please help us in testing that everything works through testing the rc :)
-
Extracting git repository data with PyDriller
I chose to do this by using the popular Pandas library for tabular data analysis tasks.
Scrapy
-
fastest web scraping options
You can use automation tools like Selenium or Playwright. You can work with a full-fledged framework such as Scrapy. I also recently discovered a Python tool like selectolax Lexbor, which allows you to extract data very quickly.
-
How to run webs scraping script every 15 minutes
You may want to check out [estela](https://estela.bitmaker.la/docs/), which is a spider management solution, developed by [Bitmaker](https://bitmaker.la) that allows you to run [Scrapy](https://scrapy.org) spiders.
-
How I used Scrapy for my ML Project
I wanted to invest my time and energy in learning the fastest, most efficient one, that can scale with my as my projects get more and more complex scrapy. After all, I want my projects to shine so bright in my cv it blinds the recruiter's eyes.
-
How to extract / download all URLs from a site?
Try Scrapy (Python) https://scrapy.org/
-
Nairobi Stock Exchange Web Scraper (MongoDB Atlas Hackathon 2022 on DEV)
Scrapy
- Advanced Web Scraping using Python-Scrapy and Splash
-
Ask HN: Best way to keep the raw HTML of scraped pages?
If you weren't already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: https://docs.scrapy.org/en/2.7/topics/downloader-middleware....
Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...
-
Tool to Scrape Manuals and Sensitive PDFs to Generate Stronger Wordlists for Lateral Movement and Initial Access
Surprised at the name of this project given there is an incredibly popular project called scrapy related to web scraping. This project would really benefit from a rebrand.
- ‘Automate the boring stuff’ — but what do you all actually automate with python
-
What are some cool things you've automated with python?
I was looking for a used cars. I written a scraper using Scrapy, that was gathering all new offers, filtered by my criteria, every hour. Then it was sending me nicely formatted email.
What are some alternatives?
requests-html - Pythonic HTML Parsing for Humans™
pyspider - A Powerful Spider(Web Crawler) System in Python.
Cubes - Light-weight Python OLAP framework for multi-dimensional data analysis
orange - 🍊 :bar_chart: :bulb: Orange: Interactive data analysis
colly - Elegant Scraper and Crawler Framework for Golang
tensorflow - An Open Source Machine Learning Framework for Everyone
MechanicalSoup - A Python library for automating interaction with websites.
playwright-python - Python version of the Playwright testing and automation library.
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
undetected-chromedriver - Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
pyexcel - Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files
Keras - Deep Learning for humans