Apache Spark - A unified analytics engine for large-scale data processing
Apache Spark is one of the most actively developed open-source projects in big data. The following code examples require that you have Spark set up and can execute Python code using the PySpark library. The examples also require that you have your data in Amazon S3 (Simple Storage Service). All this is set up on AWS EMR (Elastic MapReduce).
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
We’ve learned a lot while setting up Spark on AWS EMR. While this post will focus on how to use PySpark with Pandas, let us know in the comments if you’re interested in a future article on how we set up Spark on AWS EMR.
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Pandas user-defined function (UDF) is built on top of Apache Arrow. Pandas UDF improves data performance by allowing developers to scale their workloads and leverage Panda’s APIs in Apache Spark. Pandas UDF works with Pandas APIs inside the function, and works with Apache Arrow to exchange data.
Arrow v1.0: After 8 years, a new milestone with a lot of new features
3 projects | news.ycombinator.com | 26 Feb 2021
Trading Algos - 5 Key Metrics and How to Implement Them in Python
4 projects | dev.to | 8 Jan 2022
Test Parquet float16 Support in Pandas
3 projects | dev.to | 14 Dec 2021
How long does it take to become a core member of an open-source community?
2 projects | reddit.com/r/AskComputerScience | 12 Dec 2021
Learning Python on the Job
2 projects | dev.to | 11 Nov 2021