MNN
oneflow
Our great sponsors
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
MNN
-
Newbie having error code of cannot build selected target abi x86 no suitable splits configured
I found a solution on GitHub check your app's build.gradle, defaultConfig section - you need to add x86 to your ndk abiFilters ndk.abiFilters 'armeabi-v7a','arm64-v8a', 'x86' GitHub Hope it will help. You have to find that file and edit it as given here
oneflow
-
The Execution Process of a Tensor in Deep Learning Framework[R]
This article focuses on what is happening behind the execution of a Tensor in the deep learning framework OneFlow. It takes the operator oneflow.relu as an example to introduce the Interpreter and VM mechanisms that need to be relied on to execute this operator.
-
Explore MLIR Development Process
This article describes how OneFlow works with MLIR, how to add a graph-level Pass to OneFlow IR, how OneFlow Operations automatically become MLIR Operations, and why OneFlow IR can use MLIR to accelerate computations.
-
The History of Credit-based Flow Control (Part 1)
Backpressure mechanism, also known as credit-based flow control, is a classic scheme for network communication flow control problems. Its predecessor is the TCP sliding window. This idea is particularly simple and effective. As we will see in this article, based on the same principles, this idea is applicable to any flow control scheme and is found in the design of many hardware and software systems. In this article, the engineer of OneFlow will tell the chequered history of this simple idea.
-
Optimization of CUDA Elementwise Template Library: Practical, Efficient, and Extensible
Elementwise operation refers to applying a function transformation to every element of a tensor. In deep learning, many operators can be regraded as elementwise operators, such as common activation functions (like ReLU and GELU) and ScalarMultiply (multiplying each element of a tensor by a scalar). For this elementwise operation, OneFlow(https://github.com/Oneflow-Inc/oneflow/) abstracts a CUDA template. this article will introduce the design thoughts and optimization techniques of CUDA template.
-
Pytorch Distributed Parallel Computing or Hpc Research
You can download Oneflow on GitHub, and read the technical documents or blog on Medium to know more about OneFlow. If you have any problem with OneFlow, please write issues on github. (Sorry for the late reply)
-
How to Implement an Efficient LayerNorm CUDA Kernel[R]
Code:https://github.com/Oneflow-Inc/oneflow/
-
What an Optimal Point-to-Point Communication Library Should Be?
This series article introduced what a point-to-point communication library is, and discussed some of the general characteristics of the optimal P2P communication library. Furthermore, it dive into the details about how to design an optimal P2P library and introduce the design of CommNet in OneFlow.
-
How to Go Beyond Data Parallelism and Model Parallelism: Starting from GShard
This article lists papers on GShard, presents background information and inspiration from the papers, and finally evaluates what else can be done to improve GShard from similar work that has been done in OneFlow.
OneFlow Paper:https://arxiv.org/abs/2110.15032; Code:https://github.com/Oneflow-Inc/oneflow/
The paper of Gshard contains two main parts of work, one on parallel APIs and one on Mixture of experts. The former part is more interesting and I will only discuss this part. The contribution on parallel APIs is outlined clearly in the abstract of the paper:
GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler.
Gshard Paper: https://arxiv.org/pdf/2006.16668.pdf
-
How to Implement an Efficient Softmax CUDA Kernel
All ops computed in deep learning frameworks are translated into CUDA kernel functions on the GPU, and Softmax operations are no exception. Softmax is a widely used op in most networks, and the efficiency of its CUDA kernel implementation can affect the final training speed of many networks. So how can an efficient Softmax CUDA Kernel be implemented?
-
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch
Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a DNN model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Paper:https://arxiv.org/pdf/2110.15032.pdf; Code: https://github.com/Oneflow-Inc/oneflow
What are some alternatives?
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
tensorflow - An Open Source Machine Learning Framework for Everyone
elbencho - A distributed storage benchmark for file systems, object stores & block devices with support for GPUs
flashlight - A C++ standalone library for machine learning
kompute - General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.
OpenMLDB - OpenMLDB is an open-source machine learning database that provides a feature platform enabling consistent features for training and inference.
serving - A flexible, high-performance serving system for machine learning models
ML-examples - Arm Machine Learning tutorials and examples