tinycudann
RecursiveFactorization.jl
tinycudann  RecursiveFactorization.jl  

9  8  
3,602  75  
2.1%    
5.1  6.1  
15 days ago  4 months ago  
C++  Julia  
GNU General Public License v3.0 or later  GNU General Public License v3.0 or later 
Stars  the number of stars that a project has on GitHub. Growth  month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
tinycudann

[D] Have their been any attempts to create a programming language specifically for machine learning?
In the opposite direction from your question is a very interesting project, TinyNN all implemented as close to the metal as possible and very fast: https://github.com/NVlabs/tinycudann

A CUDAfree instant NGP renderer written entirely in Python: Support realtime rendering and camera interaction and consume less than 1GB of VRAM
This repo only implemented the rendering part of the NGP but is more simple and has a lesser amount of code compared to the original (InstantNGP and tinycudann).
 Tiny CUDA Neural Networks: fast C++/CUDA neural network framework
 Making 3D holograms this weekend with the very “Instant” Neural Graphics Primitives by nvidia — made this volume from 100 photos taken with an old iPhone 7 Plus
 NVlabs/tinyCUDAnn: fast C++/CUDA neural network framework

Small Neural networks in Julia 5x faster than PyTorch
...a C++ library with a CUDA backend. But these highperformance building blocks might only be saturating the GPU fully if the data is large enough.
I haven't looked at implementing these things, but I imagine uf you have smaller networks and thus less data, the large building blocks may not be optimal. You may for example want to fuse some operations to reduce memory latency from repeated memory access.
In PyTorch world, there are approaches for small networks as well, there is https://github.com/NVlabs/tinycudann  as far as I understand from the first link in the README, it makes clever use of the CUDA shared memory, which can hold all the weights of a tiny network (but not larger ones).
 [R] Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Training a NeRF takes 5 seconds!)
 Tiny CUDA Neural Networks
 RealTime Neural Radiance Caching for Path Tracing
RecursiveFactorization.jl

Can Fortran survive another 15 years?
What about the other benchmarks on the same site? https://docs.sciml.ai/SciMLBenchmarksOutput/stable/Bio/BCR/ BCR takes about a hundred seconds and is pretty indicative of systems biological models, coming from 1122 ODEs with 24388 terms that describe a stiff chemical reaction network modeling the BCR signaling network from Barua et al. Or the discrete diffusion models https://docs.sciml.ai/SciMLBenchmarksOutput/stable/Jumps/Dif... which are the justification behind the claims in https://www.biorxiv.org/content/10.1101/2022.07.30.502135v1 that the O(1) scaling methods scale better than O(log n) scaling for large enough models? I mean.
> If you use special routines (BLAS/LAPACK, ...), use them everywhere as the respective community does.
It tests with and with BLAS/LAPACK (which isn't always helpful, which of course you'd see from the benchmarks if you read them). One of the key differences of course though is that there are some pure Julia tools like https://github.com/JuliaLinearAlgebra/RecursiveFactorization... which outperform the respective OpenBLAS/MKL equivalent in many scenarios, and that's one noted factor for the performance boost (and is not trivial to wrap into the interface of the other solvers, so it's not done). There are other benchmarks showing that it's not apples to apples and is instead conservative in many cases, for example https://github.com/SciML/SciPyDiffEq.jl#measuringoverhead showing the SciPyDiffEq handling with the Julia JIT optimizations gives a lower overhead than direct SciPy+Numba, so we use the lower overhead numbers in https://docs.sciml.ai/SciMLBenchmarksOutput/stable/MultiLang....
> you must compile/write whole programs in each of the respective languages to enable full compiler/interpreter optimizations
You do realize that a .so has lower overhead to call from a JIT compiled language than from a static compiled language like C because you can optimize away some of the bindings at the runtime right? https://github.com/dyu/ffioverhead is a measurement of that, and you see LuaJIT and Julia as faster than C and Fortran here. This shouldn't be surprising because it's pretty clear how that works?
I mean yes, someone can always ask for more benchmarks, but now we have a site that's auto updating tons and tons of ODE benchmarks with ODE systems ranging from size 2 to the thousands, with as many things as we can wrap in as many scenarios as we can wrap. And we don't even "win" all of our benchmarks because unlike for you, these benchmarks aren't for winning but for tracking development (somehow for Hacker News folks they ignore the utility part and go straight to language wars...).
If you have a concrete change you think can improve the benchmarks, then please share it at https://github.com/SciML/SciMLBenchmarks.jl. We'll be happy to make and maintain another.
 Yann Lecun: ML would have advanced if other lang had been adopted versus Python

Small Neural networks in Julia 5x faster than PyTorch
Ask them to download Julia and try it, and file an issue if it is not fast enough. We try to have the latest available.
See for example: https://github.com/JuliaLinearAlgebra/RecursiveFactorization...

Why Fortran is easy to learn
Julia defaults to OpenBLAS but libblastrampoline makes it so that `using MKL` flips it to MKL on the fly. See the JuliaCon video for more details on that (https://www.youtube.com/watch?v=t6hptekOR7s). The recursive comparison is against OpenBLAS/LAPACK and MKL, see this PR for some (older) details: https://github.com/YingboMa/RecursiveFactorization.jl/pull/2... . What it really comes down to in the end is that OpenBLAS is rather bad, and MKL is optimized for Intel CPUs but not for AMD CPUs, so when the best CPUs are now all AMD CPUs, having a new set of BLAS tools and mixing that with recursive LAPACK tools is either as good or better on most modern systems. Then we see this in practice even when we build BLAS into Sundials for 1,000 ODE chemical reaction networks (https://benchmarks.sciml.ai/html/Bio/BCR.html).

Julia 1.7 has been released
>I hope those benchmarks are coming in hot
M1 is extremely good for PDEs because of its large cache lines.
https://github.com/SciML/DiffEqOperators.jl/issues/407#issue...
The JuliaSIMD tools which are internally used for BLAS instead of OpenBLAS and MKL (because they tend to outperform standard BLAS's for the operations we use https://github.com/YingboMa/RecursiveFactorization.jl/pull/2...) also generate good code for M1, so that was giving us some powerful use cases right off the bat even before the heroics allowed C/Fortran compilers to fully work on M1.

Why I Use Nim instead of Python for Data Processing
Not necessarily true with Julia. Many libraries like DifferentialEquations.jl are Julia all of the way down because the pure Julia BLAS tools outperform OpenBLAS and MKL in certain areas. For example see:
https://github.com/YingboMa/RecursiveFactorization.jl/pull/2...
So a stiff ODE solve is pure Julia, LUfactorizations and all.

Julia Receives DARPA Award to Accelerate Electronics Simulation by 1,000x
Also, the major point is that BLAS has little to no role played here. Algorithms which just hit BLAS are very suboptimal already. There's a tearing step which reduces the problem to many subproblems which is then more optimally handled by pure Julia numerical linear algebra libraries which greatly outperform OpenBLAS in the regime they are in:
https://github.com/YingboMa/RecursiveFactorization.jl#perfor...
And there are hooks in the differential equation solvers to not use OpenBLAS in many cases for this reason:
https://github.com/SciML/DiffEqBase.jl/blob/master/src/linea...
Instead what this comes out to is more of a deconstructed KLU, except instead of parsing to a single sparse linear solve you can do semiindependent nonlinear solves which are then spawning parallel jobs of small semidense linear solves which are handled by these pure Julia linear algebra libraries.
And that's only a small fraction of the details. But at the end of the day, if someone is thinking "BLAS", they are already about an order of magnitude behind on speed. The algorithms to do this effectively are much more complex than that.
What are some alternatives?
instantngp  Instant neural graphics primitives: lightning fast NeRF and more
SciMLBenchmarks.jl  Scientific machine learning (SciML) benchmarks, AI for science, and (differential) equation solvers. Covers Julia, Python (PyTorch, Jax), MATLAB, R
blis  BLASlike Library Instantiation Software Framework
PrimesResult  The results of the Dave Plummer's Primes Drag Race
diffrax  Numerical differential equation solvers in JAX. Autodifferentiable and GPUcapable. https://docs.kidger.site/diffrax/
Diffractor.jl  Nextgeneration AD
juliaup  Julia installer and version multiplexer
svls  SystemVerilog language server
RecursiveFactorization
SuiteSparse.jl  Development of SuiteSparse.jl, which ships as part of the Julia standard library.
vectorflow
Verilog.jl  Verilog for Julia