cifar10-fast-simple

Train CIFAR10 to 94% accuracy in a few minutes/seconds. Based on https://github.com/davidcpage/cifar10-fast (by 99991)

Cifar10-fast-simple Alternatives

Similar projects and alternatives to cifar10-fast-simple

  • hlb-CIFAR10

    Train CIFAR-10 in <7 seconds on an A100, the current world record.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better cifar10-fast-simple alternative or higher similarity.

cifar10-fast-simple reviews and mentions

Posts with mentions or reviews of cifar10-fast-simple. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-01-29.
  • Show HN: Train CIFAR10 to 94% in under 10 seconds on a single A100
    4 projects | news.ycombinator.com | 29 Jan 2023
    Great point! If I recall correctly, this team (well, nearly all the top teams from DawnBench) took Page's code and wrestled it into the multi-GPU realm. I'm a sucker for simplicity, as much as is reasonable (this codebase currently does not use JIT or any custom kernels! (!!!)), and also making sure that the average practitioner (like me) could do something workable without having to pay tons of money. My computing costs are $50 a month currently, i.e. the cost of Pro Colab. And we were able to break the single-GPU WR, and we're really close to pushing past any of the official multi-GPU submissions (old as they may be!).

    I took David's work in a different direction and just kept it true I think to the original spirit of things. Cycle times for experimentation are king in ML when it comes to the speed of research progress, regardless of what anyone else might tell you. Having tons of hardware may be really flashy and useful for the end product, but it's certainly not needed for much of the lo-fi, day-to-day stuff.

    That said, the A100 is definitely a step up. It is under 2x, though, as we are basically only memory-and-slow-backprop-kernel limited now, not as much by the convolutions (which now are among the shorter operations). Running https://github.com/99991/cifar10-fast-simple on my end gave me 17.2 seconds, vs the 24 seconds that Dave reported on the V100 (though the lovely author of that repo, @99991, was able to get faster speeds on their personal A100 setup). So we're definitely in that weird regime where moving everything to massively scaled matrix multiplies when possible is preferred, and sometimes that's...tricky for a few of these operations.

  • Show HN: Hlb-CIFAR10 0.2.0: New world record (~<12.38s) on single-GPU CIFAR10
    2 projects | news.ycombinator.com | 15 Jan 2023
    Hello everyone,

    After recreating the accuracy/rough speed from David Page's implementation in hlb-CIFAR10 0.1.0 (18.1s on an A100, SXM4, Colab), it was down to some basic NVIDIA kernel profiling to figure out which operations were the long poles in the tent. Perhaps (somewhat?) unsurprisingly, the NCHW <-> NHWC thrash was the worst part, but unfortunately the GhostBatchNorm was a barrier even using the faster-on-Ampere channels_last memory format.

    A quick note before continuing -- some may find the use of a convolutional network and on CIFAR10 to be curious. A quick answer to that would be that in doing the research that optimizes well-known problems (especially if the testing path is incredibly rapid), we get much clearer pictures of what certain fundamental information learning limits are for systems like this, as well as stable prototypes that can then be translated (potentially somewhat analogously) into other modalities. You can see this practice with a few researchers, Hinton comes to mind though his work is much more fundamental and experimental than this is. Back to the release notes.

    Ultimately, however, we were able to get a similar level of regularization to the original GhostBatchNorm (called GhostNorm) in the code, which allowed us to remove it and a bunch of tensor allocation/contiguous tensor calls, saving us nearly exactly 5 seconds or so (!!!!).

    Replacing the call for nn.AdaptiveMaxPooling(1,1) with a torch.amax(dim=2,3), added an additional .5 seconds off the clock, bringing us down below Thomas Germer (@99991)'s excellently quick implementation of the same base method (https://github.com/99991/cifar10-fast-simple) and giving us the new world record.

    This work is pretty simple on its own -- though the various ways to use the nvidia profiler(s) can be very daunting to use and I can post snippets of the simplest way that I've found (via the torch.profiler route) if someone asks/is curious. That said, looking at kernel execution order and times can really and truly do a lot to quickly improve a network in conjunction with good research engineering practices.

    This is what I'm pretty good at doing so getting to flex a bit on a spare time project is fun. I'm consistently storing up time saves into a draft bin of sorts and plan on keeping releasing them in related/clustered releases as I'm able to appropriately polish them to whatever their capabilities seem to be. There is a lot of room to grow, and I think we now definitely have a good chance at making it within that 94% accuracy under ~<2s mark within a few years!

    This work is meant to be a living resume for me, feel free to check out my README.md for more info. I love a lot of aspects of the technical/nitty gritty side of the fusion of neural network engineering and the edge of research, particularly when it comes to speed, so this is my strong area. I'm certainly happy to answer whatever reasonable questions anyone might have, let me help with getting this project going for you (or other related stuff -- feel free to ask! <3 :)))) )

  • [R] hlb-CIFAR10 0.2.0: New world record for single-GPU CIFAR10, ~&lt;12.38s with one A100 (SXM4, Colab)
    2 projects | /r/MachineLearning | 15 Jan 2023
    Replacing the call for nn.AdaptiveMaxPooling(1,1) with a torch.amax(dim=2,3), added an additional .5 seconds off the clock, bringing us down below Thomas Germer (@99991)'s excellently quick implementation of the same base method (https://github.com/99991/cifar10-fast-simple) and giving us the new world record.
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 20 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Stats

Basic cifar10-fast-simple repo stats
3
19
10.0
over 1 year ago

99991/cifar10-fast-simple is an open source project licensed under MIT License which is an OSI approved license.

The primary programming language of cifar10-fast-simple is Python.


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com