sandsifter
A64FX
sandsifter | A64FX | |
---|---|---|
2 | 7 | |
473 | 435 | |
0.2% | 0.5% | |
0.0 | 2.8 | |
over 5 years ago | 6 months ago | |
Python | ||
BSD 3-clause "New" or "Revised" License | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
sandsifter
A64FX
-
How about HPC on ARM?
This is their main repo https://github.com/fujitsu/A64FX
-
AMD-Powered Frontier Supercomputer Breaks the Exascale Barrier, Now Fastest in the World
You should check the architecture manual https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.0.pdf
-
How many x86 instructions are there?
> I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps).
Fair enough. I have high hopes for SVE, though. The first-faulting memory ops and predicate bisection features look like a vectorization godsend.
> There's only so much space in my brain.
I'm still going to attempt a nerd-sniping with the published architecture manual. Fujitsu includes a detailed pipeline description including instruction latencies. Granted its just one part, and its an HPC-focused part at that. But its not every day that this level of detail gets published in the ARM world.
https://github.com/fujitsu/A64FX/tree/master/doc
> I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again.
SVE is blessedly uniform in this regard.
> It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations.
This goes both ways, though. A64FX has two vector execution pipelines and one dedicated predicate execution pipeline. Since the vector pipelines cannot execute predicate ops, I expect it is not difficult to construct cases where code gets starved for predicate execution resources.
-
“Is Parallel Programming Hard, and, If So, What Can You Do About It?” v2 Is Out
The A64fx also has hardware synchronization barriers to synchronize cores, which is a pretty GPU-like thing (at least it is very common on GPUs, and rare on CPUs).
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Speci...
- Fujitsu A64FX Microarchitecture Manual [pdf]
What are some alternatives?
glm-ucode - GLM uCode dumps
simdjson - Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks