I started writing a software triangle rasterizer. Not sure exactly why, but I just felt like doing it.
Like many software rasterizer projects I used Fabian Giesen's software rasterizer blog series [0] as the baseline.
For solid color triangles with depth testing my 10+ year old laptop achieve ~3.2 Gpixels/s fill rate, which is above 80% of the available memory bandwidth (~26 GiB/s at 64 bits per pixel), using memset as the baseline comparison.
I used Rust and std::simd. The code can be compiled for SSE2, NEON, AVX2 or AVX512 by changing compiler parameters. The inner loop uses 16-wide vectors (512 bits) although my computer only has 8-wide AVX2, but the compiler deals with that. I used generics so I can change vector width easily and have multiple vector widths in the same binary for benchmarking. 16-wide is about 10-20% faster than 8-wide. I was excited to see that AVX masked store instructions get used even though I did not explicitly write masked stores in the code. I spent a lot of time reading the disassembly of the generated code and it's very tight.
The performance falls off a cliff (170 Mpixels/s) once I introduce a "shader" in the inner loop because it is not SIMD friendly (one pixel at a time, not 16 pixels). But that is fine, I am intending to use this with visibility buffer style rendering (store integer triangle id's in color buffer) and/or software occlusion culling (depth buffer only). Neither technique need anything more than solid colors and z-buffer.
Slightly related note, the pictures in the articles show a handheld digital oscillscope. It's an Owon HDS200 series oscilloscope with signal generator and they are amazing and the lower frequency models are quite inexpensive.
I got myself one earlier this year and it does what it says on the tin. It can also be controlled from a computer via USB serial connection using a text based protocol (albeit poorly documented and a bit buggy). I used some python scripts to program the signal generator and then capture some measurements from the scope to check the frequency responses of some analog electronics circuits for guitar.
There is a small community around, there are a few repos on GitHub for using them and also this very long eevblog thred.
I've built half a dozen similar box guitars, it's such a fun little thing to build.
Do I see a piezo disk under the bridge in the 3rd instrument? Do you use some kind of preamp with it?
I have been recently experimenting with different kinds of piezo pickups [0] and preamp electronics for them. I've figured out a pretty nice JFET based circuit for the preamp and ordered tiny 13x13mm PCBs with tiny SMT components assembled and it works pretty well (but needs a second revision). They mount directly on the volume potentiometer and fit in a small space.
The one thing I haven't figured out yet is grounding the electronics. In a typical electric guitar you ground the electronics by touching the (grounded) strings, but that doesn't work very well (at all) with slide guitar when your left hand has got a bottle neck slide on it (made of glass or ceramic which is an insulator).
Drop a message below if you want to geek out more about home made guitars and/or related electronics. Depending on your location I could also send some preamp PCBs your way.
Happy to stay in touch. I haven't build one of these in a long time. My website and email address are in my bio here on HN.
To answer your question, yes, that's a piezo pickup, if I remember correctly. I don't think I had any kind of pre-amp. I wired it to a 1/8" stereo plug and then connected that to a very small portable guitar amp. It was all very rudimentary. A very distorted sound. I don't remember grounding to be a problem but I played the slide version of this with a slide made from a galvanized pipe, rather than glass.
Nice and concise description of quadtrees implemented with classical pointer chasing data structures.
A faster and (arguably) simpler way to construct quad/octrees is using morton codes, sorting and searching with flat arrays [0]. It's also probably easier to implement.
The gist of it is that you quantize the coordinates, do bit interleaving to construct morton code and then sort. The sorting is using numerical keys so you can use a radix sort for O(n) complexity which is much faster on a GPU (but on single CPU core a comparison based sort will probably win if the array isn't huge).
Now everything on the top half of the space is in the beginning of the array and the bottom half is in the end of the array (assuming yxyxyxyx morton code bits). Each of those halves is then split left/right and then each level below that alternates between vertical and horizontal splits.
To find the split point in the array, look at the first and last entry of the (sub)array you're looking at, and look for the first differing bit with (first ^ last).leading_zeros(). Then binary search for the first entry where that bit is high.
To traverse the quad/octree, repeat this process with the two halves you found. This can be done without recursion in O(1) memory using a fixed size stack because you know the depth of the "recursion" is at most half the number of bits in the morton code.
If you used radix sorting for building the array, you can avoid the binary search if you store the histograms from the counting phase of the sorting. Storing the whole histogram may be too much, but just a few highest order bits can already help.
Although I've found experimentally that for small (less that 10000 objects) just sorting and searching is faster if the whole array fits in L2 cache. On my 2015 laptop a single core can sort 10k objects with 64 bit keys in 1 millisecond. Traversing the tree for frustum culling is about 5x faster than just going through the entire array because a lot of geometry can be discarded very quickly.
With a good comparison based sort rebuilding the array after some changes is mighty fast because modern sorting algorithms are much faster for "almost sorted" inputs.
For range queries ("Find everything in the region" in the article), you can probably get better performance by using the BIGMIN/LITMAX method [1].
Now here's a brain teaser to delight your Friday: the article and the method I describe above is for storing points/centroids, but often you need to store axis aligned boxes instead (AABB). There is a very clever trick for this that I discovered independently but later found in some research papers. Can you come up with a (very) small change in the algorithm above to extend it to AABBs instead of centroids?
Discretize the AABB with a certain length proportional to the morton code grid size so that each sample point will land in a distinct but continuous sequence of morton codes enclosing the AABB, compute the hash of each of those points, then dump all of those morton codes into the sort and reverse lookup the other AABBs with the same morton code :)
The problem comes at scale, when you have many AABBs with different sizes. Memory pressure can be punishing and you are bottlenecked by the network for distributed solutions. Is there a faster way with fewer points? I would love to know!
For each AABB, you take the morton code of the minimum and maximum and store them in the array.
But for sorting and searching, construct a sort key from the pair of morton codes so that the most significant bits are the common prefix of the two morton codes and the least significant bits are the "level" (length of common prefix in bits).
E.g. for a 3d octree with 64 bit keys, you'd have 1 unused bit, 57 = 19 * 3 bits of morton code prefix and 6 bits of level. A 2d quadtree with 32b keys would have 1 unused bit, 26 = 2 * 13 bits of prefix and 5 bits of level.
You could also store just the sort key, but I've stored both AABB morton codes so that I can reconstruct the quantized AABBs for culling.
When you sort by this key, you'll get a linear octree where the beginning of the array is all the AABBs that overlap the subdivision plane, then everything on the left side of the plane and the rest is right side of the plane. When traversing the tree, you need two searches per node to find the entries belonging to this node and to find the split point between the left and the right side.
Having the "level" in the least significant bits works as a tiebreaker in the sorting to ensure that entries closer to the root of the tree are in the beginning of the array.
I wrote about this solution vaguely the last time this link came up (wasn't that long ago either, I think?) and you've filled in all the details and the essential Karras paper link; great post :)
Vulkan gives all the tools to avoid any "lag spikes" from shader compiling. In fact, causing them is much more difficult than OpenGL where they could happen in surprising places (and only on certain hardware).
The issue is two fold:
1. Some engines produce a lot of shader permutations. Some AAA titles can have 60000 different shaders compiled.
2. Some GPU rasterizer states (such as color blending) are implemented as shader epilogues.
In Vulkan 1.0 almost all of the pipeline state had to be pre-baked into a pipeline state object compiled ahead of time. This lead to a "shader permutation explosion" where different states need different pipelines.
This requires the game engine to either a) compile all the pipeline combinations ahead of time (slow loading time) or b) compile them as needed (lag spikes).
The core issue for this was solved years ago and now most of the pipeline states can be added to the command buffers ("dynamic states"). This solves the permutation explosion. But at the same time it opens the door for issue 2: some states (blending in particular) can cause a state-based recompile (like ye olde OpenGL days) at runtime.
The only solution to the second problem is not to use the dynamic states that trigger recompiling. That's basically only blending as far as I know. You can't even have dynamic blend state on all GPUs.
For maximum developer flexibility there's the shader object extension that allows mixing and matching shaders and pipeline states any way you want. This will cause state based recompiles at unpredictable times but it's an opt-in feature and easy to avoid if lag spikes are not wanted.
tl;dr: shader recompilation is easy to avoid in Vulkan but porting legacy engine code or art content may take you off the happy path.
> look up the extension on vulkan.gpuinfo.org and see ... currently 0.3% of all devices support it.
Afaik the extension isn't even finalized yet and they are pre-releasing it to gather feedback.
And you can't use gpuinfo for assessing how widely available something is or isn't. The stats contain reports from old drivers too so the numbers you see are no indication of hardware support.
To assess how widely supported something is, you need to look at gpuinfo, sort by date or driver version and cross reference something like steam hardware survey.
Like many software rasterizer projects I used Fabian Giesen's software rasterizer blog series [0] as the baseline.
For solid color triangles with depth testing my 10+ year old laptop achieve ~3.2 Gpixels/s fill rate, which is above 80% of the available memory bandwidth (~26 GiB/s at 64 bits per pixel), using memset as the baseline comparison.
I used Rust and std::simd. The code can be compiled for SSE2, NEON, AVX2 or AVX512 by changing compiler parameters. The inner loop uses 16-wide vectors (512 bits) although my computer only has 8-wide AVX2, but the compiler deals with that. I used generics so I can change vector width easily and have multiple vector widths in the same binary for benchmarking. 16-wide is about 10-20% faster than 8-wide. I was excited to see that AVX masked store instructions get used even though I did not explicitly write masked stores in the code. I spent a lot of time reading the disassembly of the generated code and it's very tight.
The performance falls off a cliff (170 Mpixels/s) once I introduce a "shader" in the inner loop because it is not SIMD friendly (one pixel at a time, not 16 pixels). But that is fine, I am intending to use this with visibility buffer style rendering (store integer triangle id's in color buffer) and/or software occlusion culling (depth buffer only). Neither technique need anything more than solid colors and z-buffer.
[0] https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlu...
reply