I don't know why I am wasting my breath either since it is not my law that abstr...

tptacek · on Oct 8, 2012

The opposite is often true. The right abstractions can provide a performance benefit, when they force programs to conform to constructs that are easily expressed in performant native code.

The standard C library is full of functions that are slower than equivalent functionality in higher level languages. Or do you think ticking through arrays of hopefully-ASCII bytes, byte by byte, waiting for the 0 byte, is the fastest way to compare two strings?

haberman · on Oct 8, 2012

> Or do you think ticking through arrays of hopefully-ASCII bytes, byte by byte, waiting for the 0 byte, is the fastest way to compare two strings?

I'm sure you must know this, but this is not even remotely how strcmp() is implemented in modern libc's.

beagle3 · on Oct 8, 2012

It is quite close, actually. "not even remotely" is a very strong statement.

There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86), but that doesn't change the fact that in order to compare two 128KB strings which are equal, you actually have to read 2*128KB from memory and compare each single byte (in groups of 8, if you are lucky enough with your alignments and instruction set).

Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:

(a) if both strings are interned, it is enough to do a pointer comparison.

(b) if the length is not equal, the strings are not equal - a one word comparison.

(c) if both strings have been hashed before (quite likely), you can tell they are different if their hash is different - a one word comparison.

(d) if length is equal, and hash is (equal or uncomputed), you will have to do the comparison.

Whether this trade-off is worthy depends on your application. If most of your strings are 7-characters or less (as is often the case for software dealing with e.g. stock tickers), then the C approach on 64-bit archs wins hands down: you should actually have all the strings in-place because a pointer takes more memory and causes contention. However, if your strings tend to be 100 bytes or above, and many of them have equal prefixes, the Python approach wins hands down.

haberman · on Oct 8, 2012

> There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86)

It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).

> Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:

Sure, different abstractions have different trade-offs. All of these abstraction possibilities are available to C. strlen() isn't "the C approach," it's just the most common one. Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does. On the other hand, the reverse is not true: Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.

beagle3 · on Oct 8, 2012

I was mostly replying to your assertion that "that's not even remotely how strcmp is implemented", which, for most definition of "even remotely", is false.

> All of these abstraction possibilities are available to C

That's a tautology at best, and meaningless at worst. The way strcmp() is implemented, which we discussed above, is not actually available in C.

> Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does.

And similarly, any Python application that requires (insert some uncommon requirement ..) can do what C can with the same kind of help that strcmp() gets - by delegating to the layer that does it best.

> Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.

Pure python is more limited than C, true. But specific Python implementations (RPython, PyPy, IronPython) might have better access to some optimizations than specific C implementations.

And there's always the aspect of "what's theoretically possible" and "what happens in practice". The fact that PyPy will dynamically switch from 32-bit to 64-bit to unbounded-long-integer might make a real difference on a 32-bit machine where the code might occasionally require 2048 bits, but overwhelmingly requires just 32 bits.

It is possible to construct pathological cases where there are e.g. pow(2,128) possible type combinations within a function, the exact combination is only known from the data (but is consistent for an entire run) - in which case, PyPy will essentially compile the right program to machine code, whereas you cannot do AOT because of the number of combinations; which means a C program will essentially be an interpreter based on those types.

But I don't care about theoretically constructed pathologies. In practice, especially with time-to-implement constraints, it is not true that a C programmer has all the tools at their disposal that higher level languages have.

haberman · on Oct 8, 2012

> I was mostly replying to your assertion that "that's not even remotely how strcmp is implemented", which, for most definition of "even remotely", is false.

eglibc's SSE2 implementation of strcmp is just over 5k of machine code, whereas the simple implementation compiles to 56 bytes on x64-64. That was my definition of "not even remotely." I did not mean to imply that it was a fundamentally different algorithm, only that it was a far more sophisticated and optimized implementation of the same algorithm. My apologies if this was unclear or appeared overstated.

> That's a tautology at best, and meaningless at worst.

By "these abstraction possibilities" I meant the ones you mentioned, which is true.

> And similarly, any Python application that requires (insert some uncommon requirement ..) can do what C can with the same kind of help that strcmp() gets - by delegating to the layer that does it best.

That's great and I fully support that. What I am arguing against is high-level language cheerleading that discounts the importance of C (or assembly for that matter). Since you mention PyPy, I have to say that their PR is some of the worst in this regard; some of their "faster than C" claims are actively misleading, like this one that benchmarks some generated string formatting code against sprintf() (which re-parses the format string every time): http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-a...

marshray · on Oct 8, 2012

> > There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86)

> It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).

Is there a sequence of fewer than 16 instructions to spot a NUL byte inside the 16 byte block?

haberman · on Oct 8, 2012

> Is there a sequence of fewer than 16 instructions to spot a NUL byte inside the 16 byte block?

Yes:

    pxor  %xmm1, %xmm1
    pcmpeqb (mem), %xmm1  // Do 16 byte-wise compares
    pmovmskb %xmm1, %eax  // Move results into the low 16 bits
    test %eax, %eax
    jnz saw_null

marshray · on Oct 9, 2012

Very cool.

aptwebapps · on Oct 8, 2012

What's a modern libc? Not being snide or anything: I have no idea. I didn't know where to look so I looked at glibc [1] and that, in fact, does seem to be how it works.

[1] http://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcm...

pbsd · on Oct 8, 2012

You might want to look at the non-generic implementations, like http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...

darklajid · on Oct 8, 2012

This might be the most delicious piece of code ever written ( I cannot judge that, really), but we're talking 2307 lines for a string comparison.

I'm impressed, but also scared and amused. I always look with envy at system level guys, lacking the knowledge to play on that level. This, though, comforts me quite a bit. That's just not my definition of 'fun'.

aptwebapps · on Oct 8, 2012

I've never written anything non-trival in C and even that was ages ago so I didn't hope to be able judge the niceties of different implementations but I can't even tell what the algorithm is in that one.

pirateking · on Oct 8, 2012

I have heard of musl[1] and Bionic[2]. Interested in hearing about any others.

[1] http://www.musl-libc.org/

[2] http://en.wikipedia.org/wiki/Bionic_(software)

aptwebapps · on Oct 8, 2012

Those seem to both use basically the same algo as the glibc one, all though a little more compactly written.

justincormack · on Oct 8, 2012

Uclibc and dietlibc. Must recommend Musl for code readability and implementation quality.

goggles99 · on Oct 8, 2012

Care to provide and example?

Your statement would only be true if the abstraction implemented something more efficiently than the C developer. This in of itself says nothing about C or X language for that matter being faster.

homeomorphic · on Oct 8, 2012

It could also be true if the abstraction lays bare optimizations not available (readily enough) to the C compiler. While this wouldn't make X faster than C as languages (what does that mean, anyway?), it could make an X compiler produce faster code than the best C compiler.

goggles99 · on Oct 8, 2012

You are not seeing my point. If the JIT compiler can be made better - than so can the C compiler right? Until some point where neither can be made better. Then C would win because it does not carry the extra overhead of the other.

CJefferson · on Oct 8, 2012

OK, here is an example. Let's consider 'memcpy'. There are various problems with implementing memcpy in C:

1) You can't assume that pointers are aligned. 2) (related) If the user asks for you to copy n bytes, you better not read byte n+1, even if you don't need it, because maybe it is on a different page which is not available for reading.

So, an optimised memcpy usually first has to do a bit of work to find the 'aligned segment', do that, topped and tailed with some special cases.

A JIT compiler for Javascript (for example) takes issues of alignment out of the hands of the user, so can make any assumptions it wants by keeping such choices away from users.

Of course, we could write a memcpy_aligned (and some systems have such a method), but we can't ever optimise simple memcpy as much as the equivalent in javascript.

ori_b · on Oct 8, 2012

1) Loop versioning. You can check if the pointers are aligned, and then decide whether you want to take the fast path or the slow path. (A better example would be aliasing. Because pointers don't carry the size of the region they point to, we can't check if two pointers alas. Restrict solves this problem with programmer intervention.). I believe that GCC's vectorization code already does this sort of versioning.

Example:

     testl $0xf,%ptr
     jz aligned_path // in reality, fast path would be inlined here for cache locality reasons.
     jnz aligned_slow_path

2) If your reads are naturally aligned, you will never be able to read a word that starts on one page and ends on another, so working in naturally the largest naturally aligned chunks you can is valid. This is a non-problem. (In fact, glibc takes advantage of this in it's assembly implementations of various SSE string functions.)

CJefferson · on Oct 8, 2012

This is true, but a high-level language can avoid having to even make that check, making it "faster than C", in at least some sense of the phrase.

goggles99 · on Oct 9, 2012

If it does not "make that check"... then how does it know? (It does check)

CJefferson · on Oct 10, 2012

A Javascript JIT doesn't have to make the check, because it can simply decide that all memory blocks, and all copies, occur on word boundaries.

One of the problems with optimising C is that you have to assume (in simple terms, obviously the full story is more complicated) whenever a user's function is called, then all of memory might have changed. Even if you have a pointer to a const int, maybe in another part of the code there is another pointer to that int which isn't const, so you have to assume it might have changed.

In a language with different semantics, the optimisers can have a much easier job seeing what is going on, and know what can effect what else. This is the reason that Fortran compilers can often be seen beating C compilers.

homeomorphic · on Oct 8, 2012

No, not neccessarily. There could be optimization opportunities in the more abstract language that perhaps cannot ever arise in the less abstract.

tptacek · on Oct 8, 2012

Sure: interned strings.