Spending too much time optimizing for loops

octavelarose.github.io

91 points by azhenley 5 days ago

gizmo 5 days ago

What this blogposts gets at, indirectly, is optimizing "Zero cost abstractions". In C++ there are many constructs that when translated directly to machine code would result in a severe performance penalty but clever compilers can remove this overhead. (In simple programs. When zero cost abstractions get stacked on top of each other compilers give up and performance craters.)

In languages like Smalltalk Everything Is An Object and instead of function calls messages are sent between objects. It's very elegant. It's up to the compiler to figure out that "1 + 1" doesn't actually require object allocations and message :+ dispatch. Instead all these abstractions can be stripped away so you end up with a single machine code instruction for the addition. In practice this is absolutely hopeless and the compiler will not be able to transform code that is dynamic in theory but static in practice to correct and efficient machine code.

This is for two reasons.

1) Language semantics. If you have multiple threads running then one thread can rewrite the code running on another thread. This means the compiler will be hamstrung in the assumptions it can make about even the most simple for loop. Either you must actually send messages for every integer addition because another thread might be listening to those messages or you disallow multi-threading. In either case your performance will be abysmal.

2) Wide instructions. Modern CPUs can do much work in parallel thanks to SIMD. In a world where every integer is a heap object you have no chance of SIMD optimization.

It's instructive to look at javascript and python. Javascript engines are much faster than python engines because Javascript doesn't do threads. This means the compiler can do tons of static code analysis. No such luck with Python. If you want to go fast with Python you use numpy, which just calls out to C. Here too the optimizations become possible because the language semantics that are holding optimization efforts back are disregarded once the Python code calls into C.

Most of the code we write in practice is pretty static. Dynamic features like virtual function calls, type introspection, and run-time code generation can be added to static languages without much difficulty. On the other hand, it's extremely hard (if not impossible) to get even 10% of the performance your computer is capable of using a highly dynamic language. In practice you don't even get 1%. I know this is an unpopular message but many people have worked hard at making slow languages fast for decades now and it's just not happening.

noelwelsh 5 days ago

> It's instructive to look at javascript and python. Javascript engines are much faster than python engines because Javascript doesn't do threads.
I don't think this is the primary difference. While JS's semantics are bad for performance, Python's are insance. Two examples:
- In Python you can change the meaning of operators, so that 1 + 2 could mean something very different from integer addition. This can happen at any point within a program, so you must always guard against the possibility that basic arithmetic---some of the fastest operations on a CPU---has changed meaning. This slows things down considerably. JS is fairly dynamic but AFAIK it's impossible to change this.
- Python allows inspecting the stack (https://docs.python.org/3/library/inspect.html#the-interpret...) so at any point in your program you have to be prepared to reflect the stack as Python objects. Bye-bye performance.
- gizmo 5 days ago
  You can't actually override operator_plus in Python for numbers. Numbers in Python are magic for performance reasons. You can look at the Python bytecode to easily see this:
  def add_five(a): return a + 5 dis.dis(add_five)
  And this returns something like:
  LOAD FAST a LOAD CONST 5 BINARY ADD RETURN ACC
  - gergo_barany 4 days ago
    
    You're right that you can't override __add__ for numbers, but the bytecode doesn't say this. The bytecode isn't specialized to numbers. It will call the __add__ operation on whatever object you give it. The interpreter will probably have a fast path in its implementation of the BINARY_OP instruction that checks for numbers. But that doesn't mean that the BINARY_OP instruction can only handle numbers.
    Example:
    >>> class foo(object): ... def __add__(x, y): ... print(f'"adding" {x} and {y} by returning {y}+1', x, y, y) ... return y + 1 ... >>> f = foo() >>> f + 3 "adding" <__main__.foo object at 0x725f470c4e90> and 3 by returning 3+1 <__main__.foo object at 0x725f470c4e90> 3 3 4 >>> add_five(f) "adding" <__main__.foo object at 0x725f470c4e90> and 5 by returning 5+1 <__main__.foo object at 0x725f470c4e90> 5 5 6
    (Not sure why there's extra garbage printed, my Python is a bit rusty.)
brians 5 days ago

I was all set to defend Scheme, which has several extremely high performance implementations—and then realized you probably just meant Smalltalk, an entirely different language, and picked the wrong 1970s minimalist language starting with S.
- gizmo 5 days ago
  
  Oops, fixed that. (But as far as I know even performance-oriented Scheme compilers like Bigloo don't generate very efficient code nor do they auto-vectorize. I don't think there is any Scheme compiler that is in the same league as clang performance wise, but I'm happy to be corrected on that.)
  - chuckadams 5 days ago
    
    The unfortunately-named Stalin compiler boasted pretty good performance of the output (the compiler itself not so much). Hasn't been maintained since R4RS days tho, so probably not useful for any real-world code.
    
    gizmo 5 days ago
    
    15 years ago CPUs cared much more about branches and much less about cache locality and threading. Nowadays it's actually really hard to shovel data into the CPU as quickly as it can churn through it. The kind of optimizations talked about here[1] won't get you to even 1% of the performance your CPU is capable of.
    A Ryzen 5950X can do something like 5ghz * 4 IPC * 16 cores = 300 billion ops per second. Add SIMD on top of that. It's insane.
    [1] https://cstheory.stackexchange.com/questions/9765/the-stalin...
    
    sfn42 5 days ago
    
    Is this something I should know about as a C# programmer or is it largely handled by the language? I usually don't worry about low level optimization, focusing mostly on writing code that is reasonably efficient with regards to time and space complexity. I know there's a lot of gain to be had from minimizing allocations and such, but for most things it seems pointless to worry about.
    
    chuckadams 5 days ago
    
    It's something you should rely on the language to optimize, but it always helps to not allocate frivolously if you can help it. I'm talking about using StringBuilder rather than concatenation, avoiding unnecessary boxed types, etc. Pooling every last thing doesn't do performance any favors mind you -- the tenured generation is expensive to collect, whereas eden is just a pointer bump.
    
    neonsunset 5 days ago
    
    Generally speaking, transient buffers today, in performance sensitive code in NET, follow the pattern of
    T[]? toReturn = null; var buffer = length <= threshold ? stackalloc T[threshold] : (toReturn = ArrayPool<T>.Shared.Rent(length)); /* logic */ if (toReturn != null) ArrayPool<T>.Shared.Return(toReturn);
    either directly or via a buffer-like type that does it behind the scenes.
    ArrayPool<T>.Shared is generally well-behaved in terms of GC, and small lengths will not even hit it, being practically free.
    The amortized cost of this is substantially lower than allocating such arrays and then throwing them away: stackalloc, particularly for short lengths, is so cheap it might as well be noise, which is cheaper than still fast array alloc for short length, and as the length passes the threshold to avoid excessive stack pressure, it becomes faster to retrieve pre-allocated array from a threadlocal bucket within shared array pool.
    
    neonsunset 5 days ago
    
    It might be, if you are writing low-level-ish code in C#, particularly one that uses its SIMD abstraction, this is something you do care about as it is relevant to extracting maximum instruction-level parallelism from modern deep and wide CPU cores.
    Pretty much the same knowledge that applies to C/C++/Rust applies to C# in such scenarios, save for swapping auto-vectorization consideration with the one for simpler usage of Vector128/256/512<T> (which, in turns, applies to the use of intrinsics in both the former and the latter).
- ykonstant 5 days ago
  
  I know that SBCL can achieve C-like performance on some tasks, but I was not aware of a Scheme that did that; what are the implementations? Is any open source?
  - widdershins 5 days ago
    
    Chez Scheme is probably the fastest well-maintained Scheme implementation. It's open source (MIT license). I would describe the performance as Go-like, rather than C-like, since it's garbage collected. But it's pretty darn good, especially for a dynamically typed language.
    
    ykonstant 4 days ago
    
    Thanks! This actually makes me wonder if the claims of SBCL C-like performance are out-of-date or exaggerated. I'd like to know if the SBCL compler produces SIMD-enabled, cache-friendly machine code and if so, by what mechanisms/annotations.
whizzter 5 days ago

Python "threads" have no impact on the slowness of Python.
1: Threads are still under GIL's (global interpreter locks) because the entire object model is thread-unsafe.
2: The Python interpreter works with full objects, JS interpreters all work with some kind of tagged values (NaN,NuN or SMI-bit tagged)
The biggest thing about Python is that they've never wanted to rock the boat on API compatibility since their strength is the ecosystem so CPython is quite tied down. Compare it to PyPy that is far closer to popular JS impls.
marcosdumay 5 days ago

About #1, a lot of languages just prohibit this. By either making the changes thread-local or by making the global state read-only for some threads.
You are probably thinking about Python, but Python is kind of a worst case scenario for compiler optimizations. Almost all languages fare better than it.
About #2, some languages do pack your data behind the scene. It's hard to implement, so this is not very common, but the languages more biased into vector and matrix calculations normally do this.
- gizmo 5 days ago
  
  If you can't share memory between threads you can't write high performance programs. If you have a global interpreter lock you can't write high performance programs. If your threads can only communicate through channels or IPC you can't have high performance programs.
  Languages can't pack your data effectively unless you tell the compiler what data types to use. Array<u8> or Array<s32>? Should ints overflow or not? Unless the programmer specifies what should happen the compiler might pick a data type that is 4x larger than necessary.
  Despite all the effort put into javascript engines in the past decades the fastest JS code is cross-compiled C code. Quake II runs perfectly in the browser and it's 140k lines of C. But you can't make games of similar complexity in regular Javascript because the browser would completely choke on it.
  - anamax 3 days ago
    
    > If you can't share memory between threads you can't write high performance programs.
    I don't know whether shared memory is required for high-performance programs but I do know that python THREADS can share memory.
    The GIL means that only one thread is executing bytecodes at a time. That's it. It doesn't place any other restriction on python programs.
    I write crypto trading code in python. It takes 0.1ms from the time I receive a packet until the time I send the packet containing the trade based on receiving that packet. Yes, that 0.1ms includes the logic to decide whether to trade.
    FWIW, that chain of processing happens to involve two threads that share memory.
    
    jrvieira 3 days ago
    
    0.1ms is hundreds of thousands of instructions on a modern phone...
  - marcosdumay 5 days ago
    
    You don't need to share the global symbol table to share memory.

bearjaws 5 days ago

semi-related, recently started doing a "leetcode.com" challenge every morning with my coffee.

I am appalled that the top 150 interview questions are 80% iteration avoidance (if you are aiming for good timing).

My experience has always been the problem is how you handle API calls, organize queries, etc... I have maybe 5 times in my career been like 'oh wow we need to avoid this for loop'. Then of course there is managing complexity...

brabel 5 days ago

You can get a Phd by doing stuff like this?? I do that kind of thing for fun after work on my hobby projects, and at work I've spent weeks speeding up a single HTTP request path.

kragen 5 days ago

theoretically your result has to be (1) novel and (2) published with such a good explanation that people understand both the result and what's novel about it, as well as how to reproduce it, but yes. the difference between tinkering and science is documentation, and it needs to be public and conscious of the existing discourse to be scholarship
- gwervc 5 days ago
  
  The biggest difference is gate-keeping by peer review, which is quite random. Reproductibility is a meme in most fields.
  - jvanderbot 5 days ago
    
    Even in random noise, with enough samples the signal is measurable. Thus - publish or perish.
  - kragen 5 days ago
    
    i knew someone was going to reply by pointing out that academia often fails to live up to the standards of scholarship and, where applicable, science. but that's sort of missing the point; i was describing the standards, not how the existing institutions' performance measures up to them
- brabel 5 days ago
  
  No offence, but what's novel in this post??? I found it pretty basic, or?
  - OJFord 5 days ago
    
    This is a blog post, not published peer-reviewed research. You don't 'get a PhD' for a blog post, no.
  - kragen 5 days ago
    
    not every brick in the taj mahal is beautiful
FrostKiwi 5 days ago

Highly specialized topic of computer science, in a tough challenge with no easy solutions and new discoveries documented for anyone to follow in the future.
I would be sad, if you couldn't get a Phd for stuff like this.
a1o 5 days ago

If you do this by hobby, if you ever have free time and want to play with c++11, here is an interpreter that is dear to my heart that I am curious if someone can figure how to speed it up - the tricky parts are the function calls.
https://github.com/adventuregamestudio/ags/blob/ags4/Engine/...
myworkinisgood 5 days ago

There are three things here: 0. A PhD is a starting degree in research, in that, it shows that you can do research. So, a lot of industry work (after you have experience for some years) can be equivalent to PhD work.
1. An obvious exaggeration but somewhat true: https://xkcd.com/664/
2. Apart from tinkering and actually achieving a result, doing a PhD often also means being able to know why it makes sense to explore the field. So, in addition to actually achieving speedup, you have to study the background of all approaches and being aware of advantages and disadvantages of all of them.
lifthrasiir 5 days ago

While I don't have a PhD---I do have a master's degree in PL and have some published papers though---, yes you can do because a PhD only means that you can do a level of research required for the academia, at least in the technical sense. (PhD also suggests, but not always implies that you know enough tacit knowledge about a certain academic area.) You can even publish papers yourself if you can go through the usual academic process like peer review [1].
[1] Sadly enough, the current possibility of fake research means that even publishing to arXiv requires some support from the current researcher.
- gwervc 5 days ago
  
  > You can even publish papers yourself
  It's becoming harder and harder because of APC fees :(
  - nimish 5 days ago
    
    Yeah this is kind of nuts:
    >For the majority of IEEE’s fully open access journals, the article processing fee will be $1,995 USD. Some exceptions apply for certain titles, see individual journal author instructions for specific details.
    Reviewers don't get paid, journal editorial board members usually don't get paid, and it's all online so publishing is basically free.
    Rent seeking at its finest!
  - lifthrasiir 5 days ago
    
    Indeed, that's why I mentioned arXiv.
gsck 5 days ago

The only thing that differentiates having fun and science, is writing it down.
- brabel 5 days ago
  
  Right, and I do that sometimes by writing blog posts. But as other commenter said, I would still have to:
  * get peer reviewed.
  * mention previous literature (that sucks, but I agree is needed to show an understanding of the topic).
  * introduce novel concepts.
  The barrier is high.

Frieren 5 days ago

I see that I am the only one bothered by the title. Optimizing for X is a very common expression, so I read it as "optimizing for loops" as is adding more loops instead of the correct "optimizing for-loops".

Or am I the only one?

larsga 5 days ago

FWIW, my experience with JSLT https://github.com/schibsted/jslt/ was that I first wrote an AST interpreter. Then I and another person independently wrote bytecode interpreters, but the AST interpreter remains faster than both.

It may be that the JVM is simply better at optimizing one type of code than the other.

brabel 5 days ago

If you're writing the interpreters in Java (or other language compiled to JVM bytecode) your code itself is going to be interpreted and JIT'd on the go by the JVM, so it's really difficult to come up with real optimisations to your interpreter.
In my experience, the JVM "likes" dumb code: direct, without abstractions on top. Use static final methods for everything, no inheritance, avoid memory allocation (though the JVM is insanely good at optimising small, short memory usage). In a bytecode interpreter you may be tempted to use "proper" type hierarchies with Ops represented by interfaces and things like that. Instead, using an array of objects of the same type which can hold the "variants" (basically, C-style OOP without vtables) of each Op is what makes things go from slow to fast. The code looks very non-idiomatic for Java, of course, but with care can be made pretty readable.
- robinei 5 days ago
  
  Ideally arrays of primitive values like `long` to store the bytecode represenation for proper good performance. Then use bitwise manipulation to access "fields"
  - bruce343434 5 days ago
    
    At this point, why not just use C or C++ since you are already writing it like that? Gets rid of the whole middleman.
    
    gjadi 5 days ago
    
    Because you get all the nice tooling around the JVM + safety of running inside the JVM.
    See https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-... / https://news.ycombinator.com/item?id=24895395
    > An improvement can be discussed in the morning, and be implemented, tested and released in production in the afternoon.
  - larsga 5 days ago
    
    That's what we did. Still no luck.
    
    brabel 5 days ago
    
    What are the bottlenecks according to the profiler?
aardvark179 5 days ago

I would say that AST interpreters are surprisingly fast on the JVM, and byte code interpreters are surprisingly slow. Maybe one day explicit tail calls combined with invoke dynamic will offer us a path to byte code interpreters that the JIT can optimise based on the predictable nature of byte code sequences, but we’re not there yet.
Edit: I should point out that byte code interpreters still have a place because they tend to represent code in a much smaller form. Again this is something that will change over time as things like project Lilliput shrink object headers, but it’s unlikely to go away entirely.
- larsga 5 days ago
  
  What really gave clear performance improvements was generating Java bytecode. In the first attempt I got a 25% speedup, and I'm sure I could have gotten more out of it if I'd spent more time on it.
  - aardvark179 5 days ago
    
    Oh yeah, absolutely. The trade off with byte code is that it can feel like threading your language through the class file format’s very particular needle, and you have to pay a larger initial cost generating and loading that class, and you have ongoing costs in terms of the meta space classes consume.

Thorrez 5 days ago

Is this pushing a stack element for every loop iteration before starting the loop execution? If there are a ton of iterations, won't the stack overflow?

o11c 5 days ago

I'm not sure how useful this is for interpreters in general, given the frankly bizarre design decisions. My normal rant about bad interpreter design assumes things not seen here.

kragen 5 days ago

are you talking about the design decisions that define smalltalk or something else
- o11c 5 days ago
  
  I'm not talking about the inner language semantics (which mostly don't matter, other than what optimizations they allow - particularly static typing and SROA), but the implementation of the interpreters themselves.
  - kragen 5 days ago
    
    what do you think is unusual about the interpreter implementation? from what i read of the post (admittedly not all of it) it sounds like a fairly by-the-book smalltalk-80 blue book thingamabob?
moonchild 5 days ago

i saw this last week and was incredibly confused. aside from being naive it's totally overfit where general approaches are very well known??

diffxx 5 days ago

Premature optimization is the root of all evil. Instead of optimizing the performance of the interpreter, they should optimize the language semantics to be fast by default.

pipeline_peak 3 days ago

An engineer accepts the given conditions and finds a way to work within those boundaries. They don’t pine over things outside their control.