Interface Dispatch

TLDR:
Always measure before you optimize, I introduced super shitty bug to https://github.com/rekki/go-query, because I thought the bottleneck is in the binary search implementation, and gained nothing; calling interface methods in hot for loops adds ~10-15 instructions per call.
Hi,
I want to illustrate why intuition in the field of performance is
often misleading, and just like propaganda, the more you think you you are immune to the illusion, the more effect it has on you.
Let me tell you when I introduced terrible unreproducible and contextual bug in a library we use everywhere.
Cost of simple things
First, we will discuss the cost of simple things, like counting from 1
a million.
package main

import "testing"

var sum = 0

func BenchmarkNoop(b *testing.B) {
        for i := 0; i < b.N; i++ {
                sum++
        }
}

// go test -bench=.
// BenchmarkNoop-8         873429528                1.36 ns/op
the for loop needs to do one comparison and one jump
for i:= 0; i < 1000000; i++ {
}
is something like
0: inc i
1: cmp i, 1000000
2: jlt 10
3: ... code
4: jmp 0
10: .. rest of program
you can also think about it implemented with goto:
i = 0
check:
if i > 1000000
   goto done

for loop code
...
goto check

done:
rest of program
(it is also very interesting to investigate the modern branch
predictors)
In order to mitigate the loop overhead sometimes it is possible to
unroll the loop, which will do something like:
i++
code
i++
code
i++
code
Of course, not all instructions are created equal, today it is far
from intuitive what exactly is happening (especially now when people
don't own the computers they run their code on), but for me a good
intuition is:
  • bitwise, sum, local jump is fast
  • division by 2 is natural, so it is fast (x >> 1). division is a search https://youtu.be/o4-CwDo2zpg?t=354
  • same with multiplication anyting but 2 is slow (x << 1)
  • almost all other math is slow
  • locality is king
  • hot method calling adds ~10 instructions cost for i:=0; i<n; i++ { f() }
  • atomic operations are ~25 instructions
  • My CPU is ~3ghz, which means it can do 3 * 10^9 things per second, so
    we have the for loop in the benchmark so 1.36 per iteration sounds
    about right, considering about 4 things per iteration.
    (https://en.wikipedia.org/wiki/Arithmetic_logic_unit)
    Now that we are so far abstracted from the machine, it is incredibly
    difficult to have good intuition about how much things cost, so I
    usually multiply my guess by 5 or so (to be cloud native, haha).
    Anyway, let's move to the cost of calling a function.
    Function calling
    Calling functions is quite simple. We need to prepare its parameters
    and then call it. This involves putting their values on the call stack
    (offset by SP) and then invoking CALL. The function itself uses those
    parameters and then invokes RET.
    // tell the go compiler to not inline the function by doing go:noinline
    // semantic comment
    //go:noinline
    func add(a, b int) (int, int) {
            return a + b, 1000
    }
    // go build main.go && go tool objdump -s main.add main
    MOVQ $0x0, 0x18(SP)
    MOVQ $0x0, 0x20(SP)
    MOVQ 0x8(SP), AX
    ADDQ 0x10(SP), AX
    MOVQ AX, 0x18(SP)
    MOVQ $0x3e8, 0x20(SP)
    RET
    
    func main() {
            a, b := add(100, 200)
            add(a, b)
    }
    
    // go build main.go && go tool objdump -s main.main main
    MOVQ $0x64, 0(SP)
    MOVQ $0xc8, 0x8(SP)
    CALL main.add(SB)
    MOVQ 0x10(SP), AX
    MOVQ AX, 0x38(SP)
    MOVQ 0x18(SP), AX
    MOVQ AX, 0x30(SP)
    MOVQ 0x38(SP), AX
    MOVQ AX, 0x28(SP)
    MOVQ 0x30(SP), AX
    MOVQ AX, 0x20(SP)
    MOVQ 0x28(SP), AX
    MOVQ AX, 0(SP)
    MOVQ 0x20(SP), AX
    MOVQ AX, 0x8(SP)
    CALL main.add(SB)
    So calling a function with two parameters, that returns 2 values does
    at least 6 things. (prepare the call stack, call, prepare the return
    stack, return)
    Inlining
    Because of the calling overhead, all compiled languages try to inline the functions so it won't have to do do the CALL,RET and stack preparation work.
    First let's discuss how the CPU executes code. In the Von Neumann
    architecture(which is pretty much all modern computers), code and data are in the same memory, so it is only in the eye of the beholder if something is code or data, pretty much the same way as it is up to you if something is the character 'a' or the integer 97, when 'a' is encoded as 01100001.
    Let's create super simple 8 bit CPU with two 1 byte general purpose
    registers, R0 and R1:
    1 MOV addr, r0   copy addr in r0
    2 MOV addr, r1   copy addr in r1
    3 MOV r0, addr   copy r0 in addr
    4 MOV r1, addr   copy r1 in addr
    5 MOV r0, $value store $value in r0
    6 MOV r1, $value store $value in r1
    7 ADD  adds r0 and r1 and stores value in r0
    8 JMP  addr jump to given address
    9 HAL  stop
    (from Richard Buckland's 4 bit CPU)
    https://www.youtube.com/watch?v=gTeDX4yAdyU
    in our example, we will set the CPU to execute from memory address 0,
    and an example program that counts forever would look like:
    0000: MOV r0, $0
    0001: MOV r1, $1
    0002: ADD
    0003: MOV r0, 10
    0004: JMP 1
    now let's look at the memory layout byte by byte
    addr:0  00000101 // mov r0, $0
    addr:1  00000000 
    addr:2  00000110 // mov r1, $1
    addr:3  00000001 
    addr:4  00000111 // add
    addr:5  00000011 // mov r0, 10
    addr:6  00001010
    addr:7  00001000 // jmp 1
    addr:8  00000001
    addr:9  00000000 // nothing
    addr:10 00000000 // result of mov r0, 10
    As you can see, there is no difference between address 0 and address
  • The only thing stopping the cpu to execute address 8,9,10 is the JMP on address 7, that makes it go to address 0
  • The way the cpu actually executes it is by using fetch-decode-execute
    cycle (https://en.wikipedia.org/wiki/Instruction_cycle)
    Fetch Stage: The next instruction is fetched from the memory address
    that is currently stored in the program counter and stored into the
    instruction register. At the end of the fetch operation, the PC points
    to the next instruction that will be read at the next cycle.
    
    Decode Stage: During this stage, the encoded instruction presented in
    the instruction register is interpreted by the decoder.
    
    Execute Stage: The control unit of the CPU passes the decoded
    information as a sequence of control signals to the relevant function
    units of the CPU to perform the actions required by the instruction,
    such as reading values from registers, passing them to the ALU to
    perform mathematical or logic functions on them, and writing the
    result back to a register. If the ALU is involved, it sends a
    condition signal back to the CU. The result generated by the operation
    is stored in the main memory or sent to an output device. Based on the
    feedback from the ALU, the PC may be updated to a different address
    from which the next instruction will be fetched.
    
    Repeat Cycle
    To access ram is in the order of 100ns, so if we have to waste 100ns
    for every instruction the whole thing will be horrible, to mitigate
    this CPUs have hierarchy of caches (tlb[21], l1,l2,l3).
    L1 cache reference                           0.5 ns
    Executing Instruction                        1   ns
    Branch mispredict                            5   ns
    L2 cache reference                           7   ns 14x L1 cache
    Mutex lock/unlock                           25   ns
    Main memory reference                      100   ns 20x  L2 cache,
                                                        200x L1 cache
    The l1 cache is usually split into an instruction cache and data
    cache, and it is in the order of ~32kb per core (how this cache
    hierarchy works in the modern multi-core CPUs is a topic on its own,
    but lets think as if we have one core only).
    There are other mitigations to how slow the fetch-execute cycle is,
    such as instruction pipelining, vectorization, out of order execution
    etc (you can check them out on the bottom of the instruction cycle
    wiki page https://en.wikipedia.org/wiki/Instruction_cycle)
    Modern CPUs are incredibly complicated systems, and the model I use is
    that in most case (unless proven otherwise) "it does one thing at a
    time" works ok to do back-of-the-napkin calculations. (e.g., some
    modern hashing algorithms that abuse vectorization can hash incredible
    amounts of data)
    So, inlining is a dance.
    I will refer to:
    http://www.cs.technion.ac.il/users/yechiel/c++-faq/inline-and-perf.html
    [9.3] Do inline functions improve performance?
    Yes and no. Sometimes. Maybe.
    There are no simple answers. Inline functions might make the code
    faster, they might make it slower. They might make the executable
    larger, they might make it smaller. They might cause thrashing,
    they might prevent thrashing. And they might be, and often are,
    totally irrelevant to speed.
    
    inline functions might make it faster:
      As shown above, procedural integration might remove a bunch of
      unnecessary instructions, which might make things run faster.
    
    inline functions might make it slower: 
      Too much inlining might cause code bloat, which might cause
      "thrashing" on demand-paged virtual-memory systems. In other
      words, if the executable size is too big, the system might spend
      most of its time going out to disk to fetch the next chunk of
      code.
    
    inline functions might make it larger:
      This is the notion of code bloat, as described above. For
      example, if a system has 100 inline functions each of which
      expands to 100 bytes of executable code and is called in 100
      places, that's an increase of 1MB. Is that 1MB going to cause
      problems? Who knows, but it is possible that that last 1MB could
      cause the system to "thrash," and that could slow things down.
    
    inline functions might make it smaller:
      The compiler often generates more code to push/pop
      registers/parameters than it would by inline-expanding the
      function's body. This happens with very small functions, and it
      also happens with large functions when the optimizer is able to
      remove a lot of redundant code through procedural integration —
      that is, when the optimizer is able to make the large function
      small.
    
    inline functions might cause thrashing:
      Inlining might increase the size of the binary executable, and
      that might cause thrashing.
    
    inline functions might prevent thrashing
      The working set size (number of pages that need to be in memory
      at once) might go down even if the executable size goes up. When
      f() calls g(), the code is often on two distinct pages; when the
      compiler procedurally integrates the code of g() into f(), the
      code is often on the same page.
    
    inline functions might increase the number of cache misses: 
      Inlining might cause an inner loop to span across multiple lines
      of the memory cache, and that might cause thrashing of the
      memory-cache.
    
    inline functions might decrease the number of cache misses:
      Inlining usually improves locality of reference within the
      binary code, which might decrease the number of cache lines
      needed to store the code of an inner loop. This ultimately could
      cause a CPU-bound application to run faster.
    
    inline functions might be irrelevant to speed: 
      Most systems are not CPU-bound. Most systems are I/O-bound,
      database-bound or network-bound, meaning the bottleneck in the
      system's overall performance is the file system, the database or
      the network. Unless your "CPU meter" is pegged at 100%, inline
      functions probably won't make your system faster. (Even in
      CPU-bound systems, inline will help only when used within the
      bottleneck itself, and the bottleneck is typically in only a
      small percentage of the code.)
    
    There are no simple answers: You have to play with it to see what
    is best. Do not settle for simplistic answers like, "Never use
    inline functions" or "Always use inline functions" or "Use inline
    functions if and only if the function is less than N lines of
    code." These one-size-fits-all rules may be easy to write down,
    but they will produce sub-optimal results.
    Interfaces
    Lets say we have a common case where a struct gets an io.Writer
    interface to write into:
    type Database struct {
            writer io.Writer
    }
    func (d *Database) SetWriter(f io.Writer) error {
            d.writer = f
    }
    ....
    func (d *Database) writeBlob(b []byte) error {
            checksum := hash(b)
            _, err := d.writer.Write(checksum)
            if err != nil {
                    return err
            }
            _, err := d.writer.Write(b)
            return err
    }
    ....
    Now because d.writer can change at runtime, the compiler can never
    know what the other end of d.writer is, so it can never inline it even
    if it wants to (you can imagine the actual os.File.Write is just doing
    the write syscall)
    Another issue is that the thing on the other end of d.writer could be
    a pointer, so it has to be checked because if it is and it is nil (0),
    we must panic accordingly.
    The way this works is pretty much:
    if x == nil
       goto panic
    ... code
    
    return x
    
    panic:
    help build a stacktrace
    (pseudo code)
    START:
      CMPQ 0x10(CX), SP
      JBE CALL_PANIC
    
      MOVQ 0x40(SP), AX
      TESTQ AX, AX
      JLE PANIC
      ... work work ...
    TRACE:
      MOVQ CX, 0x50(SP)
      MOVQ 0x28(SP), BP
      ADDQ $0x30, SP
      RET
    PANIC:
      XORL CX, CX // cx = 0
      JMP TRACE
    CALL_PANIC:
      CALL runtime.morestack_noctxt(SB)
      JMP START
    That means on every function call to writeBlob has to check if
    d.writer is nil, because it can be and there is no way for the
    compiler to know at compile time, and then prepare the stack and call
    it.
    How Slow is Dynamic Dispatch exactly?
    More concrete example:
    package main
    
    type Operation interface {
            Apply() int
    }
    
    type Number struct {
            n int
    }
    
    func (x Number) Apply() int {
            return x.n
    }
    
    type Add struct {
            Operations []Operation
    }
    
    func (x Add) Apply() int {
            r := 0
            for _, v := range x.Operations {
                    r += v.Apply()
            }
            return r
    }
    
    type Sub struct {
            Operations []Operation
    }
    
    func (x Sub) Apply() int {
            r := 0
            for _, v := range x.Operations {
                    r -= v.Apply()
            }
            return r
    }
    
    type AddCustom struct {
            Operations []Number
    }
    
    func (x AddCustom) Apply() int {
            r := 0
            for _, v := range x.Operations {
                    r += v.Apply()
            }
            return r
    }
    
    func main() {
            n := 0
            op := Add{Operations: []Operation{Number{n: 5}, Number{n: 6}}}
            n += op.Apply()
    
            opc := AddCustom{Operations: []Number{Number{n: 5}, Number{n: 6}}}
            n += opc.Apply()
    }
    Lets look at main.Add.Apply and main.AddCustom.Apply
    // go build main.go && go tool objdump main
    TEXT main.Add.Apply(SB) main.go
    MOVQ FS:0xfffffff8, CX
    CMPQ 0x10(CX), SP
    JBE 0x4526c7
    SUBQ $0x30, SP
    MOVQ BP, 0x28(SP)
    LEAQ 0x28(SP), BP
    MOVQ 0x40(SP), AX
    TESTQ AX, AX
    JLE 0x4526c3
    MOVQ 0x38(SP), CX
    XORL DX, DX
    XORL BX, BX
    JMP 0x452678
    MOVQ 0x20(SP), SI
    ADDQ $0x10, SI
    MOVQ AX, DX
    MOVQ CX, BX
    MOVQ SI, CX
    MOVQ CX, 0x20(SP)
    MOVQ DX, 0x18(SP)
    MOVQ BX, 0x10(SP)
    MOVQ 0(CX), AX
    MOVQ 0x8(CX), SI
    MOVQ 0x18(AX), AX
    MOVQ SI, 0(SP)
    CALL AX
    MOVQ 0x18(SP), AX
    INCQ AX
    MOVQ 0x10(SP), CX
    ADDQ 0x8(SP), CX
    MOVQ 0x40(SP), DX
    CMPQ DX, AX
    JL 0x452666
    MOVQ CX, 0x50(SP)
    MOVQ 0x28(SP), BP
    ADDQ $0x30, SP
    RET
    XORL CX, CX
    JMP 0x4526b4
    CALL runtime.morestack_noctxt(SB)
    JMP main.Add.Apply(SB)
    
    TEXT main.AddCustom.Apply(SB) main.go
    MOVQ 0x8(SP), AX
    MOVQ 0x10(SP), CX
    XORL DX, DX
    XORL BX, BX
    JMP 0x4526fa
    MOVQ 0(AX)(DX*8), SI
    INCQ DX
    ADDQ SI, BX
    CMPQ CX, DX
    JL 0x4526f0
    MOVQ BX, 0x20(SP)
    RET
    Benchmark:
    goos: linux
    goarch: amd64
    BenchmarkInterface-8    171836106                6.81 ns/op
    BenchmarkInline-8       424364508                2.70 ns/op
    BenchmarkNoop-8         898746903                1.36 ns/op
    PASS
    ok      command-line-arguments  4.673s



    Optimization
    You know the golden rule: always measure first.
    When I wrote https://github.com/rekki/go-query I thought the slowest
    part was the binary search, I thought: well..the loop cant be
    unrolled, and the branchy algorithm is too difficult on the
    branch-predictor, so it is probably the slowest, I will just optimize
    it, and I did.
    Total waste of time, the overhead of the interface dispatch is 70% of
    the time, not only I wasted my time, but I introduced a super shitty
    bug that was not caught by 90% coverage and etc, luckily it did not
    end up in production (by sheer luck!).
    Anyway, this is a reminder to myself: always measure.
    Oh BTW, of course, this whole thing almost never hits you, ~10 extra
    instructions per call never matter. (almost)
    How to measure though? Considering what is executed in the benchmark
    is very likely not the thing being executed in production, so for
    example, I could have a benchmark that is super fast and nice, but
    when it runs in prod, the thing gets inlined and starts page
    thrashing.
    There is a whole spectrum of tools to use from statistics powered
    micro benchmark tools to flamegraph[19] visualizations, and the whole
    topic is a post on its own, but I believe anything you do to measure,
    will be better than intuition.
    Luckily go's profiling tools are incredible[20] and you can get good
    insights very quickly.
    BTW, I did have a profile for go-query before I went into optimizing,
    and I saw that the cost is in the Next() call, but because I did not
    get deeper into it and I decided to start with optimizing the binary
    search, instead of looking carefully into the data. As the poet says:
    "everywhere you look, you see what you are looking for".
    --
    Fri 7 Feb 18:08:42 CET 2020,
    Borislav Nikolov (https://github.com/jackdoe)
    References:
    [1] https://github.com/rekki/go-query
    Low level full text search library.
    [2] https://youtu.be/o4-CwDo2zpg?t=354
    Fastware - Andrei Alexandrescu
    [3] https://www.youtube.com/watch?v=Qq_WaiwzOtI
    CppCon 2014: Andrei Alexandrescu "Optimization Tips - Mo' Hustle
    Mo' Problems"
    [4] https://www.youtube.com/watch?v=FJJTYQYB1JQ
    CppCon 2019: Andrei Alexandrescu “Speed Is Found In The Minds of
    People"
    [13] https://www.youtube.com/watch?v=gTeDX4yAdyU
    Lecture 3: Machine Code - Richard Buckland UNSW
    [14] http://norvig.com/21-days.html#answers
    Basic numbers for back of the napkin calculations
    [15] https://gist.github.com/jboner/2841832
    Basic numbers for back of the napkin calculations
    [17] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
    What Every Programmer Should Know About Memory
    [18] http://www.cs.technion.ac.il/users/yechiel/c++-faq/inline-and-perf.html
    [9.3] Do inline functions improve performance?
    [22] https://godbolt.org
    Compiler Explorer is an interactive online compiler which shows the
    assembly output of compiled C++, Rust, Go (and many more) code.

    28

    This website collects cookies to deliver better user experience

    Interface Dispatch