☑ That's Good Enough
Gilding lilies might be fun, but gold leaf doesn't grow on trees.

Blog

Site

Me

☑ All Go: Concurrency

23 Jun 2023 at 9:15PM in Software

｜

Photo by Pixabay on Pexels

｜

See comments

go threading coroutines locking scheduling

In this, my fourth article on the Go programming language, I’m going to tackle the topic of concurrency. This concept of goroutines is one of the languages main selling points, so I’ll be interested to see how they work. We’ll also look at channels, which are used to communicate between goroutines.

This is the 4^th of the 6 articles that currently make up the “All Go” series.

All Go: Basic Semantics
Sat 3 Jun, 2023
All Go: Methods and Interfaces
Sat 10 Jun, 2023
All Go: Closures, Deferred Actions and Generics
Thu 15 Jun, 2023
All Go: Concurrency
Fri 23 Jun, 2023
All Go: Building and Testing
Fri 7 Jul, 2023
All Go: Loose Ends
Tue 25 Jul, 2023

In this article I’m going to continue my look at the Go Language, this time covering the topic of concurrency. One of the main selling points of Go is how concurrency is built into the semantics of the language, so I’m looking forward to seeing how this works in practice, and particularly how it compares to more traditional support for multithreading and multiprocess semantics in other languages.

Let’s jump right in and take a look at goroutines first.

Goroutines¶

A goroutine is a green thread, or a userspace scheduled thread of execution. This is a bit of an oversimplification, since goroutines can be scheduled to run across multiple separate OS threads, but we’ll discuss the specifics of scheduling later in the article. For now, we’ll just treat each goroutine as its own thread of execution all running within the same address space, and understand that these threads of execution may run concurrently with each other on different processors, but we cannot rely on that.

To create a new goroutine, you simply prefix a function call with the keyword go — the function is executed in a new goroutine which runs in parallel with the original.

package main

import (
    "fmt"
    "time"
)

func ticker(msg string) {
    for i := 0; i < 3; i++ {
        fmt.Printf("tick %d: %s\n", i, msg)
        time.Sleep(500 * time.Millisecond)
    }
}

func main() {
    ticker("Inline")
    go ticker("First")
    go ticker("Second")
    time.Sleep(2 * time.Second)
}

This example shows there’s nothing special about the ticker() function — it can be called synchronously, or it can be called as a goroutine. You’re allowed to use a function which returns a value, but any return will be discarded if it’s invoked as a goroutine. If you want to get results out of a goroutine, you generally use channels, which we’ll look at in a moment.

As an aside, in real code there is the time.NewTicker() function to create an endlessly running ticker, so you probably want to use that rather than implement your own. In these examples I’m trying to implement most functions directly to better illustrate what’s going on.

It’s also quite possible to use go with anonymous functions.

go func(msg string) {
    time.Sleep(1800 * time.Millisecond)
    fmt.Println(msg)
}("BOO!")

A couple of other things to note about goroutines. Firstly, being scheduled by the runtime, and sharing a common address space, they’re lightweight — they don’t require any system calls to create or destroy, and their stack reservation starts at 2KB, vs a typical 8KB for OS threads, so they incur comparatively little memory overhead.

Secondly, the application as a whole will always terminate when the main goroutine exits, regardless of other goroutines that are running. If you want to wait for all your goroutines to finish, you have to put in explicit code to do this. One way to do this seems to be to use a WaitGroup from the standard library sync package, or alternatively if you have return channels to receive values in main() you could just wait on the closure of these channels.

So that’s goroutines — conceptually pretty simple, but things start to get more interesting when we look at how they interact with each other after being created. Go’s concurrency is based on a fork-join model, so in looking at creating goroutines we’ve covered the fork aspect. In much of the rest of the article we’ll be covering the join side of things.

Channels¶

A channel in Go is a unidirectional or bidirectional queue of values with builtin concurrency protection to make it safe for use across different threads. Typically one or more channels will be created to allow goroutines to send inputs and outputs to each other during execution.

Channels can be unbuffered, which means that senders or receivers will block until there’s at least one of each, and then the sender’s item will be passed directly to the receiver, and both will be unblocked.

They can also be buffered, where a fixed-size ring buffer is allocated to store elements. Whilst there is space in this buffer, values are senders are stored in it and the sender won’t block. Receivers will take values from the buffer as they are available. If the buffer is empty then receivers will block, if the buffer is full then senders will block.

I built a simple example using a goroutine to check for cycles in a directed graph, the code for which is shown below. I’ll run through it piece by piece.

package main

import (
    "fmt"
    "sync"
    "time"
)

type edge struct {
    src string
    dst string
}

func (e edge)String() string {
    return fmt.Sprintf("[%s] -> [%s]", e.src, e.dst)
}

First we declare an edge type to represent an edge in the graph — the only method it has is the String() method to make it easier to print. Next up we implement a function cycleChecker() which is going to form our goroutine.

func cycleChecker(edges <-chan edge) {
    incomingRoutes := make(map[string]map[string]struct{})
    fmt.Println("Waiting...")
    time.Sleep(500 * time.Millisecond)
    fmt.Println("Awake")
    for e := range edges {
        fmt.Printf("Received edge: %s\n", e)

        // Check for cycles
        edgeSet, ok := incomingRoutes[e.src]
        if ok {
            _, ok := edgeSet[e.dst]
            if ok {
                fmt.Printf("Cycle detected adding %s\n", e)
            }
        }

        // Store new edge
        edgeSet, ok = incomingRoutes[e.dst]
        if ok {
            edgeSet[e.src] = struct{}{}
        } else {
            incomingRoutes[e.dst] = map[string]struct{}{e.src: struct{}{}}
        }

        // Update transitive closure of new graph
        for origin := range incomingRoutes[e.src] {
            incomingRoutes[e.dst][origin] = struct{}{}
        }
    }
    fmt.Printf("Closing\n")
}

This uses a map of sets to record the transitive closure of the directed graph, but with the directions reversed. So, if you look up a node N in the graph, the result will be a set of all the edges which can reach N by some route of existing edges.

I say it’s a map of “sets”, but since Go doesn’t have a builtin type for a set then I used a map[string]struct{}. There might be a more elegant way to represent this, but it works for now and I didn’t want to spend a long time hunting around for something.

cycleChecker() takes a <-chan edge as its sole parameter, which is a channel for receiving (only) which yields edge values. It then first creates the incomingRoutes map with make(), and then it immediately calls time.Sleep() — this delay is just so we can see how the execution of the goroutines interleaves. After sleeping, the function then enters a for…range loop on the input channel — this will loop around receiving values from the channel until the channel is closed, at which point the loop will exit.

Within the loop we log edges that we receive, to illustrate how things are working, then we do the following on each edge we receive:

Check for cycles

If the src of the new route is already reachable from the dst then the addition of this edge will create a new cycle.

Store new edge

If dst is already in the map then we add src as a potentially new origin — this operation is idempotent so we don’t care whether src was already there to begin with.

If, however, dst is not already in the map then we create a new map[string]struct{} containing src as the only member.

Update transitive closure of new graph

Lastly, we need to update the entry for dst to include not just src but all the places that can reach src — this is because with the addition of the new edge, all those places can now also reach dst as well. Once again, we rely on idempotency of insertion to avoid having to do lots of checks.

The final lines of the function print a message, so we can confirm when it’s terminating, and calls the Done() method on the WaitGroup to signal its termination to anyone waiting.

The remainder of the code is just a main() function containing a simple test driver.

func main() {
    var wg sync.WaitGroup
    c := make(chan edge)
    wg.Add(1)
    go func() {
        cycleChecker(c)
        wg.Done()
    }()

    fmt.Println("---[0]")
    c <- edge{src: "London", dst: "Paris"}
    fmt.Println("---[1]")
    c <- edge{src: "Paris", dst: "Berlin"}
    fmt.Println("---[2]")
    c <- edge{src: "Berlin", dst: "London"}
    fmt.Println("---[3]")
    c <- edge{src: "London", dst: "Berlin"}
    fmt.Println("---[4]")
    c <- edge{src: "London", dst: "New York"}
    fmt.Println("---[5]")
    c <- edge{src: "New York", dst: "Berlin"}
    fmt.Println("---[6]")
    close(c)
    fmt.Println("---[7]")
    wg.Wait()
    fmt.Println("---[8]")
}

We declare a sync.WaitGroup — this essentially acts like a semaphore, starting at zero. We also declare a channel with make(chan edge) — since we didn’t specify a maximum queue depth then it’s unbuffered, so sending to the channel if there’s an unread value will block. Also note that this is a bidirectional channel — we need to declare that because we want to send to the channel, but we need it receivable to pass into cycleChecker(). A bidirectional can be implicitly cast into a unidirectional channel, but not vice versa, hopefully for obvious reasons¹.

Once main() has created the WaitGroup and the chan, it starts a cycleChecker() goroutine and then calls Add(1) on the WaitGroup to track that there’s one routine to wait for. You’ll also note it doesn’t call cycleChecker() directly, but wraps it in an anonymous function which calls wg.Done() once cycleChecker() has returned. This is a convenient approach which keeps the WaitGroup scoped within the calling function and doesn’t pollute the cycleChecker() signature by passing in details which only the caller needs to care about.

From this point on there are some Println() calls to produce trace output that we can use to see how the operation of the two goroutines interleave. If you run this code, you should get output as follows.

Output

---[0]
Waiting...
Awake
---[1]
Received edge: [London] -> [Paris]
Received edge: [Paris] -> [Berlin]
---[2]
---[3]
Received edge: [Berlin] -> [London]
Cycle detected adding [Berlin] -> [London]
Received edge: [London] -> [Berlin]
Cycle detected adding [London] -> [Berlin]
---[4]
---[5]
Received edge: [London] -> [New York]
Received edge: [New York] -> [Berlin]
Cycle detected adding [New York] -> [Berlin]
---[6]
---[7]
Closing
---[8]

Here we can see that main() runs first, right up to the point where we try to send the first edge into the channel, in line 61. At this point the send blocks — this is because we’ve chosen an unbuffered channel, and there isn’t yet a reader ready to accept the value because the cycleChecker() is stuck in the Sleep() call.

Next we can see cycleChecker() wakes up and heads into the loop reading values from the channel — at this point main() is unblocked and writes its edge into the channel. It is almost straight away blocked again on line 63, however, as it wants to write to the channel again but there’s a pending item which hasn’t been received yet.

Control shifts to cycleChecker() which receives the value and processes it. As you can see from the output, it also receives the following edge from main() before control shifts back for main() to print ---[2].

This pattern continues, with the two routines running in parallel and blocking on sending and receiving items as appropriate. I didn’t check whether the two were running on one or two OS threads, but wouldn’t change any of the blocking behviour — it might affect the interleaving of output which wasn’t synchronised, however.

Finally, all the edges are sent and main() calls close() to close the channel. It then heads into line 75 where it blocks to wait until the WaitGroup is decremented with a Done() call.

The cycleChecker() goroutine exits the for…range loop once the channel is closed, and calls Done() on the WaitGroup() and then terminates. This allows main() to complete and the application terminates.

For comparison, let’s add some buffering to the channel by changing line 53 to the following.

    c := make(chan edge, 4)

If we do that then the first four items can be stored in the channel without blocking, but then line 69 blocks because the goroutine isn’t ready to read from the channel yet. After this, the routines interleave as the scheduler dictates.

Output

---[0]
---[1]
---[2]
---[3]
---[4]
Waiting...
Awake
---[5]
Received edge: [London] -> [Paris]
Received edge: [Paris] -> [Berlin]
Received edge: [Berlin] -> [London]
---[6]
---[7]
Cycle detected adding [Berlin] -> [London]
Received edge: [London] -> [Berlin]
Cycle detected adding [London] -> [Berlin]
Received edge: [London] -> [New York]
Received edge: [New York] -> [Berlin]
Cycle detected adding [New York] -> [Berlin]
Closing
---[8]

Receiving From Channels¶

In the example above we already saw the use of myChan <- value to send a single value to myChan. To receive a single value, without using the for…range as we did above, you can just put the operator on the other side of the channel.

Take a look at an example of this below — it contains a flaw, however, so see if you can spot it before reading on.

package main

import "fmt"

func main() {
    c := make(chan int)
    c <- 1
    c <- 2
    c <- 3
    for {
        x := <- c
        fmt.Printf("%d\n", x)
    }
}

Have you spotted the issue? We’ve used an unbuffered channel, so main() will block until something receives the value — but nothing else is reading from this channel, so we’re deadlocked. Fortunately for us, however, Go has our back — if you run this code, you’ll see an error like this.

Output

fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan send]:
main.main()
    /Users/andy/src/simplechannels/simplechannels.go:7 +0x37
exit status 2

Great, so Go has deadlock detection builtin. To fix this problem, we’re going to have to add some buffering to the channel.

    c := make(chan int, 3)

Now we get the following output.

Output

1
2
3
fatal error: all goroutines are asleep - deadlock!
…

Hmm, pretty useful this deadlock detection. OK, so now the issue is we’re stuck in an infinite loop acquiring values from a channel that will never give us any. To fix this, we can use a 2-value return form of the <- operator where the second value indicates if any more values are coming. To indicate there are no more values, the sending code calls close(), as we saw in the cycle detection algorithm above.

package main

import "fmt"

func main() {
    c := make(chan int, 3)
    c <- 1
    c <- 2
    c <- 3
    close(c)
    for {
        x, more := <- c
        if !more {
            break
        }
        fmt.Printf("%d\n", x)
    }
}

As well as the 2-value form of return from <-, this example illustrates that all remaining buffered values are flushed before the channel’s closure is propagated to the receiver.

Channel Select¶

This is all great, but leaves a bit of a hole — what if a goroutine wants to receive values from multiple channels?

Go has this covered by using the select statement. This has a form that’s similar to switch except that all the case expressions must be reads from a channel. The example below illustrates this.

package main

import (
    "fmt"
    "time"
)

func pinger(c chan<- string, interval time.Duration) {
    for {
        time.Sleep(interval)
        c <- fmt.Sprintf("PING! (%v)", interval)
    }
}

func main() {
    c1 := make(chan string)
    go pinger(c1, 500 * time.Millisecond)
    c2 := make(chan string)
    go pinger(c2, 2 * time.Second)
    c3 := make(chan string)
    go pinger(c3, 3 * time.Second)

    for i := 0; i < 20; i++ {
        var msg string
        select {
        case msg = <-c1:
            fmt.Printf("C1: ")
        case msg = <-c2:
            fmt.Printf("C2: ")
        case msg = <-c3:
            fmt.Printf("C3: ")
        }
        fmt.Printf("%s\n", msg)
    }
}

We’ll ignore the fact that we could have easily just passed a single channel to all of the pinger() goroutines — this would work, but it wouldn’t illustrate select!

As you can see, you can execute different code for different channels within the select in a similar way to with switch. In this case, however, we want to store all the results in the same variable msg for use later.

Incidentally, if we had use removed the var msg declaration and used expressions like msg := <- c1 within the select block then this would work, but msg would not be visible outside the scope of the case block which declared it. Because we wanted it to be visible, we had to declare it outside and use = instead of :=.

An useful example of where select can be helpful is implementing timeouts on waiting for a channel. This can be done using time.After(), which creates a channel, sets up a goroutine to send a Time value on it after a delay and then returns the channel.

select {
case msg := <-myChan:
    fmt.Prinf("Got message: %s\n", msg)
case <-time.After(500 * time.Millisecond):
    fmt.Println("Timeout")
}

Non-Blocking Channel Operations¶

Another use of the select keyword is to make reads or sends non-blocking. This is done using a select with a default clause — this causes the operations to be non-blocking with the default clause triggering if none of the others do.

So, a non-blocking read of two channels might look like this.

select {
case msg1 := <-chanOne:
    fmt.Printf("Got %s from chanOne\n", msg1)
case msg2 := <-chanTwo:
    fmt.Printf("Got %s from chanTwo\n", msg2)
default:
    fmt.Println("No messages on either channel")
}

Non-blocking semantics can also be used for sending — this will trigger the default clause if the an item cannot be immediately sent to a channel. For buffered channels this occurs if the buffer is full, for unbuffered channels it occurs if there’s no active reader waiting.

select {
case myChan <- "hello":
    fmt.Println("Sent message")
default:
    fmt.Println("Failed to say hello")
}

Channel Implementation¶

☑ I’m looking at how channels are implemented here, which is more detail than most developers using Go will likely need. If you’re not interested, go ahead and skip it.

To get a better understanding of the semantics of channels I wanted to get an understanding of how they’re currently implemented. Of course, this is only a snapshot right now and is subject to change, but I always like to know what’s going on under the hood when I use core language features, as I find it makes the runtime behaviour easier to anticipate.

Let’s start off by looking at the definition of a chan structure and then we can drill in to how it’s used.

type hchan struct {
    qcount   uint           // total data in the queue
    dataqsiz uint           // size of the circular queue
    buf      unsafe.Pointer // points to an array of dataqsiz elements
    elemsize uint16
    closed   uint32
    elemtype *_type // element type
    sendx    uint   // send index
    recvx    uint   // receive index
    recvq    waitq  // list of recv waiters
    sendq    waitq  // list of send waiters
    lock mutex
}

type waitq struct {
    first *sudog
    last  *sudog
}

When you call make() a chunk of memory is allocated to hold the hchan structure. This same allocation may also be used to hold any buffered elements — this single allocation makes management easier, and is probably also helpful for performance due to locality of reference. If any buffer was declared then the member buf is updated to point to the part of the allocation just beyond the hchan structure².

However, this “single allocation” strategy is only used if the elements to be queued to not themselves contain any pointers, as this simplifies garbage collection. If the elements do contain pointers, a separate allocation is used as you’d expect. If the channel is unbuffered, no additional memory is allocated.

Go channel memory allocation diagram

After the memory is allocated and buf is initialised, one way or another, then elemsize is set to the memory footprint in bytes of each element in the queue, elemtype stores the type of those values and dataqsiz is set to the number of elements to buffer. The lock mutex is also initialised.

The remaining values (qcount, closed, sendx, recvx, sendq and recvq) are left at 0 or nil, as appropriate.

As items are added to the queue, sendx tracks the index of the next free slot and recvx tracks the next item to read. Since it’s a ring buffer, these wrap around to the beginning whenever they reach the end.

Go channel buffer structure

Channel Receive Implementation¶

The implementations of chanrecv() and chansend() have some optimisations to check for early returns without taking the mutex in the case of the channel being empty or closed, and they also have some complicated logic to lock out modifications to the stack size it certain critical sections. However, I’m going to ignore those things and just focus on the basic semantics.

There are basically three cases for the receiver:

Channel is closed: If there is anything buffered then return it, otherwise return that the channel is closed.
There is a blocked sender: If there are buffered items, which can only happen if this is an unbuffered queue, then transfer the item from the sender directly to the receiver. Otherwise, return the oldest buffered item to the receiver and replace it with the item from the sender, then increment recvx and sendx. Because the channel uses a ring buffer, this essentially converts the oldest item in the queue to the newest item. In either of these cases, the sender is woken up now that its item has been successfully sent.
There are no blocked senders: If there are no waiting senders, then return one of the buffered items if there are any. Otherwise add the receiver to recvq and block the goroutine if the request is blocking, or just return that the queue is empty if it’s non-blocking.

The gory details are illustrated in the activity diagram below.

Channel receive implementation

Channel Send Implementation¶

The implementation of chansend() is somewhat symmetrical but actually rather simpler. This is because if there is a waiting receiver, it knows the queue is empty and should send its element directly to the receiver — the receive case with a blocked sender, however, is different between the case of a buffered and unbuffered channel, as you can see on the lefthand side of the diagram above.

The other simplicity is that receiving from a closed channel must drain any buffer, whereas sending on a closed channel simply triggers a panic.

So to take the corresponding three cases for the sender:

Channel is closed: Immediate panic.
There is a blocked reader: Transfer the item from the sender directly to the receiver and wake up the receiver to continue its processing.
There are no blocked readers: If there is room in the buffer, copy the item in and increment sendx. If the buffer is full, either return that if the call is non-blocking, or add the sender to sendq and block the sender if the call is blocking.

You can see the behaviour in the activity diagram below.

Channel send implementation

Goroutine Scheduling¶

☑ A quick caveat: this section represents my understanding based on reading a number of sources, but since the scheduler in Go is constantly evolving I can’t rule out that some of these sources might have had stale information. I haven’t verified this myself by poring through the Go runtime source code.

The language specification doesn’t talk about scheduling, presumably because it’s not part of the language specification but rather a property of the runtime. It also quite possibly may differ across architectures, and it’s definitely true that it changes over time — for example, Go 1.14 added preemption to the scheduling. However, I’m not going to worry about the history of what changed in which release, but do my best to describe the scheduler as it exists at time of writing.

However, I think it’s still interesting to look at how things work, at least at a high level. As a developer using the language it’s always important to have some idea of how the code you’re writing will be executed, rather than just relying purely on language semantics — otherwise your ability to optimise for space or time is going to be limited, and diagnosing issues such as deadlocks, livelocks or race conditions is going to be harder.

I’m going to assume you’re comfortable with the concept of processes and threads, and that you’ve read the rest of this article to understand goroutines and channels, and when goroutines will block themselves.

Run Queues¶

The Go Scheduler is responsible for executing a collection of goroutines across a smaller collection of OS threads. Since the point is to avoid unnecessary thread switching, the number of OS threads in use is generally the number of cores on the system³.

You can query and change the number of threads at runtime with runtime.GOMAXPROCS(), and you can override the number when starting the application by setting the GOMAXPROCS environment variable. In general it’s probably not worth messing with it unless you have a good reason, however.

When goroutines are first started they go on to a global run queue, ready to be executed. Each CPU core (up to GOMAXPROCS) has a OS thread with affinity to it, and each such thread has its own local run queue of goroutines to be scheduled on that thread. This is illustrated below — ignore the network poller for now, we’ll discuss that below.

Go scheduler

The reason for this structure is that if all goroutines were stored on the single global run queue, there would be a great deal of contention for the mutex protecting it as each OS thread frequently swaps which goroutine it’s running — this would significantly harm parallelisation and hence performance.

It’s important to note that these queues only hold goroutines that are ready to execute — if they become blocked they’re moved elsewhere until they’re unblocked, at which point they return to a run queue.

Each thread runs a goroutine until it blocks or until it’s preempted, which happens after a 10ms timeslice. On Unix this preemption happens using signal SIGURG⁴ directed at a thread — I’m not sure what mechanism is used on other platforms. However, the timeslice is only approximate since there are critical sections of the code which cannot be interrupted, so the actual task switch will happen at the end of the critical section.

Most of the time when a task switch happens, the next goroutine is picked from the thread’s local run queue. Since this is generally only done from the single OS thread using that queue, this keeps contention for the queue minimal, although we’ll revisit this in the section on work stealing later.

On its own this would mean new goroutines would never be run — there needs to be a way for them to migrate from the global to the local run queue. Periodically⁵, therefore, each thread will pick one of the goroutines from the global run queue instead of its local queue. This keeps the mutex overhead of the global queue minimal, but still allowing goroutines to migrate to the local queues promptly.

This scheduling is performed by a separate scheduler routine for each OS thread — that is responsible for selecting the next goroutine to run and jumping into it. If the local run queue for a processor ever becomes empty, it will immediately take a share of the goroutines from the global run queue, if any — we’ll look at this a bit more in a bit.

The local queues are bounded to a maximum of 256 goroutines, so routines beyond this are moved back to the global queue. However, a lot of the time a lot of goroutines are going to be blocked on either channels within the application or network or disk IO outside the application, so I’m not sure that this sort of CPU-intensive contention will be common in practice.

Handoff¶

So we’ve seen how goroutines are scheduled on to OS threads, and how they can be preempted and moved around when blocked. However, what happens if we perform a system call where the OS thread itself blocks? There are a couple of relevant factors here — one is handoff, which we’ll discuss in this section, and one is the network poller, which we’ll discuss in the next.

The general approach for handling a blocking system call is to leave the OS thread blocked, and associated with whatever goroutine is making the blocked call, but to assign another thread to run the local run queue on the CPU core just vacated. This new thread can come from a pool of available threads, or potentially be created from scratch, although creating a thread is an overhead to be avoided where possible.

When the blocking call returns, the thread must re-associate itself with a processor — either it steals back a local run queue, and brings the previously blocked goroutine with it; or if that’s not possible, it places the goroutine into the global run queue and places itself in the thread pool to be used again as needed in future.

This sort of handoff is quite expensive, compared to normal goroutine scheduling — after all, this is why goroutines were invented in the first place. If the system call returns very quickly, this overhead will be a waste. As a result, only commonly expensive system calls result in an immediate handoff — for other system calls, the thread is allowed to block, and handoff will only be performed later if the system call takes a non-trivial amount of time.

This check is performed by a background thread called sysmon, which is also what performs periodic deadlock detection and other housekeeping activities.

Network Poller¶

Handoff is a useful approach for handling arbitrary system calls which can block for potentially significant periods. However, if we used this approach for every blocking IO call, we’d need potentially thousands of threads to cope with a busy server. This is totally contrary to Go’s goal of lightweight concurrency.

Thus for IO which is likely to block, such as reading and writing from sockets, the Go libraries use non-blocking calls to make sure that the OS thread doesn’t block and prevent it from being used for other goroutines. This means that the goroutine that’s waiting for the input needs to be woken up when the IO it’s waiting for is ready.

Operating systems have all sorts of different mechanisms for doing this — two that you might be familiar with are select() and poll() calls. Still, these aren’t ubiquitous across all platforms, and they’re also generally less efficient than OS-specific alternatives — epoll on Linux, kqueue on BSDs and MacOS, and IO Completion Ports on Windows. As a result, Go tailors the call that it uses to be the highest performance one available on each platform.

Still, all of these perform essentially the same job — given a set of file descriptors or sockets, they block until at least one of them is available for reading or writing, as specified by the parameters to the function. As a result, all of these file descriptors are waited in one place in the network poller. When a network event is flagged, the OS looks up the file descriptor to discover which goroutine has been unblocked by this event, and puts this routing into a run queue — this will be the local run queue if possible, or the global run queue if not.

Work Stealing¶

We’ve talked a little about how the thread-specific schedulers generally pick goroutines from their own local run queues, but periodically will check the global run queue and the goroutines made available as a result of network IO.

There is actually a little more to things, because the Go runtime works quite hard to make sure that no processor’s thread is ever idle if there are goroutines waiting to be run. As a result, if a local run queue becomes empty then the scheduler for that thread engages in work stealing.

So, the scheduler will look for work in the following places:

The local run queue for the current thread.
The global run queue.
Run the network poller and take any goroutines newly runnable.
Steal goroutines from another processor’s local run queue.

Work stealing

The first two items we’ve already discussed. The third item involves calling into a core netpoll() function — whose implementation is platform-dependant — which returns any goroutines made newly runnable by checking for IO. If the thread finds any such routines, it transfers them to its local run queue and starts running them.

As an aside, the sysmon thread also calls netpoll() if it hasn’t been called for 10ms, and will move any newly runnable goroutines to the global run queue. This is probably going to introduce latencies into IO if you have a lot of CPU-intensive threads, because these goroutines will then need to migrate on to local run queues even after the 10ms delay. Hence if very low latency response to IO is important to you then this is something to bear in mind⁶.

The final item is the thread’s last resort if it can’t find runnable goroutines anywhere else — it picks a random thread and “steals” around half of its current local run queue, moving the goroutines to its own local run queue. You’ll remember earlier I mentioned that the contention on local run queues is minimal — well this is one example where there might be contention between threads. But I’m guessing this shouldn’t be a particularly frequent occurrence.

Synchronisation Primitives¶

Although channels are a flexible primary mechanism for goroutines to synchronise, the Go standard library does offer a few other standard options. This section contains a brief summary of some of them.

I suspect that the community’s advice would be to just use channels where you can, however, as it seems to be regarded as more idiomatic for each goroutine to own state and communicate it, rather than state be shared between goroutines.

`sync.Mutex`¶

A simple mutual exclusion lock, where the zero state is initialised and unlocked so it can be used immediately. Typically you should always use this with defer which we looked at in the previous article.

type MyContainer struct {
    lock sync.Mutex
    ...
}

func (c *MyContainer) manipulate(...) {
    c.lock.Lock()
    defer c.lock.Unlock()
    ...
}

Mutexes must not be copied once used, so should always be passed around by pointer where necessary.

`sync.RWMutex`¶

A variant of the Mutex which implements a readers-writer lock, which allows multiple concurrent read locks but only a single write lock. This can be useful in cases where a data structure is frequently read and only infrequently updated, but such updates must be atomic. One example that springs to mind is global application configuration, which is essentially static but might occasionally need to be reloaded if, say, an underlying configuration file is edited.

The Lock() and Unlock() methods acquire a write lock, and there are additional RLock() and RUnlock() methods to acquire a read lock. No ownership is implied, the lock simply maintains a count of the number of readers, so there’s no requirement that the locking goroutine is the one which does the unlocking — though if you’re not using defer to do the unlocking, you’ll want to audit your code very carefully to make sure you don’t end up deadlocked.

type AppConfig struct {
    lock sync.RWMutex
    ...
}

func (c *AppConfig) getConfigValues(...) ... {
    c.lock.RLock()
    defer c.lock.RUnlock()
    ...
}

func (c *AppConfig) reloadConfig() {
    c.lock.Lock()
    defer c.lock.Unlock()
    ...
}

This is all pretty standard stuff, and I’l just include a couple of other details. Firstly, a blocked Lock() call also presents any new RLock() calls from proceeding, to prevent starvation of write locks in the presence of constant overlapping read locks. Secondly, there’s a method RLocker() which returns a Locker object which offers Lock() and Unlock() calls which map directly to the RLock() and RUnlock() calls of the original object — this allows it to be used in contexts where a standard Mutex or similar object is required, such as with the sync.Cond object described below.

`sync.Cond`¶

Implements a condition variable, which is a synchronisation point among multiple cooperating threads of execution which might want to wait for some event. Each condition variable is associated with a mutex or some other lock — the caller provides the lock object, and any object that provides Lock() and Unlock() methods will do. Holding this lock is a prerequisite for calling any of the methods of the condition variable object.

Once you hold the lock, you can use the following methods:

Signal(): If there are any goroutines blocked in a Wait() call, wake one of them up.
Broadcast(): As Signal() except all waiting goroutines are woken up.
Wait(): Block the current goroutine until the condition variable is signalled by another. This automatically releases the lock before blocking. Once the goroutine is woken up by a later call to Signal() or Broadcast(), the lock is automatically reacquired before returning from Wait() to the calling goroutine. As is generally the case with condition variables, however, there’s a window of opportunity where nobody holds the lock so once the lock is reacquired, whatever condition was being signalled should be rechecked.

The Wait() and Broadcast() methods are illustrated in the example below.

package main

import (
    "fmt"
    "sync"
    "time"
)

func waiter(c *sync.Cond) {
    c.L.Lock()
    defer c.L.Unlock()
    start := time.Now()
    c.Wait()
    end := time.Now()
    fmt.Printf("Took %v\n", end.Sub(start))
}

func main() {
    var m sync.Mutex
    c := sync.NewCond(&m)
    var wg sync.WaitGroup
    for i := 0; i < 4; i++ {
        fmt.Println("Creating waiter...")
        wg.Add(1)
        go func() {
            waiter(c)
            wg.Done()
        }()
        time.Sleep(300 * time.Millisecond)
    }
    c.L.Lock()
    fmt.Println("Broadcasting...")
    c.Broadcast()
    c.L.Unlock()
    fmt.Println("Waiting for goroutines...")
    wg.Wait()
    fmt.Println("Done")
}

`sync.WaitGroup`¶

We saw this in some of the earlier code examples, and it’s essentially a slightly specialised semaphore. A new WaitGroup has an internal counter which starts at zero. Every time you spawn one or more goroutines, you call Add() and pass the number of new routines — this increments the counter by whatever number you pass. As an aside, the delta you pass can be zero or negative, but if the overall counter itself goes negative at any point the code immediately panics.

Once a goroutine is completed, you call Done() which always decrements the counter by one. Finally, you can call Wait() which will block until the counter returns to zero. You could pass WaitGroup into a goroutine (always by pointer) but this is a bit of a leaky abstraction, so I’ve been using anonymous wrapper functions to keep the scope limited to the caller.

Essentially the model is that you count how many goroutines you spawn with Add(), and then you arrange that Done() is called as each terminates — either with a defer in the goroutine itself, or in a wrapper function. Your main() function, or anything else, can then call Wait() to make sure it doesn’t terminate until all important goroutines are finished.

`sync.Once`¶

Construct an instance of this and then call its Do() method, passing a function as a parameter. The Once object guarantees this function will only be called once for that particular instance of Once, regardless of the concurrency of the calls to Do(). Additionally, none of the calls to Do() will return until the single call to the specified function has returned.

This is typically used to enforce once-only initialisation within code which can be called concurrently from multiple contexts.

`sync.Pool`¶

A generic untyped object pool which is safe for concurrent access. This provides a Put() method for adding objects to the pool and Get() to retrieve arbitrarily selected object from the pool. These methods accept and return instance{} values, rather than the pool being a generic, so care is needed to avoid unexpected behaviour at runtime.

When you declare a Pool you can optionally specify a New() function which will be used to construct a new value when requesting from an empty pool — otherwise, Get() will just return nil in this case.

`sync/atomic`¶

This package offers functions for safely manipulating the various integer types that Go offers. For int64, for example, the library offers the following functions which operate on pointers to an int64 value:

AddInt64(): Atomically increments a specified int64 by a specified delta.
CompareAndSwapInt64(): Atomically implement compare-and-swap (CAS) on a specified int64 value.
LoadInt64(): Atomically obtain the value of the specified value and return it.
StoreInt64(): Atomically set the value to a specified new value.
SwapInt64(): Atomically swap the value with a specified new value.

The library provides versions of these functions for int32, int64, uint32, uint64 and uintptr. It also provides generic versions of all but the AddX() functions for use on any type.

Conclusions¶

Thus concludes my whirlwind tour⁷ of concurrency in Go. I was pleasantly surprised by the easy semantics of both goroutines and channels, although I did find the description of channels as “bidirectional” if no direction is given as a little misleading — this would imply to me a full-duplex channel with separate buffers for send and receive directions. However, there is only ever a single buffer, and the direction is only used to determine whether a given piece of code has permission to read, write or both. We don’t refer to a file descriptor as “bidirectional” when it’s open for both reading and writing, for example. I appreciate this is a minor gripe, mind you.

I was also quite pleased with the ease of reading through the source code which implements the channel structures — it was remarkably similar to how I would have written the code in C, and doesn’t suffer from the extreme context-sensitivity that reading, say, C++ code can do⁸. Admittedly the similarity to C was more acute in this example since the runtime code is going to be of a lower level than most Go code — it used “unsafe” mechanisms like pointer arithmetic, for example. But it gives me more confidence that I’ll be able to make sense of other Go code.

I’ve also found it useful to drill into not only how channels work under the hood, but also find some more details of the Go scheduler. This is useful for assessing expectations about latency to network IO and similar issues, and understand how CPU-intensive and IO-bound goroutines will interact with each other. This is useful to inform things like approaches to buffering within the application, so as to best optimise throughput or latency on network communications. My overall impression is that it seems like the sort of reasonable compromise between competing factors which you’d expect to see after iterating a few times in light of real world experience.

In particular the forced preemption is good to see — this wasn’t always the case in Go, and I’d imagine resource starvation could have been a much more significant risk before this was added, unless the developer specifically wrote code to avoid it. Of course, there are still probably reasons to write code which blocks fairly frequently — if you want to have any hope of a graceful shutdown, for example, you probably don’t want goroutines stuck in intensive tasks for more than a few seconds before they check if they should be shutting down. But force preemption makes it a lot harder to add a goroutine which accdientally kills the performance of your application.

That wraps it up for this article — all the sections are Done() and the Wait() is over. As usual, I hope this has been interesting and/or useful. I’m not yet quite sure what I’ll look into in the next article — I probably should look into code structure and tooling before long, and there are a few other loose ends like error handling and automated testing that it would be useful to look into. Until then, however, have a great day!

This might seem a little odd at first if you’re used to using Unix pipes, where the read and write ends are two separate file descriptors that you can pass around as needed. In the case of Go the channel is a single abstraction, so the only way to do the equivalent of discarding one or other of those file descriptors is to cast the channel to a send-only or receive-only channel. ↩
Doing this requires the sort of direct pointer arithmetic provided by the unsafe package. It’s generally a poor idea to do this in application code if you care about compatibility across platforms and between versions of Go. ↩
This will be the number of “virtual” cores reported, which may not correspond to the number of physical cores if technologies like hyper-threading are in use. ↩
It’s worth being aware of this if you planned to use SIGURG for your own purposes, such as being notified of out-of-band data arriving on a socket. Fortunately modern use of SIGURG is quite rare, which is presumably why they chose it. ↩
1 in every 61 times to be precise, which is presumably chosen to be prime for complicated reasons due to hashing, which I’m not going to worry about too much. ↩
That said, if delays in the order of 10ms are a major problem to you then it’s possible Go isn’t an ideal choice of language for your application — you’re starting to get into the realms where dedicated kernel modules might be justified. ↩
I should apologise for the fact that all my tours are whirlwind tours — they’re not, of course, it’s just that I’m rather lazy when it comes to my writing style. I really should learn to avoid clichés like the plague, they’re a dime a dozen. ↩
What I mean by this is that in vanilla C, if you see an expression or function call then you generally know exactly what the code is doing. In C++, however, features such as operator overloading, method overloading and method overriding can all conspire to make it hard to be able to look at a piece of code and know what it’s going to do. You often have to go look what objects were passed in, which methods they override, etc. This makes it harder to do “keyhole” code examiniations, where you can make deductions about what a piece of code is doing in isolation of the context in which it will be called. ↩

⎘

The next article in the “All Go” series is All Go: Building and Testing

Fri 7 Jul, 2023

23 Jun 2023 at 9:15PM in Software

｜

Photo by Pixabay on Pexels

｜

See comments

go threading coroutines locking scheduling

☑ All Go: Concurrency

Goroutines¶

Channels¶

Receiving From Channels¶

Channel Select¶

Non-Blocking Channel Operations¶

Channel Implementation¶

Channel Receive Implementation¶

Channel Send Implementation¶

Goroutine Scheduling¶

Run Queues¶

Handoff¶

Network Poller¶

Work Stealing¶

Synchronisation Primitives¶

sync.Mutex¶

sync.RWMutex¶

sync.Cond¶

sync.WaitGroup¶

sync.Once¶

sync.Pool¶

sync/atomic¶

Conclusions¶

`sync.Mutex`¶

`sync.RWMutex`¶

`sync.Cond`¶

`sync.WaitGroup`¶

`sync.Once`¶

`sync.Pool`¶

`sync/atomic`¶