Faster than Linux

FTL usually refers to "faster than light". A theoretical particle known as a tachyon that powers certain spaceships in the Star Trek universe keeps the plot going for decades through multiple series and and movie franchises. Today though, we are going to be talking about running linux applications faster than linux. Our company, has been working with unikernels for a while now but we've mostly been focused on their security properties and had not put much thought into their performance considerations. NanoVMs Some have even correctly pointed out that is not a traditional "true blue" unikernel as it retains different privilege levels for kernel vs user code. This is because there are certain privileged instructions that allow you to change page mappings and if you have that capability then all the ASLR and page protections in the world don't matter. You're going to get hacked. nanos Having said that, there's nothing preventing us from running software faster than most linux distributions or linux in general. In fact it is almost a guarantee because linux is a general purpose operating system and is simply built to run everything from the latest FPS game to a load bearing database to doing inference at the edge. It was also purpose built to run on bare metal and so it has facilities that a unikernel would never incorporate. Over half of linux is device drivers and it's not a case of you deciding what to pick/choose for your hand-rolled kernel. Keep in mind these same design concepts (multiple process, multiple users, interactivity) are what powered the PDP-7. As much trash as we talk on the multiple process model . You want that capability for the laptop or phone you are reading this article on. However, for production server-side applications that always run inside of a virtual machine, which is mostly 99% of everything server-side nowadays, that condition goes away. you kind of need it for bare metal installations In fact we don't want users, nor do we want remote interactive access, nor do we want a bunch of random crap running that is not our software. There's already one layer of linux running as the hypervisor - does the guest need to be a heavy-weight GPOS as well? It is these environmental characteristics that allows us to do what we do. Before we go further down the road it should be clear linux is a general purpose operating system and as such has many use-cases - for example, linux is and will forever be our favorite development environments. Not only do we get security benefits by only allowing one application to run it tends to run much, much faster. There are a of unikernel projects out there, but at the end of the day cloning decades of kernel work just takes a ton of time and effort and it can be years before you see performance gains for commonly used software. lot Since is written in Go and we have numerous other software projects at our company written in Go we've used Go quite a lot for testing and thus that's where we've started seeing some results. You should be able to replicate this with most Go releases just by using the latest Nanos release but if you are using Go 1.14 you'll need to build nanos from source (or use the nightly release) as we . A scheduling change was made between Go 1.13 and 1.14. These types of changes ordinary application developers don't get much exposure too but when they interface with the operating system we have to deal with them. There's a reason why they call it bleeding edge. :) OPS just threw in a SA_ONSTACK fix Building From Source To build from source you'll need to clone and a simple make is necessary. If you haven't installed yet you should do so. From there you can copy your local build to the latest ops release like so: Nanos OPS eyberg@box:~/go/src/github.com/nanovms/nanos$ cp output/boot/boot.img ~/.ops/0.1.25/. eyberg@box:~/go/src/github.com/nanovms/nanos$ cp output/mkfs/bin/mkfs ~/.ops/0.1.25/. eyberg@box:~/go/src/github.com/nanovms/nanos$ cp output/stage3/bin/stage3.img ~/.ops/0.1.25/ It is highly recommended to just use the pre-built releases from if you can. OPS Now let's use this simple little Go hello world: main ( ) { fmt.Println( ) http.HandleFunc( , { fmt.Fprintf(w, ) }) fs := http.FileServer(http.Dir( )) http.Handle( , http.StripPrefix( , fs)) http.ListenAndServe( , ) } package import "fmt" "net/http" func main () "hello world!" "/" func (w http.ResponseWriter, r *http.Request) "Welcome to my website!" "static/" "/static/" "/static/" "0.0.0.0:8080" nil I'm using this ops config: { :{ : , : , : } } "CloudConfig" "ProjectID" "prod-1033" "Zone" "us-west2-a" "BucketName" "my-bucket" Then we build the GCE image: eyberg@box:~/y$ cat build.sh GOOGLE_APPLICATION_CREDENTIALS=~/gcloud.json ops image create \ -c config.json -t gcp -a hackernoon #!/bin/sh Let's spin it up: eyberg@box:~/y$ cat create-instance.sh GOOGLE_APPLICATION_CREDENTIALS=~/gcloud.json ops instance create \ -t gcp -i hackernoon-image -z us-west2 #!/bin/sh -a Now let's spin up 2 more instances on GCE directly. One for the benchmarking using and one for the go webserver to just sit on a debian instance. ab Don't be dumb like the author and wonder why the latency is two orders of magnitude difference from a different region - use the same region and zone you are using with OPS: Then let's transfer our little go app: ➜ ~ scp -i ~ nope eyberg@nsa.com:~ hackernoon . hackernoon % KB MB/s : ➜ ~ gcloud beta compute scp --zone --project hackernoon :~ /.ssh/ /y/ 100 7321 9.0 00 00 "us-west2-a" "project-something" "gtest" /. Then install ab: eyberg@bench:~$ sudo apt-get install apache2-utils Let's hit both of the instances up: eyberg@bench:~$ curl -XGET http: Welcome to my website!eyberg@bench:~$ ^C eyberg@bench:~$ curl -XGET http: Welcome to my website!eyberg@bench:~$ //10.240.0.41:8080/ //10.240.0.38:8080/ Seems legit. Also - just so we know whose who: Now let's run with concurrency of 1: eyberg@bench:~$ ab -c -n http: This is ApacheBench, Version Copyright Adam Twiss, Zeus Technology Ltd, : Licensed to The Apache Software Foundation, : Benchmarking (be patient).....done Server Software: Server Hostname: Server Port: Document Path: 1 100 //10.240.0.38:8080/ 2.3 1757674 1996 http //www.zeustech.net/ http //www.apache.org/ 10.240 .0 .38 10.240 .0 .38 8080 / Document Length: 22 bytes Concurrency Level: 1 Time taken for tests: 0.028 seconds Complete requests: 100 Failed requests: 0 Total transferred: 13900 bytes HTML transferred: 2200 bytes Requests per second: 3634.65 [#/sec] (mean) Time per request: 0.275 [ms] (mean) Time per request: 0.275 [ms] (mean, across all concurrent requests) Transfer rate: 493.37 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.0 0 0 Waiting: 0 0 0.0 0 0 Total: 0 0 0.2 0 2 Percentage of the requests served within a certain time (ms) 50% 0 66% 0 75% 0 80% 0 90% 0 95% 0 98% 0 99% 2 100% 2 (longest request) eyberg@bench:~$ ab -c -n http: This is ApacheBench, Version Copyright Adam Twiss, Zeus Technology Ltd, : Licensed to The Apache Software Foundation, : Benchmarking (be patient).....done Server Software: Server Hostname: Server Port: Document Path: 1 100 //10.240.0.41:8080/ 2.3 1757674 1996 http //www.zeustech.net/ http //www.apache.org/ 10.240 .0 .41 10.240 .0 .41 8080 / Document Length: 22 bytes Concurrency Level: 1 Time taken for tests: 0.037 seconds Complete requests: 100 Failed requests: 0 Total transferred: 13900 bytes HTML transferred: 2200 bytes Requests per second: 2684.06 [#/sec] (mean) Time per request: 0.373 [ms] (mean) Time per request: 0.373 [ms] (mean, across all concurrent requests) Transfer rate: 364.34 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.1 0 1 Waiting: 0 0 0.1 0 1 Total: 0 0 0.2 0 2 Percentage of the requests served within a certain time (ms) 50% 0 66% 0 75% 0 80% 0 90% 0 95% 0 98% 1 99% 2 100% 2 (longest request) Ok - got them warmed up. We see the unikernel outpacing the linux instance just by a bit. Now let's hit it more: eyberg@bench:~$ ab -c -n http: This is ApacheBench, Version Copyright Adam Twiss, Zeus Technology Ltd, : Licensed to The Apache Software Foundation, : Benchmarking (be patient) Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Finished requests Server Software: Server Hostname: Server Port: Document Path: 10 1000 //10.240.0.41:8080/ 2.3 1757674 1996 http //www.zeustech.net/ http //www.apache.org/ 10.240 .0 .41 100 200 300 400 500 600 700 800 900 1000 1000 10.240 .0 .41 8080 / Document Length: 22 bytes Concurrency Level: 10 Time taken for tests: 0.087 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 139000 bytes HTML transferred: 22000 bytes Requests per second: 11444.27 [#/sec] (mean) Time per request: 0.874 [ms] (mean) Time per request: 0.087 [ms] (mean, across all concurrent requests) Transfer rate: 1553.47 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.2 0 2 Processing: 0 1 0.2 1 1 Waiting: 0 1 0.2 1 1 Total: 0 1 0.3 1 3 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 2 100% 3 (longest request) eyberg@bench:~$ ab -c -n http: This is ApacheBench, Version Copyright Adam Twiss, Zeus Technology Ltd, : Licensed to The Apache Software Foundation, : Benchmarking (be patient) Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Completed requests Finished requests Server Software: Server Hostname: Server Port: Document Path: 10 1000 //10.240.0.38:8080/ 2.3 1757674 1996 http //www.zeustech.net/ http //www.apache.org/ 10.240 .0 .38 100 200 300 400 500 600 700 800 900 1000 1000 10.240 .0 .38 8080 / Document Length: 22 bytes Concurrency Level: 10 Time taken for tests: 0.055 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 139000 bytes HTML transferred: 22000 bytes Requests per second: 18313.68 [#/sec] (mean) Time per request: 0.546 [ms] (mean) Time per request: 0.055 [ms] (mean, across all concurrent requests) Transfer rate: 2485.94 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.2 0 2 Processing: 0 0 0.1 0 1 Waiting: 0 0 0.1 0 1 Total: 0 1 0.2 1 2 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 2 100% 2 (longest request) As you can see this is a fairly decent percentage difference. Now before you get all crazy on the twitters keep in mind that benchmarking can measure . To be utterly painfully clear - these benchmarks are . They are only meant to get you interested - not necessarily prove anything. This was merely looking at a go webserver requests/second. Measuring a different language like Rust or Node will produce very different results. lots of things very crude and naive In fact - let's go ahead and do just that. Let's look at a simple Rust webserver real quick: std::io::{Read, Write}; std::net::{TcpListener, TcpStream}; std::thread; ( stream: &TcpStream) { buf = [ ; ]; stream.read(& buf) { (_) => { req_str = ::from_utf8_lossy(&buf); } (e) => ( , e), } } ( stream: TcpStream) { response = ; stream.write(response) { (_) => {} (e) => ( , e), } } (stream: TcpStream) { handle_read(&stream); handle_write(stream); } () { listener = TcpListener::bind( ).unwrap(); ( , ); stream listener.incoming() { stream { (stream) => { thread::spawn(|| handle_client(stream)); } (e) => { ( , e); } } } } use use use fn handle_read mut let mut 0u8 4096 match mut Ok let String // println!("{}", req_str); Err println! "Unable to read stream: {}" fn handle_write mut let b"HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=UTF-8\r\n\r\n Hello world \r\n" match Ok //println!("Response sent"), Err println! "Failed sending response: {}" fn handle_client fn main let "0.0.0.0:8080" println! "Listening for connections on port {}" 8080 for in match Ok Err println! "Unable to connect: {}" GOOGLE_APPLICATION_CREDENTIALS=~/gcloud.json ops image create -c config.json -a main -i rustz1 ops instance create -z us-west2 -i rustz1 export -a Using the same config as before we'll upload it to Google and spin up three new instances. One for the rust unikernel, one for a debian instance on the same subnet and one running debian w/the rust webserver on it. Note: I'm choosing debian for no other choice than it's the default choice and so would be used quite a lot. We do a quick live-check: eyberg@dtest:~$ curl -XGET http://10.240.0.94:8080/ Hello world eyberg@dtest:~$ curl -XGET http://10.240.0.8:8080/ Hello world For the one running on debian: eyberg@dtest:~$ ab -c 1 -n 100 http://10.240.0.94:8080/ This is ApacheBench, Version 2.3 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 10.240.0.94 (be patient)..... Server Software: Server Hostname: 10.240.0.94 Server Port: 8080 Document Path: / Document Length: 39 bytes Concurrency Level: 1 Time taken tests: 0.025 seconds Complete requests: 100 Failed requests: 0 Total transferred: 9800 bytes HTML transferred: 3900 bytes Requests per second: 3948.98 [ Time per request: 0.253 [ms] (mean) Time per request: 0.253 [ms] (mean, across all concurrent requests) Transfer rate: 377.93 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.2 0 2 Processing: 0 0 0.0 0 0 Waiting: 0 0 0.0 0 0 Total: 0 0 0.2 0 2 Percentage of the requests served within a certain time (ms) 50% 0 66% 0 75% 0 80% 0 90% 0 95% 0 98% 0 99% 2 100% 2 (longest request) $Revision done for #/sec] (mean) We can see the rust unikernel outperforming just slightly: eyberg@dtest:~$ ab -c 1 -n 100 http://10.240.0.8:8080/ This is ApacheBench, Version 2.3 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 10.240.0.8 (be patient)..... Server Software: Server Hostname: 10.240.0.8 Server Port: 8080 Document Path: / Document Length: 39 bytes Concurrency Level: 1 Time taken tests: 0.021 seconds Complete requests: 100 Failed requests: 0 Total transferred: 9800 bytes HTML transferred: 3900 bytes Requests per second: 4778.97 [ Time per request: 0.209 [ms] (mean) Time per request: 0.209 [ms] (mean, across all concurrent requests) Transfer rate: 457.36 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 2 Processing: 0 0 0.0 0 0 Waiting: 0 0 0.0 0 0 Total: 0 0 0.2 0 2 Percentage of the requests served within a certain time (ms) 50% 0 66% 0 75% 0 80% 0 90% 0 95% 0 98% 0 99% 2 100% 2 (longest request) $Revision done for #/sec] (mean) Keep in mind these are on 1vCPU instances, but let's go ahead and up the concurrency: eyberg@dtest:~$ ab -c 10 -n 1000 http://10.240.0.94:8080/ This is ApacheBench, Version 2.3 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 10.240.0.94 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: 10.240.0.94 Server Port: 8080 Document Path: / Document Length: 39 bytes Concurrency Level: 10 Time taken tests: 0.072 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 98000 bytes HTML transferred: 39000 bytes Requests per second: 13919.63 [ Time per request: 0.718 [ms] (mean) Time per request: 0.072 [ms] (mean, across all concurrent requests) Transfer rate: 1332.15 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 2 Processing: 0 1 0.2 1 2 Waiting: 0 1 0.2 1 2 Total: 0 1 0.2 1 3 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 2 100% 3 (longest request) $Revision for #/sec] (mean) Not bad. Now let's check out the rust webserver running under Nanos: eyberg@dtest:~$ ab -c 10 -n 1000 http://10.240.0.8:8080/ [53/188] This is ApacheBench, Version 2.3 Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 10.240.0.8 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: 10.240.0.8 Server Port: 8080 Document Path: / Document Length: 39 bytes Concurrency Level: 10 Time taken tests: 0.046 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 98000 bytes HTML transferred: 39000 bytes Requests per second: 21736.30 [ Time per request: 0.460 [ms] (mean) Time per request: 0.046 [ms] (mean, across all concurrent requests) Transfer rate: 2080.23 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.1 0 1 Waiting: 0 0 0.1 0 1 Total: 0 0 0.2 0 2 Percentage of the requests served within a certain time (ms) 50% 0 66% 0 75% 0 80% 0 90% 1 95% 1 98% 1 99% 1 100% 2 (longest request) $Revision for #/sec] (mean) Well - that's a pretty large difference! Keep in mind that these are on 1vCPU - Nanos has SMP support because even though there are a ton of languages/applications that are inherently single process/single-threaded that's not what our future holds. Google Holy multi-threading batman. actually has support for instances with 416 threads! Going Deeper If you are testing and you just run OPS locally without any taps you'll be using user-mode networking with no hardware acceleration. Both of these will produce far less results than what you see here. That's why we use gcloud as a neutral testing ground here and since you can upload the image and start the instance in about 2 minutes it's not really that big of a deal. Likewise, measuring something like filesystem writes is not something we're looking at here. That you'll need to wait for the next blogpost for! Also keep in mind - Nanos is not linux. A lot of people seem to think we've trimmed down the linux kernel and created something like Alpine. That is most definitely not what we have done - . go look at the source We are also testing on Google Cloud here. Your results will most assuredly be different on AWS. Again, most of our testing has been done on Google and even though we can deploy to AWS today I am aware of at least one feature that needs to be implemented to make it go much faster than it is doing today. Why the difference? The instances we run on AWS use Xen and the ones on Google are KVM based. Everything from clocks to network drivers to storage is different. This is just the start of the performance stuff however it is promising. Some people think that down the road as the codebase grows we will see significant slowdowns. I don't think this is realistic though as at the end of the day we are comparing a multi-threaded single process system to a system that is not just multiple process but massively multiple processed. They are just two different beasts. Plus, there are a ton of optimizations that we have roadmapped that already exist in other general purpose systems such as fbsd and linux that we haven't even started on yet. A simple scroll through the issue tracker enumerates many of those. So if anything I expect these numbers to improve by a lot. Plus, one of the cooler, imo, things we can do since we know they are unikernels is that we can make app specific optimizations such as subbing out various schedulers easily. The Reality is that we Have a Lot to Look Forward To Most of the heavy context switching that you read about comes from large many multiple process systems. There are quite a lot of ramifications to software that is written in this style that is sadly just not being taught anymore even though the "hella-core"™ future is extremely important to be aware of. For instance remember that a process will share a heap between multiple threads but two processes have to have their own. This is why we see databases and the JVM utilizing things like huge_pages. This is also why forking as a concurrency primitive can be so horribly slow. Also, there are a lot of various interfaces in the kernel that are present to facilitate its environment of being a multiple process system and just as many facilities to ensure the concept of multiple users not stomping on each other's memory. If you have a kernel that is focused on running one and only one application it is no coincidence that it is going to run a lot faster even without tuning. It's honestly not even fair to compare the two types. It's not fair to compare. Linux has expectations that it could be deployed onto real hardware. We know we will only ever live as a vm and that allows us to take advantage of that fact. Today it is quite possible to run a vm faster than native linux because of how good hypervisors and virtualization have gotten. The hardware being produced today is actually optimized for hyperscaler deployments (eg: running in a virtualized environment). Tack on the fact that there are quite a few syscalls we don't support nor care too and we automatically get performance boosts. Then couple with the fact that a brand new instance of debian or ubuntu or whatever comes with all of this: root@bench: eyberg# ps aux | wc -l /home/ 72 That was debian. Let's look at ubuntu: root@instance : eyberg# ps aux | wc -l -1 /home/ 93 Keep in mind these are processes that are running. We haven't installed yet. This is a fresh boot! This is on a single thread instance!! 93 processes are all fighting each other for that one thread. already anything I've always graphically showing the true cost. loved this slide However, other things that stand out immediately on this host: root@instance : eyberg# ps aux | grep python root ? Ssl : : /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait- -signal root ? Ss : : /usr/bin/python3 /usr/bin/google_network_daemon root ? Ss : : /usr/bin/python3 /usr/bin/google_clock_skew_daemon root ? Ss : : /usr/bin/python3 /usr/bin/google_accounts_daemon -1 /home/ 1332 0.1 0.5 171708 19432 00 14 0 00 for 1497 0.1 0.5 65148 21352 00 14 0 00 1499 0.1 0.5 65112 21228 00 14 0 00 1501 0.1 0.5 65448 21476 00 14 0 00 If you think any of these random python programs are going to be fast you might be mistaken. Let's not forget that I'm an active user on this system screwing with the performance for every single command I type. We call them "commands" but what are they really? That's right - yet another program. On some of the smaller instances you will be throttled too. The f-1 and g-1 instances for instance are labeled wrong in the gcloud dashboard as "1 vcpu" - and after playing with them enough I feel that the 20% and 50% are bursty all the time. You can really tell the difference between a shared thread and one that is all yours. they really are 0.2 and 0.5 respectively All this is to say that running a general purpose operating system in the cloud works and everyone does it cause it's the only tools we had. However, it's not the only tool in your toolbelt anymore. The server side operating system revolution is long overdue.