The first tests looked interesting, but I have suspicions about the test which takes out most of the work. They say "we parse the request params and return a fixed number." That sounds like they're returning just an arbitrary number (unrelated to params), if the compiler notices no real work is being done, it might in some circumstances just remove all the code, so that you're benchmarking nothing, or it might not. Real services tend to need to do lots of work which is not reflected in benchmarks - auth, logging, dbs etc.
As usual benchmarks are incredibly hard to do fairly and very hard to interpret correctly, so they should at most be one datapoint when making decisions between languages/software, and should also be considered in the context of the actual performance required, the existing team, experience etc. They are sometimes good for eliminating options though if they are nowhere near performance requirements. I did find it interesting to compare the different graphs of response time, including outliers (people often ignore the worst case response times).