2016-02-15

Server performance 2

So (net server) itself wasn't too bad performance, then there must be other culprit. To find out it, I usually use profiler however it can only work on single thread environment. That means it's impossible to use it on the server program written on top of (net server) library.

Just giving up would be very easy way out but my consciousness doesn't allow me to do it (please let me go...). Thinking current HTTP server implementation uses 2 layers, Paella and Plato. The first one is the basic, then web framework. At least I can see which one would be slow. So I've just tried with bare Paella server. Copy&pasting the example and modify a bit like this:
(import (rnrs)
        (net server)
        (paella))

(define config (make-http-server-config :max-thread 10))

(define http-dispatcher
  (make-http-server-dispatcher
    (GET "/benchmark" (http-file-handler "index.html" "text/html"))))

(define server 
  (make-simple-server "8500" (http-server-handler http-dispatcher)
                      :config config))

(server-start! server)
Then uses the same script as before.The result is this:
$ time ./benchmark.sh
./benchmark.sh  4.66s user 3.76s system 335% cpu 2.507 total
Hmmm, bare server is already slow. So I can assume most of the time are consumed by the server, not the framework.

Listing up what's actually done by server would help:
  1. Converting socket to buffered port
  2. Parsing HTTP header
  3. Parsing request path.
  4. Parsing query string (if there is)
  5. Parsing mime (if there is)
  6. Parsing cookie (if there is)
  7. Calling handler
  8. Writing response
  9. Cleaning up
So I've started with the second item (port conversion actually improves performance so can't be removed, unless I write everything from scratch using socket but that sound too much pain in the ass). Conclusion first, I've improved header parsing almost 100% (mostly reducing memory allocation) but it didn't affect the performance of the server at all. Parsing header occurs once per request, so I've dumped headers what cURL sends and carefully diagnosed which procedure takes time. As the result, SRFI-13 related procedures consuming a lot of times because it has rich interface but requires packing rest arguments. So I've replaced them with no rest argument version. Then in the same library, the procedure rfc5322-header-ref which is for referring header value called string-ci=? which calls string-foldcase internally. So changed it to call case folding once. And couple of more improvements. All of them, ideed, improved performance however calling header parser only 1000 times took 30ms from the beginning. So make it 15ms doesn't make that much change.

Then I've started doubting that the benchmark script itself is actually slow. I'm not sure how fast cURL itself is but forking it 1000 times and wait for them didn't sound fast. So I've written the following script:
#!read-macro=sagittarius/bv-string
(import (rnrs)
        (sagittarius socket)
        (sagittarius control)
        (time)
        (util concurrent)
        (getopt))

(define header
  #*"GET /benchmark HTTP/1.1\r\n\
     User-Agent: curl/7.35.0\r\n\
     Host: localhost:8500\r\n\
     Accept: */*\r\n\r\n")

(define (poke)
  (define sock (make-client-socket "localhost" "8500"))
  (socket-send sock header)
  ;; just poking
  (socket-recv sock 256)
  (socket-close sock))

(define (main args)
  (with-args (cdr args)
      ((threads (#\t "threads") #t "10")
       (unit    (#\u "unit")    #t "1000"))
    (let* ((c (string->number unit))
           (t (string->number threads))
          (thread-pool (make-thread-pool t raise)))
      (time (thread-pool-wait-all!
             (dotimes (i (* c t) thread-pool)
               (thread-pool-push-task! thread-pool poke))))
      (thread-pool-release! thread-pool))))
Send fixed HTTP request and recieve the response (could be partially). -t option specifies how many threads should used and -u option specifies how many request should be done per thread. So if this ideed takes time, then my assumption is not correct. Lemme do it with bare HTTP server:
$ sash bench.scm -t 100 -u 100

;;  (thread-pool-wait-all! (dotimes (i (* c t) thread-pool) (thread-pool-push-task! thread-pool poke)))
;;  4.052414 real    0.670089 user    1.255910 sys
100 threads and 100 request per thread so in total 10000 request were send. Then it took 4 seconds, so 2500 req/s. It's faster than cURL version.

2500 req/s isn't fast but for my purpose it's good enough for now. So I'll put this aside for now.

2016-02-14

肉体改造部 第六週

今週(先週?)は風邪ひいてダウンしていた日が2日あったりした。月曜の夜にベッドの中で寒くて震えていたのはいい思い出である。普段は冬でも布団を蹴っ飛ばしてるのに・・・

計量結果:

  • 体重: 72.8kg (-0.3kg)
  • 体脂肪率: 23.6% (±0.0%)
  • 筋肉率:42.6% (±0.0%)
体重は減ったのにそれ以外が変わっていないというのはいったいどういうことなのだろう?骨か、骨が減ったのか?単純に誤差の範囲で見かけ上動いていないだけだとは思うけど、不安な結果ではある。

2016-02-12

Server performance

Sagittarius has server framework library (net server) and on top of this library I've written simple HTTP server and web framework, Paella. I don't use it in tight situation so performance isn't really matter for now. However if you write something you want to check how good or bad it is, don't you? And yes I've done simple benchmark and figured out it's horrible.

I've created a very simple static page with Plato which is a web application framework bundled to Paella. It just return a HTML file. (although it does have some overhead...) It looks like this:
(library (plato webapp benchmark)
    (export entry-point support-methods)
    (import (rnrs) (paella) (plato) (util file))

  (define (support-methods) '(GET))
  (define (entry-point req)
    (values 200 'file (build-path (plato-current-path (*plato-current-context*))
                                  "index.html")
            '("content-type" "text/html")))
)
The index.html has 200B data.
I don't have modern nice HTTP benchmark software like ApatchBench (because I'm lazy) so just used cURL and shell. The script looks like this:
#!/bin/sh

invoke () {
    curl http://localhost:8500/benchmark > /dev/null 2>&1
}

call () {
    for i in `seq 1 1000`;
    do
        invoke &
    done
}

call
wait
It's just create 1000 processes background and wait them.

The benchmark is done on default starting script which Plato generates. So number of threads are 10. Then this is the result:
$ time ./benchmark.sh
./benchmark.sh  4.89s user 3.77s system 313% cpu 2.764 total
So, I've done couple of times and average is approx 3 seconds per 1000 requests. So 300 Req/S. It's slow.

If I run the above benchmark with 10 requests, then the result was like this:
$ time ./benchmark.sh
./benchmark.sh  0.05s user 0.05s system 249% cpu 0.040 total
And 1 request is like this:
$ time ./benchmark.sh
./benchmark.sh  0.01s user 0.01s system 77% cpu 0.025 total
So up to thread number, I can assume it does better, at least it's not increased 10 times. But if it's 100, then it's about 7 times more.
$ time ./benchmark.sh
./benchmark.sh  0.49s user 0.35s system 285% cpu 0.293 total
1 to 10 is twice, but 10 to 100 is 7 times. Then 100 to 1000 is 10 times. Something isn't right to me.

Why it's so slow and gets slow when number of requests is increased? I think there are couple of reasons. The (net server) uses combination of select (2) and multithreading. When the server accepts the connection, then it tries to find least used thread. After that it pushes the socket to the found thread. The thread calls select if there's something to read. Then invokes user defined procedure. After the invocation, it checks if there's closed socket or not and waits input by select again. So flow is like this (n = number of thread, m = number of socket per thread):
  1. Find least used thread. O(nm) (best case O(1) if none of the threads are used)
  2. Push socket to the thread. O(1)
  3. Handling request. O(m)
  4. Cleaning up sockets. O(m)
I think dispatching socket smells slow. So I've made some changes like the followings:
  • Adding load balancing thread which simply manage priority queue
  • Just asking the queue which thread is least loaded
  • Code cleaning up
    • Using (util concurrent shared-queue) instead of manually managing sockets and locks
    • Don't assume write side shutdowned socket is not used.
    • more...
With these changes, the first step only takes O(1).  Now benchmark! This is the result:
$ time ./benchmark.sh
./benchmark.sh  4.61s user 3.76s system 317% cpu 2.633 total
YAHOOOOO!!!! 100ms faster!!! ... WHAAAATTTT!???

Well in average it's 2.6sec per 1000 request so it is a bit faster like 300ms - 400ms. And using (util concurrent) made the server itself more robust (it sometimes hanged before). I think the server framework itself is not too bad but HTTP server. So that'd be the next step.

2016-02-07

肉体改造部 第五週

先週はなぜか書く機会を失った。

計量結果:
  • 体重: 73.1kg (-0.6kg)
  • 体脂肪率: 23.6% (-0.4%)
  • 筋肉率:42.6% (+0.2%)
見た目ほとんど変化なし。流石に後5キロは落とさないとというところだろう。しかし、一ヶ月の成果が1キロ減というのは真剣さが足りないということなのだろうか。

筋トレの負荷が足りない気がしているので、回数を倍にしているのだが、それでも足りない気がする(筋肉痛にすらならない)。ジムに行くべきなのだろうが、時間が取れないんだよなぁ。重り背負って腕立てとかかなぁ。