May 09, 2019

Modeling Closed System Performance of a Server with Discrete Event Simulation

Photo by Kevin Ku on Unsplash


Performance testing of web servers requires a large cost and time. Also, when reporting the performance characteristics of a commercial web application system, it is a common practice to report the performance figures for different combinations of concurrency levels, heap sizes, etc. Hence, we need more cost-effective methods for performance testing. Discrete event simulation is one such option.

Although we cannot fully model the operation of a performance test using discrete event simulation (DES), we can identify the general trends in performance, for example, what happens to the 99th percentile latency when a new feature is added to the software. This reduces the cost for unnecessary performance tests.

DES models a system as a discrete sequence of events in time [1]. An event refers to an item that changes the state of the system. DES is widely used in diagnosing process issues, modelling hospital applications, lab test performance improvement ideas, and evaluating capital investment decisions.

There are three methods of performance testing: 1. The closed system model 2. The open system model and 3. The half open system model. In a previous post, we discussed how to simulate an open system using DES. In this article, we will look at how we can model a web server performance test using the closed system model.

In this blog, we will discuss how we can model a closed loop performance test using DES. We first discuss the basic concepts, DES, and closed system model performance tests. Then, we describe the DES package - simply, that we use for our implementations. We then present the code for a single server closed loop performance test. We then extend the simple version of the single server to multiple servers, which has inter-service calls (abstraction for microservices). The python codes for all the methods are available here.

Discrete Event Simulation

Discrete event simulation (DES) simulates the behavior of a process or a system. Modern hardware, which have high-speeds and capacities, have allowed DES to be applied to problems that span across many areas. The system is modeled as a series of events that occur over time, for example, in performance tests, DES is modeled as a series of requests that are emitted by an event source (a workload generating client in our tests).

Simulating a system is more difficult than actually implementing the real system due to difficulties in writing the code and difficulties in debugging. These difficulties come about due to the parallel execution nature of the simulation code. Hence, there have been attempts to build frameworks and languages for DES. There are mainly three major DES paradigms: activity oriented, event oriented, and process oriented. We will not cover these three in detail, since it is beyond the scope of this article, but an interested reader can refer to this for more details.

1. Activity Oriented Paradigm

In this approach, time is broken into small increments, and, at each time point, the code would look around all the activities (or the event that it should act upon). This approach is very slow and wastes CPU cycles.

2. Event Oriented Paradigm

In this approach, rather than incrementing the time stamps as in the activity oriented paradigm, the time counter is advanced to the time of the next event. This approach saves the unnecessary CPU cycles spent in the activity oriented paradigm.

3. Process Oriented Paradigm

In this approach, each simulation activity is modeled as a process. This is the widely used approach in current state-of-the art DES systems. Two widely used open source process oriented packages are Simpy and C++Sim.

The DES framework we use in this article, SimPy, uses a process oriented paradigm, which we will discuss in detail in the following section.


SimPy is a process-based discrete-event simulation framework [2]. Processes in SimPy are implemented using generator functions. In our examples, we use processes to model the web server and concurrent users (one process per each concurrent user). SimPy also has shared resources, for example in our code, we use SimPy resources to model the queues of the servers.

Following are the major concepts of SimPy. We are not going to explain these in detail in this article. An interested reader can refer to the SimPy official documentation for more details.

1. Yield

A SimPy process can be yielded. When a process is yielded, the execution returns from the process for the given event, and returns. The process resumes upon the completion of the event. As a more concrete example, think about a web server which processes requests. Assume these requests have some processing time x. During this x time period, the process returns from its execution. Once the x time duration is completed, it will resume the execution. This provides a good overview of SimPy yield and some examples.

2. Timeout

Timeout is an event that gets executed after a certain delay has passed.

3. Process Interactions

SimPy processes can be used for process interactions. There are two widely used process interactions: 1. waiting for another process to finish and 2. interrupting another process.

4. Shared Resources

SimPy has resources that can be shared among other different resources. In our example code, the queue between the clients and the server is a shared resource called a Pipe.

The Closed System Model

The closed system model is a widely used model for performance testing of web servers. Figure 1 depicts the typical setup for a closed system model for performance testing. At a given concurrency level (which accounts for a number of clients), the workload generator (client) sends the requests to the server. If we specify a concurrency level of 100 users, the workload generating client spawns 100 concurrent threads. Each thread then sends a request to the server. Upon receiving a request, the server processes the request and sends the response back to the client. Now you should be wondering,“can the server process all 100 requests at the same time?” The answer is no. The server has an input queue that queues the requests. Then the server processes the request at the top of the queue and moves to the next. Upon receiving the response from the server, each workload generating client thread sends the next request. This model is called the closed system model.

JMeter is a widely used workload generating client for closed system model performance testing. It generates synthetic workloads. JMeter simulates the virtual users. Refer to this for a complete tutorial on Jmeter performance tests.

Figure 1: Closed System Model Setup


We use the setup depicted in Figure 1 as the model for our DES setup. Although we have a single machine as the workload generator, it represents a set of clients. Figure 2 shows our DES abstraction for the model shown in Figure 1.

Figure 2: DES abstraction

In our DES abstraction, we use N clients (which represent the N concurrent threads in workload generating client) and N cores. The server application runs in an N core machine, and we make the assumption that the server application can handle N number of requests concurrently. We use two queues: input queue and output queue. Each client adds a requests to the input queue. The requests get queued in the queue and each core takes the top most request and processes them. Once the processing is finished for a given request, it is added to the output queue. Then the response is received by the client. Upon receiving its response, each client sends the next request. This loop executes for a specified time period.

Next we will look at the Python SimPy implementation for the above abstraction. Figure 3 shows the code.

Figure 3: Single Server Python Code

You can specify the number of clients (concurrency) at line 11, and the number of cores at line 12. This program has two process methods; client and server.

Client Process

Line 15-24 shows the client process. This code shows the workflow of a single client (thread in workload generator). Parameter env is the Simpy environment in which the process runs. in_pipe is the input queue to the client (to which the server puts the responses). out_pipe is the output pipe, into which the client puts the requests. i is the client identity number.

In line 18, the client process generates a pseudo random number using exponential distribution, using the given processing rate (note that one over processing time is the processing rate). Then in line 19, it keeps track of the time the request is generated. The request is generated in the form of a dictionary, d, which contains the arrival time and the processing time generated previously. Then it puts this request to the out_pipe.

Then the client process waits until it gets the response to its request, as shown in line 22. Here, you should notice that all the responses for each client request is put to the shared in_pipe. Hence there should be a mechanism that will return only the relevant response (response corresponding to its own request) to the client. For that purpose, we use a special pipe as the in_pipe which has the type FilterStore. Using the lambda function in line 22, we get the relevant response (this lambda function checks whether the returned response has the client ID which is specified using i).

In lines 23 and 24, the client calculates and records the round trip response time.

Server Process

Lines 27-39 show the Python code for the server process. env has the same meaning as the client process’s env. in_pipe refers to the queue from which the server fetches the requests. out_pipe is the queue to which server puts the response. (You should note that the in_pipe of the client is the out_pipe of the server and vice versa).

In a never ending while loop, it checks for new requests as shown in line 31. Once it receives a request from the in_pipe queue (line 31), it extracts it using line 32 and 33. Line 34 calculates the time difference between starting the processing of the request and the request creation time. Line 36 records the length of the in_pipe queue. Then at line 38, it yields the processing time. That means the server consumes processing_time, i.e., the amount of time to process the request. Finally the processed request is put to the out_pipe.

Lines 44, 45 and 46 instantiate the env, in_pipe and out_pipe variables. Line 48 creates concurrency number of clients and starts them (which run in parallel). Lines 51 and 52 create num_cores number of server instances. Finally in line 54, we run the simulation for 1000 time steps.

We run this code for three different concurrency levels (100, 200, 500) and for each concurrency level, we vary the number of cores (1, 2, 4) and check the average latency, throughput and 99 percentile latency. Table 1 below shows the results.

Concurrency Number of Cores Average Latency (Time Steps) 99 Percentile Latency (Time Steps) Throughput (Request Per Time Step)
100 1 2496.148 3151.9094 39.35
100 2 1252.6918 1566.3316 80.25
100 4 624.48957 792.1198 159.88
200 2 4985.5102 5917.101204 39.35
200 2 2503.9160 2950.10 80.25
200 4 1248.598 1473.52731 159.88
500 1 12415.056 14242.556 39.35
500 2 6248.57563 6968.6093200 80.25
500 4 3118.48955 3471.48797 159.88

Table 1: Closed System DES Results

We make the following observations in this experiment.

First, we observe that when the level of concurrency increases, the average latency and 99 percentile latency increase. For example, when the number of cores is fixed at 4, when concurrency increases from 100 to 200, the average latency increases from 624 time steps to 1248 time steps. Also, note that the 99 percentile latency (which represents tail latency values), increases from 792 time steps to 1473 time steps. We explain this behavior as follows.

When the concurrency increases, the waiting time for each request increases. In a closed system model, the number of requests in the system is equal to the level of concurrency. Yet when the level of concurrency increases, the amount of time a request has to wait (till it gets scheduled) for processing increases. This leads to an increased average response time and 99 percentile latency.

Second, we identified that the level of concurrency does not impact the throughput. For example, the throughput remains constant at 39.35 when the level of concurrency is varied from 100 to 200 and 500 (while keeping the number of cores constant at 1). This behavior validates the theoretical proof of queueing theory, that in a closed system, throughput is independent of the level of concurrency, and depends only on the service rate [3].

Third, we observe that when the number of cores increases, average latency and 99 percentile latency decreases. For example, when the concurrency is fixed at 500, when the number of cores is increased from one to two, the average latency reduces from 12415 time steps to 6248 time steps, where as, the 99 percentile latency reduces from 14242 to 6968 time steps. This observation can be explained using the following argument.

When the number of cores increases, the requests that are queued in the server input queue, get scheduled faster. Hence the queue waiting times decrease significantly. Hence the response time (total round trip time) decreases.

Finally, we observe that when the number of cores increases, the throughput increases. For example, when the number of cores is increased from 1 to 4, the throughput increases from 39.5 to 159.8, when the concurrency is fixed at 500. We use the above argument to supplement this observation. When the number of cores is increased, the amount of work that are done in a given time step increases. Hence, the total work done (total number of requests processed), in a given time period increases.

Modeling Interservice Calls

In this section, we will see how we can extend the above code to support interservice calls. This is especially useful when you want to model a microservices architecture based system. Figure 4 below shows the DES abstraction for a system with one interservice call.

Figure 4: Interservice Calls DES abstraction

In this abstraction, we use two servers. As in the single server abstraction, the clients put requests into the input queue of the first server. After processing for sometime, server 1 puts the partially served request to the input queue of server 2. Server 2 then processes for completion and puts the response to server 2's output queue. Server 1 forwards this response back to the client.

But, in reality, when modeling this in SimPy, we face a technical limitation. A SimPy process cannot be modeled as true multithreading processes. Hence, we cannot make server 1 listen to the input queue of server 1 (left) and output queue of server 2 (right) at the same time. To deal with this technical limitation, we use the following approach.

We use half the cores of the middle server (in this example server 1) to process the requests and the other half of the cores to forward the traffic from server 2 to the client. That is, we divide the request accepting and processing part, and request forwarding parts into two. This allows us to model the interservice calls in DES.

Figure 5 shows the python code for interservice calls.

Figure 5: Interservice calls, Python code

The client process and the server process codes are similar to our previous python code, except for some minor changes. Here, we generate two pseudo random numbers in the request d, one for server 1 processing time and the other for server 2 processing time (Lines 14-15). We have divided the server 1 process into two methods, server_1_1 (for actual processing) and server_1_2 (for response forwarding). Then in lines 53-55, we allocate half of the cores for request processing and the other half for response forwarding. In this, we have presented the code for two interservice calls and three interservice calls configurations.

Next, lets see how the performance characteristics, average latency, throughput and 99 percentile vary with respect to concurrency and number of interservice calls. We fix the number of cores to 4. Table 2 below shows the results.

Concurrency Number of Interservice Calls Average Latency (Time Steps) Throughput (Requests per Time Step) 99 Percentile Latency (Time Steps)
100 0 6238.893 159.88 7992.514
100 1 12539.17644 79.3 16485.4686
100 2 12653.462 78.54 15825.8872
200 0 12442.118 159.88 14902.873
200 1 24913.124596 79.3 31621.465
200 2 25154.7919 78.54 29512.536
500 0 30830.13843 159.88 35722.7764
500 1 61063.3348 79.3 70531.2472
500 2 61774.7721 78.54 70094.39

Table 2: Interservice calls DES results

Using the above data, we make the following important observations.

First, we observe that when the number of inter service calls increases, the average latency and 99 percentile latency increase. For example, when the number of interservice calls increases from zero to one, the average latency increases from 6238 to 12539 time steps, when the concurrency is fixed at 100. We explain this behavior using the following explanation.

When the number of interservice calls increases, a given request has to stay at increasing the number of queues. For example, when there are no interservice calls, a given request has to reside on two queues, whereas when the number of interservice calls increases to one, this number of queues become four. This increases the total queue waiting time for a given request in the total round trip. This waiting time causes the average latency and 99 percentile latency to increase.

The percentage increase of average latency and 99 percentile is low when increasing from one interservice call to two interservice calls, compared to increasing from zero interservice calls to one interservice calls. For example, when the number of interservice calls increases from zero to one, while keeping the concurrency fixed at 200, the average response time increases from 12422 to 24913, whereas from one interservice call to two interservice calls, the response time increases from 24913 to 25154. This behaviour can be explained as follows.

In our implementation, we break the intermediate server code into two, the first half of the cores for actual processing of requests, and the other half of the cores for backward response forwarding. Hence the number of available cores for processing decreases from 4 to 2 when the number of interservice calls is increased from 0 to 1. When we increase the number of interservice calls from one to two, this effect is still there, yet the comparative percentage difference becomes small. Hence we do not observe a significant increase in the average latency and 99 percentile latency when we increase the number of interservice calls from one to two. This behavior is mainly due to our implementation method.

Finally we observe that when the number of interservice service calls increases, the throughput decreases. For example, when the number of interservice calls increases from zero to one, the throughput decreases from 159.88 to 79.3. This observation can also be explained using our previous argument: when the number of interservice calls increases, the queue waiting times increases significantly. This leads to an increased round trip time response times.


In this article, we discussed how we can model a closed model performance test using DES. We first discussed the basic concepts: DES and closed system performance testing. Second, we described the DES package, SimPy. Third, we presented the code for a single server setup. We then extended the simple version of a single server to multiple servers that has inter service calls (abstraction for micro services). The Python codes for all the methods are available here.

So, now you may have a question, does this really model a web server system performance test? And the answer is partially yes. Confused? What is “partially”?

Well, the web server systems are more complex than an input queue, output queue, and a processing element (server) abstraction. If you think about it, an incoming request to a web server first has to get accepted by the accepting listening socket. Upon acceptance, a new socket connection is opened for the new client. Then depending on the web server architecture (thread model, event based model, SEDA model), the way the request is handled changes. Also, depending on the architecture of the application server, how the request is processed for dynamic content changes. Hence, modeling a web server, exactly as it is, using DES, is a technically difficult task.

So what is the use of modeling a web server closed model performance test using this approach? Well, there is a notable advantage despite its limitations. We can model the general trends of performance characteristics (though we can’t get the exact performance numbers, we can get the patterns like how average latency changes when the processing rate is changed). That way, we can get a rough estimation of how the server will behave in the presence of different factors that we have not yet tested in real deployments.

So what’s next? We can use this approach to model more specific and realistic performance tests. We can use a logging mechanism to get the actual processing time distribution of a system and use it instead of pseudo random numbers. Also, using more sophisticated modeling, we can get closer to a real web server performance test by using queues and servers at multiple levels (network level, TCP level).




[3] Performance Modeling and Design of Computer Systems: Queueing Theory in Action; page 23