MicroStream vs. JPA: An Empirical Investigation

Included in the Proceedings of the 16th Symposium and Summer School On Service-Oriented Computing.

PrePrint @ ResearchGate

SourceCode @ GitHub: https://github.com/fullben/java-persistence-benchmark

This documentation includes two parts. The first is an extension of additional plots and data which is not available in the paper due to space constraints. The second is a summary of our performance investigation for a single transaction and the implication of different service implementations and database queries on the execution time on the server.

Setup the Database

All CSV files for data evaluation are included in the latest release of our WSS application:

To setup the database connection, copy .env-example to .env and modify the environment variables. We are hosting the data from our experiment on a machine called h13 in our own private VM cluster. The connection infos follow:

For example (to use our pre-configured database that stores the raw data) you can use the following configuration:

MINION_MONGODB_HOST=h13.pi.uni-bamberg.de
MINION_MONGODB_PORT=27016
MINION_MONGODB_USER=guest
MINION_MONGODB_PASSWORD=guest
MINION_MONGODB_SSL=TRUE

Then, run the following chunk the set the environmental variables:

load_dot_env(file = ".env")

Connect and disconnect to the database

Afterwards, we can load all database-related functions into the project:

source("src/db.R")

Connect to the database (database netdata, collection netdata_metrics). Please make sure you are adjusting these values according to your database setup.

# database
database = "netdata"

# collection
collection = "netdata"

# connect
netdata_coll <- db.connect(database, collection)

Make sure that you disconnect after finishing your work:

db.disconnect(netdata_coll)

Setup your environment

During the execution of this R code, we will create temporary files and use the JMETER files generated during benchmarks. By default, this files will be stored in BASE_FOLDER.

BASE_FOLDER = "./raw/"

Finally, we can load all necessary functions for the later data processing and plotting.

source("src/processing.R")
source("src/ggplot2.R")

Data Analysis

CPU and Memory Utilization

In contrast to the Bookstore application, we also include some plots for all WSS implementations: JPA, MS-JACIS and MS-Sync

Transaction Times - Boxplots


The two figures have the same content, once the Request-Response time in the first Figure and the processing time on the server. Interesting here is the overhead of middleware routing, transmission etc. which sums up to ~80ms on average for a single request for all transactions and implementations.

Transaction Processing Anomaly

As already stated in the paper, there was an anomaly when executing the delivery transaction for our JPA solution, depicted on the left in green.

After the first version of the paper, we made different investigations.

  1. We checked the fragmentation of the database and its indices. After defragmenting the database and reexecuting the benchmark - nothing improved.
  2. We checked the LAZY and EAGER data retrieval from the database with the same result. No improvements.
  3. We checked the implementation of our SpringBoot Service class and the native SQL query again:

DeliveryService Implementation I - led to the aforementioned increase in execution time

public DeliveryResponse process(DeliveryRequest req) {
  // Find districts and carrier
  List<DistrictEntity> districts = districtRepository.findByWarehouseId(req.getWarehouseId());
  CarrierEntity carrier = carrierRepository.getById(req.getCarrierId());

  for (DistrictEntity district : districts) {

  OrderEntity order = orderRepository.findOldestUnfulfilledOrderOfDistrict(district.getId()).orElse(null);

   // …

   // Find all order items, set delivery date to now and sum amount
   for (OrderItemEntity orderItem : order.getItems()) {
    // …
   }
   // Save order and items
   order = orderRepository.save(order);

   // Update customer balance and delivery count
   CustomerEntity customer = order.getCustomer();
   // …
   customerRepository.save(customer);
  }
  return new DeliveryResponse(req);
 }

All the *Respositories in the source code snippet are generated SpringBoot Data JpaRepositories. What is noteworthy here are the many database queries made by our method, especially within the for loop. We call it the object oriented way of programming here - we are still in Java and that is fine :)

Another important aspect is our custom query findOldestUnfulfilledOrderOfDistrict. Since we need some special SQL features we implemented a native query. The SQL statement is as follows:

SELECT * 
FROM orders 
WHERE fulfilled = false AND district_id = :districtId 
ORDER BY entrydate ASC 
LIMIT 1

It took a while for the first assumption that the ordering of the result set (which is increasing linearly over time - as the execution time of the query) and the number of database queries might be the cause for our performance issues.

Based on these assumptions, we changed the DeliveryService and the native SQL Query.

DeliveryService Implementation II - reducing the number of queries and improving the SQL statement

public DeliveryResponse process(DeliveryRequest req) {
  // Find districts and carrier
  List<DistrictEntity> districts = districtRepository.findByWarehouseId(req.getWarehouseId());
  CarrierEntity carrier = carrierRepository.getById(req.getCarrierId());
 
  List<String> districtIds =;
  List<OrderEntity> orders = orderRepository.findOldestUnfulfilledOrderOfDistricts(districtIds);
  Map<String, CustomerEntity> customers = new HashMap<>();
 
  for (OrderEntity order : orders) {
 
   // … 
   // Update customer balance and delivery count
   CustomerEntity customer = order.getCustomer();
   //…
   customers.putIfAbsent(customer.getId(), customer);
  }
 
  // save - batch processing
  orderRepository.saveAll(orders);
  customerRepository.saveAll(customers.values());

  return new DeliveryResponse(req);
}

We call this the batch oriented way . The idea here is to reduce the IO time when requesting data from the database.

We additionally adapted the SQL query findOldestUnfulfilledOrderOfDistricts:

 SELECT * 
 FROM orders 
 WHERE(district_id, entrydate) IN 
  (
   SELECT district_id, MIN(entrydate) as entrydate 
   FROM orders 
   WHERE district_id IN (:districtIds) AND fulfilled = false 
   GROUP BY district_id
  )

Only one query checking for the oldest order of each district by using subqueries and aggregate functions from the SQL standard.

The result was interesting:

We cannot definitely state that the ORDER BY SQL statement caused this linear increase issue, but it is highly likely. Databases are good in filtering (can be parallelized well) but bad in sorting. Additionally in this case we reduced the number of queries (we executed our benchmark with 10 districts) from 32 to 5 :)

DeliveryService Implementation III - Understanding the impact of additional queries

In a last step, we wanted to check which effect additional queries have on the application performance. So we combined the object-oriented and the batch-oriented way to some extent.

Situation for DeliveryService III
1 Query for getting all districts by a corresponding warehouse
1 Query for getting the carrier to deliver the order
    N Queries to get the oldest order for each district
1 Query to update all orders
1 Query to update all involved customers of the oldest orders

And the corresponding SQL query:

 SELECT * 
 FROM orders 
 WHERE(district_id, entrydate) IN 
  (
   SELECT district_id, MIN(entrydate) as entrydate 
   FROM orders 
   WHERE district_id = :districtId AND fulfilled = false 
   GROUP BY district_id
  )

And the result:

What we see is that the additional 9 queries in our case lead to an average increase of 13.5 ms or to say it relatively 22%. So reducing the number of queries despite their higher complexity is worth the effort! :)