Included in the Proceedings of the 16th Symposium and Summer School On Service-Oriented Computing.
PrePrint @ ResearchGate
SourceCode @ GitHub: https://github.com/fullben/java-persistence-benchmark
This documentation includes two parts. The first is an extension of additional plots and data which is not available in the paper due to space constraints. The second is a summary of our performance investigation for a single transaction and the implication of different service implementations and database queries on the execution time on the server.
All CSV files for data evaluation are included in the latest release of our WSS application:
To setup the database connection, copy .env-example
to
.env
and modify the environment variables. We are hosting
the data from our experiment on a machine called h13 in our own private
VM cluster. The connection infos follow:
For example (to use our pre-configured database that stores the raw data) you can use the following configuration:
MINION_MONGODB_HOST=h13.pi.uni-bamberg.de
MINION_MONGODB_PORT=27016
MINION_MONGODB_USER=guest
MINION_MONGODB_PASSWORD=guest
MINION_MONGODB_SSL=TRUE
Then, run the following chunk the set the environmental variables:
load_dot_env(file = ".env")
Afterwards, we can load all database-related functions into the project:
source("src/db.R")
Connect to the database (database netdata
, collection
netdata_metrics
). Please make sure you are adjusting these
values according to your database setup.
# database
database = "netdata"
# collection
collection = "netdata"
# connect
netdata_coll <- db.connect(database, collection)
Make sure that you disconnect after finishing your work:
db.disconnect(netdata_coll)
During the execution of this R code, we will create temporary files and use the JMETER files generated during benchmarks. By default, this files will be stored in BASE_FOLDER.
BASE_FOLDER = "./raw/"
Finally, we can load all necessary functions for the later data processing and plotting.
source("src/processing.R")
source("src/ggplot2.R")
In contrast to the Bookstore application, we also include some plots for all WSS implementations: JPA, MS-JACIS and MS-Sync
The two figures have the same content, once the Request-Response time in the first Figure and the processing time on the server. Interesting here is the overhead of middleware routing, transmission etc. which sums up to ~80ms on average for a single request for all transactions and implementations.
As already stated in the paper, there was an anomaly when executing the delivery transaction for our JPA solution, depicted on the left in green.
After the first version of the paper, we made different investigations.
public DeliveryResponse process(DeliveryRequest req) {
// Find districts and carrier
List<DistrictEntity> districts = districtRepository.findByWarehouseId(req.getWarehouseId());
= carrierRepository.getById(req.getCarrierId());
CarrierEntity carrier
for (DistrictEntity district : districts) {
= orderRepository.findOldestUnfulfilledOrderOfDistrict(district.getId()).orElse(null);
OrderEntity order
// …
// Find all order items, set delivery date to now and sum amount
for (OrderItemEntity orderItem : order.getItems()) {
// …
}
// Save order and items
= orderRepository.save(order);
order
// Update customer balance and delivery count
= order.getCustomer();
CustomerEntity customer // …
.save(customer);
customerRepository}
return new DeliveryResponse(req);
}
All the *Respositories in the source code snippet are generated
SpringBoot Data JpaRepositories
. What is noteworthy here
are the many database queries made by our method, especially within the
for loop. We call it the object oriented way of
programming here - we are still in Java and that is fine :)
Another important aspect is our custom query
findOldestUnfulfilledOrderOfDistrict
. Since we need some
special SQL features we implemented a native query. The SQL
statement is as follows:
SELECT *
FROM orders
WHERE fulfilled = false AND district_id = :districtId
ORDER BY entrydate ASC
LIMIT 1
It took a while for the first assumption that the ordering of the result set (which is increasing linearly over time - as the execution time of the query) and the number of database queries might be the cause for our performance issues.
Based on these assumptions, we changed the DeliveryService and the native SQL Query.
public DeliveryResponse process(DeliveryRequest req) {
// Find districts and carrier
List<DistrictEntity> districts = districtRepository.findByWarehouseId(req.getWarehouseId());
= carrierRepository.getById(req.getCarrierId());
CarrierEntity carrier
List<String> districtIds = …;
List<OrderEntity> orders = orderRepository.findOldestUnfulfilledOrderOfDistricts(districtIds);
Map<String, CustomerEntity> customers = new HashMap<>();
for (OrderEntity order : orders) {
// …
// Update customer balance and delivery count
= order.getCustomer();
CustomerEntity customer //…
.putIfAbsent(customer.getId(), customer);
customers}
// save - batch processing
.saveAll(orders);
orderRepository.saveAll(customers.values());
customerRepository
return new DeliveryResponse(req);
}
We call this the batch oriented way . The idea here is to reduce the IO time when requesting data from the database.
We additionally adapted the SQL query
findOldestUnfulfilledOrderOfDistricts
:
SELECT *
FROM orders
WHERE(district_id, entrydate) IN
(SELECT district_id, MIN(entrydate) as entrydate
FROM orders
WHERE district_id IN (:districtIds) AND fulfilled = false
GROUP BY district_id
)
Only one query checking for the oldest order of each district by using subqueries and aggregate functions from the SQL standard.
The result was interesting:
We cannot definitely state that the ORDER BY
SQL
statement caused this linear increase issue, but it is highly likely.
Databases are good in filtering (can be parallelized well) but bad in
sorting. Additionally in this case we reduced the number of
queries (we executed our benchmark with 10 districts)
from 32 to 5 :)
In a last step, we wanted to check which effect additional queries have on the application performance. So we combined the object-oriented and the batch-oriented way to some extent.
for DeliveryService III
Situation 1 Query for getting all districts by a corresponding warehouse
1 Query for getting the carrier to deliver the order
for each district
N Queries to get the oldest order 1 Query to update all orders
1 Query to update all involved customers of the oldest orders
And the corresponding SQL query:
SELECT *
FROM orders
WHERE(district_id, entrydate) IN
(SELECT district_id, MIN(entrydate) as entrydate
FROM orders
WHERE district_id = :districtId AND fulfilled = false
GROUP BY district_id
)
And the result:
What we see is that the additional 9 queries in our case lead to an average increase of 13.5 ms or to say it relatively 22%. So reducing the number of queries despite their higher complexity is worth the effort! :)