About Henriette Harmse

I am a software architect with 20 years experience as software developer, architect and consultant in a variety of industries (i.e. financial, healthcare, media, mining, etc). I have a PhD in Artificial Intelligence/Data Science. Currently I am working at EMBL-EBI where I am leading the development of their suite of Ontology Tools.

Using Jena and SHACL to validate RDF Data

RDF enables users to capture data in a way that is intuitive to them. This means that data is often captured without conforming to any schema. It is often useful to know that an RDF dataset conforms to some (potential partial) schema. This is where SHACL (SHApe Constraint Language), a W3C standard, comes into play. It is a language for describing and validating RDF graphs. In this post I will give a brief overview of how to use SHACL to validate RDF data using the Jena implementation of SHACL.

A SHACL Example

We will use an example from the SHACL specification. Assume we have a file person.ttl that contains the following data:

person

Example RDF data

To validate this data we create a shape definition in personShape.ttl containing:

personShape

Person shape definition

A Code Example using Jena

To validate our RDF data using our SHACL shape we will use the Jena implementation of SHACL. Start by adding the SHACL dependency to your Maven pom.xml. Note that you do not need to add Jena as well as the SHACL pom already includes Jena.

SHACLPom

SHACL Maven dependency

In the code we will assume the person.ttl and personShape.ttl files are in $Project/src/main/resources/. The code for doing the validation is the following then:

personValidation

Java code using Jena implementation of SHACL

Running the Code

Running the code will cause a report.ttl file to be written out to $Project/src/main/resources/. We can determine that our data does not conform by checking the sh:conforms property. We have 4 violations of our ex:PersonShape:

  1. For ex:Alice the ex:ssn property does not conform to the pattern defined in the shape.
  2. ex:Bob has 2 ex:ssn properties.
  3. ex:Calvin works for a company that is not of type ex:Company.
  4. ex:Calvin has a property ex:birthDate that is not allowed by ex:PersonShape since it is close by sh:closed true.

A corrected version of our person data may look as follows:

personCorrected

Person data that conforms to our person shape

Conclusion

In this post I have given a brief overview of how SHACL can be used to validate RDF data using the SHACL implementation of Jena. This code example is available at shacl tutorial.

Why does the OWL Reasoner ignore my Constraint?

A most frustrating problem often encountered by people, with experience in relational databases when they are introduced to OWL ontologies, is that OWL ontology reasoners seem to ignore constraints. In this post I give examples of this problem, explain why they happen and I provide ways to deal with each example.

An Example

A typical example encountered in relational databases is that of modeling orders with orderlines, which can be modeled via Orders and Orderlines tables where the Orderlines table has a foreign key constraint to the Orders table. A related OWL ontology is given in Figure 1. It creates as expected Order and Orderline classes with a hasOrder object property. That individuals of Orderline are necessarily associated with one order is enforced by Orderline being a subclass of hasOrder
exactly 1 owl:Thing
.

Order

Figure 1: Order ontology

Two Problems

Two frustrating and most surprising errors given the Order ontology are: (1) if an Orderline individual is created for which no associated Order individual exists, the reasoner will not give an inconsistency, and (2) if an Orderline individual is created for which two or more Order individuals exist, the reasoner will also not give an inconsistency.

Missing Association Problem

Say we create an individual orderline123 of type Orderline, which is not associated with an individual of type Order, in this case the reasoner will not give an inconsistency. The reason for this is due to the open world assumption. Informally it means that the only inferences that the reasoner can make from an ontology is based on explicit information stated in the ontology or what can derived from explicit stated information.

When you state orderline123 is an Orderline, there is no explicit information in the ontology that states that orderline123 is not associated with an individual of Order via the hasOrder property. To make explicit that orderline123 is not in such a relation, you have to define orderline123 as in Figure 2. hasOrder max 0 owl:Thing states that it is known that orderline123 is not associated with an individual via the hasOrder property.

HasNoOrder

Figure 2: orderline123 is not in hasOrder association

Too Many Associated Individuals Problem

Assume we now change our definition of our orderline123 individual to be associated via hasOrder to two individuals of Order as shown in Figure 3. Again, most frustratingly the reasoner does not find that the ontology is inconsistent. The reason for this is that OWL does not make the unique name assumption. This means that individuals with different names can be assumed by the reasoner to represent a single individual. To force the reasoner to see order1 and order2 as necessarily different, you can state order1 is different from order2 by adding DifferentFrom:order2 to order1 (or similarly for order2).

HasTwoOrders

Figure 3: orderline123 has two orders

Constraint Checking versus Deriving Inferences

The source of the problems described here is due to the difference between the
purposes of a relational database and an OWL reasoner. The main purpose of a
relational database is to enable view and edit access of the data in such a way that the integrity of the data is maintained. A relational database will ensure that the data adheres to the constraints of its schema, but it cannot make any claims beyond what is stated by the data it contains. The main purpose of an OWL reasoner is to derive inferences from statements and facts. As an example, from the statement Class: Dog SubclassOf: Animal and the fact Individual: pluto Type: Dog it can be derived that pluto is an Animal, even though the ontology nowhere states explicitly that pluto is an Animal.

Conclusion

Many newcomers to OWL ontologies get tripped up by the difference in purpose of relational databases and OWL ontologies. In this post I explained these pitfalls and how to deal with them.

If you have an ontology modeling problem, you are welcome leaving a comment detailing the problem.

Risk Based Testing

A question that is asked regularly in testing circles is: “When should you stop testing?”. Proponents of code coverage tools may suggest once some percentage of coverage is achieved you can stop testing. However, what do you do when the budget is severely constrained? An even more difficult situation to address is when a project starts out with a given budget, but during its lifetime the budget gets significantly reduced (i.e. due to economic downturn). This forces us to be able to provide the highest level of quality for the least amount of money. In this post I will explain how risk based testing can help to answer this question.

Motivation

A mistake that is often made is that the testing effort is distributed equally across the system – both critical and non-critical parts of the system are tested equally. This results in critical parts of the system not being tested sufficiently and non-critical parts being tested to the point of diminishing returns.

A further mistaken mindset of developers is that there is no such thing as a useless test. This may entice developers to add tests for the sake of adding tests. In actual fact every test (unit-, integration- or systems test) has to earn its place in the codebase. If there is no good motivation for a test, the test must be deleted from the codebase. Why is that? Because every test adds to the volume of code that developers have to master and maintain to be productive members of the team. As such tests adds to the overall cost of maintenance of a system. The most cost effective code to maintain is the code that has never been written.

Risk based testing is a testing approach that helps developers and testers to prioritize the testing effort by assigning a relative risk to each software component.

Calculating the Relative Risk of a Software Component

The relative risk of a software component is based on the product of the relative risks assigned to the criteria deemed relevant for the project. Criteria for determining risk will be discussed in the next section. For each criteria a value is assigned between 1..X, where 1 indicates that the relative risk of a criterion is insignificant and X indicates that the relative risk of the criterion is critical for the given software component.

Assuming we have software components (S) 1..m, criteria (C) 1..n where we assign relative risks, Ai,1 .. Ai,n for each criterion of a software component Si, the relative risk Ri for each software component is given by:

R1 = A1,1 * … * A1,n

Rm = Am,1 * … * Am,n

Ordering R_1 … R_m in descending order of their values will cause the software components with the highest relative risk to be at the top of the list. Note that the calculated relative risk for a specific software component by itself is meaningless. A calculated risk of an component is only meaningful if it can be compared to the calculated risk of another component calculated in the same way (that is, using the same criteria and value for X). That is why we refer to the relative risk of a software component.

What value should you choose for X? In general you should choose X to be equal or greater than the number of criteria you will be using in calculating the risk. If many of the calculated relative risks R_i have the same value, it means you have to increase the value of X.

What Criteria should be Used?

In this section we give examples of criteria that can be used in the calculating the risk of each software component. However, these criteria merely serve as examples. You are free to choose whatever criteria are deemed of importance for your project.

Some general criteria that you may want to consider using are:

  • Frequency of use: How frequently is this software component used? The more often a software component is used, the higher is its potential for having an adverse effect when it fails.
  • Cost: What will be the cost of this software component failing? The cost can be monetary, but it does not have to be. It can be cost in terms of reputation, loss of clients, or whatever factor that can hurt the business. Here it may be a good idea to get the input of the project sponsor or project owner.
  • Complexity: What is the complexity of this software component? If this software component forms part of a green fields project, this criteria can express the complexity of this software component in comparison with other newly developed components. In the case where enhancements are made to an existing system, code analysis tools can be used to determine complexity of software components that need to be changed. For example, when an enhancement requires that 2 classes need to be changed substantially, the class with a higher cyclomatic complexity (the number of linear independent paths through the code) represents a higher risk.
  • Frequency of change: How often will the software component be changed? If a software component realizes the core value proposition of a business, it is likely that this component will constantly need to be enhanced as the business is trying to stay competitive. If there is little chance that a software component is going to be changed after it has been completed, the tests for that software component may be of limited value in the long term.
  • Number of bugs: For a system that is already in production, the number of bugs that are logged (that can be related back to specific software components) can help give an indication as to where testing effort should be focused.

An Example

As an example, let us assume we are working on a legacy online share trading system, naturally with zero automated tests. As with most legacy systems, a large number of users depend on the system. A steady stream of user requests for enhancements and bugs logged require the system to be frequently updated. These changes often introduce new bugs.

We further assume the system consists of the following modules:

  • A user management, authentication and authorization module through which users and their permissions are managed. It is very seldom that any changes are required to this module.
  • A share price notification engine of which the main purpose is to notify users in realtime of changes in share prices. The correct functioning of this engine is critical to the business since it enables users to buy/sell shares as and when needed. Any failure of the engine could mean that users cannot buy/sell shares as needed which could result in massive losses. However, since the initial problems with the engine have been resolved, bugs are seldomly logged for this module.
  • The account management system keeps track of the funds of users as they trade, calculates daily interest earned or charged (a positive balance earns interest and for a negative balance interest is charged) and generate monthly statements of transactions.
  • The mark-to-market process runs each day after the stock exchange has closed to calculate the gains/losses for each account. There are frequent changes to this process as quants are forever fine tuning the way commission is calculated on trades. Most of the bugs logged are related to this module.

To improve the situation, management wants the development team to add automated tests. How should we approach testing the system?

Table1

If we apply risk based testing as in seen in Table 1, it is clear that we need to focus our testing efforts on the mark-to-market process. Let us further assume the mark-to-market process consists of the following steps for each account:

  1. determine prices at which shares have been bought,
  2. determine current prices of shares held,
  3. calculate the profit/loss, and
  4. update the account balance.

A detailed plan of how to approach testing of the mark-to-market process can be drawn up (see Table 2), but instead of listing modules, it lists the software components that the mark-to-market process consists of. These can for example be classes, methods, functions, scripts etc. Table 2 indicates that we need to particularly focus our attention on the calculateProfitLoss software component.

Table2

Advantages/Disadvantages of Risk Based Testing

The advantages of risk based testing are:

  • The highest risk items can be developed and tested first which reduces the overall risk on the project.
  • If testing has to be watered down, a guideline exists for deciding what to test and what not to test.
  • At any given time an indication can be given of the risk of the project by considering the highest risk use cases that has not been tested.
  • It is a valuable tool for communicating risk to management, project sponsors and project owners.

The potential disadvantages of risk based testing are:

  • With a risk based testing approach many parts of the system will go untested. However, this will be an intentional decision rather than an accidental one.
  • The relative risk value, that are assigned for each criterion of a software component, is likely to be subjective. That means that another person may assign a different relative risk value for a criterion of a software component. However, it is just as subjective as, for example, story points in agile methodology. It similarly makes sense to assign relative risk values as part of a team discussion.
  • Some teams feel risk based testing adds to their documentation load. That is, they now have to draw up a complete risk profile before they can start testing. Do I always draw up risk profiles for all my projects? Honestly? No. What I will do is to have a discussion with the team regarding what areas of the system we need to test to death and which areas we can skimp on. If there is some disagreement, then I may write it out. Writing it out is very useful when the testing approach needs to be discussed with management, the project sponsor or the project owner.

Conclusion

This post explained how risk based testing can help to ensure that the testing effort is focused where it will bring the most business value.