What are (Q)SARs?

(Q)SARs (Quantitative Structure-Activity Relationships) are models that can predict the relationship between the structures of chemical substances and their properties.

This could for example be a physico-chemical property or a biological activity, including the ability to cause different types of toxic effects. The models range from simple mathematical equations to advanced 3D computer models.


Over a hundred years ago, a group of researchers discovered a link between the way in which small organic chemicals become distributed between oil and water and their toxicity. They also discovered that this link could be described mathematically, thus inventing the first (Q)SAR model. The development of modern (Q)SAR models have been increasing steadily over the past forty years.

The availability of ever greater and faster computing power has opened up opportunities to develop increasingly sophisticated models. Today, (Q)SARs are used within toxicology and ecotoxicology, with the aim of identifying the harmful effects of chemicals on humans and the environment.

(Q)SAR stands for (Quantitative) Structure-Activity Relationships – i.e. the link between chemical structure and the activity of the substance. The Q in parentheses indicates that the prediction can be either quantitative (e.g. what dose or concentration is needed before the substance causes an effect) or qualitative (e.g. is the substance carcinogenic - yes or no).

Chemical structures that are similar can also have the same type of effect

The basic hypothesis for the models is that chemical substances that are similar will have the same types of properties. This enables the properties of chemicals to be predicted where no experimental testing data is available, thus reducing the number of animal studies that are necessary in the assessment of chemical substances. Similarly, it can increase the amount of information available concerning a given substance (including information on metabolites/degradation products), thereby saving industry and authorities time and money.

Structure of (Q)SAR models

The training set for the models

All (Q)SAR models are constructed on the basis of a training set. This consists of a number of chemical substances and associated test data for a given effect (e.g. the mortality of fish or cancer in rats). Other descriptors of the chemical substances (such as the distribution coefficient of the substance in octanol and water (log Kow), water solubility, etc.) are often also included. It is this "training set" which comprises the test data for the property of the chemicals which is to be predicted, and which with the aid of a mathematical model can predict the effects of other chemicals that have not been tested.

Global and local (Q)SAR models

As a general rule, a (Q)SAR model can only provide credible predictions for substances that to some extent resemble the substances that are included in the training set. Models that are designed to provide predictions for a narrow group of substances with similar chemical structures are known as local models.
Models that are designed to provide predictions for a large number of substances with widely varying chemical structures are called global models.

The applicability domain for the models

The specification of an area of validity ("applicability domain") is a cornerstone in the use of (Q)SARs. It is used to assess whether a (Q)SAR model can provide reliable predictions for a given chemical.

The applicability domain can be subdivided into a structural domain and a descriptor domain, which together delimit the area within which the model can provide reliable predictions. For example, a model could have a structural domain that covers "aliphatic amines" and a descriptor domain that requires log Kow to be between 1 and 6.
If we attempted to derive estimates for an aliphatic amine with a log Kow of 7, the estimate would be outside the model's descriptor domain and the prediction would therefore be uncertain.

For more complicated models, specifying the applicability domain will often be equally complicated, as the models can have different descriptors with different weighting. For this type of model, the domain specification can be built into the computer version of the (Q)SAR model and be specified as, for example, the probability that a given prediction lies within the model's domain.

The precision of (Q)SAR models

(Q)SAR models are assessed according to how good they are at predicting a given property (e.g. bioaccumulation in fish), with a distinction being made between internal performance and external performance.

  • Internal performance:
    Internal performance is specified using goodness-of-fit , which is an indication of how well the model takes into consideration the variation in the training set, and robustness , which is an indication of the model's stability (e.g. how much the model's predictions are affected by removing a chemical from the training set).
  • External performance:
    External performance is measured using three different expressions for the model's predictivity ("power of prediction"), i.e. concordance, sensitivity and specificity.
    Concordance is an indication of what proportion of the model's estimates are accurate and is therefore an overall indicator of the model's precision.
    Sensitivity is an indication of how good the model is at producing accurate predictions that a substance has an effect.
    Specificity indicates how accurately the model predicts that a substance does not have an effect.

A model's performance should always be viewed in the context of the variation in the test data. The best (Q)SAR models are actually comparable with - or even better than - test data with regard to precision when only predictions within the applicability domain of the models are used.