In the present Ph.D. Thesis, an innovative approach to derive Quantitative Structure-Property/Activity Relationships (QSPR/QSARs) was investigated and discussed by applying it to various predictive problems. This approach is based on the direct and adaptive treatment of molecular structure by means of a Recursive Neural Network (RNN). Chemical compounds are represented through appropriate graphical tools and no numerical descriptors are needed.
In the first part, the RNN-QSPR method was applied to predicting the melting point (Tm) of a set of 126 pyridinium bromides and the glass transition temperature (Tg) of a set of 337 (meth)acrylic homopolymers. Particular emphasis was placed on the representation of cyclic moieties, which can be achieved in different ways by exploiting the flexibility of the structured approach. Various representations were devised, each one having different advantages and sampling requirements. The performance did not show significant variations when passing from a more specific representation to a more general one. The best result obtained for the Tm of pyridinium bromides showed, for the test set of 37 molecules, a mean absolute residual (MAR) of 25 K, a standard error of prediction (S) of 29.6 K and a squared correlation coefficient (R2) of 0.62. The best outcome for the Tg of poly(meth)acrylates had MAR, S and R2 values of 15.8 K, 20.4 K and 0.85, respectively, for the test set of 54 molecules.
In the second part, the representation used for the treatment of homopolymers was expanded to treat copolymers. A data set containing the Tg of 275 random (meth)acrylic copolymers was investigated, either alone or mixed with homopolymer data. The prediction on copolymers was excellent, with MAR, S and R2 for the 57 compounds in the test set of 4.9 K, 6.1 K and 0.98. The method yielded a good performance also on the total data set comprising homopolymers and copolymers together.
In the last part, the RNN approach was employed to model and predict the toxicity of two sets of aromatic molecules. The first data set involved the median growth impairment concentration (IGC50) of 221 phenols towards Tetrahymena pyriformis. The results were good for the training set, but the performance on the test set (41 molecules) was not on par with that of other methods in the literature. However, it must be stressed that the referenced methods employ a priori information synthesized into appropriate numerical descriptors, whereas our method does not make use of any background knowledge. The second data set concerned the median Lethal Concentration (LC50) of 69 substituted benzenes towards Pimephales promelas. This data set was also investigated by means of a descriptor-based MLR technique. The performance was good for both calculations, yielding MAR ≈ 0.22, S ≈ 0.25 and R2 ≈ 0.80 on the test set of 18 molecules. The results obtained by RNN and MLR were very similar, despite the radically different approaches of these two methods.