| Article Index |
|---|
| Data-driven modelling |
| References |
| Members |
Problem statement and challenges
In several applications it is difficult and/or time-consuming to construct models that are based on first principles. A black-box modelling approach is a viable alternative then, given the considerable progress that has been made in areas as machine learning, system identification, pattern recognition and statistics in relation to optimization. Commonly used models include support vector machines and kernel-based learning, graphical models and Bayesian networks, neural networks and others. However, new technologies are posing increasing challenges e.g. on the generation of massive data sets, the high dimensionality of the data spaces, the reliability of predictions and the need for interpretability of the estimated models.
The main focus of the working group is data-driven modelling. Different tasks are studied such as regression, classification, clustering, dimensionality reduction, data visualization, predictive modelling, feature selection, structure detection, data fusion, ranking and survival analysis. Major aims are the development of reliable and generic methodologies, convex formulations and convex relaxations, regularization mechanisms and incorporation of prior knowledge, handling different model structures, high dimensionality and large data sets.
Objectives and methodology
-
Achieving sparseness. Objectives of our work include the study of block-wise penalties [1] both in a parametric and nonparametric setting, in relation e.g. to current work on L1 regularization; to link the optimality properties of solutions with their statistical relevance; to establish test procedures aimed at identifying all the relevant (groups of) covariates, based on duality arguments. Among the studied applications will be the use of interpretable models in survival analysis [2].
-
Regularization mechanisms and prior knowledge incorporation. In problems of nonlinear system identification, the application of kernel-based models and support vector machines has been best
established for general black-box models [3]. An open problem is to incorporate prior knowledge about the system within the optimization formulations. Up till now this has only been achieved for a
limited class of systems such as Hammerstein systems [4]. Further systematic approaches will be investigated. Improved black-box modelling schemes will be investigated in the
analysis of magnetic resonance spectroscopy in semi-parametric models [5,6] with the incorporation of spatial constraints. -
Optimization based clustering. Related to spectral clustering [7], clustering over time
and the incorporation of prior knowledge will be studied. The methodology will be based on existing links between spectral clustering, kernel methods and least squares support vector machines [8]. In this setting underlying models are employed which have a feature map representation in the primal and a kernel-based representation in the dual. Applications will be studied in the analysis of network data (e.g. power grid networks, literature networks). -
Large data sets. In most (bio)chemical companies, process optimization and control is limited to data archiving, with database sizes of TeraBytes. Identification of a black box process model on these data sets requires efficient data clean-up, measurement selection and estimation procedures
capable of dealing with these massive amounts of data. Both the applicability of parametric and non-parametric models in this context, including the application of convex optimization methods, is challenging [9]. Kernel-based models and support vector machines have been performing well on a wide variety of (smaller scale) problems [10,11]. Further research is needed towards the applicability of massive data sets, including estimations in the primal with fixed-size methods. -
Modelling for control. The use of black-box modelling approaches will be studied for use in control applications: for model based predictive control this involves the study of fast on-line updating of parameters and hyper-parameters; for the use of black-box models in statistical process control new
multiple objective optimization problems need to be explored.




