Valid interpretation of feature relevance for linear data mappings

Linear data transformations constitute essential operations in various machine learning algorithms, ranging from linear regression up to adaptive metric transformation. Often, linear scalings are not only used to improve the model accuracy, rather feature coefficients as provided by the mapping are interpreted as an indicator for the relevance of the feature for the task at hand. This principle, however, can be misleading in particular for high-dimensional or correlated features, since it easily marks irrelevant features as relevant or vice versa. In this contribution, we propose a mathematical formalisation of the minimum and maximum feature relevance for a given linear transformation which can efficiently be solved by means of linear programming. We evaluate the method in several benchmarks, where it becomes apparent that the minimum and maximum relevance closely resembles what is often referred to as weak and strong relevance of the features; hence unlike the mere scaling provided by the linear mapping, it ensures valid interpretability.


I. INTRODUCTION
Machine learning (ML) methods constitute core technologies in the era of big data [1]: successful applications range from everyday tasks such as spam classification up to advanced biomedical data analysis.Further, today's most significant machine learning models are supported by strong theoretical guarantees such as their universal approximation capability and generalisation ability.Still, it is a long way to enable the direct use of advanced ML technology in complex industrial applications or settings where a human has to take responsibility for the results.Most popular ML models act as black boxes and do not reveal insight into why a decision has been taken [2].Hence the accuracy on the given data is the sole information based on which practitioners can decide to use a model.Despite strong theoretical results under idealised assumptions, this can be extremely problematic, since these assumptions are usually not met in practice.Further, black box models are restricted to a mere functional inference.Auxiliary information is not extracted, albeit often aimed for e.g. in biomedical data analysis.These facts have caused a strong 1 Those authors contributed equally to this work.interest in interpretable ML models, with first promising results in specific domains such as biomedical data analysis [3]- [9].Linear (or locally linear) data transformations constitute a particularly prominent element in machine learning which seemingly combines efficient and well founded training algorithms with interpretable model components.Global linear models such as ridge regression, linear discriminant analysis, or principal component analysis constitute premier techniques in many application domains in particular if high data dimensionality is involved [10].Besides, the very active field of metric learning usually aims for an adaptive quadratic form, which essentially corresponds to a linear transformation of the data.Many different successful approaches have recently been proposed in this context, see e.g.[11], [12].One of the striking properties of linear models is that they seemingly allow an interpretation of the relevance of input features by inspecting their corresponding weighting; in a few cases, such techniques have led to striking semantic insights of the underlying process [13].Thus, these models carry the promise of fast and flexible learning algorithms, which directly address a simultaneous, quantitative, and interpretable weighting of the given features, provided linear data modelling is appropriate.
Recent results, however, have shown that the interpretation of linear weights as relevance terms can be extremely misleading in particular for high-dimensional data [14]: those data likely display correlations of the features, hence relevance terms can be high due to purely statistical effects of the data.Conversely, highly correlated but very important features can be ranked low due to the fact that they share their impact.In the contribution [14] a first cure to partially avoid these effect by a L 2 regularisation has been proposed; in particular in the case of feature correlations, the approach still fails to provide efficient bounds for the minimum and maximum feature relevance, hence it offers a partial solution of the problem only.
In this contribution we propose a L 1 regularisation instead, which allows an efficient formalisation of the minimum and maximum feature relevance as a linear programming problem.Since many recent datasets are characterised by their high dimensionality, this constitutes a crucial step for feature rele-vance interpretability in many modern domains.
Very high data dimensionality is becoming more and more prominent.For example, in omics studies, many genes are simultaneously considered [15], [16].Even if having more information may seem beneficial at first glance, this wealth of features can also be problematic.Indeed, machine learning in high-dimensional space suffers from the curse of dimensionality [17], [18], also known as the empty space phenomenon.This is due to the fact that the size of a dataset should scale exponentially with its dimensionality, what cannot be achieved in practice.Other counterintuitive phenomena like the concentration of distances [19] occur, what causes distances to be less useful in high-dimensional spaces.Eventually, highdimensional data are harder to analyse and to visualise for human experts.As argued above, direct feature ranking in linear maps can easily loose its interpretability in this situation.
Feature selection [20] is a common preprocessing for highdimensional data, and we will compare our modelling to classical feature selection.Feature selection consists in selecting a few relevant features which allow reaching good prediction performances with easy-to-interpret models.For example, least angle regression (LARS) [21], [22] obtains sparse feature subsets for linear regression.Many methods have been proposed for non-linear models, based e.g. on mutual information [23]- [29].Such solutions improve the performances of subsequently used machine learning algorithms.In our setting, we are not so much interested in a sparse linear representation, rather we address the question, given a linear mapping, what is the relevance of features for the given mapping, taking into account all possible invariances inherent in the data.Concerning this question, classical feature selection, though very powerful, is not entirely satisfying when it comes to interpretability.Indeed, most feature selection algorithms only provide either a unique subset of features or a path of feature subsets of increasing size.This leaves out an important part of the information.For example, if two relevant features are linearly dependent, the LARS algorithm may arbitrarily include any of them in the feature subset, what may incorrectly suggest that the other feature is irrelevant.Also, most feature selection methods do not specify which features are strictly necessary, what may be interesting to understand the system under study.
These limitations of feature selection can be alleviated using the concept of strong and weak relevance [30]- [32].Strongly relevant features provide new information, even if all other features are already used.Weakly relevant features may provide new information, but only if certain features (e.g.redundant ones) are not simultaneously considered.In general, the determination of weakly relevant features requires exhaustive search over all feature subsets [32].In this paper, we restrict to linear mappings only, ignoring possible nonlinear effects.We are interested in the relevance of the features for the given mapping, aiming at both, strong and weak feature relevance.We do not strictly follow the formal definition of strong and weak feature relevance for linear settings, but we will use a different formalisation which is inspired by these terms but allows efficient modelling.Essentially, we will consider two weight vectors of a given mapping as equivalent, if they have the same (or a similar) classification behaviour and the same (or similar) length of the weight vector, thus accounting for a similar signal to noise ratio or generalisation ability, respectively.Then we propose a measurement similar to weak and strong feature relevance by the minimum and maximum weight of a feature in this equivalence class.These bounds give an interpretable interval for the feature relevance.This paper is organised as follows.First, Section II discusses the problem of weak and strong relevance for linear relationships.The concept of bounds for feature relevance is introduced, as well as a simple, generic reference algorithm.Section III proposes a new algorithm to find strongly and weakly relevant features for linear models (and the corresponding feature relevance bounds).Experiments are performed in Section IV and Section V concludes this paper.

II. DEFINITION AND MEASURE OF FEATURE RELEVANCE
This section defines the concept of feature relevance and discusses a simple algorithm to quantify it, aiming at approximations of the formal concept of weak and strong feature relevance.For linear mappings, a similar mathematical definition is proposed in Section III which resembles the underlying ideas but directly gives rise to an efficient solution.

A. Feature Relevance
The question what means feature relevance has been extensively discussed, see e.g. the survey [33] and the approaches [34], [35].The notion of strong and weak feature relevance has been defined in [30]- [32].Assume the task is to predict a target Y based on d features X 1 . . .X d , which can be either continuous (regression) or discrete (classification).A variable Y is conditionally independent of a variable X j given a set of variables S, if P (Y |X j , S) = P (Y |S).This is denoted as where X (j) is the set of all features except X j .Strongly relevant features are strictly necessary to achieve good prediction, since they contain some information which is not provided by any other feature.Finding theses features is particularly interesting to understand the studied process, since these features are likely to play a key role.
A feature X j is defined as weakly relevant to predict Y iff it is not strongly relevant and for some feature subset S ⊂ X (j) .A weakly relevant feature is not necessarily useful, since it provides information which is also contained in other features.Indeed, Y ⊥ ⊥X j |X (j) holds if the feature X j is not strongly relevant (first part of the definition).This can occur if X j is redundant with other features, for example.Nonetheless, experts are often still interested in such features: some weakly relevant features are often necessary for a good model accuracy, albeit the choice is not necessarily unique.Further, weakly relevant features are often crucial to understand the complex relationships between the features and the target.One example is explained in [32]: in gene expression analysis, experts 'are primarily interested in identifying all features (genes) that are somehow related to the target variable, which may be a biological state such as "healthy" vs. "diseased"' [36], [37].

B. Searching for Relevant Features
Under reasonable assumptions, generic (but potentially time consuming) algorithms are proposed in [32] to find strongly and weakly relevant features.We recall this procedures for convenience.Strongly relevant features can be found by selecting all features whose removal lowers the prediction performance.Assume there is given a classifier with prediction error c(S) based on the feature set S. Then these features corresponds to the subset X j |c X (j) > c(X) + where the parameter > 0 controls the trade-off between prediction and recall [32].This backward procedure is efficient, since this criterion must only be estimated d times.
Weakly relevant feature are much harder to find.When directly testing the definition, one has to consider the O 2 d possible feature subsets S ⊂ X (j) for the conditional dependence Y / ⊥ ⊥X j |S.In practice, such an exhaustive search is not affordable and one has to rely on heuristics to find weakly relevant features.For example, the recursive independence test (RIT) algorithm [32] first finds the features X j satisfying Y / ⊥ ⊥X j .Then, it recursively adds all the other features X j which are pairwise dependent with respect to those features, i.e.X j / ⊥ ⊥X j .For each step, a (specific) statistical independency test is required.

C. Bounds for Feature Relevance
The algorithms described in Section II-B find sets of relevant features, whereby weakly relevant features can only approximately be determined efficiently.We are interested in a yet different setting: on the one hand, we do not necessarily consider a clear objective such as the classification error, rather our goal is to interpret the relevance of features for a given linear mapping and data set.In addition, we are not only interested in qualitative results, indicating a feature as relevant or irrelevant, respectively.Rather, we would like to identify an interval for every feature which quantifies the minimum and maximum relevance the feature might have for the given mapping.Thus, such bounds should not only indicate whether features are strongly or weakly relevant, but also how much they are relevant.A non-zero lower bound indicates that a feature is strongly relevant, whereas a large upper bound points out that the feature is at least weakly relevant.
In the following, we will focus on linear relationships, which are common in biomedicine or social sciences, and particularly interesting for the case of high data dimensionality, i.e. a potentially large number of correlated features.In this section, inspired by the formal notion of strong and weak feature relevance, we propose a generic approach which is suitable for low dimensionalities and which can serve as a basic comparison.Afterwards, in Section III, we propose another efficient method to compute feature relevance bounds.This is then tested in Section IV.

D. Generic Approach to Compute Feature Relevance Bounds
Using the same idea as the algorithm in [32] which finds strongly relevant features (see Section II-B), the following algorithm computes lower bounds for the feature relevance.
Here, D X (j) is the dataset restricted to the features X (j) and c measures the relevance of a feature subset to predict Y .
Algorithm 1 Compute lower bounds for feature relevance Input: criterion c and dataset D = {(x i , y i ) } i=1...n Output: lower bound l j for each feature X j compute c (D) Hence, the difference c D X (j) − c (D) can be interpreted as the minimum contribution of X j to the total relevance.This quantity is used as a lower bound l j to the relevance of feature X j .It is non-zero if X j is strongly relevant.
For upper bounds, an exhaustive search would be necessary, but intractable in practice.Instead, a greedy forward-backward search is used in the following algorithm.

Algorithm 2 Compute upper bounds for feature relevance
lower bounds l j for every feature X j Output: upper bound u j for each feature Here, C and S are the subsets of candidate and selected features, respectively.If c is the mean square error, the quantity c(D ∅ ) is defined as the target variance.Also, NB FB STEPS is the number of backward and forward steps which are performed.Using greedy algorithms like the above forwardbackward search is a standard approach in feature selection.Even if it is not optimal, it often gives good results.The particularity of the above greedy search is that the search criterion is the upper bound itself.In other words, the algorithm searches for the feature subset which allows a given feature to be as useful as possible.The number of steps is deliberately limited because (i) weakly relevant features are unlikely to be highly relevant when a lot of other features are simultaneously considered and (ii) the estimation of c is often less reliable when the dimensionality increases.Also, computing the upper bounds with Alg. 2 requires to evalute O(d 2 ×NB FB STEPS) times the criterion c.It is therefore necessary to use a small value for NB FB STEPS.Here, we use NB FB STEPS = 6 as a compromise between accuracy and efficiency.
Fig. 1 shows the lower and upper bounds obtained for the diabetes dataset used in the original LARS paper [21].The 10 features for the 442 patients are the age, the sex, the body mass index (BMI), the blood pressure (BP) and 6 blood serum measurements X 5 . . .X 10 .The goal is to predict a measure Y of diabetes progression one year after feature acquisition.Fig. 1 shows that the BMI X 3 , the BP X 4 and the serum measurement X 9 are particularly informative; this is confirmed by the results of LARS obtained by Efron et al. [21].

E. Notes on the Error Criterion and the Proposed Algorithms
In this paper, c is the mean square error, since we focus on linear regression.However, the above discussion and the two proposed algorithms remain valid for non-linear regression using e.g. a kNN like in [32].Also, other criteria can be used, like the (estimated) conditional entropy c Entropies can be estimated with the Kozachenko-Leonenko estimator [25], [26], [38], [39].Similar approaches exist in feature selection [40], [41], but they do not derive bounds.
The above algorithms have several drawback.First, the criterion c has to be computed for each feature subsets.Second, when the number of feature d increases, the lower bounds tend to zero because of overfitting.Third, the used algorithm for the upper bounds is a heuristic, since forward-backward search is not exhaustive.Eventually, the overall computational cost is quadratic w.r.t. the dimensionality d.However, these two algorithms can still provide excellent points of comparison in Section IV due to their strong resemblance of the weak and strong relevance of features.

III. LINEAR BOUNDS
We are interested in the interpretation of a given linear mapping f (x) = ω x ∈ R with ω ∈ R d , which we assume to map to a one-dimensional space, for simplicity.Generalisations to higher dimensions such as present in metric transformation, for example, are immediate (i.e.treat each onedimensional mapping independently and aggregate the results).We assume that this mapping either comes from a regression or classification task such as ridge regression, LARS, LASSO, or it arises from a quadratic metric adaptation method which corresponds to a linear transformation of the data space.For a given linear mapping, the value |ω j | is often taken as a direct indicator of the relevance of feature X j provided the input features have the same scaling, i.e. the values delivered by a linear mapping are directly interpreted.As pointed out in [14], this is highly problematic: for high-dimensional data and hence high feature correlation, the absolute value ω j can be very misleading.The approach [14] bases this observation on the formalisation of mapping invariances for the given data.
First, we define the central notion of invariance, which will substitute the role of a criterion c.Given a mapping f (x) = ω x and data X consisting of a matrix with data vectors x i we define that ω is equivalent to ω iff i.e. the mapping of the data is not changed when substituting ω by ω .Unlike a pre specified criterion c such as the accuracy, this notion directly relates to the behaviour of the mapping on the given data only.The approach [14] exactly characterises under which condition ω is equivalent to ω : two vectors ω and ω are equivalent iff the difference vector ω − ω is contained in the null space of the data covariance matrix XX .The covariance matrix has eigenvectors v i with eigenvalues λ 1 ≥ . . .≥ λ I > λ I+1 = . . .= λ d = 0 sorted according to their size, whereby I denotes the number of non zero eigenvalues.
In [14] it is proposed to choose one canonic representation ω of the equivalence class induced by a given ω before interpreting the values: one considers the vector ω which results by dividing the null space; ω becomes ω = Ψω where denotes the matrix which corresponds to the projection of ω to the eigenvectors with non zero eigenvalues only induced by the eigenvectors v i of the matrix XX .Hence the eigenvectors with eigenvalue zero are divided out.It has been shown in the approach [14] that this choice of a representative corresponds to the vector in the equivalence class with smallest L 2 norm.This has the result, that it is no longer possible to assign a high value ω j to an irrelevant feature based on random effects of the data, i.e. strongly relevant features are identified.While providing a unique representative of every equivalence class, this choice is problematic as concerns the direct interpretability of the values: Weakly relevant features share the total relevance of the features uniformly.Hence a feature which is highly correlated to a large number of others is always weighted low, independent of the fact that the information provided by this feature (or any equivalent one) might be of high relevance for the linear mapping prescription.In the following, we propose an alternative to choose representatives which are equivalent to ω but which allow a direct interpretation of the weight vector.Essentially, we will not consider the representative with smallest L 2 norm, but use the L 1 norm instead.Unlike the former, the latter induces a set of equivalent weights which have minimal L 1 norm.We can infer the minimum and maximum relevance of a feature by looking at the minimum and maximum weighting of the feature within this set.Now we formalise this intuition.

A. Formalising the Objective
Given a parameter vector ω of a linear mapping, we are interested in equivalent vectors, i.e. vectors of the form for real valued parameters α i which add the null space of the mapping to the vector ω.We want to avoid random scaling effects of the null space, therefore we choose minimum vectors only, similar to the approach [14] .Unlike the L 2 norm, however, we use the L 1 norm: The value of the minimum µ is unique per definition.This is not the case for the corresponding vector ω + d i=I+1 α i v i .A very simple case illustrates this fact: assume identical features X i = X j and a weighting ω i and ω j .Then any weighting ω i = t • ω i + (1 − t)ω j and ω j = (1 − t)ω i + tω j yields an equivalent vector with the same L 1 norm.This observation enables us to formalise a notion of minimum and maximum feature relevance for a given linear mapping: the minimum feature relevance of feature X j is the smallest value of a weight |ω j | such that ω is equivalent to ω and |ω | 1 = µ.The maximum feature relevance of feature X j is the largest value of a weight |ω j | such that ω is equivalent to ω and |ω | 1 = µ.In mathematical terms, this corresponds to the following optimisation problems: where (v i ) j refers to component j of v i .This framework yields a pair (ω j , ω j ) for each feature X j indicating the minimum and maximum weight of this feature for all equivalent mappings with the same L 1 norm.This strongly resembles the notion of strong and weak feature relevance in the special case of linear mappings and the mapping invariance as objective.
Note that this framework does not realise the notion of strong and weak feature relevance in a strict sense due to the following reason: we aim for scaling terms as observed in the linear mapping, which are subject to L 1 regularisation.This has the consequence that two features which have the same information content but which are scaled differently are not treated as identical by this formalisation.Rather, the feature with the better signal to noise ratio which corresponds to a smaller scaling of the corresponding weight is preferred.Qualitative feature selection would treat such variables identically.
There exist natural relaxations of this problem as follows: In Eq. ( 4), we can incorporate eigenvectors which correspond to small eigenvalues, thus enabling an only approximate preservation of mapping equivalence.Further, we can relax the equality in Eq. ( 5) to allow values which do not exceed µ + instead of µ for some small > 0. Such relaxations with small values are strongly advisable for practical applications to take into account noise in the data.We will use these straightforward approximations in experiments.

B. Reformalisation as Linear Programming Problem
For an algorithmic solution, we rephrase these problems as linear optimisation problems (LP).We reformulate problem (6) as the following equivalent LP where we introduce a new variable ωk for every k which takes the role of where µ is computed in (5) and the variables ωi must be non negative due to the constraints.For the optimum solution, we can assume that equality holds for one of the two constraints for every k; otherwise, the solution could be improved due to the weaker constraints and the minimisation of the objective.
For problem (7), we use the equivalent formulation max ω,α where, again, new variables ωk are introduced.Again, these take the role of the absolute value |ω k + d i=I+1 α i (v i ) k |: any solution for which equality does not hold for one of the constraints can be improved due to the weaker constraints and maximisation as the objective.This is not yet a LP since an absolute value is optimised.For its solution, we can simply solve two LPs where we consider the positive and negative value of the objective: and we add the corresponding non negativity constraint At least one of these LPs has a feasible solution, and the final upper bound can be derived thereof as the maximum value This approach requires to solve LP problems containing 2d constraints and I +1 variables.Standard solver can be applied.

IV. EXPERIMENTS
In this section, results accomplished by the linear bounds method and the generic approach are compared.For both methods, data are normalised beforehand to have zero expectation and unit variance.Further, we consider a relaxed LP, allowing a bound of 1.1 • µ instead of µ, and incorporating eigenvectors also with eigenvalues close to zero.We report the used number of eigenvectors for every data set.
Note that the methods investigated in this experiment do not reveal the strong and weak relevance, but they rely on the quantitative scaling instead.Still, upper and lower bounds allow us to distinguish three settings: 1) A feature is irrelevant: this corresponds to a small upper bound.2) A feature is relevant for the mapping but can be substituted by others: this corresponds to a small lower bound and large upper bound.3) A feature is relevant and cannot be substituted: this corresponds to a large lower bound.
Albeit cases 2) and 3) are not equivalent to weak and strong feature relevance in the strict sense, we will refer to these setting by these terms in the following.
As a first illustration, we display the feature relevances of the LP approach generated on the diabetes dataset as discussed in Section II-D in Fig. 2. Here, we utilize the smallest 3 eigenvalues.The features X 3 and X 9 are indicated as strongly relevant.Otherwise, features display similar upper bounds as predicted before, with small differences: the strongly relevant features X 2 and X 4 , as detected by the baseline, are not highlighted by the LP technique.This is due to the fact that the resulting map can slightly be changed since noise due to small eigenvectors is accepted.Under these conditions, the features are no longer mandatory to explain the mapping.Further, X 1 vanishes for the LP method, which can be attributed to the fact that the same effect to the mapping can be achieved with another feature which has a better signal to noise ratio, i.e.L 1 norm would increase when incorporating X 1 .

A. Difference between methods
To show a major advantage of the LP method, a toy dataset was generated: unlike iterative feature selection, the LP technique simultaneously judges the relevance of all features.Hence it can better handle settings where a large number of noisy features masks weakly relevant information.In this example, the first twelve dimensions are noisy and only slightly correlated with the target, features X 13 and X 14 are useful but redundant, and the last two dimensions are necessary and independent.The objective for the task is to predict the sum of the last three dimensions.We choose the dimensionality 1 for the approximated null space.
Results for both methods are displayed in Fig. 3.The generic method finds the two necessary and independent dimensions.It does not single out the weak relevance of the previous two features.Better results can be obtained with the linear programming approach which disregards the first dimensions completely, shows a full lower bound for the last two features, and correctly indicates the potential relevance of the other two dimensions.
Boston Housing: The Boston Housing dataset [44] concerns housing values in suburbs of Boston with the median value of owner-occupied homes as target.The dimensionality of the null space is picked as 3. Like displayed in Fig. 4, features X 6 and X 13 which correspond to the average number of rooms per dwelling and the percentage of lower status of the population are identified as most relevant.The same holds for X 4 , X 11 and X 12 but to a lesser degree.Interestingly, the relevance of features like X 9 (index of accessibility to radial highways) can play an important role, but this information can also be gathered from other features.
Poland Electricity Consumption: This dataset [45], [46] is a time series monitoring the electricity consumption in Poland based on time windows of size 30.We choose the zero space dimensionality as 3 corresponding to the extremely high correlation observed in this time series data.Fig. 5 shows that the last feature is identified by LP as the most relevant one.This is expected due to the smoothness of the time series.For the LP technique, the feature is marked as strongly relevant since its substitution would require a too large weighting.Further, for both methods, the cyclicity of the time series is clearly observable, whereby the basic method does not identify any feature as strongly relevant but the last one.Interestingly, the LP technique identifies two consecutive features as relevant for every cycle, since two values allow the estimation of the first-order derivative for better time series prognosis [47].
Santa Fe laser: This dataset [48], [49] is a time series monitoring the physical process related to a laser with time windows of size 12; the dimensionality of the null space is chosen as 2. Interestingly, a result which is very similar to the previous one can be obtained.The features X 6 and X 12 as well as their immediate predecessors are picked by the LP technique as strongly relevant.As can be seen in Fig. 6 both methods identify the last two features as relevant, but the LP method shows a clearer profile as concerns the past values, which coincides with findings from [47].

V. CONCLUSION
We have addressed the question in how far weights which arise from a linear transformation such as a linear classification, regression, or metric scaling, allow a direct interpretation of the weighting terms as relevances.We have discussed that this is usually not the case in particular for high-dimensional data, a setting with particular importance e.g. for the biomedical domain.Inspired by previous work which addresses the null space of the observed data, and the notion of weak and strong feature relevance, we have developed a framework which yields to an efficient quantitative evaluation of the minimum and maximum feature relevance for a given linear mapping.This framework is based on the hypothesis that the objective is the output of the given mapping for the given data, and only weights which are minimum in L 1 norm are of interest.Then, linear programming enables a polynomial technique to estimate these relevance intervals.
We have compared the techniques to a corresponding baseline which is directly based on forward-backward feature selection.It becomes apparent that the techniques closely resembles the notion of weak and strong feature relevance; unlike iterative methods, it does not face problems when dealing with high-dimensional data and many irrelevant features, still being capable of distinguishing this information from mere noise.
So far, we have demonstrated the techniques for various benchmarks with very promising results.It will be the subject of future work to test the suitability of this technique for biomedical applications where relevance intervals will be checked by medical experts.In addition, we are in the process of testing and improving the technique for higher dimensionality in the range of several hundred or thousand features.For these settings, efficient optimisation techniques will be needed for a feasible LP solution.

Fig. 1 .
Fig. 1.Lower and upper bounds of feature relevance given by Alg. 1 and Alg. 2 for the diabetes dataset.c is the mean square error of a linear regression.

Fig. 2 .
Fig.2.Lower and upper bounds of feature relevance given by the linear programming method for the diabetes dataset.

Fig. 3 .
Fig. 3. Lower and upper bounds of feature relevance for a toy dataset.The top figure shows the results of the generic approach, the lower one for the LP method.

Fig. 4 .
Fig. 4. Lower and upper bounds of feature relevance for a Boston Housing dataset.The top figure shows the results of the generic approach, the lower one for the LP method.

Fig. 5 .
Fig. 5. Lower and upper bounds of feature relevance for a Poland Electricity Consumption dataset.The top figure shows the results of the generic approach, the lower one for the LP method.

Fig. 6 .
Fig. 6.Lower and upper bounds of feature relevance for a Santa Fe Laser dataset.The top figure shows the results of the generic approach, the lower one for the LP method.