Datamining inmass spectrometry-based proteomics studies

Use your smartphone to scan this QR code and download this article ABSTRACT The post-genomic era consists of experimental and computational efforts to meet the challenge of clarifying and understanding the function of genes and their products. Proteomic studies play a key role in this endeavour by complementing other functional genomics approaches, encompasses the large-scale analysis of complex mixtures, including the identification and quantification of proteins expressed under different conditions, the determination of their properties, modifications and functions. Understanding how biological processes are regulated at the protein level is crucial to understanding the molecular basis of diseases and often highlights the prevention, diagnosis and treatment of diseases. High-throughput technologies are widely used in proteomics to perform the analysis of thousands of proteins. Specifically, mass spectrometry (MS) is an analytical technique for characterizing biological samples and is increasingly used in protein studies because of its targeted, nontargeted, and high performance abilities. However, as large data sets are created, computational methods such as data mining techniques are required to analyze and interpret the relevant data. More specifically, the application of data mining techniques in large proteomic data sets can assist in many interpretations of data; it can reveal protein-protein interactions, improve protein identification, evaluate the experimental methods used and facilitate the diagnosis and biomarker discovery. With the rapid advances in mass spectrometry devices and experimental methodologies, MS-based proteomics has become a reliable and necessary tool for elucidating biological processes at the protein level. Over the past decade, we have witnessed a great expansion of our knowledge of human diseases with the adoption of proteomic technologies based on MS, which leads to many interesting discoveries. Here, we review recent advances of data mining in MS-based proteomics in biomedical research. Recent research in many fields shows that proteomics goes beyond the simple classification of proteins in biological systems and finally reaches its initial potential – as an essential tool to aid related disciplines, notably biomedical research. From here, there is great potential for data mining in MS-based proteomics to move beyond basic research, into clinical research and diagnostics.


INTRODUCTION
Proteomics encompass a broad range of technologies that allows the identification and the quantification of proteins in complex biological specimens. Proteomics approaches rely on the ability to detect small changes in protein abundance of an altered state given a control or reference condition. Thus, the identification and quantification of differences between two or more physiological states of a biological system can be defined as changes on the control sample, determining the up-or down-regulation of such protein 1 . These approaches have been extensively applied in biomedical research for the understanding of diseases, including protein-based biomarker discovery for early detection and monitoring of different types of cancer 2 , the analysis of abnormal protein phosphorylation patterns associated with diseases 3 and the identification of therapeutic targets 4

.
There are many technologies used to extract protein information from biological samples. These techniques cover a range of approaches and quality of extracted data. Commonly used techniques include two-dimensional gel electrophoresis, enzyme-linked immunosorbent assay (ELISA), protein arrays, affinity separation and mass spectrometry (MS) technologies. Many of these methods, such as gel electrophoresis and ELISA are limited in the number of proteins they can analyze because of time-consuming process. They also require specific proteins be selected during the design of the study and proper available antibodies; this can be a challenge for non-model organisms. Meanwhile, MS-based proteomics has become a widely used high throughput method to investigate protein expression and functional regulation. From being able to study only dozens of proteins, state-ofart MS proteomic techniques are now able to identify and quantify ten thousand proteins 5 .
MS is used to measure the mass-to-charge (m/z) ratio of molecules. However, the molecules must first be electrically charged and transformed into a gas phase due to electromagnetic fields ( Figure 1). Electrospray ionization is a commonly used method for the ionization of molecules. However, other methods are increasingly popular, including matrix-assisted laser desorption/ionization (MALDI) and surfaceenhanced laser desorption ionization (SELDI). Once the molecules have been transformed into a gas phase, their m/z ratios are measured by their motion in an electric or magnetic field, this occurs in mass analyzer. There are different types of mass analyzers, including quadrupole systems, time of flight, ion trap and fourier transform. Each of these systems has different strengths and weaknesses, such as the m/z value range that can be detected and the mass spectrometric resolution. Once measured, the m/z values are displayed as mass spectra, describing the molecules present through the peaks at corresponding m/z values 6 . In recent years, with advances in instrumentation and detection techniques, MS has been applied more widely in various areas, including pharmacology and biomedical practice. However, the more the sensitivity, accuracy and performance of MS analysis are improved, the more the quantity, dimensionality, and complexity of the data sets generated by MS have increased significantly. In order to interpret this huge amount of data efficiently, there is growing interest in applying informatics technology based on data mining algorithms to meet current demand. The aim of this article is to give a brief overview how data mining algorithms could help processing complex MS-based proteomics data, to provide a valuable molecular insight into different biological specimens, and make MS techniques more versatile and translatable in solving biomedical problems. First, we introduce the field of data mining in proteomics studies and highlight the essential concepts. Then specific implementations of data mining algorithms are reviewed, ordered by the steps in a typical workflow. Thereafter, challenges of standardization databases and softwares availability are mentioned. Finally, application of MS-based proteomics in biomedical, as well as limitations and future perspectives of this approach are discussed.

MS-BASED PROTEOMICS AND DATA MINING
In proteomics, mass spectrometry is increasingly used in studies because of its specific and high performance capabilities. The most commonly used method of MS for protein identification is known as the "bottom-up" approach. Using this approach, the molecules measured are peptides generated by the enzymatic digestion of peptides in a sample. The resulting spectra of the fragmented peptides, known as MS tandem spectra (MS/MS), are generated where the peaks describe the amino acids present in the peptides. However, this only provides the identifications of the peptides present in the sample after enzymatic digestion and, therefore, it is still necessary to work from the known peptides to predict which proteins were originally present in the sample. The "bottom-up" approach contrasts with the "top-down"approach, for which MS is used to directly analyze undigested proteins, by ionization and dissociation of intact mass spectrometer proteins. This approach may be more specific than "bottom-up", but it has higher experimental requirements and requires more complex tools to be applicable to a global analysis 7 . Data mining techniques have been widely used to analyze data from many areas of biology; in particular, various machine learning methods have been applied to data generated by analytical techniques of genomics, transcriptomics and metabolomics to classify unknown samples and identify genes relevant to the state of the disease. Currently, similar methods are applied in the field of proteomics, and more specifically, in the analysis of data generated by MS 8 . In many studies, MS generates large data files containing lists of many peaks. Implementing data mining methods therefore is necessary for the identification of proteins related to the interested peaks and to compare the samples. In most cases, the analysis of MS data follows the paths summarized in Figure 2.

Basic Steps in MS-Based Proteomics Data Mining
As mentioned above, the use of MS yields a huge amount of data, where the number of characteristics (peaks) is larger than the number of samples. MS data are typically composed of hundreds to thousands of protein peaks. These data can not be analyzed manually or managed by normal data mining tools. In search of adequate tools to analyze available data and extract useful information, proteomics scientists are increasingly rely on advanced data mining techniques that can address issues such as the wide dimensionality and limited data sets. These advanced techniques include machine learning and artificial intelligence. Current practices of data mining in MS-based proteomics include following steps: Firstly, data was  modeling using identified peaks by pre-processing and feature selection. Then, data sampling was careful applied to process the typical small sample size of MS data. Lastly, the performance of generated model was evaluated.
The critical phases mentioned above must be carefully treated by proteomics researchers to get correct and robust decision models. The steps are repeated iterative and changes are made to explore different aspects of the data. Figure 3 describes the typical flowchart in data mining.

Pre-processing
Raw data obtained from MS is often disturbed. The purpose of pre-processing is to improve data quality. The results of the classification algorithms will be misleading and will be negatively affected when data quality is poor. Therefore, data pre-processing are crucial in the analysis of untreated proteomics data. Many published studies used the software provided by the manufacturer for pre-processing 9 . The software detects the positions and intensities of the proteins in the samples and performs important pretreatment steps such as first subtraction, intensity normalization, alignment, and peak detection. The criteria specified by the operator are used to filter the peaks. To date, there was no study has been conducted to compare the effectiveness of available reduction techniques and it seems that researchers are optimizing this step in a heuristic way that works best with their own datasets. Although most studies focus on to cut low-frequency noise in spectra, many attempts have also been made to characterize and subtract highfrequency noise components 10 . With the reduction of the baseline complete, standardization is the next step. Since a peak in a spectrum only describes the relative amount of a protein, normalization is performed to make sure meaningful comparisons between the spectra. After preprocessing, obtained peaks are further analyzed by extra size reduction techniques.

Feature Extraction
Feature extraction involved selection of spectra to find peaks (or characteristics) and is usually done by grouping. According to this, each group of m/z points entering a container is described by a value, such as its average or greatest intensity. Subsequently, the characteristics of these containers, such as m/z tray position and estimated intensity, are used as features for data mining. The containers can be independent or overlapping, of equal or adaptive size. By changing the size of containers and grouping method, the researcher can empirically optimize the feature extraction process. Selecting and using only those features important to the modeling process makes the entire data extraction process more accurate and efficient. The data mining algorithm would be faster for a set of data composed of smaller and more significant peaks and give simple and meaningful results. Therefore, it is essential to eliminate irrelevant and redundant features to create better models. However, it must be kept in mind that the function selection process does not always guarantee a correct selection of peaks for the classification problem. Therefore, it is necessary to validate the selected functions when increasing the size of the data sample 11 . Advances in machine learning have led to develop automatic function selection tools. There are two types of feature selection techniques today. A first type analyzes each function independently and removes the functions one by one depending on the relationship between the function and the goal 12 . Selecting independent features is a simple, straightforward and fast process. However, it often happens that a group of entities is more correlated with the desired output. Therefore, the hypothesis of independence of the characteristics can be rather limiting. To overcome the above limitation, some techniques have been proposed to select characteristics in which characteristics are analyzed in groups/subsets 12 . The correlation between the groups of characteristics is considered to the destination output. Although the process requires a lot of computation, it exploits the interrelations of important characteristics when it discovers critical information generally lost during the analysis of independent characteristics.

Classification and Data Modelling
Humans and animals gain the ability to learn through interaction with the environment. Data learning has been an area of interest for researchers in statistics and computer science 1 . Machine learning algorithms can infer a sample of data through familiarization and repeated interaction with the data. These algorithms vary in their training techniques, their end goal and correspond data. A wide range of algorithms for machine learning has been developed. Some popular algorithms are summarized and compared in Table 1. A learning process normally includes the task of learning and developing rules or functions from the data set provided by the samples. The development of mathematically precise rules and functions to describe data is called data modeling. The expanded model identifies the properties of the different classes and what separates them for proper classification. In the next phase, called the test, the developed model is validated with new observations to verify that the model produces accurate results. The learning phase and the model estimation are implemented and described using different learning methods or algorithms. There are two types of machine learning algorithms: supervised and unsupervised. In supervised learning (also called "learning with a teacher"), there is a earlier knowledge of the class to which each case belongs (sample). The training data set includes the input values and the associated output classes (provided by the master). During the learning phase, learning data is used to decide how entities will be selected, weighted, and combined to distinguish classes. The test phase involves the application of weighted characteristics to classify new test data whose class is unknown and which the decision model has never seen before. Therefore, the goal of the classification methods is to create dataset models by hand and use this model to classify new samples. The learning process would create of a model so that model predictions come close to the desired goal. If the model is able to correctly classify new data, we have reason to believe that it is a good model 5 . The most widely used supervised learning algorithms are Bayesian classifiers, Rule-based learners and Support Vector Machines. In unsupervised learning (also called "learning without a teacher"), the group to which each sample belongs is unknown or ignored and the data is grouped according to similarity measures. The learning process does not involve a teacher and the algorithm must identify the models in the data. Unsupervised learning can often lead to more than one possible solution. Artificial Neural Networks are typical examples of unsupervised learning used in studies analyzing mass spectroscopy data 13 . In both learning techniques, the goal is to predict (classify) or describe data by developing data models, which are then used to classify or describe new cases. If the data has only two or three characteristics, it would be easy to classify the data. However, developing models can be a daunting task if there are many features to analyze. Large data is not only difficult to visualize, but all possible combinations must be taken into account through comprehensive research techniques during the model training phase. A large number of dimensions with very few samples leads to what is often called over-regulation or over-training. Excessive regulatory models can not generalize and classify new cases with the desired accuracy.

Data Sampling
One of the major challenges in applying machine learning algorithms to biomedical data is the validation of an experimented model with new test data. The decision modeling process requires that the model be developed by training a given set of data (training set), followed by validation of the model in another dataset never seen before during the training (test set). The obvious way to handle this is to split the data into tests and series before constructing the model using stratified random sampling. However, medical data is often very difficult and expensive to acquire. As a result, there are not enough cases available to be divided into subsets of train tests. In addition, the disturbance inherent in most medical data and the complex relationships between characteristics require a sample of sufficient size to efficiently model the data. In addition, the size of the test set controls statistical power and confidence in the developed decision model. As a result, sophisticated sampling strategies are needed to capitalize on the available data. Cross-validation is one of the most widely used data resampling methods to assess the generalization ability of a predictive model and to prevent overfitting 19 . The data is randomly divided into two sets. The decision model is formed in the first and tested in the second. This random division process is repeated several times to reduce the selection bias. The average of all test estimates provides the average error of the model. If the dataset used for the training is too small, the model may not be able to predict the test cases well. A small series of tests may not result in a validated classifier and may generate a high error rate. As a result, different train test reports are examined (for example, 50-50%, 75-25%, etc.) with cross-validation. A common implementation is cross-validation of kfolds. The data is partitioned into k-disjoint sets. The training of the classifier is performed in sets k-1 and tested in a set of remaining data. This is done for all k-subsets producing k patterns and the estimated error will be the average of the error rates k. For example, a 10-fold cross-validation divides the data into 10 groups. Nine groups are used for training and tests are performed in the left group. This is repeated 10 times until each of the 10 groups has served as a test group. The average test error of the 10 groups is the estimation of the final test error and gives a rough idea of the quality of the model for the classification of the data. To conclude, effective classification methods by MS data could contribute to early and less-invasive diagnosis and also facilitate developments in the bioinformatics field. As protein MS data growing with data volume becomes complicated and large; improvements in classification methods in terms of classifier selection and combinations of different algorithms and preprocessing algorithms are more emphasized in further work 20 .

Performance Assessment
The last phase of the data mining process is the assessment of the models developed by the previously described machine learning algorithms. The accuracy of the classification is calculated by taking the ratio of the number of correctly classified samples to the total number of samples in the test data. However, when the prevalence of a particular class is higher than that of another class, the majority class will distort the result. In such a scenario, measuring accuracy can be misleading 21 . Based on Bayes' theorem with an assumption of independence among the predictors, making it particularly useful for large datasets. Despite its simplicity, Bayesian classifiers often works surprisingly well and is widely used because it often goes beyond the most sophisticated classification methods.
Fast and easy to implement. This method is suitable fordatasets with missing values. The main disadvantageis it assumes attributes are independent of each other.

Rule-based learners 5
Rule-based learners is a computer term used to understand any machine learning method that identifies, learns, or develops "rules" for storing, manipulating, or applying. The defining feature of a rule-based machine student is the identification and use of a set of relational rules that collectively represent the knowledge acquired by the system.
The rules generated are easilyreadable, and is suitable for identification of putative biomarkers, however there isa possibility of overfitting.

Decision trees 15
The decision tree methodology is a commonly used data extraction method for establishing classification systems based on multiple covariates or for developing prediction algorithms for an objective variable. This method classifies a population into branch-like segments that build an inverted tree with a root node, internal nodes, and terminal nodes.
The output from decision trees can be easily interpreted, but it does depend on the algorithm used and the complexity of the tree generated. It is also well suited to datasets with missing values.

Random forest 16
Random forests are common learning methods for classification, regression and other activities that work by building a multitude of decision trees at the time of training and leaving the class that is the class mode (classification) or predicting the mean (regression) of individual trees. Random forests correct the habit of supercharging decision trees in their training set.
This method is efficient on large datasets and can handlelarge numbers of attributes, however it is not verysensitive to outliers.

Support
Vector Machines (SVMs) 17 SVMs are machine learning algorithms that analyze the data used for regression and classification analysis. Using a set of learning examples, each of which falls into one of two categories, an SVM learning algorithm constructs a model that assigns new examples to a category or another, making it a non-probabilistic binary linear classifier. An SVM model is a representation of space point examples, assigned so that the examples of distinct categories are divided by the largest possible gap. Thus, the new examples are assigned to this same space and should belong to a category based on the side of the hole in which they are located.
SVMs uses kernels to learn complex functions, however they are very slow and there are multiple parameters to be chosen by the user.
Artificial Neural Networks (ANNs) 18 ANNs are computer models composed of several simple processing units that communicate by transmitting signals via a large number of weighted connections. Like human brains, neural networks also consist of treatment units (artificial neurons) and connections (weights) between them. The processing units carry the incoming information on their outgoing connections to other units. The "electrical" information is simulated with specific values stored in the weights that allow these networks to learn, memorize and create relationships between the data. A very important feature of these networks is their adaptive nature, in which "learning by example" replaces "programming" to solve problems. After training, ANNs can be used to predict the outcome of new independent input data.
ANNs use a multilayer perceptron to learn complex functions. The output of ANNs are not able to be read and the training of the model can be very slow.
In samples of two classes, there are four possible outcomes when testing the decision model. These are real positive results, true negatives, false positives and false negatives. Sensitivity (actual positive rate) is the ratio of the number of correctly graded positive samples to the total number of positive samples. High sensitivity is highly desirable in medical diagnosis, where the impact of a prediction incorrectly indicates that a sick person is in good health. The false positive rate is the probability that a healthy person is wrongly classified as a sick person (so-called specificity). High specificity is desirable when a false alarm leads to unwanted tests and elaborate treatments. Ideally, for a perfect classification, sensitivity and specificity must be equal to 1 (100%). Clinically acceptable sensitivity and specificity depend on the application. Several studies reported their results using sensitivity and specificity as performance indices 22 . The main limitation of the use of sensitivity and specificity as the only indices of evaluation is their dependence on the prevalence of class and decision threshold. It is therefore difficult to directly compare the results of reported studies using only the sensitivity and specificity measures.

Standards and Databases
Driving by improvements in speed and resolution of MS in the field of proteomics, which involves the large-scale detection and analysis of proteins in cells, tissues and organisms, continues to expand in scale and complexity. There is a resulting growth in datasets of both raw MS files and processed peptide and protein identifications. MS-based proteomics technology is also used increasingly to measure additional protein properties affecting cellular function and disease mechanisms, including post-translational modifications, protein-protein interactions, subcellular and tissue distributions. Consequently, biologists and clinicians need innovative tools to conveniently analyse, visualize and explore such large, complex proteomics data and to integrate it with genomics and other related large-scale datasets. The main challenge for big data mining then would be how we can achieve a transition from association study to causality study. From this point of view, standarization of data provides a new way for system-wide study and could play a key role in such a transition in big-data era.
For MS-based proteomics, the Standards Initiative Proteomics of the Human Proteome Organization (HUPO-PSI) is an organization that plays a pioneering role in development of standard terminologies, file format and minimum requirements for MS based proteomics data 23 . The most common formats are: (i) mzML, which stores raw MS data, as well as the peak list of the processed spectrum 24 ; (ii) mzI-dentML, which has information on peptides and proteins obtained from MS data 25 ; and (iii) mzQuantML, which has detailed quantitative information 26 . While mzML and mzIdentML have been applied for a long time, standards for quantitative data are still rarely applicable. This is mainly due to the lack of quantitative standard support from popular analysis tools. The need for large and easily accessible data repositories is essential for the benefit of proven data exchange in other research areas such as genomics and transcriptomics. Public databases will integrate data obtained from many laboratories and as a result, data can be analyzed by applying a new tools or new algorithms. Therefore, some public databases have been developed, for example, the PRIDE Archive 27 , GP-MDB 28 , PeptideAtlas 29 , Massive 30 and the Human Proteome Map 31 (Table 2). These databases are designed to provide a user-friendly interface, featuring graphical navigation with interactive visualizations that facilitate powerful data exploration in an intuitive manner. Moreover, they also offer a flexible and scalable ecosystem to integrate proteomics data with genomics information, RNA expression and other related, large-scale datasets.
Because of the nature of biological data, conducting research in life science to some extent has to change its style in the era of big data, e.g., from academic exploration individually to more cooperative study in systematic, standardized and pipelining ways. The main challenges here could be to establish interoperable databases, make sustainable tools available to the research community, create tool development centers, construct resources and infrastructure, such as cloud computing to serve the huge amount of researches, generate standards, vocabularies and ontologies of big biological data, develop new systems of infrastructure and tools, and obtainbuy-in from the scientific community, such as cloud service. Clearly, aforementioned challenges can be solved in a more engineering manner, and a well-designed experiment system matching some systematic, standardizing data processing pipe-line will be an important factor for a successful study.

Softwares and Tools
Many computer programs have been developed for the analysis of MS-based proteomics data 32,33 . Besides the available software, many useful tools have also been reported in the programming languages of BioPython 34 , BioJava 35 and BioPerl 36 . Furthermore, various bioinformatic tools used to data conversion, quantification, visualization and identification of peptides/proteins have also been noted (http ://tools.proteomecenter.org/; http://wiki.nbic.nl/inde x.php/ProteomicsTools; http://www.msutils.org/wik i/pmwiki.php/Main/SoftwareList). Some tools has a role as components of larger platforms to form master data processing processes (Table 3). Alternatively, there is a large number of software packages for the analysis of quantitative proteomics data, available in both commercial and free distributions (Table 4). This list is intended to serve as a useful reference and guide to the selection and use of different pipelines to perform quantitative proteomics data analysis depending on the type of instrument, method or platform used. Some excellent reviews on existing software are available, such as [37][38][39] . For instance, in 37 , three different software platforms, Progenesis, MaxQuant and Proteios were compared for peptide-level quantification in shotgun proteomics using a spike-in peptide data set with two different spike-in peptide dilution series. The performance of the software workflows was evaluated with different metrics, including harmonic mean of precision and sensitivity, mean accuracy, coverage and the number of unique peptides found 37 . The comparison suggested that Progenesis performed best, but a noncommercial combination of Proteios with imported features from MaxQuant also performed well 37 . Algorithms, such as peak picking and retention time alignment, usually included within a quantitative shotgun proteomics label-free workflow, have also been evaluated and compared separately [40][41][42] . While such separate comparisons are interesting and the evaluation by Chawade et al. 37 is informative, a thorough comparison of multiple workflows on protein level is still missing, especially in terms of differential expression analysis. Here, some of the most popular free software applications applied to proteomics profiling biomarker discovery and cluster analysis will be mentioned.

MaxQuant
MaxQuant is a quantitative proteomics software package designed for analyzing large-scale massspectrometric data sets, developed by the Max Planck Institute of Biochemistry 43 . It supports all main labeling techniques like SILAC, Di-methyl, TMT and iTRAQ as well as label-free quantification. MaxQuant is a comprehensive software that performs several analysis steps: a) Peak detection and scoring of peptides: MaxQuant corrects systematic inaccuracies of measured peptide masses and corresponding retention times;b) Mass calibration: It detects mass and intensity of peptide peaks in MS spectra and assemble them into 3D peak hills over m/z retention time plane, followed by filtration to identify isotope patterns; c) Database searches for protein identification: Peptide and fragment masses (in case of an MS/MS spectra) are searched in an organism specific sequence database, and are then scored by a probability-based approach termed peptide score d) Protein quantification: High mass accuracy is achieved by weighted averaging and through mass recalibration. The software is written in C# and freely available on http://www.co xdocs.org.

OpenMS
OpenMS is an open-source software C++ library for LC/MS data management and analyses, developed at the Free University of Berlin, the University of Tübingen, and the ETH Zurich 45 . It provides a large number of tools (more than 200) to analyze proteomics datasets, in the form of command lines. These tools can perform the following tasks: a) Import, export and conversion of vendor formats and several open community-driven XML formats; b) Preprocessing of spectra: Filtering based on various properties, Peak picking, Baseline and noise filtering; c) MS2 spectrum identification: Support for third-party peptide search engines, own customisable and extensible basic search engine, indexing of peptides in custom protein databases with SeqAn, statistical validation via posterior error probability and FDR/q-value calculation, combining results of different peptide search engines with ConsensusID; d) Visualisation of spectra (on all MS levels), features and peptide identifications in our TOPPView; e) Finding RNA and proteinprotein crosslinks; f) Identification of phosphorylation sites with Luciphor. OpenMS is free software and runs under Windows, macOS and Linux.

Sample classification from protein mass spectra
Application of data mining suggests a novel algorithm for pattern classification from protein mass spectra, which is a slight variation of the "nearest centroid" classiffication, the proposed "Peak Probability Contrast" (PPC). It is first described in the study of Tibshirani et al. 46 . Briefly, PPC works by extracting peaks from each spectrum, and then determining the optimal peak height split point for discriminating between the classes at each site. Then it computes the proportion of spectra in each class with peak heights above the split point and uses these proportions to build a nearest centroid classifier. In particular, when applied to spectra from both diseased and healthy patients, the PPC technique provides a list of all common peaks among the spectra, their statistical significance, and their relative importance in discriminating between the two groups. Compared to other statistical approaches for class prediction, this method performs as well or better than several methods that require the full spectra, rather than just labeled peaks. The algorithm consists of six sequential steps,as shown in Figure 4. The development of this method, so as to find a relative small number of peak clusters for class prediction, is expected to facilitate the identification of biologically significant and relevant proteins for specific biological states, such as tumor development and progression 47-50 .

Clustering mass spectra peak-lists
Data mining algorithms are also applied to proteomics data, in an attempt to group proteins based on their spectral similarities 51 . Notably, clustering validation methods are used to find the clustering method which most faithfully captures the underlying distribution of the samples. These work also show that the application of clustering algorithms in proteomics can assist in (a) identifying peak features responsible for categorizing samples, (b) formulate hypotheses on the possible function and role of unidentified proteins and (c) reveal proteins which act jointly as biomarkers in a concrete biological state [52][53][54] . The proteomics data on which clustering is performed are the mass spectra peak-lists (not the raw mass spectra) which derive from a mass spectrometer. In order to apply cluster analysis, these peak-lists are represented as vectors in a multidimensional space, where each vector element is a feature of a specific mass (e.g., its intensity) or a group of masses. To deal with the  high dimensionality of the generated peak-list vectors mass "containers" (i.e., contiguous nonoverlapping regions in the m/z axis) can be defined before analyzing the samples of an experiment. The process of binning performs dimensionality reduction by grouping consecutive masses and selecting a representative feature of those masses for each group(e.g., mean, log, maximum intensity value). Moreover, one can preprocess the peak-lists vectors by performing scaling or normalization. The suggested clustering algorithms for these data are the hierarchical as well as the k-means clustering. For a better comprehension of the clustering results several visualization methods are also exploited (i.e., dendrograms, heatmaps and cluster sets). In the clustering results that derive from this method, not only well separated protein clusters can be easily discerned, but also the spectral containers that are most influ-ential in partitioning the proteins into clusters. Furthermore, the presented method offers the option of integrating the identification results for the proteins -members of each cluster, as well as their Gene Ontology annotation 55 . By exploiting both the identification and the Gene Ontology classification information for most proteins in each cluster, one can attempt to infer the role of unidentified proteins. This can be based on the already known functions of the proteins which are identified with high confidence and are found to be close to unidentified proteins in the same cluster.

Protein-protein interactions prediction using association rules
The work by Kotlyar et al. 56 are first attempts that use association rules not only to discover protein-protein interactions, but also to predict whether a given pair of proteins interacts. Predicting interactions with association mining can be viewed as a classification problem where the part of the rule consists of a single item only, the class variable. After the application of association mining, the rules are ranked according to a measure of "interestingness" (e.g., confidence, support) and used for prediction as follows:a given protein pair is predicted to interact if its attributes include the items of any rule. The presented approach is based on the idea that both direct and indirect evidence (e.g., data coming from experimental and computational methods) could be used to predict interactions reliably and on a proteome-wide scale. In particular, datasets that consist of interacting and non-interacting protein pairs annotated with different types of evidence are first constructed. Then, with the help of association rules, patterns that discriminate the interacting and the non-interacting proteins are detected. Lastly, using these patterns the prediction of interactions is achieved, assigning a confidence level to each interaction [57][58][59] .
To conclude, with this approach, different types of evidence for interaction are integrated in order to create rules that act as a classifier for new interaction pairs. Thus, association mining is used to search thoroughly in large datasets for predictive patterns. However, to evaluate the performance of this method and strengthen its applicability, it is important to incorporate additional evidence, perform testing and validation using already known interactions from specific organisms and compare the results to those of other interaction detection methods 60 .

Biomarker discovery
Data mining can also be useful in determining which proteins, from MS data, could be used as biomarkers to differentiate between samples of different classes 61 . Table 5 includes information from investigations on the application of data mining on mass spectrometry data for the identification of the most suitable biomarkers, based on factors such as the ability to test for proteins in a clinical setting; this includes both identified proteins and mass spectral peaks as biomarkers. Further analysis, following identification of peptides or proteins as putative biomarkers, are then required, as it may be that the proteins identified would not actually be suitable for use as biomarkers. For example, body fluids such as urine and serum (blood) are regarded as being most suitable fluids to search for biomarkers because they are easier to obtain for assessment purposes during diagnostic tests and treatments. Also, blood is pumped around the body by the circulatory system and bathes cells, tissues, and organs, thus carrying putative protein biomarkers around the body before being processed by the liver and filtered by the kidneys into urine 62 . Table 5 also shows that the number of possible biomarkers identified varies greatly between studies, due to differing complexities of data, for example, Ralhan et al. 63 identified only three m/z values as biomarkers, and Fan and Chen 64 formed a panel of five biomarkers. This is in comparison to Ryberg et al. 65 and Bloemen et al. 66 who identified 41and 40 putative biomarkers, respectively. Some found biomarkers that had previously been identified; this is both useful as support for the previous investigations, and as some validation to the methods being newly applied to the area. Other investigations identified biomarkers that work specifically well together and so formed panels of markers. Bayesian classifiers used for biomarker panel analysis using 3-fold cross validation.
Panel of 3 best biomarkers identified.
Ovarian cancer biomarker discovery and classification. 37 patients with papillary serous ovarian cancer and 35 controls 69 .
Quantification using mzMine. SVM tested. Continued on next page Five proteins were significantly associated with diabetic kidney disease Biomarker for early diabetic mellitus (DM) discovery. 942 proteins in healthy volunteer urine and 645 proteins in the DM patient urine were identified with label-free semiquantitation 75 .

Gene ontology and pathway analysis
In total, 344 proteins were significantly associated with DM.
Biomarker for survival in Non-small-cell lung carcinoma (NSCLC)patients with immunotherapy 47 patients with advanced stage NSCLC 76 .

Machine learning
Serum proteomic signature may serve as a biomarker for survival outcome in patients with NSCLC, including patients undergoing immunotherapy In Fan and Chen 64 , different panels of biomarkers were compared and those markers that worked best together werei dentified. The development of panels of biomarkers is useful as using multiple biomarkers may reduce false positives as itremoves dependence on individual proteins, and allows proteins that are detected for different diseases to be useful. To discriminate between samples, the majority of the studies applied data mining to only the peaks from the mass spectrometry data that correspond to peptides. To facilitate the development of diagnostic assays and/or inform the underlying biology at a molecular level, peptide biomarkers require further investigation.

Literature mining and pathway analysis
Data mining has been shown to highlight important peptides/MS peaks, however further analysis is required to determine to which proteins they relate. In the case of data mining applied to quantified proteins, literature mining is also useful for understanding the biological relevance of the proteins identified as potential biomarkers. It may be important to discover more information about interacting proteins and pathways in which they have arole 77 ; by doing this, it can be determined whether the identified proteins may become useful biomarkers and which processes would be measured. Pathway analysis can be used to narrow down, or provide a focus to, the search for biomarkers by determining which pathways they participate in 78 . Literature mining is also essential in discovering more information after data mining has been applied to M Speaks, however identification of the proteins the peaks relate to is first required 70 . Tools such as Ingenuity Pathway Analysis (http://www.ingenuity.com) and DAVID 79 can be used to facilitate literature mining and pathway analysis, or information can be mined directly using such article databases as PubMed (http://www.ncbi.n lm.nih.gov/pubmed).

Limitation
The use of MS in proteomics studies has opened up a number of opportunities; however, there are also technical and conceptual challenges that need to be overcome and these will vary from study to study. First, it is often impractical to produce large numbers of samples due to time and financial constraints. Furthermore, a high-through-put approach is not always required 80 . There is also some difficulty in finding proteins of interest if they are at low abundance, when compared to other proteins within the sample, which is often the case for proteins that may be suitable as disease biomarkers 81 . Moreover, compared to genome studies, current protein studies often involve several cases, or represent discoveries that are only intended to prove the principles. Furthermore, although modern machine learning methods are available, their integration into proteomics analysis is rarely performed 82 . Another limitation of proteomics experiments is the experimental design. Appropriate experimental design is often important for a successful study, including the amount and type of repetition, as well as the randomized principles 83 . A weak test design can not even determine whether the observed difference between samples is due to biotransformation or simply to a technical factor. The high cost of proteomics experiments often leads to poor experimental design, which includes small amounts of inadequate duplication and control, and therefore poor reproducibility 84 .

Future Perspectives
Data mining has been successfully applied to proteomics studies, yet it can still be used for other purposes. For example, rule-based learners, as well as being used for classification, are suitable for the identification of biomarkers, as the attributes that are used frequently in rules are those that are better at discriminating between classes. Rule-based machine learning has also been applied to microarray data to develop gene interaction networks based on genes that are used together in rules 85 . This method could be applied to MS data in the same way, generating networks from groups of proteins that appear together in rules.
There are also other methods that were originally developed for transcriptomic data, such as gene set analysis 86 , that could be modified for application to proteomics. Furthermore, machine learning could be combined with literature data to include background knowledge, which is not necessary for machine learning to be applied, but could improve the data analysis process 87 . Deep learning is a recent and fast-growing field of machine learning. It attempts to model abstraction from large-scale data by employing multi-layered deep neural networks (DNNs), thus making sense of data such as images, sounds, and texts. The early framework for deep learning was built on artificial neural networks (ANNs) in the 1980s, while the real impact of deep learning became apparent in 2006. Since then, deep learning has been applied to a wide range of fields, including automatic speech recognition, image recognition, natural language processing, drug discovery, and bioinformatics 88 .
Peptide identification by fragmentation is a fundamental part of bottom-up mass-spectrometry-based proteomics.
Peptide molecules are fragmented with the aid of one of several technique, including collision-induced dissociation (CID), higher energy collisional dissociation (HCD) and electron transfer dissociation, producing a pattern of fragments that is indicative of the amino acid sequence 89 . The frequency with which a peptide backbone bond breaks determines the relative signal intensities in a fragmentation spectrum. Theoretically, the intensities can be calculated by quantum chemistry. However, for molecules as large as peptides, this is too computationally expensive to be practical. Hence, the intensity information contained in fragmentation spectra remains underused in many peptide identification strategies. This problem is an ideal situation to employ deep learning. It can learn the relationship between sequence and fragment abundances based on a large dataset of training examples, without explicit knowledge of the physical mechanisms behind it. Furthermore, the predictive models do not have to remain black boxes, but can be examined with specialized methods that identify features or combinations thereof that are most relevant for making a prediction. While fragment intensity prediction has been attempted before using a variety of methods, they have had limited success 90,91 . Very recently, Tiwary et al. 92 present a deep learning method called DeepMass whose accuracy is close to the theoretical limitation. Furthermore, they demonstrate its utility by integrating it into data-dependent acquisition(DDA) and data-independent acquisition (DIA) computational proteomics workflows, and the results suggest that both can benefit from the improved spectrum prediction. With the applications of deep learning in the field of mass spectrometry, we can successfully demonstrated a more accurate method that significantly increases our ability to identify and characterize known biomarkers in a sample.

CONCLUSION
Data mining is a data-driven process where the results obtained largely depend on the analyzed data. The methods employed for feature selection, classification, data sampling, and performance evaluation drive the process and alter final results. Thus, it is recommended to explore more than one technique to make comparisons and better understand the problem in hand. Furthermore, standardized and optimized methodology is essential for achieving accurate measurement and meaningful analysis. This includes all involved steps extending from experimental design, specimen collection, storage and handling, throughout all methods used in the analytical chemistry and MS signal processing. Proper bioinformatics including analytical tools, data storage and sharing are required for data mining and validation. As proteins are critical biomarkers of disease development and progression -the more we know about them and their relationship to specific diseases, the earlier and more precisely we can intervene. We hope that data mining will enable researchers to characterize disease-relevant protein profiles to build new diagnostic tools and therapeutics. We look forward to continuing the application of machine learning and deep learning to proteomics and other fields, to fulfill the mission of making health data useful.