Machine-Learning Methods in Prognosis of Ageing Phenomena in Nuclear Power Plant Components

In the Long-Term Degradation Management (LTDM) project we approach component ageing problems with data-analysis methods. It includes literature review about related work. We have used several data sources: water chemistry data from the Halden reactor, simulator data from the HAMBO simulator, and data from a local coffee machine instrumented with sensors. K-means clustering is used in cluster analysis of nuclear power plant data. A method for detecting trends in selected clusters is developed. Prognosis models are developed and tested. In our analysis ARIMA models and gamma processes are used. Such tasks as classification and time-series prediction are focused on. Methodologies are tested in experiments. The realization of practical applications is made with the Jupyter Notebook programming tool and Python 3 programming language. Failure rates and drifts from normal operating states can be the first symptoms of an approaching fault. The problem is to find data sources with enough transients and events to create prognostic models. Prognosis models for predicting possible developing ageing features in nuclear power plant data utilizing machine learning methods or closely related methods are demonstrated.


I. INTRODUCTION
S many Nuclear Power Plants (NPPs) are approaching the end of their licensed operational lifetime, maintenance is getting more and more important. It is also expensive to build new power plants, and many countries are giving up NPPs for several reasons. To maintain sufficient energy delivery, it is often necessary to keep the plants running beyond the initially planned operational lifetime. Monitoring ageing effects of critical and safety related components at the plant is essential for safe operation in the latest phase of operation.
The Long-Term Degradation Management (LTDM) project is studying some of the ageing effects of critical and non-critical components at a plant with data analysis. It is not easy to get data from real NPPs containing ageing information over a long period. Therefore, data sources used here are from simulators, real NPPs, and other somewhat similar processes.
Data storage of process data and other sensor measurements at NPPs can increase the available data to analyse as storage media becomes cheaper and storage volumes gets larger capacity. This facilitates data analysis with machine learning and big data technologies to extract new or previous hidden information from the data. Here we have use machine learning methods to extract health indicators from events and transient data which can be used as input parameters to predictive models. The prognostic models are used to estimate the future health condition of components. With a few exceptions, degradation of equipment is developing gradually and monotonically in one direction; towards a deteriorated state. Data can often be separated in one part, which corresponds to the physical model of the degradation, and a second part which are seasonal variations or noise, etc. Preparation of the data is usually necessary before applying them to the degradation A models, like noise reduction, removal of outliers, feature extraction and normalisation of the data sets.
We have investigated how and if the identification of failures and transient over time can be used as health indicators. Increased frequency of failures over time may finally lead to a poor performance or a broken component. The second step, after health indicators have been identified and quantified, is to develop the prognostic models that can tell us when a component needs to be repaired or fixed before it is broken. The use and application of various prognostic modelling tools have also been applied to process data. For both the classification of events and prognostic analysis methods from machine learning libraries have been used.

II. CONDITION MONITORING OF COMPONENT AGEING
The components in a nuclear power plant need to be reliable and safe. Both mechanical and electrical components need attention. It is important to eliminate disturbances and to ensure safe and secure operation.
Most nuclear power plants have a comprehensive testing program where the most important functions of components are checked at fixed time intervals. Important aspects are the systems, equipment and their components monitoring, follow-up of statistical events, checking, testing and the necessary maintenance actions. It is important to anticipate possible events. When potential problems and needs for changes are observed beforehand, there is enough time to plan and carry out corrective actions without any safety risks or production stops.
Condition Based Monitoring (CBM) is an approach used to try to resolve problems in this area. The prediction of the Remaining Useful Lifetime (RUL) is needed in planning the optimal cycle in various component replacement programs. For instance, life-time testing can be used. In accelerated life-time tests, component sustainability and lifetime is tested in controlled conditions. A common model in estimation of component lifetime is the bathtub curve [1]. Also, the Weibull distribution [2,3] is used in the estimation of component lifetime. Mostly the ageing is a result of combinations of several mechanisms such as thermal ageing, electrical strain and effects, and mechanical stress [4]. Environmental causes are such as high and low temperatures as well as sudden temperature variations, humidity, chemical exposure, radiation, pressure changes, mechanical and biological impurities such as dust and microbes.
Certain parts of nuclear power plants are more difficult to instrument than others. For instance, a reactor tank is a very demanding environment for measurements and sensors. It is an environment with high temperatures and pressures, and radiation as well.

III. RELATED WORK
Parts of this work with a little different focus have been presented in [5]. There we have focused more on data structure analysis, fault detection and identification, and also on classification and visualization. In this article we concentrate more on prognosis besides clustering and classification. The review concentrates on related methodologies that we have used in our study, but in some extent, it goes into a little broader scope. Some of those methods we have explored quite in detail, but we do not present the experimental outcomes in this article.
Related topics in literature are found for instance in studies [6][7][8][9]. A model is made for the increase in the air filter pressure difference during the ageing of a filter [10]. The model sums up cumulative, sporadic aerosol emission, periodic changes during different seasons of the year and error term. The main changing variable during different seasons of the year is the air humidity. The goal is to estimate the Remaining Useful Lifetime (RUL), which is an important measure in Condition Based Maintenance (CBM).
Prediction of condenser fouling using machine learning and visualization techniques is presented and analyzed in [11]. Remaining Useful Life (RUL) of choke valves has been estimated in [12]. Predictions based on the gamma process are carried out in this application in oil and gas industry.
We have collected some literature works about machine learning methods to classify events in nuclear power plants component degradation management. Condition-based maintenance (CBM) recommends maintenance decisions based on information collected in condition monitoring [13]. There are two important aspects in a maintenance program: diagnostics and prognostics. There are more literature works on diagnostics than on prognostics [13]. Prognostic methodologies have entered the literature later. We have also collected literature works about long-term degradation and prognosis of nuclear power plant equipment in long-term operation Equipment reliability and maintenance affect the key elements of competitiveness and security [14]. A proactive maintenance approach helps in eliminating failures of equipment. Early diagnostics and component replacements are needed in planning better maintenance programs. CBM is a decision-making strategy based on real-time diagnosis of failures and prognosis of future equipment health [15]. Nuclear component degradation is identified by timefrequency ridge pattern in [16].
In [17] a probabilistic neural network (PNN) has been used in classification of severe accident progression scenarios, where the initiating event can be a loss of coolant accident (LOCA), total loss of feedwater (TLOSFW), station blackout (SBO) or steam generator tube rupture (SGTR). Also, a fuzzy neural network (FNN) is used in this study. Machine learning algorithms have been used in data classification in nuclear power plants, e.g., in [18] and [19]. Machine learning is used in prediction [20] and fault detection [21] and identification.
Maintenance-based prognostics of nuclear power plant equipment for long-term operation has been studied in [6]. Failure times of critical equipment can be predicted taking into account potential maintenance actions. Degradation level is defined against the observations for components whose lifetime can be predicted according to following the decreasing tendency of heat transfer coefficient taking into account flush and cleaning [6].
In Japan a maintenance strategy and effective ageing management program has been developed in Tokyo electric power company (TEPCO) including long-term maintenance plans [7]. In Korea a long-term asset management strategy for refurbishment and replacement of nuclear power plants has been constructed [8]. Systematic monitoring and proactive measures against ageing mechanism can reduce unplanned loss due to failure of large components. Longterm degradation profiling in time series for complex physical systems has been studied in [9].
Nondestructive examination (NDE) in detecting material degradation precursors before cracking occurs has been investigated in [22]. The diagnostic-prognostic process for estimating remaining useful lifetime of industrial components has been defined in [23] and a taxonomy of models is presented including data-driven and physical models. Methods for the life expectancy models have been examined.
A survey of applying gamma process in maintenance [24] shows that the method is mainly applied to maintenance decision problems of single components rather than for complete systems. Using ARIMA-based prediction method in prognostics of machine health condition has been examined in [25]. A neural network approach has been used in residual life predictions by using vibration-based degradation signals in [26]. Self-organizing map (SOM) and back propagation neural network methods are introduced for residual life predictions for ball bearings in [27]. Minimum quantization error (MQE) indicator is derived from SOM.
Disposition curves for irradiation-assisted stress corrosion cracking based on international data collection have been presented in [28]. Simulated stress can be caused, e.g., by varying high temperatures or neutron fluence (dose). Measures and constants need to be selected. The screening can be done from the point of view of fracture, fatigue damage or corrosion mechanical damage [29].

IV. DATA SOURCES AND ANALYSIS TOOLS
In the next two chapters we present the data sources used in the experiments, the analysis tool that is used for realization of our case examples, and the methodology that our experiments are based on. The methodology consists of a selected variety of advanced data-analysis and machinelearning technologies complemented with other statistical methodologies.
Data sources used in the experiments of this article are the following: water chemistry data from experimental nuclear power plant reactor in Halden Norway, simulated ageing data from HAMBO simulator of Halden Reactor Project, which is using the simulator model from Forsmark nuclear power plant in Sweden, and local coffee machine data from an instrumented espresso machine in Institute for Energy Technology in Norway. All datasets in our experiments are normalized.
The first data source used in the experiments is water chemistry data from the Halden reactor. We have named it corrosion data. The corrosion data contain 23 variables, which are mostly concentrations of ions. The data is in two sets. In the first set there are ions including weak acids CHOOand CH3COO -. In the other set there are metals. In the experiment in this article we use both datasets. We have corrosion data from a two year period and a seven year period. In this article we analyze the data including measurements from a two year period.
The second data source used in the experiments is from the nuclear power plant model in the HAMBO simulator of Halden Reactor Project. The model is Forsmark 3 in Sweden. We have four different data sets: an air leak in the condenser, normal data, condenser transient data, and a seven year period ageing data including a pump failure. In the experiments in this article, we have used a seven year period ageing data including a pump failure. We have simulated ageing data with noise and without noise. There are less than one hundred different variables in this simulator data.
The third data source is local coffee machine data from Institute for Energy Technology in Halden. A set of sensors have been installed into an old functioning instrumented expresso machine. The data includes physical and electrical measurements from different parts of the machine including heater, pumps, valves, etc. Water and steam flow process has one input for water and five outputs for water or steam: two for coffee, one for tea, and two for steam. Data is stored with 0.1 seconds interval, and so the number of measurement samples is much larger than in our other data sources. For instance, one of the combined datasets in the analysis includes about six million measurement samples. In this particular dataset there are about 35 measurements.
More features and characteristics of the datasets used in the experiments are described in Chapter 6 in the context of analysing the case examples.
The analysis in this article has been done with the Jupyter Notebook programming tool with Python 3 programming language. A Spark/Hadoop cluster using several computers was tested out to estimate advantages in parallel computing in time-consuming tasks. Beta-version of Zeppelin Notebook with Python 3 programming language was tried out as well.

V. METHODOLOGY
In this chapter we present some basic ideas behind statistical learning, and how they are utilized in our research work. The used methods are shortly presented and discussed. Examples of using the methods in the research of ageing of nuclear power plant components are presented later in this and the next chapters. The topic is focused on long term degradation management. One important goal is to use machine learning methods in classifying events at nuclear power plants.
There is a comprehensive set of methodologies available in statistical learning [30] that can be utilized in the classification of data. The basic elements of supervised learning, linear methods for regression, linear methods for classification, and model assessment and selection are to be studied first. Also, neural methods and unsupervised learning should be mentioned as they certainly belong to the group of useful and interesting topics. One taxonomy of machine learning is presented in Table 1 [31].
K-means clustering [32] is a method for finding clusters and cluster centers in a set of unlabeled data. It can be used also for classification of labeled data. Cluster analysis is a kind of segmentation. It relates to grouping and segmenting a collection of objects into subsets or clusters. Within each cluster the objects are closely related. Sometimes the goal is also to arrange the clusters into natural hierarchy. Hierarchical clustering measures dissimilarity between groups of observations. K-means clustering algorithm is a top-down procedure. We have used unsupervised clustering, as opposed to supervised classification where the training set is labeled with cluster numbers.
Normalizing the data helps in getting a suitable fit that can produce reliable prediction results. Sometimes it is also necessary to shuffle the data before dividing it into training set and test set. For handling empty values in data or outliers several possible methods have been used. A method for detecting trends in selected clusters is developed and presented. K-means algorithm is used in clustering and polynomial fit in visualizing trends and predictions for selected variables. Data is normalized before clustering, and outliers and empty values are replaced by mean values when needed. The possible errors here were estimated to be small, almost negligible. Hierarchical clustering can also be combined with the trend method. Next the procedure of this method is presented in detail.

A. METHOD TO DETECT TRENDS IN A CLUSTER
A method to detect trends in a single cluster is developed. First data is normalized and then clustered with K-means algorithm. The outliers or missing values are replaced with mean values of each variable if needed. After that the desired state is defined and selected. In the experiments with corrosion data the desired state is normal data. Normal data means here data including no higher concentration peaks in ions causing corrosion. With other words we mean clean data without pollution from maintenance actions. In this our experiment the desired state includes only minimal amount of corrosion products.
The cluster index of the desired state is searched. Data is converted to a suitable form for each operation. When the desired cluster has been identified, new time points are produced for it in chronological order, and time series of selected variables are plotted separately. From these plots potential trends upwards or downwards can be notified. The same procedure can be repeated to any other cluster when needed.
Next a polynomial fit is produced and then the trends are visualized according to the mathematical result. The polynomial fit can be linear (first order) or higher order fit. The fit curve can be followed over the last time point as far as we want to make a prediction, which is based on the mathematical trend calculated from the variable values so far. The first order polynomial (linear) fit is mostly used in the prediction.

B. TIME-SERIES PREDICTION
Time series is a sequence where metric is recorded over regular time intervals [33]. Forecasting is next step where you want to predict which future values the series is going to take. ARIMA method (AutoRegressive Intgrated Moving Avarage) is a forecasting algorithm based on the idea that the information in the past values of the time series are alone used to predict the future values.
ARIMA model is a class of statistical models for analyzing and forecasting time series data [34]. It is a model that uses the dependent relationship between an observation and some number of lagged observations. It uses differencing of raw observations in order to make the time series stationary. The model uses the dependency between an observation and a residual error from a moving average model applied to lagged observations. In our time-series prediction experiments ARIMA model [36], Gamma process [37], Gaussian process [38], Random forest [39] methods have been used. In addition, some simple prediction algorithms have been demonstrated.

C. DEGRADATION PROGNOSIS WITH PHYSICAL HEALTH INDICATORS
Physical modeling can sometimes be used in combination with machine learning. As an example, machine learning can be used to pre-process input data before it is fed to a physicsdriven prognostic models. In our case we have used machine learning to identify transients and other events in process data which can be used as condition indicators in physical models.
The mechanical degradation of machinery and equipment is often handled with time-based replacements of machine parts or other condition preserving maintenance, like lubrification. In cases where machine parts are relatively expensive, or unplanned halts in production represents a high safety risk, physical health monitoring (PHM) is needed in order to plan for maintenance, or to get spare parts in due time before any unplanned break-down of the equipment. Typical measurements being monitored are, e.g., vibration of rotating machinery, the temperature in bearings or the change of conductivity in a conductor. These measurements are called physical health indicators. Sometimes the health indicators cannot be measured directly but are calculated from other measurements that each cannot individually be used effectively to measure the degradation. In some cases there are no measurements available, and the rate of degradation is determined from visual inspection and given a score from a predefined scale.
For normal wear and tear, the degradation trend is monotonically increasing or decreasing, depending on what physical health indicator is being monitored.
One of the physical modeling methods we have used is gamma processes [35]. In degradation phenomena, the health indicators, or condition indicators, makes small incremental changes over time. The incremental changes will be gamma distributed. It is a one-way process, and degradation trends are therefore always monotonically increasing or decreasing (with a few exceptions).
Gamma processes are often used to estimate remaining useful life of components undergoing degradation because the method requires the estimated function to include the last, and most up-to-date, measurement and requires a monotonic behavior of the data. Measurements or computation of the degradation process may not always be monotonic. Therefore, the data needs to be pre-processed with, e.g., machine learning or averaging algorithms before being used in the prognostic models. The mathematical expression and equation for the gamma process used in our experiments is [5]: where u is a constant scale parameter, v(t) is a monotonically increasing, positive defined, right-continuous, real-valued function with initial condition v(0)=0 and Γ is the gamma function. The expectation value is ( ( )) = ( )/ .

In [5] we have presented case examples about using Principal
Component Analysis (PCA) in differentiating failure data from normal data, and in separating operational stages in process data. In the latter example K-means clustering is combined with PCA. The data sources in these examples are from Loviisa nuclear power plant in Finland and from HAMBO simulator of Halden Project using Forsmark nuclear power plant (in Sweden) model. It was necessary to add noise to the simulator data to visualize the data densities in the transient scenario.
In [5] we showed with the case examples how to detect anomalies, structures and possible developing ageing features in nuclear power plant data with machine-learning methods. In this article we concentrate more in prognosis. PCA is powerful tool in data structure detection, but the difficulty is to connect the possible faults to certain components. PCA loadings somewhat help in this issue. PCA method is efficient especially with big datasets.
In the next subsections we present experiments to demonstrate how our proposed methodology works in practical examples. Three case examples include prognostics models to achieve certain goals in predicting near future behaviour.

A. TRACEING CORROSION
The goal in the first experiment is to trace or observe corrosion in nuclear power plant water chemistry data. We have developed a method using well-known algorithms and programmed a procedure, performing also all the required operations on data to carry out this task.
In the first case example we present here a method to find trends in a single cluster. The data is from water chemistry analysis from the Halden reactor as explained in Chapter 2. In the primary circuit and the secondary circuit of Halden reactor the water is purified, and the concentration of all minerals and other pollutants is much smaller than in ordinary water. The concentration of Cland SO4 2ions is the main cause for corrosion and cracking.
Mostly an increase of concentration of these particles occurs after manual maintenance operations, and the pollutants probably come from the cooling water of pumps. Regular cleaning operation is carried out after an increase of pollutants. These effects are seen in the data as peaks (up and down) in all detected concentrations.
The concentrations of the metals in the water are kind of corrosion products and they have strict connection to the corrosion effect. The most important metals in this context are in the importance order Fe, Cr, Ni and Cu. The iron concentration is measured with two different methods. These concentrations tend to increase after certain maintenance actions. Fe, Cr and Ni all come out from the same materialsteel, and therefore it would be logical that they would somewhat correlate.
In Fig. 1 the normal data cluster (light-green colour dots) is in the down-left corner. The other clusters represent concentration peaks for both variables after maintenance actions (dots in dark blue and blue colours) and the delays before returning to normal stage (dots in purple and yellow colour). In the clusters nearest to the normal data smaller maintenance peaks can be included. All values in Fig. 1 are normalized. As the water is purified, the amounts of concentration are mostly relatively small.
The concentration unit in the original data is ppb (partper-billion meaning American billion 10 9 ). With iron (Fe) ions the value range is between 0…10 including the maintenance peaks. With sulfate ions (SO4 2-) the normal value range is between 0…20, and the highest values in the maintenance peaks around 50. The scatter plot of iron concentration and sulfate concentration is clustered into five clusters with K-means algorithm, see Fig. 1 (upper plot). The labeled normal data cluster is identified from the whole dataset and the samples are picked up in chronological order and plotted into a time series plot. A linear fit and other low order polynomial fits are calculated to the time series to find out possible trends in this part of data, see Fig. 1 (lower plot). Especially the linear fit can be used in short term prediction of the becoming data points in near future. From the calculated trends we cannot make any strong conclusions about corrosion. We made similar experiments with a seven year period corrosion data, but no significant difference was noticed compared to a two year period data examined here. The reason why we present results experimented with a two year period data is that a seven year period data included more values that could be considered outliers at least in sense of visualizing the results.

B. TIME-SERIES PREDICTION WITH ARIMA MODEL
In the second experiment we make long-term forecasts for ageing related variables or variables that behave somewhat in a similar manner. We have tried out several advanced prognosis methods, and after comparison we present the most promising ones. The data sources in these presented experiments are ageing data from HAMBO simulator of Halden Reactor Project including gradually developing pump failure and from local experimental coffee machine as explained in Chapter 2.
Times-series prediction has been experimented with several methodologies beginning with simple algorithms such as naïve forecast, simple average forecast, moving average forecast, etc. Also, polynomial fit including prediction has been tested. From a little more sophisticated methodologies ARIMA model (AutoRegressive Integrated Moving Average), Gamma process, Gaussian process and Random Forest algorithm have been used. Most reliable and promising results in time series prediction were got with ARIMA model. ARIMA model performs well in one-step ahead predictions updating the history on every time step. Especially with larger amount of data this kind of prediction is very accurate. The characteristics of this kind of model is on-line prediction, which is always not suitable approach.
It is possible to use ARIMA model also for predicting more time steps in one time. In this article ARIMA model is used in such off-line prediction. Effects of varying different parameters in the ARIMA model are also tested out.
In Fig. 2 ARIMA model one-step ahead prediction of a pump mass flow in simulated ageing data is shown. In Fig. 3 comparison of different solver methods in ARIMA model is seen. Root mean square error and CPU time are measured for the prediction in Fig. 2. In this experiment the training set includes 2059 samples and the test set 500 samples (19.5% of the data).
The solver methods in the Fig. 3 are the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (bfgs), limited-memory BFGS (lbfgs), Newton's method (newton), Tiger algebra solver (nm), conjugate gradient algorithm (cg), nonlinear conjugate gradient algorithm (ncg) and Powell's conjugate direction method (powell). In Fig. 4 a longer-term prediction for a pump mass flow in simulated ageing data is presented. Training set and test set are in the same figure with the prediction including ±10% confidence interval. The ARIMA model succeeded better in this task than Gamma process or Gaussian process. The division into training set and test set is the same as in the example in Fig. 2, so test set includes 19.5% of the data. The long forecast includes 500 time-steps. The mean value of the training set (normalized values) is 0.393 and the standard deviation of the same set is 0.027. The last point in the forecast is 0.323. The prediction error is about 2%. Similar longer ARIMA model prediction for a temperature in coffee machine data is presented in Fig. 5. Mean square error is calculated and displayed as well. In this figure the data is normalized as well. In this task the ARIMA model performed better and more reliable prediction than Random Forest (RF) method. Here the test set includes 100000 samples (17% of the data). The long forecast includes 100000 time-steps. The prediction error is under 2%. Figure 5. ARIMA model prediction for the test set of the temperature before feedwater pump in the coffee-machine normalized data including calculated Mean Square Error (MSE) for the deviation between test set and forecast. The confidence interval here is ±10%.
As there no prior, no comparison to baseline is done. There are multiple measures of accuracy of the model fit, such as ME, MAE, RMSE, MPE, MAPE, MASE, etc. All indicators are aggregations of two types of error: bias (wrong model, accurate fit) and variance (right model, inaccurate fit). In the metrics of solver methods RMSE is used. The chosen ARIMA model structure comes from a python library, see the first line in Fig. 6. In the same figure it is seen a python code of the basic lines of long ARIMA forecast. Somewhat similar structure is used in the other ARIMA model examples as well. In the solver comparison a little more complex code structure is needed. In Table 2 there are listed the ARIMA model parameter values used in our experiments. from statsmodels.tsa.arima_model import ARIMA # model1 = ARIMA(x1_train1, order=(5,1,0)) model1_fit = model1.fit() output1 = model1_fit.forecast() forec1 = model1_fit.forecast(steps=500)[0]

C. GAMMA PROCESS
In the third experiment we apply gamma process in forecasting remaining useful lifetime (RUL) for a simulated pump. The pump mass-flow data visualized in Fig. 2 was also applied to a gamma process predictive algorithm. Continuously degrading processes with small incremental difference from previous measurement will usually be gamma distributed. Hence, gamma processes can be used to make prediction on monotonically increasing and decreasing functions. The pump mass-flow data has some noise that is eliminated by averaging the data over rather long-time sequences and thereby making the function monotonic increasing. The scaling of the parameters in this case is different than in Fig. 2. The gamma process determines the parameters for a power-law function fitted to the measurement data, see

VII. DISCUSSION
The focus in this article is on prognosis of ageing related phenomena, and somewhat also clustering and classification of data. One aim is also to visualize possible developing ageing features in nuclear power plant data by using machine learning algorithms, and some closely related methods. After introducing the problem domain, some component ageing issues are shortly discussed and literature about related work is examined. In this chapter we do some analysis and comparison of our work and related studies.
One of the very basic themes in this work is Condition-Based Monitoring (CBM) and Remaining Useful Lifetime (RUL). In earlier work done in the Halden Project and at IFE the RUL concept is discussed also in [10] and in [12]. Principal Component Analysis (PCA) has been the main method for dimensionality reduction in [5] and it was used in data structure exploration as well.
In [6] heat-transfer coefficient variable is used to help estimate the component lifetime. It is a roughly monotonically decreasing function, and therefore the challenge is mostly in defining the threshold for a recommended component replacement or repair. Regular cleaning is slowing down this development process and need to be taken into account in calculations. We were planning to use a similar idea in our prognostic studies in future. Our gamma process example already approaches this research ideology.
In [25] ARIMA model is used in prognosis of health condition. This idea relates to our health-index philosophy, and we have also used the ARIMA model successfully in our time-series predictions.
Differences between used methodologies and their applicability and advantages in our purposes with data available were discussed in Chapter 6 about practical applications. K-means clustering functions well in our experiments, and therefore there has not been a need to try other clustering methods such as Learning Vector Quantisation (LVQ) algorithm [30]. More testing of developments in new directions could be done though. Adaptive Nearest Neighbourhood method is proposed for challenges with high dimensionality. Association rule analysis could be useful for us as well.
Self-Organizing Map (SOM) [40] method is sometimes described as constrained K-means clustering. Multidimensional scaling and principal curves are often compared with SOM method, and could be tried out as well.
In Sensors Journal somewhat similar topics are discussed in [41] and [42]. In [41] machine learning and deep learning techniques are used with large datasets for extracting relevant information and making predictions. In [42] machine learning methodology is utilized in automatic teaching of several different motion activities.
In data grouping and classification common methodologies have been used. There exists a large variation of other potential options. For instance, neural networks offer interesting options to be used in our experiments in the future. Unsupervised learning and supervised learning in general deserve also more attention.
Time-series prediction is one of the main focus areas in this article. In general, this field is widely studied and a wellknown area. We are working in developing more reliable and accurate prognostics models for component lifetime and remaining useful operation time.
A computing procedure to select and analyse single clusters was developed. The procedure utilizes rather common methodologies such as K-means clustering and polynomial fit to reveal potential trends in some central parts of the data. In addition, the procedure includes all necessary data operations needed in our experiments. The method works well in principle, but with this data we did not manage to prove corrosion effects, because the notified variations were in too small scale to be significant.
The health index is an important concept in our work. The dataset can be large, diverse and complex. The basic approach is to collect health indicators such as maintenance logs, technical reports, and measurements and to calculate the frequency of process condition states, transients and incidents from process data using machine learning techniques. A suitable combination of methodologies and tools comprise a plausible environment where, e.g., failures and transients can be detected and analysed.
A methodology we have not yet used for the ageing data is Kalman filters. Kalman filters is a common and popular method used for predictive modelling. Kalman filters require a model apriori. This filter takes into account, the measurement noise, process noise and minimizes the prediction error upon continuous cycles of prediction and filtering.
There can be some stabilizing problems using Kalman filter, like the covariance matrix of the Kalman filter can run into non-positive semi-definiteness over time and detecting outliers is more complicated. This filter is able to take into account the variance of the initial estimate of the state and the variance of the model error. It provides information about the quality of the estimation by providing the variance of the prediction error. The Kalman filter is well suited for online digital processing but requires some computational power.
The Jupyter programming tool and Python programming language have been used in the experiments. This tool proved to be rather convenient and practical in our experiments. Larger review of applicable methods and tools is a part of our project plan as well as identification of applicable events. Safety critical components present a focus area. We also experimented with Zeppelin Notebook environment, which is also a promising tool for data analysis. We have also tested a cluster environment in parallel computing in time consuming tasks.
In our application examples well-known data-analysis methods are mostly used, and the merit of the work comes more from utilizing a carefully selected set of methods to this new application area in our problem domain, and the emphasis of the work has not been in developing new methodologies or algorithms. In addition, we have developed a procedure to detect trends in clusters by combining cluster analysis, polynomial algorithms and computed necessary data operations as mentioned earlier.
There has been some gap between our analysis and longterm degradation, which is due to lack of appropriate data. Simulated ageing data has been one important step to fill this gap. By analyzing and visualizing anomalies, we approach this goal. Validation of used methodology will become easier, when we get data including faults and the fault history of the whole lifetime of a certain component.
We are heading to get more data to investigate ageing related issues in a later phase of the project. We are waiting for more comprehensive water chemistry data, data from pump failures, steam generator vibrations and crack development in a reactor tank. We have already analyzed simulated data of slowly developing ageing issues of pumps and generators measured as increasing vibrations to increase our capability to conduct similar analysis of real data.
Before we can give more solid recommendations about component ageing issue, we need more data and especially data with quality characteristics with respect to observed phenomena. We have already been able to show, e.g., how to differentiate anomalies in data with different classification methods.

VIII. CONCLUSION
We presented the problem domain, discussed about ageing, and reviewed related literature. We showed the most suitable data sources available, development tools used in our work and chose the methodologies. The methodologies were tested in experiments, where we demonstrated, e.g., prognosis models based on classification. We applied and presented prediction models and got promising results. The main contributions of our work are in structure visualization, anomaly detection, classification-based prognosis models, as well as other prediction models and their results. We also presented a new approach to visualization of corrosion data connected to the method detecting possible trends showing potential marks for corrosion.