Quantitative investment is the process of establishing mathematical models using statistics, information technology, and mathematics to quantify and implement risks, returns, and traditional investment concepts. However, due to the backwardness of computing tools in the past, quantitative investment has not received much recognition. With the improvement of computer science and quantitative analysis theory, traditional fundamental analysis and the use of sampling statistical technology to build advanced mathematical models for investment analysis have failed to meet the requirements of investors. Therefore, the Quantitative investment strategies based on data mining technology are receiving more and more attention. In this paper, we uses MATLAB software to capture big data from financial and economic websites, and then uses neural network training models to predict the trend of stock changes, and finally establishes a suitable quantitative stock selection model. The simulation results show that only by using quantitative stock selection strategies to curb risks and selecting a suitable investment portfolio can achieve the ideal goals in the stock market.
Keywords: Quantitative investment, data mining, neural network, portfolio
In recent years, due to the continuous development of the stock market, more and more attention is paid to the quantitative investment technology [13]. Quantitative investment system is becoming mature gradually. With the continuous improvement of the stock market rules, the number of listed stocks and their associated data are increasing. There is a lot of complex stock data containing useful information, which cannot be found through conventional methods. However，the data mining technology developed in recent years can help us mining data information from the vast number of stock data [46]. By analyzing these data, we can get the information we want. In terms of factor stock selection, some researchers have successfully proposed a quantitative stock selection model based on multiple factors [7,8].These systems can use quantitative methods to analyze some transaction data and financial indicators of listed companies. At the same time, they combine statistical testing methods to help investors find the most valuable investment portfolio. But while some methods are convenient and easy to operate, they ignore the issues of correlation and overlap between factors [9]. Using the shortest distance hierarchical clustering method, we can reduce the massive stock price series, which not only simplifies the workload, but also more intelligent. But the shortest distance method is easy to make the samples in the class more and more, so it is an extreme method. Jigar Patel compared four prediction models, including artificial neural network (ANN), support vector machine (SVM), random forest and Naive Bayes, and then got the optimal prediction model [10].
Data mining is the process of extracting the hidden and unknown useful information and knowledge from a large amount of incomplete, noisy, fuzzy and random practical application data [11,12]. The core of data mining is to use algorithms to train the processed input and output data and obtain models. Then, the model is verified, so that the model can describe the relationship between data and input to a certain extent. Finally, the model is used to calculate the newly input data to obtain a new output which can be used for interpretation and application [13]. The content of data mining mainly includes association, regression, classification, clustering, prediction and diagnosis.
A typical BP neural network includes an input layer, one or more hidden layers, and an output layer. Its network structure is shown in Figure 1. The algorithm learning process of BP neural network is mainly composed of input forward propagation and error back propagation. In the forward propagation process, input samples are input from the input layer, processed by the hidden layer units, and the actual output value of each unit is calculated according to the weight and threshold. If the actual output value and the expected value reach a predetermined error range at this time, the learning process ends successfully. The backpropagation method is to adjust the weight through the network error in the back, and modify the weight matrix according to the actual output and the expected output to reduce the error of the neural network structure [14,15].
Figure 1. Structure of BP neural network model 
First, we define the following variables and arguments. Input layer vector , hidden layer output vector , output layer output vector , expected value output vector , weighted connection matrix from input layer to hidden layer , matrix of weights from the hidden layer to the output layer . The specific implementation steps of the BP neural network are as follows:
Step 1. The initialization matrices and of the network are determined by the activation function range. We determine the maximum number of trainings and the learning accuracy value , and choose the activation function:

(1) 
Step 2. Data preprocessing, we select sample data input, get the output of hidden layer and output layer :

(2) 

(3) 
Step 3. Calculating the error using the actual output value and the expected output value of the network:

(4) 
Step 4. Calculating the partial derivative of the error function with respect to every neuron of the hidden layer and the output layer:

(5) 

(6) 
Step 5. Using the error signal to adjust the connection weight of each layer, let be the weight from the hidden layer to the output layer, and be the weight from the input layer to the hidden layer

(7) 

(8) 
Step 6. Calculating Global Error:

(9) 
Step 7. The global error is compared with the precision value. If the global error is less than the given precision value, or the number of trainings exceeds the maximum number of times , the algorithm ends at this time; otherwise, the learning continues.
Data is the foundation of data mining. Many financial websites have rich and reliable transaction data, such as Yahoo, Sina and Tencent. Yahoo has an interface with MATLAB, so we use MATLAB to obtain these transaction data from Yahoo. The important function “fetch” in MATLAB is used as follows:
Data=fetch(Connect,’security’,’FromDate’,’ToDate’)
Among them, ‘Connect’ indicates the location where the data was obtained, such as Yahoo. ‘Security’ indicates which stock data to obtain. ‘FromDate’ is the start time of the specified time range. ‘ToDate’ is the end time of the specified time range. In this paper, we use this method to obtain the stocks of Shenzhen Stock Exchange from 1 to 1000 and save them in Excel. After the data is standardized, training samples and prediction samples are obtained. We then use the neural network model described in Section 2.2 to train the samples and implement predictions.
The model results in a sort table of all stocks, as shown in Table 1. The ranking is based on the data predicted by the last column, which can be understood as the probability of future growth of the stock. The effect of this result is that in the actual process of stock buying and selling, we can choose the top stocks to buy, and vice versa. This provides conditions for buying and selling in quantitative stock selection.
65  1  1  1  1  0.217464  0.689387  0.615622  0.933314  1.076462 
802  0.649562  0.714952  0.590378  0.669138  0.533305  0.493489  0.119175  0.450005  0.995385 
985  0.489474  0.388007  0.219643  0.032438  0.289402  0.922103  0.458649  0.370715  0.985637 
582  0.350914  0.507703  0.590378  0.669138  0.58377  0.410922  0.118595  0.226798  0.940392 
66  0.846695  0.593295  0.590378  0.669138  0.551252  0.670699  0.293865  0.605941  0.885136 
751  1  1  0.87818  0.881371  0.332703  0.595813  0.626997  0.948292  0.88133 
707  0  0.650724  0.302097  0.244671  0.699561  0.544556  0.236814  0.403214  0.830667 
819  1  0.888569  0.87818  0.881371  0.613117  0.822664  0.666953  0.978776  0.826818 
522  0.343439  0.942634  0.417467  0.456905  0.029334  0.000374  0.035146  0.607885  0.778539 
521  0.710836  1  0.302097  0.244671  0.396943  0.315258  0.393372  0.913728  0.75364 
In this experiment, we also use historical data to evaluate the model, and the verification method is full set verification. Figure 2 shows the accuracy and error rate of the model classification. Obviously, the accuracy is significantly higher than the error rate. In finance, it is not easy to achieve 72% accuracy. So, as long as the number of transactions is enough, the probability of profit is very considerable.
Figure 2. Evaluation results of the model 
In this section, we build a portfolio model to determine the best weight for each stock investment. Suppose we want to invest in 8 stocks, just select the top 8 from the stock ranking table 1 given in the previous section.
Assume that the investor chooses sorts of securities to invest, and the proportion of various securities in the total investment is , which is represented by a vector as The yields are respectively, which is represented by a vector as The expected rate of return are which is represented by a vector as Then the yield of the securities investment portfolio is the weighted average of the yields of various securities:

(10) 
Expected rate of yield is the weighted average of the expected rate of yield of various securities, namely:

(11) 
We use the covariance to indicate the degree of correlation between the ith security and the jth security investment yield. In particular, Let be the covariance matrix of . That is

(12) 
Then, the risk of the portfolio is

(13) 
In order to minimize the investment risk as much as possible, we establish the following model:

(14) 
Assuming the covariance matrix is a positive definite matrix, let

(15) 
Then, the portfolio model can be transformed into

(16) 
Constructing the Lagrange multiplier function , where .
Let that is

(17) 
Therefore, is the optimal portfolio weight for a given expected rate of return. Under this weight, the risk of the portfolio is minimized, which is

(18) 
The proposed portfolio theoretical model is verified and simulated by MATLAB software. Now we are ready to invest in 8 stocks, just select the top 8 from the stock ranking table 1 given in the previous section, which are recorded as respectively. The simulation results are shown in Figure 3 and Figure 4.
Figure 3. Effective frontier curve 
Figure 4. Distribution of investment weight 
Here, we need to focus on Figure 3. With this chart, we can easily see the distribution curve of risk and return. This will provide us with a basis for deciding which set of portfolios to choose. When we choose a point on the curve, we get a set of investment weights. If you are an investor who seeks high returns without fear of high risks, you can choose the top set of portfolios. Of course, most people will choose a relatively compromise solution, that is, the benefits are greater, but the risks can be tolerated.
Figure 4 is an investment weight allocation chart for different risk appetites. When we choose an abscissa, it corresponds to a portfolio. Of course, we can also directly calculate the specific weight distribution data from the model. But in the form of a graph, it is more intuitive to see the difference in portfolio schemes under different risk preferences. The specific manifestation is that the investment ratio of each stock is different. When you choose a preference, you can directly get the specific investment allocation plan.
In the field of quantitative investment, investors' attention has been paid to quantitative stock selection strategies based on data mining technology. For investors, the key is to design good indicators and improve the accuracy of the model, thereby improving the profitability of the model and maximizing the potential of the data and model. Based on the observation and analysis of the Beidou navigation plate, the stocks with the most investment value in the plate were finally selected. While selecting better stocks, using quantitative timing strategies to suppress risks, and then selecting a suitable investment portfolio, in order to achieve the ideal goal of high returns and low risks in the stock market.
This work has been partially supported by the Key projects of natural science research of the higher education institutions of Anhui (grant no. KJ2016A530).
[1] Ouyangn W., Szewczyk S.H. Stock price informativeness on the sensitivity of strategic M&A investment to Q. Review of Quantitative Finance & Accounting, 50(3):745774, 2018.
[2] Chava S., Wang R., Zou H. Covenants, creditors’ simultaneous equity holdings, and firm investment policies. Journal of Financial and Quantitative Analysis, 54(2):481512, 2019.
[3] Guo H., Zhang Y., Wu S., Shang L. Investment risk evaluation of existing building energysaving renovation project for ESCO. Ecological Economy, 27(3):180189, 2018.
[4] Huiqi Gan. Does CEO managerial ability matter? Evidence from corporate investment efficiency. Review of Quantitative Finance & Accounting, 52(4):10851118, 2019.
[5] Ferrando A., Preuss C. What finance for what investment? Surveybased evidence for European companies. Econ. Polit., 35:1015–1053, 2018.
[6] Serdar M.A., Serteser M., Ucal Y., etc. An assessment of HbA1c in diabetes mellitus and prediabetes diagnosis: a multicentered data mining study. Applied Biochemistry and Biotechnology, 190(Suppl1):113, 2019.
[7] Sorensen E.H. Miller K.L., Ooi C.K. The decision tree approach to stock selectionAn evolving tree model performs the best. Journal of Portfolio Management, 27(1):4252, 2000.
[8] Piotroski J.D. Value investing: The use of historical financial statement information to separate winners from losers. Journal of Accounting Research, 38(2):4351, 2001.
[9] Fama E.F., French K.R. A fivefactor asset pricing model. Journal of Financial Economics, 116(1):122, 2015.
[10] Patel J., Shah S., Thakkar P., Kotecha K. Predicting stock and stock price index movement using Trend Deterministic Data Preparation and machine learning technique. Expert Systems with Applications, 42(1)：259268, 2015.
[11] Svefors P., Sysoev O., Ekstrom E.C., et al. Relative importance of prenatal and postnatal determinants of stunting: data mining approaches to the MINIMat cohort Bangladesh. BMJ Open, 9(8):e025154, 2019.
[12] Arabameri A., Pradhan B., Rezaei K. Spatial prediction of gully erosion using ALOS PALSAR data and ensemble bivariate and data mining models. Geosciences Journal, 23:669–686, 2019.
[13] Dong Y., Wang H. Robust output feedback stabilization for uncertain discretetime stochastic neural networks with timevarying delay. Neural Processing Letters, 51:83–103, 2020.
[14] Li M.X., Yu S.Q., Zhang W. Segmentation of retinal fluid based on deep learning: application of threedimensional fully convolutional neural networks in optical coherence tomography images. International Journal of Ophthalmology, 12(6):10121020, 2019.
[15] Segler M.H.S., Preuss M., Waller M.P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604610, 2018.
Published on 30/03/20
Accepted on 25/03/20
Submitted on 18/02/20
Volume 36, Issue 1, 2020
DOI: 10.23967/j.rimni.2020.03.006
Licence: CC BYNCSA license