Quantitative investment model based on data mining

Abstract

Quantitative investment is the process of establishing mathematical models using statistics, information technology, and mathematics to quantify and implement risks, returns, and traditional investment concepts. However, due to the backwardness of computing tools in the past, quantitative investment has not received much recognition. With the improvement of computer science and quantitative analysis theory, traditional fundamental analysis and the use of sampling statistical technology to build advanced mathematical models for investment analysis have failed to meet the requirements of investors. Therefore, the Quantitative investment strategies based on data mining technology are receiving more and more attention. In this paper, we uses MATLAB software to capture big data from financial and economic websites, and then uses neural network training models to predict the trend of stock changes, and finally establishes a suitable quantitative stock selection model. The simulation results show that only by using quantitative stock selection strategies to curb risks and selecting a suitable investment portfolio can achieve the ideal goals in the stock market.

Keywords: Quantitative investment, data mining, neural network, portfolio

1. Introduction

In recent years, due to the continuous development of the stock market, more and more attention is paid to the quantitative investment technology [1-3]. Quantitative investment system is becoming mature gradually. With the continuous improvement of the stock market rules, the number of listed stocks and their associated data are increasing. There is a lot of complex stock data containing useful information, which cannot be found through conventional methods. However，the data mining technology developed in recent years can help us mining data information from the vast number of stock data [4-6]. By analyzing these data, we can get the information we want. In terms of factor stock selection, some researchers have successfully proposed a quantitative stock selection model based on multiple factors [7,8].These systems can use quantitative methods to analyze some transaction data and financial indicators of listed companies. At the same time, they combine statistical testing methods to help investors find the most valuable investment portfolio. But while some methods are convenient and easy to operate, they ignore the issues of correlation and overlap between factors [9]. Using the shortest distance hierarchical clustering method, we can reduce the massive stock price series, which not only simplifies the workload, but also more intelligent. But the shortest distance method is easy to make the samples in the class more and more, so it is an extreme method. Jigar Patel compared four prediction models, including artificial neural network (ANN), support vector machine (SVM), random forest and Naive Bayes, and then got the optimal prediction model [10].

2. Basic theory and method

2.1 Data mining

Data mining is the process of extracting the hidden and unknown useful information and knowledge from a large amount of incomplete, noisy, fuzzy and random practical application data [11,12]. The core of data mining is to use algorithms to train the processed input and output data and obtain models. Then, the model is verified, so that the model can describe the relationship between data and input to a certain extent. Finally, the model is used to calculate the newly input data to obtain a new output which can be used for interpretation and application [13]. The content of data mining mainly includes association, regression, classification, clustering, prediction and diagnosis.

2.2 Principle of BP neural network

A typical BP neural network includes an input layer, one or more hidden layers, and an output layer. Its network structure is shown in Figure 1. The algorithm learning process of BP neural network is mainly composed of input forward propagation and error back propagation. In the forward propagation process, input samples are input from the input layer, processed by the hidden layer units, and the actual output value of each unit is calculated according to the weight and threshold. If the actual output value and the expected value reach a predetermined error range at this time, the learning process ends successfully. The back-propagation method is to adjust the weight through the network error in the back, and modify the weight matrix according to the actual output and the expected output to reduce the error of the neural network structure [14,15].

Figure 1. Structure of BP neural network model

First, we define the following variables and arguments. Input layer vector ${\textstyle X=(x_{1},x_{2},\cdots ,x_{n})}$ , hidden layer output vector ${\textstyle H=(h_{1},h_{2},\cdots ,h_{m})}$ , output layer output vector ${\textstyle Y=(y_{1},y_{2},\cdots ,y_{l})}$ , expected value output vector ${\textstyle D=(d_{1},d_{2},\cdots ,d_{l})}$ , weighted connection matrix from input layer to hidden layer ${\textstyle V=(V_{1},V_{2},\cdots ,V_{m})}$ , matrix of weights from the hidden layer to the output layer ${\textstyle W=(W_{1},W_{2},\cdots W_{l})}$ . The specific implementation steps of the BP neural network are as follows:

Step 1. The initialization matrices ${\textstyle W}$ and ${\textstyle V}$ of the network are determined by the activation function range. We determine the maximum number of trainings ${\textstyle M}$ and the learning accuracy value ${\textstyle e}$ , and choose the activation function:

(1)

Step 2. Data preprocessing, we select sample data input, get the output of hidden layer ${\textstyle h_{j}}$ and output layer ${\textstyle y_{k}}$ :

(2)

(3)

Step 3. Calculating the error using the actual output value ${\textstyle y_{k}}$ and the expected output value ${\textstyle d_{k}}$ of the network:

(4)

Step 4. Calculating the partial derivative of the error function with respect to every neuron of the hidden layer and the output layer:

(5)

(6)

Step 5. Using the error signal to adjust the connection weight of each layer, let ${\textstyle w_{jk}^{N+1}}$ be the weight from the hidden layer to the output layer, and ${\textstyle v_{ij}^{N+1}}$ be the weight from the input layer to the hidden layer

(7)

(8)

Step 6. Calculating Global Error:

(9)

Step 7. The global error ${\textstyle E}$ is compared with the precision value. If the global error is less than the given precision value, or the number of trainings exceeds the maximum number of times ${\textstyle M}$ , the algorithm ends at this time; otherwise, the learning continues.

2.3 Simulation experiments to predict stocks

Data is the foundation of data mining. Many financial websites have rich and reliable transaction data, such as Yahoo, Sina and Tencent. Yahoo has an interface with MATLAB, so we use MATLAB to obtain these transaction data from Yahoo. The important function “fetch” in MATLAB is used as follows:

Data=fetch(Connect,’security’,’FromDate’,’ToDate’)

Among them, ‘Connect’ indicates the location where the data was obtained, such as Yahoo. ‘Security’ indicates which stock data to obtain. ‘FromDate’ is the start time of the specified time range. ‘ToDate’ is the end time of the specified time range. In this paper, we use this method to obtain the stocks of Shenzhen Stock Exchange from 1 to 1000 and save them in Excel. After the data is standardized, training samples and prediction samples are obtained. We then use the neural network model described in Section 2.2 to train the samples and implement predictions.

The model results in a sort table of all stocks, as shown in Table 1. The ranking is based on the data predicted by the last column, which can be understood as the probability of future growth of the stock. The effect of this result is that in the actual process of stock buying and selling, we can choose the top stocks to buy, and vice versa. This provides conditions for buying and selling in quantitative stock selection.

Table 1. Model prediction results (first 10 lines)

65	1	1	1	1	0.217464	0.689387	0.615622	0.933314	1.076462
802	0.649562	0.714952	0.590378	0.669138	0.533305	0.493489	0.119175	0.450005	0.995385
985	0.489474	0.388007	0.219643	0.032438	0.289402	0.922103	0.458649	0.370715	0.985637
582	0.350914	0.507703	0.590378	0.669138	0.58377	0.410922	0.118595	0.226798	0.940392
66	0.846695	0.593295	0.590378	0.669138	0.551252	0.670699	0.293865	0.605941	0.885136
751	1	1	0.87818	0.881371	0.332703	0.595813	0.626997	0.948292	0.88133
707	0	0.650724	0.302097	0.244671	0.699561	0.544556	0.236814	0.403214	0.830667
819	1	0.888569	0.87818	0.881371	0.613117	0.822664	0.666953	0.978776	0.826818
522	0.343439	0.942634	0.417467	0.456905	0.029334	0.000374	0.035146	0.607885	0.778539
521	0.710836	1	0.302097	0.244671	0.396943	0.315258	0.393372	0.913728	0.75364

In this experiment, we also use historical data to evaluate the model, and the verification method is full set verification. Figure 2 shows the accuracy and error rate of the model classification. Obviously, the accuracy is significantly higher than the error rate. In finance, it is not easy to achieve 72% accuracy. So, as long as the number of transactions is enough, the probability of profit is very considerable.


Figure 2. Evaluation results of the model

3. Portfolio model

In this section, we build a portfolio model to determine the best weight for each stock investment. Suppose we want to invest in 8 stocks, just select the top 8 from the stock ranking table 1 given in the previous section.

Assume that the investor chooses ${\textstyle n}$ sorts of securities to invest, and the proportion of various securities in the total investment is ${\textstyle w_{1},w_{2},\cdots ,w_{n}}$ , which is represented by a vector as ${\textstyle W={\left(w_{1},w_{2},\cdots ,w_{n}\right)}^{T}.}$ The yields are ${\textstyle r_{1},r_{2},\cdots ,r_{n},}$ respectively, which is represented by a vector as ${\textstyle R={\left(r_{1},r_{2},\cdots ,r_{n}\right)}^{T}.}$ The expected rate of return are ${\textstyle u_{1},u_{2},\cdots ,u_{n},}$ which is represented by a vector as ${\textstyle U={\left(u_{1},u_{2},\cdots ,u_{n}\right)}^{T}.}$ Then the yield ${\textstyle r}$ of the securities investment portfolio is the weighted average of the yields of various securities:

(10)

Expected rate of yield ${\textstyle u_{p}}$ is the weighted average of the expected rate of yield of various securities, namely:

(11)

We use the covariance ${\textstyle {\sigma }_{ij}={\sigma }_{ji}=cov(r_{i},r_{j})}$ to indicate the degree of correlation between the i-th security and the j-th security investment yield. In particular, ${\textstyle {\sigma }_{ij}={\sigma }_{i}^{2}=D(r_{i}),}$ Let ${\textstyle E={\left({\sigma }_{ij}\right)}_{n\times n}}$ be the covariance matrix of ${\textstyle r}$ . That is

{\textstyle E=\left[{\begin{array}{c}{\sigma }_{11}{\mbox{ }}{\mbox{ }}\cdots {\mbox{ }}{\mbox{ }}{\sigma }_{1n}\\{\mbox{ }}{\mbox{ }}\vdots {\mbox{ }}{\mbox{ }}{\mbox{ }}{\mbox{ }}\ddots {\mbox{ }}{\mbox{ }}{\mbox{ }}{\mbox{ }}\vdots \\{\sigma }_{n1}{\mbox{ }}{\mbox{ }}\cdots {\mbox{ }}{\mbox{ }}{\mbox{ }}{\sigma }_{n1}\end{array}}\right]}

(12)

Then, the risk ${\textstyle {\sigma }_{p}^{2}}$ of the portfolio is

(13)

In order to minimize the investment risk as much as possible, we establish the following model:

(14)

Assuming the covariance matrix is a positive definite matrix, let

，

(15)

Then, the portfolio model can be transformed into

(16)

Constructing the Lagrange multiplier function ${\textstyle L=W^{T}EW+\lambda ^{T}(AW-B)}$ , where ${\textstyle \lambda =[\lambda _{1},\lambda _{2}]^{T}}$ .

Let ${\textstyle \displaystyle {\frac {\partial L}{\partial \lambda }}=0,\displaystyle {\frac {\partial L}{\partial W}}=0,}$ that is

(17)

Therefore, ${\textstyle W=E^{-1}A^{T}{\left(AE^{-1}A^{T}\right)}^{-1}B}$ is the optimal portfolio weight for a given expected rate of return. Under this weight, the risk of the portfolio is minimized, which is

(18)

4. Simulation results and analysis

The proposed portfolio theoretical model is verified and simulated by MATLAB software. Now we are ready to invest in 8 stocks, just select the top 8 from the stock ranking table 1 given in the previous section, which are recorded as $P_{1},P_{2},\cdots ,P_{8}$ respectively. The simulation results are shown in Figures 3 and 4.

Figure 3. Effective frontier curve

Figure 4. Distribution of investment weight

Here, we need to focus on Figure 3. With this chart, we can easily see the distribution curve of risk and return. This will provide us with a basis for deciding which set of portfolios to choose. When we choose a point on the curve, we get a set of investment weights. If you are an investor who seeks high returns without fear of high risks, you can choose the top set of portfolios. Of course, most people will choose a relatively compromise solution, that is, the benefits are greater, but the risks can be tolerated.

Figure 4 is an investment weight allocation chart for different risk appetites. When we choose an abscissa, it corresponds to a portfolio. Of course, we can also directly calculate the specific weight distribution data from the model. But in the form of a graph, it is more intuitive to see the difference in portfolio schemes under different risk preferences. The specific manifestation is that the investment ratio of each stock is different. When you choose a preference, you can directly get the specific investment allocation plan.

5. Conclusion

In the field of quantitative investment, investors' attention has been paid to quantitative stock selection strategies based on data mining technology. For investors, the key is to design good indicators and improve the accuracy of the model, thereby improving the profitability of the model and maximizing the potential of the data and model. Based on the observation and analysis of the Beidou navigation plate, the stocks with the most investment value in the plate were finally selected. While selecting better stocks, using quantitative timing strategies to suppress risks, and then selecting a suitable investment portfolio, in order to achieve the ideal goal of high returns and low risks in the stock market.

Acknowledgement

This work has been partially supported by the Key projects of natural science research of the higher education institutions of Anhui (grant no. KJ2016A530).

References

[1] Ouyangn W., Szewczyk S.H. Stock price informativeness on the sensitivity of strategic M&A investment to Q. Review of Quantitative Finance & Accounting, 50(3):745-774, 2018.

[2] Chava S., Wang R., Zou H. Covenants, creditors’ simultaneous equity holdings, and firm investment policies. Journal of Financial and Quantitative Analysis, 54(2):481-512, 2019.

[3] Guo H., Zhang Y., Wu S., Shang L. Investment risk evaluation of existing building energy-saving renovation project for ESCO. Ecological Economy, 27(3):180-189, 2018.

[4] Huiqi Gan. Does CEO managerial ability matter? Evidence from corporate investment efficiency. Review of Quantitative Finance & Accounting, 52(4):1085-1118, 2019.

[5] Ferrando A., Preuss C. What finance for what investment? Survey-based evidence for European companies. Econ. Polit., 35:1015–1053, 2018.

[6] Serdar M.A., Serteser M., Ucal Y., etc. An assessment of HbA1c in diabetes mellitus and pre-diabetes diagnosis: a multi-centered data mining study. Applied Biochemistry and Biotechnology, 190(Suppl1):1-13, 2019.

[7] Sorensen E.H. Miller K.L., Ooi C.K. The decision tree approach to stock selection-An evolving tree model performs the best. Journal of Portfolio Management, 27(1):42-52, 2000.

[8] Piotroski J.D. Value investing: The use of historical financial statement information to separate winners from losers. Journal of Accounting Research, 38(2):43-51, 2001.

[9] Fama E.F., French K.R. A five-factor asset pricing model. Journal of Financial Economics, 116(1):1-22, 2015.

[10] Patel J., Shah S., Thakkar P., Kotecha K. Predicting stock and stock price index movement using Trend Deterministic Data Preparation and machine learning technique. Expert Systems with Applications, 42(1)：259-268, 2015.

[11] Svefors P., Sysoev O., Ekstrom E.C., et al. Relative importance of prenatal and postnatal determinants of stunting: data mining approaches to the MINIMat cohort Bangladesh. BMJ Open, 9(8):e025154, 2019.

[12] Arabameri A., Pradhan B., Rezaei K. Spatial prediction of gully erosion using ALOS PALSAR data and ensemble bivariate and data mining models. Geosciences Journal, 23:669–686, 2019.

[13] Dong Y., Wang H. Robust output feedback stabilization for uncertain discrete-time stochastic neural networks with time-varying delay. Neural Processing Letters, 51:83–103, 2020.

[14] Li M.X., Yu S.Q., Zhang W. Segmentation of retinal fluid based on deep learning: application of three-dimensional fully convolutional neural networks in optical coherence tomography images. International Journal of Ophthalmology, 12(6):1012-1020, 2019.

[15] Segler M.H.S., Preuss M., Waller M.P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604-610, 2018.