DOC

Initializing the Weights in Multilayer Network with Quadratic

By Terry Bell,2014-11-26 12:07
13 views 0
Initializing the Weights in Multilayer Network with Quadratic

40 526 U1180 neural networks

    Initializing the weights in multiplayer network with quadratic sigmoid function

    Abstract A new method of initializing the weights in back propagation networks using the quadratic threshold activation function with one hidden layer is developed. The weights can be directly obtained even without further learning. This method relies on general position assumption of the distribution of the patterns. This method can be applied to many pattern recognition tasks. Finally, simulation results are presented, showing that this initializing technique generally results in drastic reduction in training time and the number of hidden neurons.

1 Introduction

    Using neural networks for pattern recognition applications is more and more attractive. The main advantages of using neural networks are their massive parallelism, adaptability, and noise tolerance, etc (B. Widrow, and R. Winter, 1988)( J. J. Hopfield, 1982)( T. J. Sejnowski, and C. R. Rosenberg, 1986). One of the most popular neural networks is the back propagation (BP) or multilayer perceptron (MLP) neural network. The most commonly used activation function of BP is the hard \-limited threshold function and sigmoid function.

    Using the hard-limited activation function in every neuron, the upper bound of the number of hidden neurons in a single-hidden layer network required for solving a

    K?general-position two-class classification problem is , where K is the number of ?n?

    patterns, n is the input dimension. If without the general-position constraint, the upper bound of the number of the hidden neurons is K 1.

    Recently, a new quadratic threshold activation function is proposed (C. C. Chiang, 1993). By using it in each neuron, it is shown that the upper bound of the number of hidden neurons required for solving a given two-class classification problem can be reduced by one half compared with the conventional multilayer perceptrons which use the hard-limited threshold function, The results are given in table 1. Since the quadratic function is a little bit more complicated than the hard-limited threshold function, the learning is much difficult for the BP network with the quadratic function. Both the learning period and convergence properties are not

    526 U1180 neural networks 41

    good enough to obtain effective results and can be observed in typical simulations. To relieve this learning difficulty, a new method for initializing weights in BP networks using the quadratic threshold activation function with one hidden layer is presented. The method based on the use of Gauss elimination; it is applicable to many classification tasks. The paper is organized as follows: the basic quadratic threshold function is described in section 2; this new initialization method is addressed in section 3; and finally, simulation results are shown in section 4.

     Problem General-position Not general-position Activation

     Function

    K? Hard-limited K-1 ?n?

    KK?? QTF ??2n2??

     Table 1. Number of hidden neurons required

2 Quadratic Threshold Function

    The quadratic threshold function (QTF) is defined as (C. C. Chiang, 1993)

42 526 U1180 neural networks

    2?0,ifnet?f(net,)Quadratic Threshold Function: ?2?1,ifnet~?

    In (C. C. Chiang, 1993), an upper bound on the number of hidden neurons

    n in E is derived required for implementing arbitrary dichotomy on a K-element set S

    under the constraint that S is in general position.

    nDefinition 1 A K-element set S in E is in general position if no (j+1) elements in

    S in a (j-1)-dimensional linear variety for any j where 2 ? j ? n.

    Proposition 1 (S. C. Huang, and Y. F. Huang, 1991, Proposition 1) Let S be a

    nfinite set in E and S is in general position. Then, for any J-element subset S of S, 1

    where 1 ? J ? n, there is a hyperplane, which is an (n-1)-dimensional linear

    nvariety of E and no other elements of S.

    nIn (E. B. Baum), Baum proved that if all the elements of a K-element set S in E

    K?is in general position, then a single-hidden-layer MLP with hidden neurons ?n?

    using the hard-limited threshold function can implement arbitrary dichotomies defined

    on S. In (C. C. Chiang, 1993), it is proved that a two-layered (one hidden layer) MLP

    K?with at most hidden neurons, which use the QTF, is capable of implementing ?2n?

    narbitrary dichotomies of a K-element set S in E if S is in general position.

    Since the quadratic threshold function is non-differentiable. To ease the derivation, we use the quadratic sigmoid function (QSF) as follows:

    1f(net,) Quadratic Sigmoid Function: 21exp(net)

3 Description of this Method

    nConsider a classification problem consisting in assigning K vectors of R to 2

    predetermined classes. Let the given training set H = {x, x, …, x} = {H, H} is 12K01

    partitioned into K ? K training vectors in subset H corresponding to class 0, and K ? 001

    K training vectors in set H corresponding to class 1, where K + K = K. And H = 1010

    12K012K1{p, p, …, p}, H = {q, q, …, q}. The classification can be implemented within 1

    a two-layer neural network with N + 1 input units, N + 1 hidden neurons, and N 012

    K?output units, as illustrated in Figure 1. (N = n, N = , N = 1). 012?2n?

    526 U1180 neural networks 43

    According to Proposition 1, for any n-element subset S’ of S, which contains 11

    elements belong to for example class 1, there is a hyperplane which is a

    n(n-1)-dimensional linear variety of E containing S’ and no other elements of S. We 1

    can use the Gauss elimination to solve the linear equations of the hyperplane that the n

    patterns of class 1 lie on.

    Let us now describe the initialization scheme more formally. Let the input and hidden layers have each a bias unit with constant input activation value of 1, and the

    (1)(2)matrices of weights W and W are between the input and the hidden layer, and between the hidden and the output layer, respectively.

    (1)(1)(1)(1)(1)????W,W,,WW,Wn1,01,11,1,01??(1)(1)(1)(1)(1)??W,W,,WW,Wn2,02,12,2,02(2)W??

    ??

    ??(1)(1)(1)(1)(1)??W,W,,WW,WNNNnNN,0,1,,011111????

    (1)Where represents the connection from the i-th input unit to the j-th hidden Wj,i

    (2)neuron, for j = 1, 2, …, N, i = 0, 1, …, n; represents the connection from the W11,j

    j-th hidden unit to the output neuron, where j = 0, 1, 2, …, N; 1

    (2)(2)(2)(2)(2)(2) W(W,W,W)(W,W)N1,01,11,1,011

    (1)(2)Let θ, θ be the vector of θs values of the hidden and output layer, respectively.

    (2)(1) and represent the θ value of the j-th hidden neuron j = 1, 2, …, N and of 11j

    the output neuron, respectively.

    (1)(1)(1)(1)(2)(2) (,,);()N1211

     If K ? K, we use the set of patterns in set H corresponding to class 1 to obtain 011

    K?1the weights values for the hidden neuron, so the number of the hidden neurons ?n?

    KK??1, let N~.;j?{1,2,;,N},;i?{1,2,;,n}11??n2n??

    (1) = 1 (1) Wj,0

    (2) = 1 (2) W1,j

    (2) = -1 (3) W1,0

44 526 U1180 neural networks

    (1) = β (4) j

    (2) = β (5) 1

    where 0 <β < 1 is a very small positive number.

     input patterns, In order for each j-th hidden neuron represents a different set of n

    K?(1)1we solve the following equations to get , for each j = 1, 2, …, , i = 1, 2, …, Wj,i?n??n.

    n?(1)(j1)n1(1)WqW0)j,iij,0?i1?n?(1)(j1)n2(1)WqW0?)j,iij,0 (6) ?i1

    ??n?(1)(j1)nn(1)WqW0)j,iij,0?i1?

    KKKK??1111If is a integer (i.e. = = N), here we done. If is not a 1??nnnn???

    KKK???111?integer (i.e. = - 1 = N 1), the rest K n patterns are 11???nnn?????then represented by the N-th hidden neuron using the following formula. For j = N, i 11

    K?1?= 1, 2, …, K n . 1?n??

    K?1n?n*1?(1)(1)n??WqW0?)j,iij,0?i1

    ?K?1nn*2?(1)(1)n???WqW0)j,iij,0 (7) ?i1??

    n?K(1)(1)1WqW0?)j,iij,0i1?

    K?(1)1?The equations can solve for i = 1, 2, …, K n . Let the other n - K W11Ni,?1n??

    K?1?+ n weights connects to the N-th hidden neuron be zero. 1?n??

     If K < K, we use the set of patterns in set H corresponding to class 0 to obtain 010

    K?1the weights values for the hidden neuron, so the number of the hidden neurons ?n?

    526 U1180 neural networks 45

    KK??0 < . Everything is the same except that let N = 1??2nn??

    (2) = 0 (8) W1,0

    (1)and q is replaced by p. We now solve the following equations to get , for each j Wj,i

    K?n= 1, 2, …, , i = 1, 2, …, n. ?n??

    n?(1)(j1)n1(1)WpW0)j,iij,0?i1?n?(1)(j1)n2(1)WpW0?)j,iij,0 (9) ?i1

    ??n?(1)(j1)nn(1)WpW0)j,iij,0?i1?

    K0If is not a integer, the weights connects inputs and the N-th hidden neuron can 1n

    be obtained the same way as for the case K ? K. 01

     To test the results, when K ? K, with these initial values of the weights, an input 01

    ?x H such: 1

    (1)(1) W(xW0jj,0

    means x lies on the hyperplane that the j-th hidden neuron represents. This input will

    cause only the j-th hidden unit to have an activation value close to 1, since

    (1)(1)(1) , netW(xW0jjj,0

    (1)where, represent the net input magnitude of the j-th hidden neuron. And netj

    (1) net0j

    The other hidden neurons t ? j will not get activated because

    (1)(1)(1) , netW(xW?0ttj,0

    and

    (1)(1)ifnet0:net~;tt (1)(1)ifnet0:~nettt

    Consequently, provided the activation of the activation of the j-th hidden neuron is

46 526 U1180 neural networks

    close to 1 with a QSF and all other neurons are 0, the output neuron will also get activated, since

    (2)(2)(2) , netW(yW1y10i111,0

    (2) net01

    (2)netwhere y represent the output vector of the hidden layer, represent the net input 1

    magnitude of the output neuron. If K < K, there is a similar situation. 01

4 Simulations

    This section provides the results of applying the above method for the initialization of three simple problems. The first is the XOR problem, the second is the parity problem, and the third is the gray-zone problem. They are trained with this algorithm using a two-layer neural network. All the simulations have been performed using BP with a momentum term. In this algorithm, the weights and θs are updated

    after a training pattern, according to the following equations (C. C. Chiang, 1993):

    ?E(1)(1)(W(t)(W(t1),11,jiji(1)?W(t1),ji

    ?E(2)(2)(W(t)(W(t1)1,111,jj(2)?W(t1)1,j

    ?E(1)(1)((t)((t1)22jj(1)?(t1)j

    ?E(2)(2)((t)((t1)1221(2)?(t1)1

    12where j = 0, 1, 2, …, N; i = 0, 1, …, n; E = ; o = the actual output; d = the (od)12

    desired output; t = a discrete time index; α = the momentum coefficient for weights; 1

    α = the momentum coefficient for θs; μ = the learning rate for weights; μ = the 212

    learning rate for θs.

     The exclusive-or (XOR) function has been widely used as a benchmark example to test the neural network‟s learning ability. Figure 2 depicts the learning curves of the XOR problem for our method, random QSF BP, and random Sigmoid BP. The learning curves of the random QSF BP and the random sigmoid BP are the average results of 50 times training using 50 different sets of initial random weights all between 0.1 and +0.1 to relieve the effect of different initial weights. In the simulations, the parameter were chosen as follows: μ = 0.3, μ = 2.0, α = 0.3, α = 1212

    526 U1180 neural networks 47

    0.005, β = 0.09. This figure shows that the evaluated mean squared error decreasing with the epoch number. An “epoch” means the presentations of the whole training set. As shown in Figure 2, our learning speed is far superior to the learning speed of the QSF BP and the random Sigmoid BP with the same learning rate and even though we use less number of hidden neurons.

     The parity problem contains 32 8-dimensional binary training patterns. If the sum of every bit of the input pattern is odd, the output is 1, otherwise 0. For example, presentation of 00001111 should result in an output of „0‟ and presentation of 01000110 should result in an output of „1‟. Figure 3 shows the learning curves of the parity problem for our method, random QSF BP, and conventional BP. In the same way, the learning curves of the random QSF BP and conventional BP are the average results of 50 times training using 50 different sets of initial random weights all between 0.1 and +0.1. In the simulations, the parameter were chosen as follows: μ = 1

    0.01, μ = 2.0, α = 0.001, α = 0.005, β = 0.09. With the same learning rate and less 212

    number of hidden neurons, our learning speed is far superior to the others.

48 526 U1180 neural networks

     Figure 4 is the distribution of the training patterns of the gray-zone problem on the 2-dimension plane. Circle and cross present class 0 and class 1, respectively. Figure 5 is the learning curves for our method, random QSF BP, and random sigmoid BP (μ = 0.2, μ = 0.9, α = 0.02, α = 0.005, β = 0.04). Figure 6 is the result of using 1212

    our method after 200 epoches. The white zone is class 0 (i.e. the output is smaller than 0.2), while the black zone is class 1 (i.e., the output is larger than 0.8). Some data is included in neither class 0 nor class 1 (i.e., the output is between 0.2 and 0.8), as shown in the gray zone. Therefore in such training patterns as figure 4, the ambiguous ones will be grouped to the gray zone. So our method can solve the ambiguous problem of the training patterns.

5 Conclusion

    In this paper, we propose a new method to initialize the weights of QSF BP neural networks. According to the theoretical results, the upper bound on the number of the hidden neurons is reduced to be a half of the single-hidden-layer MLP used. According to the simulation results, the initialized neural network is far superior to the conventional BP and random QSF BP both in learning speed and network size.

    526 U1180 neural networks 49

Exercises

    3.1 Note that the MLP networks using the quadratic threshold activation function

    defined as

    2?0ifnet?f(net,) ?2?1ifnet~?

    can implement dichotomy problems. The patterns as shown in Figure P1 have to

    be classified in two categories using a layered network. Figure P2 is the

    two-layer classifier using the quadratic threshold activation function. The values

    in the circles are the values of θ corresponding to the neurons. Figure P3 is the

Report this document

For any questions or suggestions please email
cust-service@docsford.com