Discussion of the community detection algorithm based on statistical inference

Summary

This paper aims to solve the model and parameters with the discussion from the algorithm characteristics of model, the analysis of each algorithm, solving the difficulties, problems and development direction. The paper tries to analyze and summarize the evolution law of algorithm and solution thinking. Some improvements are given on the existing community detection algorithm based on statistical inference.

Keywords

Statistical inference ; Community detection ; Probabilistic model

Introduction

The goal of the community detection is to resolve the modular community structure in complex network with the information contained in the graph topology structure. This is the key of network analysis. At present, it has been widely used in the fields of sociology, biology, physics and computer science. It is very important for people to understand the characteristics of complex system. A good community detection algorithm can find a variety of network structure and deal with all kinds of network (including a directed, undirected or weighted network). Its time complexity and space complexity can be controlled in large network. It must has a reliable theoretical basis and not be only empirically based heuristic method.

A community detection method based on statistical inference can identify the structure of the network with structural equivalence and regular equivalence, and fit the observed network with the generated model to obtain the nodes division and the structure of the network. The community detection method based on statistical inference has a complete probability theory and interpretation, and can better meet the standard of community detection algorithm. This paper reviews the research status of community detection model based on statistical inference and the main problems. The principle and application of each model are analyzed in detail. Finally, this paper discusses the future development prospects and problems of community detection method based on statistical inference.

Community detection model based on statistical inference

According to the different community elements in the generative model, it can be divided into vertex community and linked community (Ahn et al., 2010 and Evans and Lambiotte, 2009 ). Nodes are assigned to each community in vertex community, and the link is assigned to each community in linked community. Because the edges of the vertices can be assigned to different communities, the idea of linked community is easy to explain the phenomenon of overlapping communities. The following are discussed in detail from the following aspects, such as the modelling idea, the network generation process, the characteristics of the network (direction, overlap), the complexity of the problem solving and so on.

Statistical inference model based on vertex community

Statistical inference model based on vertex community includes: PPM (Planted partition model), NMM (Newmans mixture model), MMM (mixed membership model) MMSBM (mixed membership stochastic block model) and DCSBM (degree-corrected stochastic block model).

Partition model

The model of planted area model is used to generate the model of benchmark test network, which belongs to the special random block model. The diagonal elements and the non diagonal elements of the random block matrix are respectively representing the link probability of the nodes in the community and the link probability of the community (Condon and Karp, 2001 ). The model turns community detection problem into a statistical inference problem for the “Function”, which can be used to find the non-overlapping of the traditional community, the complexity is higher, but the speed is relatively fast on the sparse graph.

Mixed model

The mixed model of Newman is used to detect the community that has a similar link pattern (Mungan and Ramasco, 2010 , Vazquez, 2008 and Vazquez, 2009 ). It can identify the traditional community, also can identify the “non-coordinated mixed” structure with the similar link pattern. The model assumes that nodes in the community have a similar link, not caring in the same community. The idea is similar to the random block model, but it has no clear description of the link probability relationship between the communities, and describes the relationship between the community and the node. This method can find the structure of the traditional community and the “non-coordinated mixed” structure, but it can’t explain which kind of structure, the structure can not clearly describe the structure of the network. Its time complexity and space complexity are O (KN), which can be used to find the medium size of the network community.

Mixed membership model

In 2003, Blei et al. proposed the LDA mixed membership model (Psorakis et al., 2010 , Parkkinen et al., 2009 , Blei et al., 2013 , Cohn and Chang, 2000 , Erosheva et al., 2004 , Nallapati et al., 2008 , Yang et al., 2010 , Yang et al., 2009a and Yang et al., 2009b ). The model assumes that each node belongs to each class, with a probability that the membership degree vector describes the probability of each class, each vector is independent of the distribution, and the value of the vector is represented by the probability of the data. Observation object belongs to a number of classes, compared to the mixed model and simple random block model, which is more close to reality. This model mainly deals with the problem of link analysis to the network, and the existing research shows that it can improve the clustering results with the content of nodes, and the time complexity is O (N² K). The current model can only deal with the problem of community detection in medium scale sparse networks.

Mixed Membership Degree Random Block Model

The mixed membership degree model and the random block model are combined with (Airoldi et al., 2008 and Airoldi and Fienberg, 2006 ), and the model is established. The model combines the global parameters (the block link matrix) and the local parameters (the mixed membership of the link), so as to solve the problem of the function of the pair. MMSB on the assumption that the nodes are more community and the community membership degree vector is more close to the reality, and the matrix of the membership degree of the nodes can be obtained quickly by using the variable Bayesian algorithm. The disadvantage is that for the node assignment community of ideas is not easy to extend into the hierarchical model, also network of two nodes belonging to the similarity between bigger and more easy to create a link, it implies the overlap region of the edge density higher than non-overlap region edges, which in many cases can not reflect the characteristics of real networks. The time complexity of this kind of model is O (KN² ), which is suitable for modelling of small scale.

Fusion Node Degree of Random Block Model

The paper proves that the random block model without considering the degree of the node degree is easy to be combined with the large sum of nodes and the smaller community (Karrer and Newman, 2011a and Karrer and Newman, 2011b ). The proposed model considers the effect of node degree on the network, and can identify the real structure of the network. In order to simplify the model, it is assumed that the network contains multiple edges and self loops, which is almost not affected by the large sparse graphs, but it is convenient to compute. Karrer et al. also designed a fast Monte Carlo iterative algorithm, which time complexity is O (K² ). The model can deal with large non-multiple networks, which is a non-overlapping community detection algorithm, and can be used to improve the performance of the original model in the model of overlapping community detection model and mixed membership model. The model can generate the non realistic degree sequence, and can not represent the multi-scale community structure.

Statistical Inference Model Based On Link Community

The main statistical Inference Model Based On Link Community are: SPAEM(Simple Probabilistic Algorithm for Community Detection Employing Expectation Maximization), SBMLC (stochastic block model for link community), GSBM (general stochastic block model).

Simple probabilistic algorithm expectation maximization

The SPAEM model is a special kind of symmetric joint link model. It is considered that there are many types of links in the network (Ren et al., 2009 ). It estimates community membership and community distribution and probability of community point node with maximum likelihood parameter. EM algorithm will not stop to iterate until convergence. Compared to the modular community detection method, it can better identify the different size, different degrees of the sequence of asymmetric network in the community; Compared to the Newman hybrid model, it is found that the effect of the traditional community is more excellent. SPAEM also provides a solution to the optimal class number selection. Using the minimum description length scheme can get a compromise between the maximum likelihood and the number of classes.

Stochastic block model for link community

Authors (Karrer and Newman, 2011a and Karrer and Newman, 2011b ) proposed a method based on the global overlapping community of the linked community. Different from the existing heuristic link community detection method, it is not only assignment community for the link, but also to overcome the shortcomings which the existing community could not find a link to different scales, different structure of community by fitting the observed data through the formation of the model. The model assumes that the network is a non-direction multiple network, and one is colored in a variety of colors, each color corresponds to a community (a role that represents a node in a social network). The model can calculate the probability of the nodes belonging to the community by the node to the color edge. It can also be used to find the non-overlapping community, that is, the community which has the largest node to the color edge is selected as the community.

General stochastic block model

The general stochastic block model (Duan et al., 2011 ) can find a variety of intrinsic structural rules to the network without any prior information. It is assumed that the network nodes are divided into several groups, and all nodes in any two groups have similar link patterns. The model sets the group assignment as an implicit variable, and the relationship between groups is modelled as a block matrix, which is used to fit the observed data.

Analysis based on statistical inference of community detection model

Qualitative comparison of models

In order to facilitate the user to choose a community detection algorithm based on statistical inference, the following selection of various models of several representative algorithms, from many aspects of the comparison, the results see Table 1 . Some of the symbols in the time complexity: N represents the number of network nodes, L represents the number of edges of the network, K represents the number of communities.

Table 1. Comparison of typical models of community detection based on statistical inference.
Category	Model	Structure	Type	Scale	Overlap	Time complexity
Vertex community	PPM	Traditional	Non-direction	N	N	O (N)
	NMM	Generalized	All	N	A	O (KL)
	MMM	Traditional	Direction	N	Y	O (KN² )
	MMSB	Traditional	Direction	N	Y	O (KN² )
	DCSBM	Generalized	Non-direction	Y	N	O (NK² )

Link community	SPAEM	Traditional	Non-direction	N	Y	O (KL)
	SBMLC	Generalized	Non-direction	Y	A	O (KN)
	GSB	Generalized	All	N	A	O (LK² )

Analysis

The advantages of community detection model based on statistical inference include: it can be used to realize overlapping and non-overlapping community detection; it can find the relationship matrix between the community and the generalized community. But the model is still in the research stage, and there are many problems need to be solved. The analysis should be as follows.

There is no standard network data set for the generalized community. The advantage of statistical inference model is that it can identify the actual network in the broad community. The existing benchmark data set is mainly aimed at the traditional community detection method, which cannot be used to test the statistical inference model and can be found in many kinds of structures.
The method of overlapping community evaluation needs to be improved: the statistical inference model is obtained by the edge or nodes membership degree, although the existing NMI and the module function have been extended to the overlapping community metrics, but can not effectively measure the accuracy rate of the joint membership degree.
It is not effective to link community thought and network potential. It can effectively deal with the problem of overlapping in the community. The hierarchical structure is the actual phenomenon in the network. The existing statistical inference model cannot consider the network structure.
The complexity of the algorithm is high: the current generation model uses EM algorithm to study the parameters of the algorithm, the number of iterations, the amount of each iteration, and EM algorithm to overcome the problem of local optimum, the operation efficiency is very low.
The number selection problem of the model: the community detection also belongs to the graph clustering problem. The existing SPAEM and GSB all adopt the minimum description length method to select the appropriate number, but also need the user to specify the number of classes.

Conclusions

In reality, many complex networks have complex structures, which can help us to understand the complex systems better. The community detection model based on statistical inference is trying to use the network “latent” structure to generate observation network, and use Bayesian inference to solve the problem of community detection. In addition, the method based on the stochastic block model can not only identify the node community assignment, but also find the interaction between the communities. This feature allows us to grasp the structure of the entire network from the global interaction matrix. The research of this kind of model is also related to Gibbss algorithm, belief propagation algorithm, variational inference and so on. Research Based on statistical inference of community detection model approximate reasoning not only has important significance to the community detection, the theory research and application of the approximate solution method has the positive significance.

Conflict of interest

The authors declare that there is no conflict of interest.

Acknowledgement

We thank the support of VŠB-TUO activities with China with financial support from the Moravian-Silesian Region and the support of beforehand Research Foundation on National Natural Science Foundation from Shijiazhuang University of Economics .

References

Ahn et al., 2010 Y.Y. Ahn, J.P. Bagrow, S. Lehmann; Link communities reveal multiscale complexity in networks; Nature, 466 (7307) (2010), pp. 761–764
Airoldi and Fienberg, 2006 E.M. Airoldi, S.E. Fienberg, et al.; Mixed membership stochastic block models for relational data with application to protein–protein interactions; Proc. Int. Biom. Soc. (2006)
Airoldi et al., 2008 E.M. Airoldi, D.M. Blei, S.E. Fienberg, et al.; Mixed membership stochastic block models; J. Mach. Learn. Res., 9 (2008), pp. 1981–2014
Blei et al., 2013 D.M. Blei, A.Y. Ng, M.I. Jordan; Latent dirichlet allocation; J. Mach. Learn. Res., 3 (2013), pp. 993–1022
Cohn and Chang, 2000 D. Cohn, H. Chang; Learning to Probabilistically Identify Authoritative Documents Citeseer; (2000), pp. 167–174
Condon and Karp, 2001 A. Condon, R.M. Karp; Algorithms for graph partitioning on the planted partition model; Random Struct. Algorithms, 18 (2) (2001), pp. 116–140
Duan et al., 2011 D. Duan, Y. Li, R. Li, et al.; MEI: mutual enhanced infinite generative model for simultaneous community and topic detection; Discov. Sci. (2011), pp. 9–106
Erosheva et al., 2004 E. Erosheva, S. Fienberg, J. Lafferty; Mixed-membership models of scientific publications; Proc. Natl. Acad. Sci. U. S. A., 101 (2004), pp. 5220–5223
Evans and Lambiotte, 2009 T.S. Evans, R. Lambiotte; Line graphs, link partitions, and overlapping communities; Phys. Rev. E, 80 (1) (2009), p. 06105
Karrer and Newman, 2011a B.B.B. Karrer, M.E.J. Newman; An efficient and principled method for detecting communities in networks; Phys. Rev. E, 84 (3) (2011), p. 036103
Karrer and Newman, 2011b B. Karrer, M.E.J. Newman; Stochastic block models and community structure in networks; Phys. Rev. E, 83 (1) (2011), p. 016107
Mungan and Ramasco, 2010 M. Mungan, J.J. Ramasco; Stability of maximum likelihood based clustering methods: exploring the backbone of classifications; J. Stat. Mech., 4 (2010), p. 04028
Nallapati et al., 2008 R.M. Nallapati, A. Ahmed, E.P. Xing, et al.; Joint latent topic models for text and citations; ACM (2008), pp. 542–550
Parkkinen et al., 2009 J. Parkkinen, J. Sinkkonen, A. Gyenge, et al.; A block model suitable for sparse graphs; Proceeding of the 7th International Workshop on Mining and Learning with Graphs (2009)
Psorakis et al., 2010 I. Psorakis, S. Roberts, B. Sheldon; Partitioning in networks via Bayesian Non-negative Matrix Factorization; The 24th Conference on NIPS (2010)
Ren et al., 2009 W. Ren, G. Yan, X. Liao, et al.; Simple probalitic algorithm for detecting community structure; Phys. Rev. E, 79 (3) (2009), p. 036111
Vazquez, 2008 A. Vazquez; Population stratification using a statistical model on hypergraphs; Phys. Rev. E, 77 (6) (2008), p. 066106
Vazquez, 2009 A. Vazquez; Finding hypergraph communities: a Bayesian approach and variational solution; J. Stat. Mech.: Theory Exp. (2009), p. 07006
Yang et al., 2009a T. Yang, R. Jin, Y. Chi, et al.; Combining link and content for community detection: a discriminative approach; KDD (2009), pp. 927–936
Yang et al., 2009b T. Yang, Y. Chi, S. Zhu, et al.; A Bayesian framework for community detection integrating content and link; BUAI (2009), pp. 615–622
Yang et al., 2010 T. Yang, Y. Chi, S. Zhu, et al.; Directed network community detection: a popularity and productivity link model; SDM (2010), pp. 742–753

Summary

Keywords

Introduction

Community detection model based on statistical inference

Statistical inference model based on vertex community

Partition model

Mixed model

Mixed membership model

Mixed Membership Degree Random Block Model

Fusion Node Degree of Random Block Model

Statistical Inference Model Based On Link Community

Simple probabilistic algorithm expectation maximization

Stochastic block model for link community

General stochastic block model

Analysis based on statistical inference of community detection model

Qualitative comparison of models

Analysis

Conclusions

Conflict of interest

Acknowledgement

References

Document information

Document Score

Share this document

Keywords

claim authorship