Matching user accounts based on location verification across social networks

Abstract

Social network has become the main platform for people to obtain information and connect with each other. Matching user accounts can help us build better users’ profiles and benefit many applications. It has attracted much attention from both industry and academia. At present, cross-platform user identification can be divided into three categories: based on user basic attribute information, user online social structure relationship and user behavior. Research on mobile social networks is a kind of dynamic mixed data analysis. Due to the remarkable heterogeneity of its data across platforms and the incomplete and untrue information caused by users’ concealment behavior, the recognition rate of the algorithm is relatively low. The paper provides a new matching user accounts method based on location verification. First, the self-centered network algorithm is applied to find cross-network edges in the respective networks of the two users to be matched, which is taken as the initial similarity value of the two users. Secondly, the longitude, latitude and time coordinates of a single platform node were used to modify the similarity. Specific 5 time points were selected within 24 hours and the error range of 10min was taken as the calculation method of great circle distance. Thirdly, since the user did not log in a certain social platform in a certain period of time, the convolutional neural network algorithm was adopted to mark the trajectory. Finally, all users in the whole network are identified by iterative operation. Experimental results on artificial random networks and real social networks show that the proposed algorithm has a high readiness rate and recall rate.

Keywords: Social networking service, across social networks, matching user accounts, location verification

1. Introduction

In recent years, the binding of personal life information and mobile intelligent terminals has become increasingly close [1,2]. By March 2019, 47% of the world’s population had Internet access. 81 percent (1 billion users) are in developed countries, 40.1 percent (2.5 billion users) are in developing countries and 15.2 percent are in underdeveloped countries. 98% Internet users use at least one social network, with an average 7.6 online accounts per user [3]. Globally, the largest online accounts per user were in Latin America (8.8) and Asia (8.1) [4]. Because the data volume of mobile intelligent terminals is too large, especially the problem of cross-platform collaborative processing of different accounts has not been solved, the IT industry’s service cost for mobile intelligent terminals is high [5,6].

Human beings generate a large amount of social data through mobile network all the time, which contains information such as interaction information, emotional evolution, community composition and action trajectory, etc. Through reasonable modeling and mining, different social relations and distribution characteristics can be revealed. Cross-platform user identification can effectively solve the situation that the users are separated into isolated social networks [7,8]. So, the same user identifying process in different social network has great practical significance, including commercial marketing, network security and privacy protection, etc. [9, 10]. Mobile social network mixed data modeling research mainly includes two parts: user location data preprocessing in multi-social platform and multi-source social relationship modeling and analysis. This paper proposes a cross-network user identity link method combined with location coordinate verification.

2. Related research

2.1 Research of cross-platform user identification

Cross-platform user identity identification is mainly divided into network user real identity identification, social network user geographic location identification and cross-network co-user linking. In which, the network user identity identification were generally done by social platform, there are also some scholars identifying the real users through big data analysis. Such as the social network identity recognition method was proposed using location based social network and user’s relationship to identify users and speculate the address, information and interest [11]. The geo-location prediction of a social network user is mainly to get the user’s geographic location using the social data. For example, the relationship in Facebook was used to predict the users’ geographical location and the conclusion was that there are more than 5 locatable users in the relationship network, and then the other users’ geographical location can be effectively predicted [12]. Cross-network co-user linking is the process to finding the same user in different types of social networks. Generally speaking, in order to enjoy different experiences and services, everyone will participate in different social networking sites. On one hand, the cross-network co-user linking is useful for public opinion analysis and control; on the other hand, it is also helpful to mine knowledge association of multi-source data for personalized service. The multi-source of social media is mainly reflected in the heterogeneous users’ behavior information from the different social media networks [13]. So understanding the multi-source phenomenon of social media is of great significance for the analysis of social media and the in-depth application of big data of social media.

2.2 Major problems exist in cross-network co-user linking

Cross-platform user identity identification can be divided into three categories: based on user basic attribute information, user online social structure relationship and based on user behavior [14]. Specifically, data analysis is mainly based on attribute information such as name, friends, and user behavior. The main difficulties are the lack of user information, false problems and heterogeneous data caused by platform differences.

Social network users will intentionally hide some information due to privacy disclosure. In the dataset, at least 85% users were missing two key attributes, while few people filled out options such as interests and hobbies, and fewer than 5 percent of users filled out all options. Serious information missing brings challenges to user identification based on user basic attribute information. Email authentication and mobile phone SMS authentication are the two basic registration way in some social software. At the same time, users may not provide real information for registration, and will generally use a core word to do different extensions, such as adding surname and number to register on different platforms, and women generally do not fill in the real age, some men have false age and gender, so the recognition algorithm cannot identify malicious user interference.

Social platforms have obvious differences in order to provide various user experiences [15]. So the same user will fill different information for personalized social software, which also makes it harder to identify users. For example, users publish and pay attention to things in different fields on different platforms. They follow a lot of celebrity gossip on weibo, get a lot of public health accounts and groups on WeChat, and discuss their own work in QQ work group. Or one user will update a social platform in real time, show the feeling for a movie on WeChat, and collect photos and sending them to other platforms irregularly or randomly adding pictures when they feel something, which makes it impossible to use timestamp as an effective way to measure the user’s identity.

It can be seen that the lack and accuracy of information bring great challenges to cross-platform user identity identification based on user attributes, and the different performance of users in different platforms also makes it difficult for the identity measurement method based on events, semantics or even time evolution to achieve ideal results.

2.3 Research progress of cross-network co-user linking

At present, the research on cross-network co-user linking mainly applies the explicit attribute information such as user name and image. A method based on user name was proposed to determine user identity. Based on the user names attribute quantitative analysis, the identity results can be given for further linking, they carried out confirmatory experiments on large-scale real data sets [16]. A new cross-network co-user identification algorithm based on information entropy were proposed, the main idea was analyzing the different data types and physical meaning of the attribute items, accordingly the different similarity calculation method, and according to the weights of attributes and information entropy is given to excavate the potential information of each attribute the final fusion decisions determine whether the account matching for each attribute [17]. On the other hand, recognition effect only based on attribute information is limited, so lots of friends’ topology relations researches appeared. The principle of three degrees influence was used to build a speculation model, which can identify users through complete sub-graph of social network user identity identification. Xu et al. [18] proposed a cross-network co-user linking algorithm based on weighted hypergraph to solve the problem that the existing friend-based algorithm does not have a high utilization rate of heterogeneous relationships in social networks. However, most studies fail to apply location information to cross-platform common user connection. Based on this, this paper proposes a cross-network co-user linking algorithm method combined with location coordinate verification.

3. Cross-network user identity link method based on location verification

3.1 link method based on friend relationship

Define 1 User identity. The virtual account $v_{i}$ , where $v_{i}$ presents the network account of user i.

Define 2 Labeled nodes. A node has been identified, and it can be represented as $cv_{i}$ . The network graphs can be abstracted into a powerless and undirected topology, which can be described by mathematical language as follows:

(1)

where $V_{i}$ is the user accounts set of network $v_{i}$ , and each node $v_{i}$ is an account in network $G_{i}$ . $E_{i}$ is the edge set in single network marked with “internal”, ${\textstyle {e}_{i}^{1,2}\quad }$ is an edge between ${\textstyle {v}_{i}^{1}\,\,}$ and ${\textstyle {v}_{i}^{2}\,\,}$ in network $G_{i}$ ( ${\textstyle {v}_{i}^{1},internal,{v}_{i}^{2}}$ ), and means the two users are friends online. $C_{i,j}$ is the edge set across different networks marked with “cross”, ${\textstyle \,{C}_{i,j}^{1,2}}$ is an edge between ${\textstyle {\,v}_{i}^{1}}$ in $G_{i}$ and ${\textstyle {v}_{i}^{2}}$ in $G_{j}$ ( ${\textstyle {v}_{i}^{1},cross,{v}_{i}^{2}}$ ), and means the two users are the same people in real world.

Based on the above definition, the cross-network identity identification can be expressed as: to realize the identity identification of a person in different networks, to find out multiple accounts that the person has in different networks, thus forming the “connection mapping” ${\textstyle \varnothing \left({v}_{i}\right)=\ldots =\varnothing \left({v}_{j}\right)}$ between multiple accounts. The inputs are single-network social network topology, and their outputs are labeled nodes CV_iand cross edge set $C_{i,j}$ .

The existing self-center network algorithm is to find and match the cross-network connection formed between labeled nodes in the self-center network, which can be used as the similarity value of the two nodes. The flowchart is shown in Figure 1. If the similarity value is greater than a certain threshold value, it is judged as a successful match, otherwise it fails. After K time’s iteration, the matched nodes can be new labeled nodes. When no new labeled nodes are added to the network, the iteration algorithm stops.

Figure 1. The existing self-center network algorithm

3.2 position coordinates and distance

Define 3 the location information. The location information of single network platform composed with longitude, dimension, and time coordinates ${\textstyle {\,L}_{i}=}$ $({lat}_{i},{lon}_{i},{t}_{i})$ .For similar nodes, the location information at time t should be selected and recorded

(2)

where, ${\textstyle t-\Delta t,t+\Delta t}$ is a permit range with and ${\textstyle \Delta t}$ equals 10min. The selection of time point $t$ mainly consists of several obvious online peak time points of social network users.

Define 4 A distances calculating method between geographical locations. $l_{i}$ and $l_{j}$ represent the geographic location of user $i$ and user $j$ , respectively. ( $lat_{i}$ , $lon_{i}$ ) is the longitude and latitude coordinates of location $l_{i}$ , and ( $lat_{j}$ , $lon_{j}$ ) is for $l_{j}$ . The distance between the two Global Positioning System (GPS) coordinates is calculated by using the great circle distance. The great circle distance refers to the shortest path length from one point to another point on the sphere. The calculation method is as follows:

d\left({l}_{i},{l}_{j}\right)=2R\times \arcsin {({\sin }^{2}\left({\frac {{lat}_{i}-{lat}_{j}}{2}}\right)+\cos({lat}_{i})\cos({lat}_{j}){\sin }^{2}({\frac {{lon}_{i}-{lon}_{j}}{2}}))}^{\frac {1}{2}}

(3)

where, ${\textstyle R}$ is the earth Radius (6371 km), so the $d(l_{i},l_{j})$ is in kilometers.

3.3 Track labeling and similarity

It is not excluded that the user did not log on a certain social platform in a certain period of time. Therefore, the convolutional neural network algorithm is adopted to mark the trajectory of the first stage, input the characteristics of the trajectory, and output the labeled nodes. The trajectory labeling can be regarded as a multi-label classification problem, so multiple labeling terms of corresponding trajectory can exist jointly. Therefore, the dimension of the supervision vector corresponding to the multi-category information of the sample is set as 1, and the excitation function of the output layer is the logistic function. The results of the output layer need to be sorted, and the former category is the predicted result of the sample category by the convolutional neural network. Convolutional neural network consists of two steps:

Step 1: Forward propagation phase: ①Take a sample( $X$ , $Y_{p}$ ) from the sample set, input $X$ into the network; ② Calculate the corresponding actual output $O_{p}$ . At this stage, the information is transformed from the input layer to the output layer. This process is also the process that the network performs when it is running normally after training.

Step 2: Backward propagation phase: ① Calculate the difference between the actual output $O_{p}$ and the corresponding ideal output $Y_{p}$ ; ② Adjust the weight matrix according to the method of the minimizing error. The back propagation algorithm is based on gradient descent, which is divided into two steps: forward propagation produces output results and computational errors and adjusts weights by back propagation. For sample n, the error is:

(4)

where, $c$ is the number of samples categories, ${\textstyle {t}_{k}^{n}}$ the target value of the $n_{th}$ sample’s $k_{th}$ dimension (the classification label), indicates the $k_{th}$ dimension output corresponding to the $n_{th}$ sample. Specifically, the trajectory information fitting adopts a method of fitting at a common time point, so it means ignoring the problem of trajectory absence of the cross-platform social media at the same time.

4. Experiment

In order to verify the practicability of the algorithm, this paper extracts real data from Facebook and Twitter, and then collects the accounts in the friend list of 14 students who have registered their accounts in the two networks. Among them, 308 accounts were found on the Facebook network and 877 internal links were found, while 896 accounts and 3,541 internal links were found on the Twitter network. There are 139 cross links between the two networks, which means that 139 pairs of accounts can be mapped to the same offline user according to the network structure, and then only 107 pairs of accounts can be valid users by applying position verification, as shown in Table 1.

Table 1. Cross-network user statistic based on seed users

Social network	Nodes	Edges	Accosted edges	Location confirm edges
Facebook	308	877	139	107
Twitter	896	3541	139	107
Total	1204	4418	278	214

In this paper, some nodes were randomly selected from the cross-connected edges as prior label nodes, whose minimum value in this experiment was 20 and maximum value was 80. The precision, recall and comprehensive evaluation index F1 were adopted as evaluation criteria to measure the performance of the algorithm. The definition is as follows:

(5)

(6)

(7)

in which, $tp$ represents the correctly matching account, $fp$ represents the wrong ones, $fn$ is the missing matching account, and the recall is the rate of correct identification.

In this paper, the experiment used random selection, the degree of arrangement and the Pagerank sort to choose tag nodes, respectively. In which, the degree of arrangement mainly choose the bigger degree nodes, the weight network using the co-tag number divided by the co-neighbors number in Pagerank, and the PRF values based on different prior node selection methods on user pairs after position-based verification are shown in Table 2.

Table 2. Statistics of PRF values based on different prior node selection methods

Data source	No. of prior tags	Random			Degree of arrangement			Pagerank
Data source	No. of prior tags	P	R	F	P	R	F	P	R	F
Facebook Twitter	20	0.533	0.234	0.325	0.834	0.167	0.278	0.864	0.158	0.267
	30	0.587	0.284	0.383	0.871	0.201	0.327	0.901	0.197	0.323
	40	0.612	0.315	0.416	0.895	0.284	0.431	0.912	0.291	0.441
	50	0.687	0.346	0.460	0.921	0.306	0.459	0.894	0.289	0.437
	60	0.724	0.452	0.557	0.935	0.392	0.552	0.931	0.346	0.505
	70	0.852	0.514	0.641	0.969	0.483	0.645	0.952	0.492	0.649
	80	0.783	0.562	0.654	0.934	0.496	0.648	0.963	0.504	0.662

In the table, the prior node selection methods based on degree sorting and Pagerank sorting have higher accuracy, but the recall rate of random selection is better, and there is no significant difference in comprehensive evaluation indexes.

In order to analyze the effect of position information verification, 139 original edge data sets in the social network structure and 107 prior nodes after position verification were respectively taken as 30, 50 and 70 prior nodes for analysis. The prior nodes were all conducted by degree ranking method. The comparison results are shown in Table 3.

Table 3. Statistics of PRF values before and after combined position verification

No. of prior tags	Network based			Location verification
No. of prior tags	P	R	F	P	R	F
30	0.871	0.201	0.327	0.892	0.310	0.460
50	0.921	0.306	0.459	0.946	0.417	0.579
70	0.969	0.483	0.645	0.971	0.532	0.687

Comparing with the original network structure method, the method combining location verification has different degrees of improvement in accuracy, recall rate and comprehensive evaluation index, which can more accurately carry out online and offline cross-network user identity link.

5. Conclusions

Based on the status quo of different platforms for social network users to register, this paper analyzes three existing methods of cross-network user identity link based on user basic attribute information, user online social structure relationship and user behavior. In addition, a cross-network user identity link method combined with position coordinate verification was proposed, which combined with the network structure to find the initial cross-network connection edge, screened matching nodes through position verification, and verified that the algorithm had high readiness rate and recall rate on the real social network.

Acknowledgements

This work was supported in part by the Heilongjiang province social science research and planning project (18TQB100), in part by the Natural Science Foundation of China under Grant (61672179), in part by the Harbin Youth Reserve Talent Project under Grant (2017RAQXJ102), and in part by Heilongjiang province Postdoctoral fund (LBH-Z16053).

References

[1] Lu Di, Han Yinli, Xu Yue. Development and management of intelligent media under the background of post-mobile Internet. Modern Communication, 40(05):9-13, 2018.

[2] Xu jinghong, Duan Zening, Hou Weipeng, et al. Data sharing and privacy protection under the business model of mobile internet. Information Theory and Practice, 41(01):50-54, 2018.

[3] Ravi Vatrapu, Raghava Rao Mukkamala, Abid Hussain, et al. Social set analysis: A set theoretical approach to big data analytics. IEEE Access, 4:2542-2571, 2016.

[4] Hongyang Zhao, Huan Zhou, Chengjue Yuan, et al. Social discovery: Exploring the correlation among three-dimensional social relationships. Journal of Communications and Networks, 17(2):126-132, 2015.

[5] Liu D., Wu Q.Y. Cross-platform user profile matching in online social networks. Applied Mechanics and Materials, 380-384:1955-1958, 2013.

[6] Mehess S.J. Finding the missing links: A comparison of social network analysis methods. Dissertations & Theses Gradworks, 2016.

[7] Xiaoping Zhou, Xun Liang, Haiyan Zhang, et al. Cross-platform identification of anonymous identical users in multiple social media networks. IEEE Transactions on Knowledge and Data Engineering, 28(2):411-424, 2016.

[8] X. Zhou, X. Liang, H. Zhang, et al. Cross-platform identiﬁcation of anonymous identical users in multiple social media networks. IEEE Trans. Knowl. Data Eng., 28(2):411–424, 2016.

[9] Rao yuan, Wu Lianwei, Zhang Junyi. A survey of information propaganda mechanism under the cross-medium. Scientia Sinica Informationis, 47(12):1623-1645, 2017.

[10] Hadeel S. Humadde, Alia K. Abdul-Hassan,Bashar S. Mahdi. Proposed user identification algorithm across social network using hybrid techniques. 2nd Scientific Conference of Computer Sciences (SCCS), 27-28, 2019.

[11] Hu Kaixian, Liang ying, Xu hongbo, et al. A method for social network user identity feature recognition. Journal of Computer Research and Development, 53(11):2630-2644, 2016.

[12] Backstrom L., Sun E., Marlow C. Find me if you can: Improving geographical prediction with social and spatial proximity. Proc of the 19th Int Conf on World Wide Web, New York, ACM, 61-70, 2010.

[13] Sang jitao, Lu Dongyuan, Xu Changsheng. Overlapped user-based cross-network analysis: Exploring variety in big social media data. Chinese Science Bulletin, 59(36):3554-3560+1, 2014.

[14] Wang Q., Shen D.R., Feng S., et al. Identifying users across social networks based on global view features with crowdsourcing. Journal of Software, 29(3):811-823, 2018.

[15] Liu Dong, Wu Quanyuan, Han Weihong, et al. User identification across multiple websites based on username features. Chinese Journal of Computers, 38(10):2028-2040, 2015.

[16] Wu Zheng, Yu Hongtao, Liu Shuxin, et al. User identification across multiple social networks based on information entropy. Journal of Computer Applications, 37(08):2374-2380, 2017.

[17] Hu Kaixian, Liang Ying, Su Lixin, et al. Method for social network user feature recognition based on clique. Pattern Recognition and Artificial Intelligence, 29(08):698-708, 2016.

[18] Xu Qian, Chen hongchang, Wu Zheng, et al. User identification method across social networks based on weighted hypergraph. Journal of Computer Applications, 37(12):3435-3441+3471, 2017.