(7 intermediate revisions by the same user not shown)
Line 30: Line 30:
 
Social network analysis (SNA) [3] can help us obtain the implicit characteristics of the users and information dissemination in a numerical manner. The forum topics are mainly divided into two categories: (1) emergency topics, which are characterized by a short duration with intense discussion; (2) persistent topics, characterized by long duration, typically closely related to one’s livelihood. Most studies have focused on the former, such as researching the discovery and prediction of online Forum hot topics and false information dissemination after emergencies [4]. There are two core issues that must be solved to identify key users in persistent livelihood topics: (1) extraction of persistent topics and (2) the identification of key users. To solve the first issue, we combine the time dimension and apply the latent Dirichlet allocation (LDA) topic model and the short text similarity assessment modelto discover the persistent topics [5]. To solve the second, SNA provides a series of node metrics (e.g., central, prestige, trust and connectivity). The node position assessment, proposed by Przemysław Kazienko, is a very effective method for analysis, but it is more suitable for the global network while ignoring the semantic factors. Therefore, we provided the sentiment weighted node position algorithm (SWNP) and applied it to the persistent topic network to sort the users’ influence.
 
Social network analysis (SNA) [3] can help us obtain the implicit characteristics of the users and information dissemination in a numerical manner. The forum topics are mainly divided into two categories: (1) emergency topics, which are characterized by a short duration with intense discussion; (2) persistent topics, characterized by long duration, typically closely related to one’s livelihood. Most studies have focused on the former, such as researching the discovery and prediction of online Forum hot topics and false information dissemination after emergencies [4]. There are two core issues that must be solved to identify key users in persistent livelihood topics: (1) extraction of persistent topics and (2) the identification of key users. To solve the first issue, we combine the time dimension and apply the latent Dirichlet allocation (LDA) topic model and the short text similarity assessment modelto discover the persistent topics [5]. To solve the second, SNA provides a series of node metrics (e.g., central, prestige, trust and connectivity). The node position assessment, proposed by Przemysław Kazienko, is a very effective method for analysis, but it is more suitable for the global network while ignoring the semantic factors. Therefore, we provided the sentiment weighted node position algorithm (SWNP) and applied it to the persistent topic network to sort the users’ influence.
  
The algorithm must solve several problems. First, it must ensure that the extraction topic is related to the clustering results, so the algorithm uses the LDA model and the short text similarity assessment model for screening and gathering related posts while adopting adjacent time slice cross matching to ensure the topic sustainability on the timeline. After cataloging the posts, corresponding participants and replies relations, the persistent topic social network can be built and expressed as <math display="inline">(PTSN= V, E)</math>, where V and E represent the nodes and their relationships in the local network, respectively. It then identifies the critical nodes in the local network, which have the greatest amount of influence on the specific topic and other users. After attempting different methods on real three-year online Forum data, the SWNP is provided and compared to the typical method.
+
The algorithm must solve several problems. First, it must ensure that the extraction topic is related to the clustering results, so the algorithm uses the LDA model and the short text similarity assessment model for screening and gathering related posts while adopting adjacent time slice cross matching to ensure the topic sustainability on the timeline. After cataloging the posts, corresponding participants and replies relations, the persistent topic social network can be built and expressed as <math display="inline">(PTSN= V, E)</math>, where <math display="inline">V</math> and <math display="inline">E</math> represent the nodes and their relationships in the local network, respectively. It then identifies the critical nodes in the local network, which have the greatest amount of influence on the specific topic and other users. After attempting different methods on real three-year online Forum data, the SWNP is provided and compared to the typical method.
  
 
The rest of the paper is organized as follows. We briefly review related work in Section 2. We then present an overview of LDA and the short text similarity assessment model in Section 3. In Section 4, we propose persistent topic key person analysis in online Forum software, with detailed explanations. We discuss detailed experimental results on the research corpus in Section 5, and we conclude this paper in Section 6.
 
The rest of the paper is organized as follows. We briefly review related work in Section 2. We then present an overview of LDA and the short text similarity assessment model in Section 3. In Section 4, we propose persistent topic key person analysis in online Forum software, with detailed explanations. We discuss detailed experimental results on the research corpus in Section 5, and we conclude this paper in Section 6.
Line 173: Line 173:
 
{| style="text-align: center; margin:auto;"  
 
{| style="text-align: center; margin:auto;"  
 
|-
 
|-
| <math>sentiment_{i,j}=\frac{{\sum }_{k=1}^{n^j}O_{k,j}}{n^j}</math>
+
| <math>sentiment_{i,j}=\frac{{\displaystyle\sum }_{k=1}^{n^j}O_{k,j}}{n^j}</math>
 
|}
 
|}
 
| style="width: 5px;text-align: right;white-space: nowrap;" | (5)
 
| style="width: 5px;text-align: right;white-space: nowrap;" | (5)
Line 221: Line 221:
 
{| style="text-align: center; margin:auto;"  
 
{| style="text-align: center; margin:auto;"  
 
|-
 
|-
| <math>C(x\rightarrow y)=\frac{A(x\rightarrow y)}{{\sum }_{j=1}^mA(x\rightarrow y_j)}</math>
+
| <math>C(x\rightarrow y)=\frac{A(x\rightarrow y)}{{\displaystyle\sum }_{j=1}^mA(x\rightarrow y_j)}</math>
 
|}
 
|}
 
| style="width: 5px;text-align: right;white-space: nowrap;" | (7)
 
| style="width: 5px;text-align: right;white-space: nowrap;" | (7)
Line 238: Line 238:
 
===5.2 Identification of the topics in specific periods===
 
===5.2 Identification of the topics in specific periods===
  
We used the LDA to identify the topic in specific months, setting <math display="inline">α=0.5,\, \beta =0.1</math>, topic number <math display="inline">Z=50</math> and Gionline Forum sampling iterations to 1000. Not all of each month's topics are related to the livelihood issues that this article focuses on, so these topics are omitted by the attribute filter described in Section 4.
+
We used the LDA to identify the topic in specific months, setting <math display="inline">\alpha =0.5,\, \beta =0.1</math>, topic number <math display="inline">Z=50</math> and Gionline Forum sampling iterations to 1000. Not all of each month's topics are related to the livelihood issues that this article focuses on, so these topics are omitted by the attribute filter described in Section 4.
  
 
After applying this attribute filter, there were a total of 978 topics with an average of 27 topics per month. The minimum number occurred in the 10<sup>th</sup> month with 9 topics and maximum was in the 6<sup>th</sup> month with 37 topics. To analyze the size of each topic, [[#img-1|Figure 1]] shows the statistics on the number of topics related posts. Setting <math>\sigma_{2}=0.05</math> retains more valid data for extracting persistent topics that is, if a title contains a keyword related to a certain topic, it will be retained. Eighty-two percent of retained topics ranged in size from 61 to 150 related posts.
 
After applying this attribute filter, there were a total of 978 topics with an average of 27 topics per month. The minimum number occurred in the 10<sup>th</sup> month with 9 topics and maximum was in the 6<sup>th</sup> month with 37 topics. To analyze the size of each topic, [[#img-1|Figure 1]] shows the statistics on the number of topics related posts. Setting <math>\sigma_{2}=0.05</math> retains more valid data for extracting persistent topics that is, if a title contains a keyword related to a certain topic, it will be retained. Eighty-two percent of retained topics ranged in size from 61 to 150 related posts.
Line 249: Line 249:
 
| style="background:#efefef;text-align:left;padding:10px;font-size: 85%;"| '''Figure 1'''. The related posts number for each topic
 
| style="background:#efefef;text-align:left;padding:10px;font-size: 85%;"| '''Figure 1'''. The related posts number for each topic
 
|}
 
|}
 
  
 
===5.3 Identification of the persistent topic===
 
===5.3 Identification of the persistent topic===
  
The next analysis concerned the identification of the persistent topics, which must exist over a given period. The persistent topic number is affected by σ<sub>1</sub>. The keyword of a topic always has a frequency of approximately 0.05, while a similarity of 0.1 means the topics have at least two keywords, and then it can be certain that they are in fact the same. Experiments have proven that an important turning point occurs at <math>\sigma_1=0.09</math>, corresponding to the 18 relatively persistent topics. The persistent topics have high accuracy and quality by manual validation.
+
The next analysis concerned the identification of the persistent topics, which must exist over a given period. The persistent topic number is affected by <math>\sigma_1</math>. The keyword of a topic always has a frequency of approximately 0.05, while a similarity of 0.1 means the topics have at least two keywords, and then it can be certain that they are in fact the same. Experiments have proven that an important turning point occurs at <math>\sigma_1=0.09</math>, corresponding to the 18 relatively persistent topics. The persistent topics have high accuracy and quality by manual validation.
  
 
There are 18 persistent topics with 4637 related posts. A total of 91281 users (28% of total users) were present in the following analysis, which greatly reduces the data size for further analysis. There are 257 related posts per persistent topic, and according to the minimum period (three months), they have only 86 posts per topic per month. This number is less than the size of the general topics retained in Section 5.2, which also reflects the persistent topics that do not have a high post rate, click rate or response rate and instead have their own characteristics of long duration.
 
There are 18 persistent topics with 4637 related posts. A total of 91281 users (28% of total users) were present in the following analysis, which greatly reduces the data size for further analysis. There are 257 related posts per persistent topic, and according to the minimum period (three months), they have only 86 posts per topic per month. This number is less than the size of the general topics retained in Section 5.2, which also reflects the persistent topics that do not have a high post rate, click rate or response rate and instead have their own characteristics of long duration.
Line 368: Line 367:
 
===5.6 Node position iterative data processing===
 
===5.6 Node position iterative data processing===
  
The experiments revealed that the number of iterations necessary to calculate the node positions for all users in each <math>PTSN</math> depends on the value of the parameter <math>\varepsilon  </math> (Eq.(6)): the greater the value of <math>\varepsilon</math>, the greater the number of iterations ([[#img-2|Figure 2]]). Each node was initialized in <math>PTSN</math> with <math> SWNP=1</math> and the stop condition <math>\tau =0.00001</math>. The iterative processing of <math> SWNP</math> uses six different <math>\varepsilon (0.01, 0.1, 0.3, 0.5, 0.7, and 0.9)</math> for comparative analysis. Because the given 18 <math>PTSN</math>s have similar sizes, their tendencies are similar.  
+
The experiments revealed that the number of iterations necessary to calculate the node positions for all users in each <math>PTSN</math> depends on the value of the parameter <math>\varepsilon  </math> (Eq.(6)): the greater the value of <math>\varepsilon</math>, the greater the number of iterations ([[#img-2|Figure 2]]). Each node was initialized in <math>PTSN</math> with <math> SWNP=1</math> and the stop condition <math>\tau =0.00001</math>. The iterative processing of <math> SWNP</math> uses six different <math>\varepsilon (0.01, 0.1, 0.3, 0.5, 0.7, \, \hbox{and } 0.9)</math> for comparative analysis. Because the given 18 <math>PTSN</math>s have similar sizes, their tendencies are similar.  
  
 
<div id='img-2'></div>
 
<div id='img-2'></div>
Line 383: Line 382:
 
===5.7 Distribution characteristics of SWNP===
 
===5.7 Distribution characteristics of SWNP===
  
Experiments analyze the distribution characteristics of <math> SWNP</math> in <math>18\, PTSN</math>, and [[#img-3|Figure 3]] gives the average <math> SWNP</math> and their standard deviation in <math>No.16</math> and <math>No.18</math> <math>PTSN</math> with different <math>\varepsilon</math>. The average SWNP does not depend on <math>\varepsilon</math>, and it can be formally demonstrated that the SWNP equals approximately 1 in all cases. On the other hand, the standard deviation differs substantially depending on <math>\varepsilon</math>: the greater the <math>\varepsilon</math>, the greater the standard deviation. Namely, the <math> SWNP</math> value has increased disproportionately with bigger <math>\varepsilon</math>, which has been proven by the experimental data.
+
Experiments analyze the distribution characteristics of <math> SWNP</math> in <math>18\, PTSN</math>, and [[#img-3|Figure 3]] gives the average <math> SWNP</math> and their standard deviation in <math>No.16</math> and <math>No.18\,PTSN</math> with different <math>\varepsilon</math>. The average SWNP does not depend on <math>\varepsilon</math>, and it can be formally demonstrated that the SWNP equals approximately 1 in all cases. On the other hand, the standard deviation differs substantially depending on <math>\varepsilon</math>: the greater the <math>\varepsilon</math>, the greater the standard deviation. Namely, the <math> SWNP</math> value has increased disproportionately with bigger <math>\varepsilon</math>, which has been proven by the experimental data.
  
 
<div id='img-3'></div>
 
<div id='img-3'></div>
Line 394: Line 393:
  
  
The distribution characteristics of <math> SWNP</math> are determined by its network topology structure; for example, the standard deviation variation tendency of <math>No.18</math> is more noticeable than <math>No.16</math>. This result indicates the greater difference of <math> SWNP</math> in <math>No.18 PTSN</math>, as there are a few nodes with ultra value. It can also be noted that the average <math> SWNP</math> over 81% of users is less than 1. This result means that only a few members exceed the average value that equals 1. This result also shows that the members’ <math> SWNP</math> difference increased for greater <math>\varepsilon</math>, and it is valid for all the 18 <math>PTSN</math>. The <math>No.18 PTSN</math> has the standard deviations of the most obvious change: while <math>\varepsilon =0.9</math>, fewer than 1% of users have a <math> SWNP >1</math>, and these users are clearly important. [[#img-4|Figure 4]] shows the percentage of users with <math> SWNP\ge 1</math> and <math> SWNP\ge 2</math> within <math>No.18</math> and <math>No.16 PTSN</math> in relation to <math>\varepsilon</math>.
+
The distribution characteristics of <math> SWNP</math> are determined by its network topology structure; for example, the standard deviation variation tendency of <math>No.18</math> is more noticeable than <math>No.16</math>. This result indicates the greater difference of <math> SWNP</math> in <math>No.18\, PTSN</math>, as there are a few nodes with ultra value. It can also be noted that the average <math> SWNP</math> over 81% of users is less than 1. This result means that only a few members exceed the average value that equals 1. This result also shows that the members’ <math> SWNP</math> difference increased for greater <math>\varepsilon</math>, and it is valid for all the <math>18\,PTSN</math>. The <math>No.18 \, PTSN</math> has the standard deviations of the most obvious change: while <math>\varepsilon =0.9</math>, fewer than 1% of users have a <math> SWNP >1</math>, and these users are clearly important. [[#img-4|Figure 4]] shows the percentage of users with <math> SWNP\ge 1</math> and <math> SWNP\ge 2</math> within <math>No.18</math> and <math>No.16 \,PTSN</math> in relation to <math>\varepsilon</math>.
  
 
<div id='img-4'></div>
 
<div id='img-4'></div>
Line 405: Line 404:
  
  
It can be seen that the different <math>PTSN</math>s have the same <math> SWNP</math> distribution trend, with the <math> SWNP\ge 1</math> nodes decreasing and <math> SWNP\ge 2</math> nodes increasing. The average percentage of nodes with <math> SWNP\ge 2</math> is 4.7% in all the 18 <math>PTSN</math> (<math>No.16</math> with 7.54% and <math>No.18</math> with 0.57%). This conclusion can help us identify the important nodes in persistent topic social networks. The percent of <math> SWNP\ge 1</math> and <math> SWNP\ge 2</math> are 3.12% and 0.49% in <math>No.18 \,PTSN</math> while <math>\varepsilon =0.7</math>, so it can be assured that 3.12% users are active users and the 0.49% users are key person in this topic. In fact, the greater the <math>\varepsilon</math>, the more distinguishable the results, but the larger number of iterations directly influences the processing time. Generally, the parameter is determined by the different network scales, but the nodes with high <math> SWNP</math> values do not necessarily represent key persons, as the adjacent nodes may pass a lot of negative energy (if the commitment function is less than 0). Therefore, sentiment analysis is needed to actually identify the key persons.
+
It can be seen that the different <math>PTSN</math>s have the same <math> SWNP</math> distribution trend, with the <math> SWNP\ge 1</math> nodes decreasing and <math> SWNP\ge 2</math> nodes increasing. The average percentage of nodes with <math> SWNP\ge 2</math> is 4.7% in all the <math>18\,PTSN</math> (<math>No.16</math> with 7.54% and <math>No.18</math> with 0.57%). This conclusion can help us identify the important nodes in persistent topic social networks. The percent of <math> SWNP\ge 1</math> and <math> SWNP\ge 2</math> are 3.12% and 0.49% in <math>No.18 \,PTSN</math> while <math>\varepsilon =0.7</math>, so it can be assured that 3.12% users are active users and the 0.49% users are key person in this topic. In fact, the greater the <math>\varepsilon</math>, the more distinguishable the results, but the larger number of iterations directly influences the processing time. Generally, the parameter is determined by the different network scales, but the nodes with high <math> SWNP</math> values do not necessarily represent key persons, as the adjacent nodes may pass a lot of negative energy (if the commitment function is less than 0). Therefore, sentiment analysis is needed to actually identify the key persons.
  
 
===5.8 The Top N key persons in PTSN===
 
===5.8 The Top N key persons in PTSN===
  
Extracting the Top <math display="inline"> N </math> key persons in <math>PTSN</math> is achieved through a ranking nodes process based on the importance degree. The algorithm sorts the nodes according to the <math> SWNP</math>, and then modifies the list using the emotional attributes. The comparing algorithms mainly used are <math> IDC </math> (Indegree Prestige Centrality), <math>ODC  </math> (Outdegree Prestige Centrality) and <math>PR</math> (PageRank). <math> IDC </math> is based on the indegree number, so it takes into account the number of members that are adjacent to a particular member of the community, as follows: <math display="inline">IDC(x) =i(x)/(m−1)</math>, where <math display="inline"> m </math> is the number of nodes in the network, and <math>i(x)</math> is the number of members from the first level neighborhood that are adjacent to <math display="inline">  x</math>. In other words, more prominent people receive more nominations from members of the community. <math>ODC </math> takes into account the outdegree number of the member <math display="inline">  x</math> for edges that are directed to the given node, as follows: <math display="inline">ODC(x) =o(x)/(m−1)</math>, where <math display="inline">o(x)</math> is the number of the first level neighbors to <math display="inline">  x</math>. On the other hand, users who have low outdegree centrality are not very open to the external world and do not communicate with many members. <math>ODC  </math> and <math> IDC </math> are the simplest and most intuitive measures that can be used in network analysis. Google uses <math>PR</math> to rank the pages in its search engine to measure the importance of a particular page to the others. [[#tab-2|Table 2]] gives the top 10 important nodes using different methods in the <math>No.16\, PTSN</math> with 3450 nodes.
+
Extracting the Top <math display="inline"> N </math> key persons in <math>PTSN</math> is achieved through a ranking nodes process based on the importance degree. The algorithm sorts the nodes according to the <math> SWNP</math>, and then modifies the list using the emotional attributes. The comparing algorithms mainly used are <math> IDC </math> (Indegree Prestige Centrality), <math>ODC  </math> (Outdegree Prestige Centrality) and <math>PR</math> (PageRank). <math> IDC </math> is based on the indegree number, so it takes into account the number of members that are adjacent to a particular member of the community, as follows: <math display="inline">IDC(x) =i(x)/(m-1)</math>, where <math display="inline"> m </math> is the number of nodes in the network, and <math>i(x)</math> is the number of members from the first level neighborhood that are adjacent to <math display="inline">  x</math>. In other words, more prominent people receive more nominations from members of the community. <math>ODC </math> takes into account the outdegree number of the member <math display="inline">  x</math> for edges that are directed to the given node, as follows: <math display="inline">ODC(x) =o(x)/(m-1)</math>, where <math display="inline">o(x)</math> is the number of the first level neighbors to <math display="inline">  x</math>. On the other hand, users who have low outdegree centrality are not very open to the external world and do not communicate with many members. <math>ODC  </math> and <math> IDC </math> are the simplest and most intuitive measures that can be used in network analysis. Google uses <math>PR</math> to rank the pages in its search engine to measure the importance of a particular page to the others. [[#tab-2|Table 2]] gives the top 10 important nodes using different methods in the <math>No.16\, PTSN</math> with 3450 nodes.
  
 
<div class="center" style="font-size: 75%;">'''Table 2'''. Top 10 users in ''No.18 PTSN''</div>
 
<div class="center" style="font-size: 75%;">'''Table 2'''. Top 10 users in ''No.18 PTSN''</div>

Latest revision as of 11:54, 21 December 2023

Abstract

The influence of users on online Forum should not be simply determined by the global network topology but rather in the corresponding local network with the user’s active range and semantic relation. Current analysis methods mostly focus on urgent topics while ignoring persistent topics, but persistent topics often have important implications for public opinion analysis. Therefore, this paper explores key person analysis in persistent topics on online Forum based on semantics. First, the interaction data are partitioned into subsets according to month, and the Latent Dirichlet Allocation (LDA) and filtering strategy are used to identify the topics from each partition. Then, we try to associate one topic with the adjacent time slice, which fulfills the criterion of having high similarity degree. On the basis of such topics, persistent topics are defined that exist for a sufficient number of periods. Following this, the paper provides the commitment function update criteria for the persistent topic social network (PTSN) based on the semantic and the sentiment weighted node position (SWNP) to identify the key person who has the most influence in the field. Finally, the emotional tendency analysis is applied to correct the results. The methods in real data sets validate the effectiveness of these methods.

Keywords: Online Forum, persistent topic, key person, social network analysis

1. Introduction

As an electronic information service system on the Internet, an online Forum provides a public electronic forum on which each user can post messages and put forward views [1]. Online Forum gathers many users who are willing to share their experiences, information and ideas, and a user can browse others’ information and publish his/her own to form a thread through a unique registration ID [2].

Social network analysis (SNA) [3] can help us obtain the implicit characteristics of the users and information dissemination in a numerical manner. The forum topics are mainly divided into two categories: (1) emergency topics, which are characterized by a short duration with intense discussion; (2) persistent topics, characterized by long duration, typically closely related to one’s livelihood. Most studies have focused on the former, such as researching the discovery and prediction of online Forum hot topics and false information dissemination after emergencies [4]. There are two core issues that must be solved to identify key users in persistent livelihood topics: (1) extraction of persistent topics and (2) the identification of key users. To solve the first issue, we combine the time dimension and apply the latent Dirichlet allocation (LDA) topic model and the short text similarity assessment modelto discover the persistent topics [5]. To solve the second, SNA provides a series of node metrics (e.g., central, prestige, trust and connectivity). The node position assessment, proposed by Przemysław Kazienko, is a very effective method for analysis, but it is more suitable for the global network while ignoring the semantic factors. Therefore, we provided the sentiment weighted node position algorithm (SWNP) and applied it to the persistent topic network to sort the users’ influence.

The algorithm must solve several problems. First, it must ensure that the extraction topic is related to the clustering results, so the algorithm uses the LDA model and the short text similarity assessment model for screening and gathering related posts while adopting adjacent time slice cross matching to ensure the topic sustainability on the timeline. After cataloging the posts, corresponding participants and replies relations, the persistent topic social network can be built and expressed as , where and represent the nodes and their relationships in the local network, respectively. It then identifies the critical nodes in the local network, which have the greatest amount of influence on the specific topic and other users. After attempting different methods on real three-year online Forum data, the SWNP is provided and compared to the typical method.

The rest of the paper is organized as follows. We briefly review related work in Section 2. We then present an overview of LDA and the short text similarity assessment model in Section 3. In Section 4, we propose persistent topic key person analysis in online Forum software, with detailed explanations. We discuss detailed experimental results on the research corpus in Section 5, and we conclude this paper in Section 6.

2. Related work

2.1 SNA in online Forum

Online Forum is an important platform for information dissemination. A user publishes a post to express his/her views on a given event, and others can browse the posts and create his/her own to form a thread through a unique registration ID [2]. A very important element of posting is the ability to add comments, which enables discussions. Accessibility to posts is generally open, so anyone may read or comment. Online Forum is always busy with activity: every day, a large number of new users will register, and thousands of new posts and millions of new comments are written. The lifetime of posts is very short, and the relationships between users are very dynamic and temporal, providing a large amount of semantic information to explore intensely [6].

Research on online Forum is primarily rooted in public opinion guidance, sociology, linguistics and psychology, while data mining with technology is less frequently employed. However, nearly all online Forum websites record some basic statistics, which lend themselves well for data analysis and important findings. This network model, consisting of the board, posts and comments, can be analyzed by SNA to find the most important or influential users. Around such users, groups that share similar interests will form.

There are many types of online Forum: campus online Forum, commercial online Forum, professional online Forum, emotional online Forum and individual online Forum. We chose the comprehensive Tianya forum as the basis for our research because persistent livelihood topics are more likely to occur in this active social online Forum. There is some research on the Tianya forum datasets, such as the opinion leader algorithm based on users’ interests, but the accuracy depends on the quality of the interest field [7].

2.2 Topic discovery

Some results have been achieved in the network topology and topic propagation models, but they are still new. Previous studies can be mainly divided into three categories: (1) the first type of research mainly focuses on the distribution of users to reveal their dynamic characteristics. (2) The second type of research mainly focuses on the topics of discovery and prediction. Wang [8] improved the information diffusion model based on topic influence, and proposed the topic diffusion trend prediction method based on the reply matrix. (3) The third type of research studies semantic communities for user characteristic analysis. Bu et al. [9] proposed a sock puppet detection algorithm that combines authorship-identification techniques with link analysis.

Compared to research around sudden hot issues, few studies consider persistent livelihood topic discovery, evolution and traceability. With the rapid dissemination of information, people’s livelihood topics will continue to ferment, and they will inevitably have an impact on the management of the networked public opinion without necessary regulatory and counseling.

2.3 Key person extraction

There are two separate approaches to key person extraction in social networks: those based on context roles and those based on social network structure. The most common key person extraction methods rely on various centrality measures for each separate node. However, these algorithms lack a holistic view, and the node position in the social community is determined by its neighborhoods, such as in degree prestige and degree centrality. Other algorithms are more global, such as proximity prestige, rank prestige, node position, eccentricity and closeness centrality. Much of this research has been applied to different domains (e.g., influence spread, public opinion analysis, and terrorist group analysis) [10].

In fact, the user influence is not solely determined by the overall network topology but confirmed by the local network structure and semantic relationships among active users. No existing algorithm can meet this demand, and because the entire network is not the best choice, the influence field must be determined before the key person may be extracted. The is a semantic-based local network, so we propose a node position algorithm combined with semantic information to identify key persons.

3. Persistent topic extraction in social network

To obtain the persistent livelihood topic in online Forum, two basic methods are introduced here. The first is the LDA model for extracting topics, and the second is the short text similarity assessment model to distinguish persistent topics and emergency ones.

3.1 LDA

In statistical natural language processing, one common way of modeling the contributions of different topics to a document is to treat each topic as a probability distribution over words, viewing a document as a probabilistic mixture of these topics. Given documents containing topics and unique words: , where each belongs to some document , and is a latent variable indicating the topic from which the i-th word was drawn. The complete probability generative model is defined as follows:

(1)

Here, the hyperparameters and are mainly used to control the sparsity of the distribution. According to this model, every word will be assigned to a latent topic .

In a corpus, the goal of LDA is to extract the latent topic through evaluating the posterior distribution. The sum in the denominator involves terms, where is the total number of word instances in the corpus. However, it does not factorize, so Gionline Forum sampling is now widely adopted. Gionline Forum sampling estimates the probability of a word belonging to a topic, according to the topic distribution of the other words. At the beginning of the sampling, every word is randomly assigned to a topic as the initial state of a Markov chain. Each state of the chain is an assignment of values to the variables being sampled. After enough interations, the chain approaches the target distribution and the current values are recorded as the expected probability distribution. In the end, it obtains the topic and where may appear in with probability .

3.2 Short text similarity assessment model

Quan provides a short text on similarity computing methods based on probabilistic topics [11]. The algorithm uses a topic model on the short text feature vectors, then determines the semantic similarity by computing the cosine between the vectors. We improve the model for the online Forum title text using the minimum threshold and therefore require less computing cost.

The model analyzes topics in two adjacent time periods, so let the former topics and corresponding topic vector , the later ones , and . The existing similarity formula is not suitable, so Eq. (2) is used for this work. To obtain high similarity degree topics in adjacent time slots, calculation time is need, i.e., each topic is required to match with all topics in another time period

(2)

where is the similarity degree of topic and , which equals the sum of the minimum probability of the words appearing in both topics. If is larger than threshold , it means the two topics are similar. If a topic continues over some periods, it can be considered a persistent topic.

Meanwhile, the size of topic in a certain period can be measured by Eq. (3)

(3)


Here, the post title is used to match the keywords of topic , and then the sum of all the probabilities of success matching is the relevancy. If is greater than , then the post is related to the topic. The thresholds and will be confirmed in the experiment.

4. Analysis of key person in persistent topic with online Forum

Two important issues in social network analysis are individual role and social position. Analysis of key persons in persistent topics with online Forum is further considered.

Due to the time characteristics, the gathered data should be partitioned into subsequent periods with the same length, which are always labeled from 0 to , and these periods are separable or partly overlapped. In the experimental studies (see Section 5), we assumed that they have a length of 30 days.

The LDA was used to obtain the topics in each period and extract the persistent topic across multi-periods through the similarity assessment. Then, for each persistent topic, the social network was generated and the fundamental SNA measures were calculated to identify the key person.

In the first step, the interaction data are partitioned into the subsets by month, and the LDA and filtering strategy are used to identify the topics from each partition. Then, the algorithm attempts to associate one topic with another from the neighboring period while fulfilling the criterion of having a similarity degree larger than . On the basis of this comparison, the persistent topics that exist for the sufficient number of periods are defined. Following this, the algorithm uses sentiment weighted node positions in the interaction data to identify the key person who has the most influence in the field.

The algorithm consists of six subsequent steps:

Step 1. The gathered text stream should be partitioned into subsequent periods with the same length.

Step 2. Extract topics, and then record the relevant posts, users, reply rates etc. To achieve this, the algorithm LDA described in Section 3.1 is used. The topics will be obtained in every time slice.

Step 3. Simplify the topics using the filtering strategy. For a given period , after the attribute filter and topics set are identified, each topic contains its keywords and the corresponding probability. The topic will be retained once it meets one of the following filtering strategies:

(1) The number of posts related to the topic ( larger than ) is greater than or equal to 10. is 0.05 in Section 5, that is, the post is related if a post title contains a keyword of a certain topic.

(2) The total number of users involved in the topic is greater than or equal to 10% of the active users of the period;

(3) The topic's “hotness” (click times divided by the number of active users) is greater than or equal to 10%;

(4) The response rate (the total participation of users divided by the total number of clicks) is greater than or equal to 30%;

Step 4. The topics in adjacent time are crossed matching. To achieve this, the short text similarity assessment model described in Section 3.2 is used. The is 0.09 in Section 5.

Step 5. Identify persistent topics that exist for a minimum period of time. Urgent topics have small time spans and simple network evolutions, which do not belong to the persistent topics that this article focuses on. Ephemeral topics do not last for more than two periods, but some may occur in the junction of two periods, so the is defined as the minimum period for topic longevity. In these experiments, it is assumed that .

A set of topics, which consists of similar topics during the periods , the number of topic-related posts and users are respectively defined as follows:

(4)


Step 6. The key persons in the persistent topic are identified using SWNP. First, is built. The traditional node position algorithm has an experimental basis for large-scale data [12] but does not consider interest, topics and sentiment factors, so this paper provides to estimate the importance of the node in a local network.

Every term/phrase is manually assigned a value between 0 and 1 according to its tone. Oppressive terms range between 0.5 and 1, and a higher value corresponds to a greater degree of oppression. Supportive terms range between 0 and 0.5, and a smaller value corresponds to a greater degree of support. If the phrase is neutral, it is assigned a value of 0.5.

For a given comment from one ID to the other, we can determine the implicit orientation by counting the number of positive or negative words in it (if there are several emotional words in one comment, we take the average).

(5)

where is the emotional word weight in comments from to , is the number of emotional words in all the comments from to . The indicates a negative emotional tendency with a negative commitment function, and indicates a positive commitment function. The can be redefined as follows:

(6)

where ’s nearest neighbors, i.e., nodes that are in the direct relation to ; is the commitment function; is the constant coefficient in the range [0,1], and its value denotes the openness of node position measurement on external influences: a smaller value indicates that ’s node position is more static and independent while a larger value means that the node position is more influenced by others.

The value of the commitment function in PTSN must satisfy the following set of criteria:

(1) The value of commitment is from the range .

(2) The sum of all commitments' absolute values must be equal to 1 in the case of each node in the network: ?.

(3) The commitment to oneself is .

(4) If there is no relationship from , .

(5) If a member is not active with respect to anybody and other members , are active with respect to , then instead of satisfying the above criterion 4, the commitment value is distributed equally among all of ’s acquaintances , i.e., .

Some comments in online Forum are always presented without a clear view, so based on this consideration, we believe that comments labeled with strong emotions tend to communicate more information and therefore should attract greater attention. As such, if shows the reply relationship from to , we assume that comments with strong emotions should transmit a greater commitment than just a passing glance. There are three specific cases:

(1) if meets the strong negative (0.8, 1] or strong positive [0,0.2), ;

(2) if satisfies the general negative (0.6,0.8] or general positive [0.2,0.4), ;

(3) if belongs to relatively neutral [0.4, 0.6], ;

where is the total response number from to , and is the comments numbers after emotional weighted.

The value of the commitment function can be evaluated as the normalized sum of all activities from to in relation to all activities of :

(7)

where is the number of all nodes within the is the function that denotes the activity of node directed to node , such as the number of comments from to . Using the emotional weighted instead of is more conducive to finding an important node in the semantic web.

5. Experiments

5.1 Data set

The dataset is from the Tianya forum (http://focus.tianya.cn), which is a popular bulletin-board service in China. It includes more than 300 boards, and the total number of registered user identifications (IDs) exceeds 32 million. Since its introduction in 1999, it has become the leading social-networking site in China due to its openness and freedom. We selected the Tianya By-talk board and collected data between January 2011 and December 2013 including 325288 users, 102756 posts and 4524756 replies. Among all the users, 12701 of them wrote at least 1 post in the period, 3724 wrote at least 2 posts and 573 at least 5 posts. Taking into consideration the users who wrote at least 1 post, the average number of posts for each user was 8.09. Most of the users’ behavior consisted of replying to posts or even just browsing; the average number of comments for all users equaled 13.91, which was still greater than 8.09.

The largest hot topic post has 6571731 clicks and 66274 comments, and 71929 posts have more than 5 comments. In 2011 through 2013, 176346 users wrote at least one comment, 110261 wrote more than one comment, and the most active user posted 10276 comments. Considering only posts that have at least 5 comments, the average number of comments per post was 62.91. In 2011, users wrote 10324 posts and 400571 comments (38.8 comments/post); in 2012, they wrote 31146 posts and 1326819 comments (42.6 comments/post); and in 2013, they wrote 61286 posts and 2797366 comments (45.6 comments/post).

5.2 Identification of the topics in specific periods

We used the LDA to identify the topic in specific months, setting , topic number and Gionline Forum sampling iterations to 1000. Not all of each month's topics are related to the livelihood issues that this article focuses on, so these topics are omitted by the attribute filter described in Section 4.

After applying this attribute filter, there were a total of 978 topics with an average of 27 topics per month. The minimum number occurred in the 10th month with 9 topics and maximum was in the 6th month with 37 topics. To analyze the size of each topic, Figure 1 shows the statistics on the number of topics related posts. Setting retains more valid data for extracting persistent topics that is, if a title contains a keyword related to a certain topic, it will be retained. Eighty-two percent of retained topics ranged in size from 61 to 150 related posts.

Draft LIN 956018287-image8.png
Figure 1. The related posts number for each topic

5.3 Identification of the persistent topic

The next analysis concerned the identification of the persistent topics, which must exist over a given period. The persistent topic number is affected by . The keyword of a topic always has a frequency of approximately 0.05, while a similarity of 0.1 means the topics have at least two keywords, and then it can be certain that they are in fact the same. Experiments have proven that an important turning point occurs at , corresponding to the 18 relatively persistent topics. The persistent topics have high accuracy and quality by manual validation.

There are 18 persistent topics with 4637 related posts. A total of 91281 users (28% of total users) were present in the following analysis, which greatly reduces the data size for further analysis. There are 257 related posts per persistent topic, and according to the minimum period (three months), they have only 86 posts per topic per month. This number is less than the size of the general topics retained in Section 5.2, which also reflects the persistent topics that do not have a high post rate, click rate or response rate and instead have their own characteristics of long duration.

5.4 Analysis of duration time of the persistent topic

Thirteen persistent topics (72%) lasted for 3 months, which is the minimum duration necessary to consider the topic as a persistent one in our analysis. Four persistent topics lasted exactly 4 months, and the longest lasted 5 months. The distribution of persistent topics is relatively uniform; only in May 2013 (the 29th month) and June 2013 (the 30th month) was there four co-existing persistent topics. Data analysis found that this was during the time of graduation season and the university entrance exam. Additionally, youth films such as “So Young” and singing reality shows such as “X Factor” and “Chinese Idol” caused such topics to remain hot and evolve continuously around this time, although topic evolution is beyond our research.

At the same time, the obtained persistent topics have high diversity, for there is little overlap within the same period. Though two topics with interval time may be similar, they are apparently two different events. Issues concerning graduation, college entrance examinations and employment will repeat themselves every year in different fashions, although this type of topic evolution analysis is not within the scope of this study. Therefore, this algorithm ensures diversity among the persistent topics.

5.5 Persistent topic social network (PTSN)

The goal on the next analysis is to count the posts and the users in the persistent topic. Table 1 shows the basic information of 18 persistent topics, and there are 257 posts and 5071 users per persistent topic. Social network can be built for each persistent topic, where is a finite set of registered users who take part in the topic (i.e., the IDs), and is a finite set of social relationships (i.e., posts and replies).

Table 1. The basic information of the persistent topic
No. Periods Posts Users
1 3 246 4835
2 3 202 5124
3 3 316 6147
4 4 340 5410
5 3 198 4105
6 3 279 4716
7 3 248 6124
8 5 336 4398
9 3 187 5627
10 3 268 4981
11 3 227 5671
12 4 249 5019
13 4 342 3957
14 3 179 6105
15 4 283 4281
16 3 305 6289
17 3 269 5042
18 3 163 3450


5.6 Node position iterative data processing

The experiments revealed that the number of iterations necessary to calculate the node positions for all users in each depends on the value of the parameter (Eq.(6)): the greater the value of , the greater the number of iterations (Figure 2). Each node was initialized in with and the stop condition . The iterative processing of uses six different for comparative analysis. Because the given 18 s have similar sizes, their tendencies are similar.

Draft LIN 956018287-image9.png
Figure 2. The number of iterations in relation to


The experiments revealed that the does not increase the number of iterations and processing time compared with . Because the sentiment analysis only gives every comment a one-off score to determine its emotional inclination (positive or negative), linearly enhancing the corresponding comments without a change in the iteration processing simply adds a linear time complexity to the iterative process. For a clearer demonstration, the value generally refers to the absolute value except in particular emphasis. Next, the distribution characteristics of the are analyzed to discover the important nodes.

5.7 Distribution characteristics of SWNP

Experiments analyze the distribution characteristics of in , and Figure 3 gives the average and their standard deviation in and with different . The average SWNP does not depend on , and it can be formally demonstrated that the SWNP equals approximately 1 in all cases. On the other hand, the standard deviation differs substantially depending on : the greater the , the greater the standard deviation. Namely, the value has increased disproportionately with bigger , which has been proven by the experimental data.

Draft LIN 956018287-image10.png
Figure 3. Average and their standard deviations in relation to


The distribution characteristics of are determined by its network topology structure; for example, the standard deviation variation tendency of is more noticeable than . This result indicates the greater difference of in , as there are a few nodes with ultra value. It can also be noted that the average over 81% of users is less than 1. This result means that only a few members exceed the average value that equals 1. This result also shows that the members’ difference increased for greater , and it is valid for all the . The has the standard deviations of the most obvious change: while , fewer than 1% of users have a , and these users are clearly important. Figure 4 shows the percentage of users with and within and in relation to .

Draft LIN 956018287-image11.png
Figure 4. The percentage of users with and within and in relation to


It can be seen that the different s have the same distribution trend, with the nodes decreasing and nodes increasing. The average percentage of nodes with is 4.7% in all the ( with 7.54% and with 0.57%). This conclusion can help us identify the important nodes in persistent topic social networks. The percent of and are 3.12% and 0.49% in while , so it can be assured that 3.12% users are active users and the 0.49% users are key person in this topic. In fact, the greater the , the more distinguishable the results, but the larger number of iterations directly influences the processing time. Generally, the parameter is determined by the different network scales, but the nodes with high values do not necessarily represent key persons, as the adjacent nodes may pass a lot of negative energy (if the commitment function is less than 0). Therefore, sentiment analysis is needed to actually identify the key persons.

5.8 The Top N key persons in PTSN

Extracting the Top key persons in is achieved through a ranking nodes process based on the importance degree. The algorithm sorts the nodes according to the , and then modifies the list using the emotional attributes. The comparing algorithms mainly used are (Indegree Prestige Centrality), (Outdegree Prestige Centrality) and (PageRank). is based on the indegree number, so it takes into account the number of members that are adjacent to a particular member of the community, as follows: , where is the number of nodes in the network, and is the number of members from the first level neighborhood that are adjacent to . In other words, more prominent people receive more nominations from members of the community. takes into account the outdegree number of the member for edges that are directed to the given node, as follows: , where is the number of the first level neighbors to . On the other hand, users who have low outdegree centrality are not very open to the external world and do not communicate with many members. and are the simplest and most intuitive measures that can be used in network analysis. Google uses to rank the pages in its search engine to measure the importance of a particular page to the others. Table 2 gives the top 10 important nodes using different methods in the with 3450 nodes.

Table 2. Top 10 users in No.18 PTSN
Pos.
1 ID
Val.
122756
1.834
22614
5.634
307146
12.458
307146
15.762
8961
20.546
8961
25.874
14864
0.214
7996
0.130
22614
0.0133
2 ID
Val.
235523
1.627
307146
5.301
8961
12.041
8961
15.240
307146
20.121
307146
21.371
248153
0.197
200416
0.124
70064
0.0105
3 ID
Val.
57681
1.526
8961
4.982
20547
11.878
20547
15.046
20547
16.824
196349
18.627
84134
0.182
14267
0.120
89712
0.0092
4 ID
Val.
22614
1.475
20547
4.870
22614
11.526
22614
14.872
22614
16.345
276482
18.064
33224
0.176
14864
0.106
6401
0.0088
5 ID
Val.
307146
1.404
57681
4.633
276482
10.954
57681
14.534
57681
15.015
20547
17.349
313375
0.172
9246
0.095
85216
0.0080
6 ID
Val.
8961
1.377
276482
4.315
235523
10.467
122756
13.801
122756
15.246
235523
16.202
51229
0.154
81820
0.087
578
0.0078
7 ID
Val.
276482
1.306
122756
4.157
122756
10.348
235523
13.008
235523
15.205
70064
15.977
52166
0.143
241357
0.084
3601
0.0076
8 ID
Val.
20547
1.288
235523
3.946
57681
8.002
276482
11.328
196349
14.548
57681
15.279
7996
0.132
120608
0.079
14027
0.0073
9 ID
Val.
196349
1.270
196349
3.415
70064
7.856
70064
11.340
314627
13.851
122756
14.675
921712
0.121
122412
0.070
39240
0.0070
10 ID
Val.
70064
1.256
70064
3.097
196349
6.912
196349
9.282
70064
11.067
314627
12.544
810204
0.117
15246
0.059
317540
0.0067


The important node ranking is relatively stable when used with different values of . As the simplest and most intuitive measures that can be used in network analysis, the ODC and have low accuracy. The node sort result of is a good one, but there are two main shortcomings: (1) without the commitment function in , all links have the same weight and importance. The PR is distributed by its outdegree and gives no considerations to the strength of the interaction. (2) No sentiment analysis to identify the effective opinion leaders. After ranking, we analyzed the ratio of negative emotions for the selected node (e.g., ID22614). Due to 73% of the commitment functions being less than 0, the node is an active user but not a positive advocate, which helps to control the spread of false information as well as in public opinion analysis and other follow-up work.

The can identify key persons in the specific topic, so it cannot be evaluated by the typical methods, such as Google’s search engine or the users ranking list by computing click rate. To further confirm the stability of the algorithm, the top 10 users in different are used to analyze their community duties and real occupational information. By checking and calculating though artificial verification, a high level of accuracy is maintained.

6. Conclusions

Two main independent approaches are provided in the paper for identifying key persons in online Forum: (i) discovery of the persistent topics and (ii) extraction of the key person using . Identifying persistent topics mainly combines the LDA model and similarity model on the timeline. is a new method of node position analysis, which takes into account both the node position of the neighbors and the strength and emotional tendency of connections between network nodes. The data are from Tianya forum, as indicated in Section 5. The experiment shows that the number of persistent topics is far less than urgent topics, and most of them exist for approximately 3 months with uniform distribution on the timeline. In the established PTSN, the high influence persons are extracted through the iterative calculation and have been analyzed by contrast experiment and artificial verification. The weighted sentiment in mainly reflects that the emotional intensity can be converted to the number of comments, which changes the value of the commitment function and the iterative results. In addition, negative emotions can be used to alter the notion of the key persons to a certain extent, such as discovery of the different ideas of factions, online water armies and false advertisement publishers.

Funding

The work was supported by the Self-funded project of Harbin Science and Technology Bureau(2022ZCZJCG023,2023ZCZJCG006),the Harbin Social Science Federation special topic(2023HSKZ10),the Research Project of Education Department of Jilin Province (No:JJKH20210674KJ and No:JJKH20220445KJ) and 2022 Science and Technology Department of Jilin Province(20230101243JC).

References

[1] Kulunk A., Kalkan S.C., Bakirci A., et al. Session-based recommender system for social networks' Forum platform. 2020 28th Signal Processing and Communications Applications Conference (SIU), pp. 1-4, 2020.

[2] Lu D., Lixin D. Sentiment analysis in Chinese BBS. Intelligence Computation and Evolutionary Computation, pp. 869-873, 2013.

[3] Wasserman S., Faust K. Social network analysis: Methods and applications. Cambridge University Press, New York, 1994.

[4] Liu H., Li B.W. Hot topic detection research of internet public opinion based on affinity propagation clustering. In: He X., Hua E., Lin Y., Liu X. (eds) Computer Informatics Cybernetics and Applications, Vol. 107, pp. 261-269, 2012.

[5] Zhang F., Si G.Y., Pi L. Study on rumor spreading model based on evolution game. Journal of System Simulation, pp. 1772-1775, 2011.

[6] Gregory A.L., Piff P.K. Finding uncommon ground: Extremist online forum engagement predicts integrative complexity. PLOS ONE, 16(1):e0245651, 2021.

[7] Liu J., Cao Z., Cui K., Xie F. Identifying important users in sina microblog. Multimedia 2012 Fourth International Conference on IEEE, Nanjing, China, pp. 839-842, 2012.

[8] Wang Wei, “The Study of Topic Diffusion State Presentation and Trend Prediction within BBS,” Procedia Engineering, vol.29, 2012, pp.2995-3001.

[9] Z. Bu, Z. Xia, J. Wang, “A sock puppet detection algorithm on virtual spaces,” Knowledge Based Syst. 37.2013 :366-377.

[10] Carrington P., Scott J., Wasserman S., “Models and methods in Social Network Analysis,” Cabrige University Press, Cambrige, 2005.

[11] Quan XJ, Liu G, Lu Z, Ni XL, Liu WY. “Short text similarity based on probabilistic topics,” Knowledge and Information Systems, vol.25, 2010,pp.473-491.

[12] Kazienko, P., Musiał, K. and Zgrzywa, A, “Evaluation of Node Position Based on Email Communication,” Control and Cybernetics, 38 (1), 2009, pp.67-86.
Back to Top

Document information

Published on 21/12/23
Accepted on 16/12/23
Submitted on 11/09/23

Volume 39, Issue 4, 2023
DOI: 10.23967/j.rimni.2023.12.001
Licence: CC BY-NC-SA license

Document Score

0

Views 16
Recommendations 0

Share this document

claim authorship

Are you one of the authors of this document?