Skip to main content

Sentiment leaning of influential communities in social networks

Abstract

Social media and social networks contribute to shape the debate on societal and policy issues, but the dynamics of this process is not well understood. As a case study, we monitor Twitter activity on a wide range of environmental issues. First, we identify influential users and communities by means of a network analysis of the retweets. Second, we carry out a content-based classification of the communities according to the main interests and profile of their most influential users. Third, we perform sentiment analysis of the tweets to identify the leaning of each community towards a set of common topics, including some controversial issues. This novel combination of network, content-based, and sentiment analysis allows for a better characterization of groups and their leanings in complex social networks.

Introduction

Environmental and sustainability issues are among the major societal concerns today. The formulation of environmental policies is often a result of the interaction between antagonistic interest groups, including policy makers (governments and international organizations), advocacy groups representing the interest of specific industry sectors, and civic activists. The motivation for this research is to contribute to a better understanding of the dynamics of advocacy and activism around policy issues. We expect that the results will help policymakers in monitoring the response of various interest groups to the proposed regulations and policy targets.

The explosive growth of social media and user-generated contents on the Web provides a potentially relevant and rich source of data. This work is based on data from Twitter [1], a social networking and micro blogging service with over 270 million monthly active users, generating over 500 million tweets per day.

We collect a broad range of tweets related to the environmental issues and address the following research questions:

  • Can one identify influential communities and environmental topics of interest?

  • Are there differences in their leanings towards various environmental topics?

Our results indicate that there are observable differences in sentiment leanings towards various environmental issues between the major communities.

There are several aspects of Twitter data analysis that are relevant for this research. On the one hand, Twitter is a social network, and several types of networks can be constructed from the data, e.g., followers, mention, or retweet networks. Network analysis algorithms then yield interesting network properties, such as communities, modularity, various, and centralities. On the other hand, Twitter data can also be analyzed for its contents, by applying text mining and sentiment analysis algorithms. A novelty of our research is that we combine both types of analysis. We detect influential communities, identify discussion topics, and assign sentiment of the communities towards selected topics.

There are three different ways how users on Twitter interact: 1) a user follows posts of other users, 2) a user can respond to other user’s tweets by mentioning them, and 3) a user can forward interesting tweets by retweeting them. Based on these three interaction types, Cha et al. [2] define three measures of influence of the user on Twitter: indegree influence (the number of followers, indicating the size of his audience), mention influence (the number of mentions of the user, indicating his ability to engage others in conversation), and retweet influence (the number of retweets, indicating the ability of the user to write content of interest to be forwarded to others). They find that mention and retweet influence are correlated, but that indegree alone reveals little about the user’s actual influence. This is also known as the million follower fallacy [3]. Instead of the number of followers, they show that it is more influential to have an active audience who mentions or retweets the user. Suh et al. [4] analyze factors which have a positive impact on the number of retweets: URLs, hashtags, the number of followers and followees, the age of the account, but not the number of past tweets. Bakshy et al. [5] quantify the influence on Twitter by tracking the diffusion of URLs through retweet cascades. They find that the longest retweet cascades tend to be generated by the most influential users in the past.

Closely related to our research is the work by Conover et al. [6], albeit applied to the problem of political polarization. They construct both retweet and mention networks from political tweets and apply community detection. It turns out that the retweet network exhibits clear community segregation (to the left- and right-leaning users), while the mention network is dominated by a single community. In [7], they compare the predictive accuracy of the community-based model to two content-based (full text tweets and hashtags-only) models. The community-based model constructed from the retweet network clearly outperforms the content-based models (with the accuracy of 95 vs. 91 %).

The above research indicates that the retweet influence seems to be the most promising measure of influence on Twitter, and that community detection in the retweet network will likely yield the most influential communities. However, in the environmental domain, the community segregation is not as clear as in the political domain. We therefore characterize communities not only by their influential members, but also by their prevalent discussion topics and sentiment.

Sentiment analysis has been applied to Twitter in several domains [8], most notably for stock market predictions [9], and in political elections. There has been some controversy whether Twitter analysis can be used to predict the outcome of elections—Gayo-Avello gives a survey of various studies [10]. We have successfully applied Twitter sentiment analysis to monitor Slovenian presidential election in 2012 and Bulgarian parliamentary elections in 2013 [11]. Most of the other approaches are based on tweet volume or simple sentiment analysis by counting positive and negative sentiment words in tweets. In contrast, we apply supervised machine learning, the SVM classification in particular [12]. The training data comes either from manually annotated tweets (which are problem-specific and of high quality, but expensive in terms of resources needed), or from generic, smiley-based tweets [13] (which are of lower quality, but very extensive).

This paper is based on our preliminary work, presented in a workshop proceeding [14], and, in several aspects, extends the proposed methodology. First, the experiments capture 1 year of Twitter data and hence analyze twice the original amount of data. Second, the structural properties of most prominent communities discussing environmental topics are examined. Third, content filtering is enhanced by similarity calculation in a multi-dimensional vector space. Finally, a custom sentiment model, trained on manually labeled domain-specific tweets, is applied to produce better sentiment classification results.

The paper is organized as follows. In the “Methodology: discovering influential communities and their sentiment” section, we present the network and content analysis employed in our work. We describe the Twitter data acquisition and construction of the retweet network. We use a standard community detection algorithm and define the Twitter user and community influence measures. A standard text mining approach is used to identify topics discussed by the major communities. For sentiment analysis, we construct a binary SVM classifier with neutral zone, from three different sets of training data. The “Results and discussion” section describes the outcomes of the experiments. First, we analyze the structural properties of the most influential communities, in terms of their internal and external influence, and balance of the influence distribution. We identify categories of influential communities (e.g., environmental activists, news media, skeptics, celebrities) and the topics of their interests. Sentiment classification is applied to the tweets of different communities, and sentiment leaning of the communities towards different topics is analyzed. We highlight interesting findings and some unexpected results. We conclude with plans for future work.

Methodology: discovering influential communities and their sentiment

We have monitored Twitter for a period of the entire year 2014. We use the Twitter Search API and define a wide range of queries to select tweets related to environmental and energy topics (see see Table 6 in Appendix for the full list of queries). The collected environmental tweets are then used to construct a social network and identify influential users and communities, as well as their topics of interest and sentiment. The process of identifying community interests and their leanings consists of three steps. First, the network of users retweeting each other is constructed, and the densely connected communities are detected. Second, the content published by these communities is analyzed to reveal the communities’ interests, and finally, sentiment analysis is performed to asses the sentiment leaning of the communities with respect to different topics of interest.

Network structure and influence measures

We explore which Twitter users share similar content on environmental topics. To model this phenomenon, we construct a retweet network, connecting users who are in a retweet relation, i.e., an undirected edge between two users indicates either one user retweeted the other or vice versa. The network is constructed from 30.5 million tweets about environmental topics, acquired between January 1, 2014 and December 31, 2014. The network consists of 3.7 million users (nodes) linked by 9.7 million retweet relations.

The largest part of the network consists of one large connected component of 3.4 million users, the rest are components of size smaller than 1000 users. In the largest component, we want to find groups of users that share similar views on environmental topics. If we assume that retweeting is a proxy of expressing agreement on the published content, the retweet network can be regarded as consisting of the connections between users who agree on a certain topic. Therefore, the problem translates into partitioning the network in the so-called communities. In the field of complex networks, the notion of “community” corresponds, loosely speaking, to a subset of nodes that are more densely connected among themselves than with the nodes outside the subset. Several definitions of community and methods to detect communities have been proposed in the literature (see [15] for a review).

We apply a standard community detection algorithm, the Louvain method [16], to our retweet network. The method partitions the network nodes in a way that maximizes the network’s modularity. Modularity is a measure of community density in networks. It measures the fraction of edges falling within groups of a given network partitioning as compared to the expected fraction of edges in these groups, given a random distribution of links in the network [17]. Among the available detection algorithms in the optimization-based class, the Louvain method is one of the few methods that are suitable: (i) to analyze large networks with good scalability properties and (ii) to avoid ex-ante assumptions on their size [18].

Further, we propose an approach to identify the most influential users in the network, i.e., users whose content is apparently approved and shared the most. Let the retweet network be represented as a directed graph G, with edges E(G). A directed edge e u,v from the user u to the user v indicates that contents of the user u have been retweeted by the user v. Let w(e u,v ) be the weight of the edge e u,v indicating the number of times that the user v retweeted the contents of the user u. Then user influence I(u) is defined as

$$ I(u) = \sum_{ e_{u,v} \in E(G) }{w(e_{u,v})} $$
((1))

The differences in the structure of the detected communities C 1,…,C n are examined through the influence of the users of a particular community C k . We address this by measuring the intra and inter-community influence of each community, as well as by measuring the distribution of influence among the community’s users.

Community influence is defined as the cumulative influence of all its users,

$$ I(C) = \sum_{u \in C}{I(u)}=\sum_{u \in C}{\left(\sum_{e_{u,v} \in E(G)}{w(e_{u,v})}\right)} $$
((2))

It can be divided into the influence that the community users have within their own community and the influence they exert outside their community. Hence, we define intra-community influence I in and inter-community influence I out as:

$$ I_{in}(C) = \sum_{u \in C}{I_{in}(u)} = \sum_{u \in C} {\left(\sum_{\substack{ e_{u,v} \in E(G)\\ v \in C}}{w(e_{u,v})}\right)} $$
((3))
$$ I_{out}(C) = \sum_{u \in C}{I_{out}(u)} = \sum_{u \in C} {\left(\sum_{\substack{ e_{u,v} \in E(G)\\ v \notin C}}{w(e_{u,v})}\right)} $$
((4))

The ratio between these two measures I out /I in reveals the extent to which a community is influential outside its “borders” versus its internal content exchange.

Furthermore, to measure the distribution of user influence within a community, we use the Herfindahl-Hirschman index (HHI), commonly used in economics to measure the amount of competition among leading companies in an industry with respect to their market share [19]. When applied in the context of community structure, we look at the N leading users u i , i{1,…,N}, in a community C in terms of their normalized intra-community influence \(r_{i} = I_{\textit {in}}(u_{i})/\sum _{j=1}^{N}I_{\textit {in}}(u_{j})\). Hence, the Herfindahl-Hirschman index is defined as

$$ HHI(C) = \sum_{i = 1}^{N}{{r_{i}^{2}}} = \sum_{i = 1}^{N}{\left(\frac{I_{in}(u_{i})}{\sum_{j=1}^{N}I_{in}(u_{j})}\right)^{2}} $$
((5))

The squared sum of influence ratios ranges from 1/N to 1, where lower values indicate a dispersed and more balanced influence distribution, whereas higher values reflect the community influence being concentrated only on few strongly influential users.

Content identification and filtering

The retweet relation can be considered as the agreement between users on the published content. Hence the retweet network reveals which users support similar interests, without looking into the actual content. On the other hand, to identify the content and to see what are different groups of users talking about, we adopt a standard text mining approach as follows.

  1. 1.

    For each group of users g i , i{1,…,N}, create a document d i that aggregates all the content which the users of the group g i have published.

  2. 2.

    The vocabulary (i.e., the set of terms) used by groups {g 1,…,g N } is obtained from the documents {d 1,…,d N }. Term frequency T F i (t) denotes the number of appearances of a term t in a document d i .

  3. 3.

    For each term t from the vocabulary, document frequency D F(t) is the number of documents in which t appears.

  4. 4.

    For each of the documents, {d 1,…,d N } construct a bag of words (BoW) vector where each term value in the vector is the TFiDF value of the term t from the vocabulary:

    $$ TFiDF_{i}(t) = TF_{i}(t)\cdot \log{\frac{N}{DF(t)}} $$
    ((6))

    Term frequency-inverse document frequency (TFiDF) is a standard and widely used measure of importance of a term t to a document in a collection of documents [20].

We use this adopted text mining approach to identify the terms that are the most distinctive and therefore the most characteristic for the content tweeted by different groups of users. More specifically, we use the detected retweet communities as the groups of users. Next, we employ the above procedure to summarize and represent the most characteristic topics in the content of each community. Such content identification and representation is done by displaying only the selected number of the highest TFiDF ranked terms from a BoW vector constructed for a selected community. In this way, we are able to get a readable and reliable overview of the specific interests and topics discussed in the observed communities.

On the other hand, for the purpose of identifying the leaning of different communities towards specific topics of interest, we have to retrieve the individual tweets forming a certain topic. We employ a filtering procedure based on document similarity, to obtain tweets that revolve around a specified topic (query). In this case, each tweet from the dataset is treated as an individual document and is transformed into a BoW vector. Hence, the filtering works as follows.

  1. 1.

    The vocabulary V of a specific domain is obtained from all unique tweets acquired for the targeted domain. From V the base of the document vector space is constructed by standard text preprocessing (stemming, stop-word removal, n-grams) resulting in terms t 1,…,t n .

  2. 2.

    For each tweet t w i , i{1,…,m}, from the dataset D, a BoW vector v i of term frequencies T F i (t) for each term t in t w i is constructed and normalized.

  3. 3.

    A BoW model of the examined domain can be represented by a matrix M with rows v i for each t w i D.

  4. 4.

    The dataset D is filtered according to a query that is transformed into a normalized BoW vector q.

  5. 5.

    Similarity between query q and tweets t w i D is calculated as s=M·q:

    $$ \left[ \begin{array}{c} s_{1} \\ \vdots \\ s_{m} \end{array} \right] = \left[ \begin{array}{ccc} m_{1,1} & \cdots & m_{1,n} \\ \vdots & & \vdots \\ m_{m,1} & \cdots & m_{m,n} \\ \end{array}\right] \cdot \left[ \begin{array}{c} q_{1} \\ \vdots \\ q_{n} \end{array} \right] $$
    ((7))

    where s i , i{1,…,m}, is the cosine similarity1 between the query vector q and v i representing tweet t w i , and m i,j is the (normalized) term frequency of term t j in tweet t w i .

Given a query q and the calculated similarity vector s, the filter returns tweets t w i for the indices i where s i is greater than a given threshold. Note that, since the number of terms (n) and especially the number of tweets (m) can be very large, in practice the computations are performed with sparse representations of vectors and matrices.

Sentiment analysis

Our goal is to measure the collective attitude of a Twitter community towards a certain topic. The first step is to measure the sentiment of each individual tweet posted by the community. To perform Twitter sentiment analysis, we construct a sentiment classifier from the training data. We employ the Support Vector Machine (SVM) algorithm [12], and in particular its SVMperf [2123] implementation. The SVM algorithm requires a labeled collection of instances to build a model. We have collected three labeled Twitter datasets which differ in terms of size, discussion topics, and labeling method. We have trained three corresponding sentiment models and compare their performance on the same testing set. The best sentiment classification model is then used in the rest of our analyses.

The first dataset consists of 1.6 million positively and negatively labeled tweets collected by the Stanford University [13]2. The labeling of the tweets is based on the presence of positive (e.g., “:)”) or negative (e.g., “:(”) emoticons, which were then removed from the dataset for training. Although such approach does not provide the highest labeling quality, it is a reasonable and inexpensive substitute for manual tweet labeling [24]. The tweets in this dataset are general and not focused on any specific domain.

The second dataset consists of general English tweets too, but the tweet labels were obtained by manual annotation. In this dataset, there are 25,721 positive, 23,250 negative, and 37,951 neutral hand-labeled tweets.

The tweets in the third dataset are a uniformly sampled subset of our environmental tweets, therefore highly domain-specific. This dataset consists of 2,850 positive, 5,569 negative, and 11,439 neutral hand-labeled tweets, from January to December, 2014. We randomly choose 20 % of these tweets (preserving the labeling distribution of the whole dataset) as a test set, used for evaluating the trained sentiment models. The rest of the 80 % tweets from the domain-specific dataset were used for training the domain-specific sentiment model.

Sentiment models are built only from the positive and negative tweets. However, the classification covers three categories: positive, negative, and neutral as well. A tweet is classified as positive (negative) if its distance from the SVM hyperplane is higher than the average distance of positive (negative, respectively) training examples from the hyperplane. Otherwise, i.e., if it is too close to the hyperplane, it is classified as neutral. Similar approaches to adapting the binary SVM classifier to the three-class setting were already applied in our previous studies [24, 25].

Twitter messages are adequately preprocessed, using both standard and Twitter-specific techniques. Standard preprocessing [26] includes tokenization, stemming, unigram and bigram construction, removing terms which do not appear at least twice in the corpus, and construction of term frequency (TF) feature vectors.3 Additionally, Twitter-specific preprocessing [8, 13, 24] transforms usernames, hashtags, and collapses repetitive letters.

We build three sentiment models (smiley-labeled general, hand-labeled general, and hand-labeled domain-specific) using the corresponding preprocessed positive and negative tweets, and tested their performance on the separate test set described above. In Table 1, we report the results in terms of macro-averaged error rate [27] and in terms of macro-averaged F-score of positive and negative classes [28]. We are particularly interested in the correct classification of the positive and negative tweets.

Table 1 The evaluation results of smiley-labeled general, hand-labeled general, and hand-labeled domain-specific sentiment models on the test dataset in terms of the macro-averaged error rate and the macro-averaged F-score of positive and negative classes

As can be seen from Table 1, the best performing sentiment model is the hand-labeled domain-specific one as it achieved the lowest error rate and the highest macro-averaged F-score on the test set. Note that this model is trained on only 6,735 tweets, while the other two models employed substantially more tweets (1.6 million for the smiley-labeled general model and 48,971 for the hand-labeled general model). Therefore, the results indicate that the high-quality domain-specific tweets produce better sentiment models even if the number of such tweets is lower. For the rest of our study, we use the hand-labeled domain-specific sentiment model trained using the complete hand-labeled domain-specific dataset.

The sentiment of different communities regarding a specific topic is calculated as follows. First, for each community, the tweets posted by its users are selected. Second, the sentiment of each tweet is determined and weighted by its retweet count. Third, the weighted negative and positive sentiment of tweets is aggregated for each user and summed over all users in the community. Finally, the leaning of a community towards a specific topic is computed as the polarity of the aggregated weighted sentiment multiplied by the ratio of sentiment carrying tweets (subjectivity) of the respective community. The polarity and subjectivity measures are adapted from [29]. The pseudo-code for community sentiment computation is presented in Algorithm 1.

Results and discussion

We present the results of the proposed methodology for identifying interest groups and their leaning towards different environmental topics, in terms of network and community structure, content categorization and identification, and sentiment analysis.

Network and community structure

We analyze a retweet network of 3.7 million users linked by 9.7 million retweet relations. In Fig. 1 we present the distribution of out-degree and influence I (as defined by Equation 1) for the nodes of the network. Community detection results in over 125,000 communities. Their size distribution is presented in Fig. 2. Notice that both plots are in log–log scale and therefore even only by eye inspection we can say that the distribution displays a fat tail in the sense that it deviates strongly to the right from a Gaussian distribution. This means that, in line with the empirical literature on social networks, nodes with very high degree and communities with a very large size occur with frequency much larger than in a Gaussian scenario.

Fig. 1
figure 1

Out-degree (orange) and influence (green) distribution in the retweet network

Fig. 2
figure 2

Distribution of community sizes using logarithmic binning, as defined in [30]

We focus our analysis on communities of considerable size, which also produced a sufficient amount of tweets for meaningful content identification and sentiment analysis. This results in 12 communities, each with more than 50,000 users, and with at least 10,000 unique tweets.

The analysis in terms of community influence and its distribution among their users reveals significant structural differences among the largest communities. Results are presented in Table 2. The ratio between the inter- and intra-community influence, I out (C) and I in (C), shows that the majority of communities are greatly introverted, as their influence outside their “borders” presents less than a quarter of the impact they have. However, there are two communities (k=1 and 4) that have almost a third of their influence outside the community, and one where its external influence is almost as high as its internal influence (k=5).

Table 2 Structural properties of the 12 largest communities

The distribution of influence within communities, as measured by the Herfindahl-Hirshmann index (HHI), also shows interesting differences among communities. The lowest values of HHI are around 0.03, for communities k=6,9,10, and 11. Hence, these are the communities that have the lowest inequality in terms of I in among their 50 most influential users. Whereas communities k=8 and 12 have the highest inequality between their 50 most influential users. It is interesting to notice that community k=6 with the lowest inequality is also the second most introverted. Other than that, we find no obvious relation between HHI and the relative inter-community influence.

In Fig. 3, we present the relation between the user influence, out-degree, and the number of unique tweets, for the top three most influential users of selected nine communities. The selection is explained in the subsequent section. The figure shows the magnitude of the top users in different communities and is consistent with the inequality measures by HHI. On the other hand, there is no obvious relation between the tweet volume and the influence of the users. It seems that higher out-degree is accompanied by higher influence, which can be seen also from Fig. 1.

Fig. 3
figure 3

Influence (bubble size), out-degree (number of retweeting users), and the number of unique tweets for the top three most influential users of the nine selected communities

Community content

A preliminary community categorization was performed by looking at the Twitter profiles of their most influential users and the contents of their tweets. We find that the communities could roughly be classified into six categories. Table 3 presents the community categories and examples of the most influential users in these categories.

Table 3 Community categories and their most influential users

The community categorization reveals that for our further investigations we can ignore certain categories of communities. First, in the “Humor” community, the presence of an actual leaning or sentiment towards a certain topic is for one questionable (every topic can be made fun of using positive or negative words), and for two, it is hard to automatically identify the correct polarity due to frequent use of irony and sarcasm. Second, we also ignore a smaller community in the category “Other” that we are unable to strictly categorize.

One community from the “Environmental” category is also not included, because it contains numerous content duplicates as a result of marketing and spamming. The final selection includes three communities from the “Environmental” category (labeled as “Env 1”, “Env 2”, and “Env 3”), three from “News” (“News 1”, “News 2”, and “News 3”), the “Indian” community (“India”), one “Celebrities” community (“Celebrity”), and the “Skeptics” community (“Skeptic”). The network of these nine communities is outlined in Fig. 4. Each community is represented with its own color and the size of the nodes is proportional to the user’s influence. The presented network layout shows a relatively clear segregation between the communities.

Fig. 4
figure 4

Subgraph of the retweet network induced on the nine selected communities. Only users with influence larger than 100 retweets are displayed. The size of the nodes is proportional to the user influence and individual communities are distinguished by color

We analyze the content tweeted by a community in terms of (i) hashtags and (ii) plain text. Hashtags can represent entities in the tweet and/or user-inserted labels of a tweet, indicating the topic or broader context of the tweet. Content analysis in terms of hashtags, using the approach presented in section “Content identification and filtering”, is therefore expected to show the characteristic entities and topics of interest in a selected community. On the other hand, plain text analysis is more appropriate for identification of actions, attitude, and phrases that are most distinctive for a particular community. The results of content analysis are presented in Table 4.

Table 4 Characteristic content of the nine influential communities, selected on the basis of the largest number of unique tweets (in parenthesis are the most influential users)

The most characteristic content of each community, as shown by the results in Table 4, reasonably distinguishes the communities of different categories. The hashtag content analysis supports the membership of the communities with the most influential users “ClimateReality” and “climateprogress” in the “Environmental” category, therefore from now on labeled by “Env 1” and “Env 2”, respectively. Next two largest communities include topics present in the news in the United Kingdom and the United States of America, hence called “News 1” and “News 2”, respectively. It reveals that the users retweeting “JunkScience” belong to the “Skeptic” community. Local topics from “India” are apparent from the hashtags of the next community. Similarly, the hashtags of the Ian Somerhalder Foundation (#isf) and their opinions point to the “Celebrity” community. Hashtag analysis of the last two communities shows interest in Canadian political and environmental issues, hence “News 3”, and in environmental problems and political topics in Australia, therefore “Env 3”.

On the other hand, the results of the plain text analysis mostly show more specific topics that are shared in the observed communities. The top terms or phrases (n-grams) in the “Env 1”, “Env 2”, “News 1”, “News 2” and “Celebrity” communities, reflect their interest in the promotion of alternative, renewable, and environmentally friendly energy sources, in contrast to the controversial energy supply solution provided by fracking, as well as raise awareness of global pollution. The two most distinctive topics that surface from the content of the “Skeptic” community are “man-made global-warming” and “conducts dangerous human experiments”. The former is related to the community’s skepticism regarding human-caused global warming, and the latter is about an article published by the “Investor’s Business Daily” newspaper [31] that criticizes an allegedly harmful experiment by the Environmental Protection Agency (EPA). The plain text content results for the communities “India”, “News 3”, and “Env 3” show less specific topics, with the main focus on the local political situation, or environmental and energy policies.

Community sentiment

Finally, we investigate the sentiment leaning of the most content-rich communities. In our dataset of over 30 million environmental tweets, there are almost 3.2 million unique tweets. We label them by the SVM sentiment model, described in the “Sentiment analysis” section, as positive (1), neutral (0), or negative (−1). Only 31 % of the unique tweets are labeled as subjective, i.e., non-neutral. Furthermore, among the sentiment-carrying tweets, there are 52 % of tweets with positive sentiment and 48 % with negative sentiment.

We analyze the sentiment leanings towards selected topics related to the environmental issues. The selection is based on three major groups of topics that are of interest to environmental policy makers: energy sources and energy generation, environmental side effects, and actions or initiatives for solving the environmental issues. We separate the first group into four topics: renewable or green energy sources, nuclear energy, fossil fuels, and fracking, as a separate controversial topic. The second group is represented by the broader topic of global warming and climate change, more general pollution and contamination, and its more specific variant about emissions of greenhouse gases (CO2 and methane). The last group is separated into recycling and waste management, and environmental policies and initiatives.

The nine communities selected for investigation produce over two thirds of the unique tweets in our dataset. We use the approach presented in the “Content identification and filtering” section to filter these 2.1 million tweets by the nine topics defined above. Table 5 presents the queries used in the filtering process to describe a particular topic. The number of tweets filtered by topic for each community is shown in Fig. 5.

Fig. 5
figure 5

The number of unique tweets for each selected topic published by each of the major communities

Table 5 Selected environmental topics and the associated queries for tweet filtering

The sentiment of a community towards a selected topic is computed from the tweets on that topic, tweeted by that particular community, as proposed in the “Sentiment analysis” section, Algorithm 1. The results of the community sentiment analysis on different environmental topics are presented in Fig. 6. Community leaning towards a specific topic is computed as the difference between the community sentiment on this topic and the community’s average sentiment in our dataset. In Figs. 5 and 6, the topics of interest are in descending order from left to right by their average sentiment over all the communities.

Fig. 6
figure 6

Sentiment leaning of the nine communities towards different environmental topics

The first interesting finding is that the sentiment analysis is in accordance to the commonly accepted attitude towards different environmental topics. All communities show positive leaning towards “green energy” and “recycling”, and negative towards “fossil fuels”, “climate change”, “pollution”, and “fracking”, except for two outlier communities that we examine separately. Regarding “emissions”, “nuclear energy”, and “policies”, the sentiment leanings are less unanimous, which is to some extent also expected. These results indicate that the domain-specific sentiment model produces reasonable results.

Observing individual communities, we find that most of them follow the same trend; however, there are two notable exceptions: the “Skeptic” and the “Celebrity” communities. The “Skeptic” community is very segregated from the rest (see Fig. 4), and its sentiment leanings show greatest deviations from the leaning of other communities (see Fig. 6). It is the only community having a positive sentiment leaning about the topics “fossil fuels” and “fracking”, which is considerably different from all other communities. These results clearly indicate that the preferences of this community are diverging from the interests of the other communities.

The “Celebrity” community is dominated by “iansomerhalder”, one of the most influential users overall (see Fig. 3). Despite the high influence, the community produces very low number of original tweets (less than 1 % of all the unique tweets, see Table 4). Its influence emerges from the large number of retweets, due to the large number of followers of “iansomerhalder”. This hints at the possibility to engage high-profile celebrities, with the commitment to environmental issues, in promotion and spreading of influential contents.

This is exactly what can be observed for the topics “emissions” and “pollution”. The extremely positive sentiment leaning towards these topics is predominantly (60 and 78 %, respectively) due to only three tweets by the two most influential users of the “Celebrity” community: “iansomerhalder” and “LeoDiCaprio”. They are expressing their happiness and thankfulness regarding the “action to limit carbon pollution” and “cutting carbon pollution”, which will “clean up our air and tackle climate disruption”, as they put it. Hence, the distinctively positive leaning for the topics “emissions” and “pollution”. On the other hand, the “Celebrity” community seems to be least in favor of “fracking”.

Conclusions

The paper contributes to the research on complex networks in social media by combining a structural and content-based analysis of Twitter data. From structural properties of the retweet network, we identify influential users and communities. From the contents of their tweets, we characterize discussion topics and their sentiment. Sentiment of different communities shows perceivable differences in their leanings towards different topics. We have identified two communities that considerably diverge from the rest, “Skeptic” with the most different sentiment leanings on several topics, and “Celebrity” with a low number of original tweets, but highly influential, with the potential to spread interesting information.

Our previous research in sentiment analysis of Twitter data in politics and stock market suggests that different vocabularies are used in different domains and that high-quality expert labeling of domain-specific tweets yields better sentiment models. The comparison of the three sentiment models (smiley-based general, hand-labeled general, and hand-labeled domain specific) presented in this paper confirms our intuition: hand-labeled domain-specific model yields lower error rate and higher combination of precision and recall (F-score) than the other two models. However, more extensive evaluations are required to determine the amount of hand-labeled tweets needed to approach the “maximum” performance, e.g., the inter-annotator agreement.

Another line of future research is the construction of more sophisticated SVM classifiers. In the case of smiley-based training data, only positive and negative tweets were available, and a binary SVM classifier was extended with a neutral zone to allow for the three-class classification. However, in the case of hand-labeled tweets, there are three sets of training data available: positive, neutral, and negative tweets, so we are dealing with a multiclass problem. Further, we can assume that the classes are ordered (neutral is between the positive and negative), and therefore, we are faced with the problem of ordinal regression [32], instead of binary classification. In the future, we plan to exploit various extensions of an SVM to deal with the multiclass [33] and ordinal regression problems.

In this paper, we present a general methodology of combining a structural and content-based analysis of Twitter networks, and then apply it to 1 year of Twitter data about environmental topics. There are several plans for future work. On the one hand, we plan to study the temporal aspects of community formation and sentiment spreading. In addition to the retweet networks, we will also construct mention networks (which model mutual engagement of users in conversations). We will investigate various spreading models and study the differences in sentiment spreading at such multilayer (retweet and mention) networks.

We are also collecting Twitter data in several other interesting domains: stock market, EU commission and parliament members, and lobbying organizations. The application of the presented structural and content-based analysis to these new domains will result in complex ‘Twitter’ networks. On the other hand, networks between the same entities can also be constructed by other means, such as correlations between stock returns, national and party membership of politicians, vote similarity, and ownership between the companies. The research challenge for the future is the comparison between the Twitter induced and other types of networks, and the mutual interplay and property spreading between these multilayer networks.

Endnotes

1 Cosine similarity is a measure of similarity between vectors a and b. It is calculated as the normalized dot product between vectors a and b: \(\text {sim}(\mathbf {a}, \mathbf {b})= \cos (\angle (\mathbf {a},\mathbf {b})) =\frac {\mathbf {a}\cdot \mathbf {b}}{|\mathbf {a}|\cdot |\mathbf {b}|}\)

2 The dataset was obtained from “For Academics” section, at http://help.sentiment140.com/for-students.

3 The approach to feature vector construction was implemented using the LATINO (Link Analysis and Text Mining Toolbox) software library, available at http://source.ijs.si/mgrcar/latino.

Appendix

Our dataset of over 30 million tweets on environmental topics was acquired using the Twitter Search API [34]. Table 6 shows the list of search queries used.

Table 6 Queries for the “Environmental dataset” acquisition from the Twitter Search API

References

  1. Dorsey, J, Williams, E, Stone, B, Glass, N, Twitter online social networking service. http://www.twitter.com/. Accessed: Feb 15, 2015.

  2. Cha, M, Haddadi, H, Benevenuto, F, Gummadi, PK: Measuring user influence in twitter: the million follower fallacy. ICWSM. 10, 10–17 (2010).

    Google Scholar 

  3. Avnit, A: The million followers fallacy. Pravda Media Group, Tel Aviv, Israel (2009).

    Google Scholar 

  4. Suh, B, Hong, L, Pirolli, P, Chi, EH: Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network. In: 2010 IEEE Second Intl. Conf. on Social Computing, pp. 177–184. IEEE, Piscataway, New Jersey (2010).

    Google Scholar 

  5. Bakshy, E, Hofman, JM, Mason, WA, Watts, DJ: Everyone’s an influencer: quantifying influence on twitter. In: Proc. Fourth ACM Intl. Conf. on Web Search and Data Mining, pp. 65–74. ACM, New York City, New York (2011).

    Google Scholar 

  6. Conover, M, Ratkiewicz, J, Francisco, M, Gonçalves, B, Menczer, F, Flammini, A: Political polarization on twitter. In: Proc. Fifth Intl. Conf. on Weblogs and Social Media (ICWSM). AAAI, Palo Alto, California (2011).

    Google Scholar 

  7. Conover, MD, Gonçalves, B, Ratkiewicz, J, Flammini, A, Menczer, F: Predicting the political alignment of twitter users. In: Privacy, Security, Risk and Trust, 2011 IEEE Third Intl. Conf. on Social Computing, pp. 192–199. IEEE, Piscataway, New Jersey (2011).

    Google Scholar 

  8. Agarwal, A, Xie, B, Vovsha, I, Rambow, O, Passonneau, R: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38. Association for Computational Linguistics, Stroudsburg, PA, USA (2011).

    Google Scholar 

  9. Bollen, J, Mao, H, Zeng, X: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011).

    Article  Google Scholar 

  10. Gayo-Avello, D: A meta-analysis of state-of-the-art electoral prediction from twitter data. Soc. Sci. Comput. Rev.31(6), 649–679 (2013).

    Article  Google Scholar 

  11. Smailović, J: Sentiment Analysis in Streams of Microblogging Posts. PhD thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia (2014).

  12. Vapnik, VN: The Nature of Statistical Learning Theory. Springer, New York, NY, USA (1995).

    Book  MATH  Google Scholar 

  13. Go, A, Bhayani, R, Huang, L. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1–12 (2009).

  14. Sluban, B, Smailović, J, Juršič, M, Mozetič, I, Battiston, S: Community sentiment on environmental topics in social networks. In: Proceeding of the Tenth International Conference on Signal-Image Technology & Internet-Based Systems, pp. 376–382. IEEE Computer Society, Washington, DC, USA (2014).

    Google Scholar 

  15. Fortunato, S: Community detection in graphs. Phys. Rep. 486, 75–174 (2010).

    Article  MathSciNet  Google Scholar 

  16. Blondel, VD, Guillaume, J-L, Lambiotte, R, Lefebvre, E: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), 10008 (2008).

    Article  Google Scholar 

  17. Newman, MEJ: Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 103(23), 8577–8582 (2006).

    Article  Google Scholar 

  18. Lancichinetti, A, Fortunato, S: Community detection algorithms: a comparative analysis. Phys. Rev. E. 80(5), 056117 (2009).

    Article  Google Scholar 

  19. Werden, GJ: Using the Herfindahl–Hirschman index. In: Phlips, L (ed.)Applied Industrial Economics, pp. 368–374. Cambridge University Press, Cambridge, UK (1998).

    Google Scholar 

  20. Feldman, R, Sanger, J: Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York, NY, USA (2006).

    Book  Google Scholar 

  21. Joachims, T: A support vector method for multivariate performance measures. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 377–384. ACM, New York City, New York (2005).

    Google Scholar 

  22. Joachims, T: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. ACM, New York City, New York (2006).

    Google Scholar 

  23. Joachims, T, Yu, C-NJ: Sparse kernel SVMs via cutting-plane training. Mach. Learn. 76(2-3), 179–193 (2009).

    Article  Google Scholar 

  24. Smailović, J, Grčar, M, Lavrač, N, žnidaršič, M: Stream-based active learning for sentiment analysis in the financial domain. Inf. Sci. 285, 181–203 (2014).

    Article  Google Scholar 

  25. Smailović, J, Grčar, M, Lavrač, N, žnidaršič, M: Predictive sentiment analysis of tweets: A stock market application. In: Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data. Lecture Notes in Computer Science, pp. 77–88. Springer, Berlin Heidelberg (2013).

    Google Scholar 

  26. Feldman, R, Sanger, J: Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York, NY, USA (2006).

    Book  Google Scholar 

  27. Baccianella, S, Esuli, A, Sebastiani, F: Evaluation measures for ordinal regression. In: Intelligent Systems Design and Applications, 2009. ISDA’09. Ninth International Conference On, pp. 283–287. IEEE, Piscataway, New Jersey (2009).

    Google Scholar 

  28. Kiritchenko, S, Zhu, X, Mohammad, SM: Sentiment analysis of short informal texts. J. Artif. Intell. Res. 50, 723–762 (2014).

    MATH  Google Scholar 

  29. Zhang, W, Skiena, S: Trading strategies to exploit blog and news sentiment. In: Proc. Fourth Intl. AAAI Conf. on Weblogs and Social Media (ICWSM), pp. 375–378. AAAI, Palo Alto, California (2010).

    Google Scholar 

  30. Newman, MEJ: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005).

    Article  Google Scholar 

  31. Obama’s EPA Conducts Dangerous Human Experiments. Investors.com. http://news.investors.com/ibd-editorials/040414-696061-epa-conducts-pollution-experiments-on-humans.htm. Accessed: Sep 5, 2014.

  32. Cardoso, JS, Da Costa, JFP: Learning to classify ordinal data: the data replication method. J. Mach. Learn. Res. 8, 1393–1429 (2007).

    MathSciNet  MATH  Google Scholar 

  33. Crammer, K, Singer, Y: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2002).

    MATH  Google Scholar 

  34. Twitter search API. Twitter, Inc.https://dev.twitter.com/rest/public/search. Accessed: Jan 1, 2014.

Download references

Acknowledgements

This work was supported in part by the European Commission under the FP7 projects SIMPOL (Financial Systems SIMulation and POLicy Modelling, grant no. 610704) and MULTIPLEX (Foundational Research on MULTIlevel comPLEX networks and systems, grant no. 317532), by the H2020 project DOLFINS (Distributed Global Financial Systems for Society, grant no. 640772), and by the Slovenian Research Agency programme Knowledge Technologies (grant no. P2-103). We thank Matjaž Juršič for his help on the construction of retweet networks, Petra Kralj Novak, Miha Grčar and Martin žnidaršič for their help on sentiment models, their evaluation, and tweet preprocessing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Borut Sluban.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

BS and IM conceived and designed the experiments. BS and JS performed the experiments. BS, JS, IM, and SB analyzed the data and results and wrote the paper. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sluban, B., Smailović, J., Battiston, S. et al. Sentiment leaning of influential communities in social networks. Compu Social Networls 2, 9 (2015). https://doi.org/10.1186/s40649-015-0016-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40649-015-0016-5

Keywords