There is a common stereotype that western people are more active on social media than eastern people. But how true is this statement? Let's find out!
We propose to analyze the features' distribution of different language users to find any interesting patterns. Starting from the insights that are already available to everyone on the Twitter platform (number of followers or number of tweets), we generated additional features such as average daily number of tweets and an internationality coefficient to say something more with respect to the superficial differences between different cultures. Further, we categorize the users based upon their language into two broad groups - Western and Eastern. The data analysis is mostly based on comparing and describing the notable differences between these groups. We also try to confirm the behavior analysis provided in our reference paper and we replicate two propositions, namely the 80-20 Rule and the Circadian Rhythms, but this time, we consider only one category (Western vs Eastern) at a time. We also check the preference in following other users among users of different languages. Finally, we apply machine learning algorithms, both in a supervised and unsupervised framework, to see if the difference in behaviour of different language users are distinct enough to predict a user's culture (Western or Eastern) based on the features of the user's Twitter profile alone.
EgoAlterProfiles: This dataset contains profiles of two kinds of users - egos and alters. The ego users are uniformly sampled from Twitter API, and the alter users are derived by adding the the follower or the followee of the ego users. There are 34,006 egos and over 2,000,000 alters in this dataset. It includes Twitter profile data for each user, egos and alters, which includes the followers counts, friends counts, statuses counts, and language used by the user.
EgoNetwork: This dataset contains the following relationship on Twitter between the egos and alters. There are over 4,000,000 following networks in this dataset.
Note that the dataset only contains ego-alter following relationship but not ego-ego and alter-alter following relationship
EgoTimelines: This dataset contains information of tweets generated by the egos during a certain period. The information of tweets include create time, hashtag usage, mentioned ID etc.
1. Do different language users have different characteristic on statistical results? ( i.e. activeness, popularity etc.)
2. Do different culture (western and eastern) users have different behaviors on Twitter?
3. Can we predict the culture (western or eastern) of a user based on the features of the user's Twitter profile and that of the follower-followee network?
The following table presents the mapping between the language code and the language of the user. It is a useful reference when comparing data from the various plots in further sections.
Please note that this is not the complete list of languages used by the users in the original dataset, but only a subset of languages relevant to the dataset used for our plots.
Before we start our data story, we need to take a look at the language composition of the dataset. Since alters in this dataset are not randomly sampled, we only consider egos in this section. We also filter out some inactive users who haven't posted any tweets or do not have followers in order to get the language composition of the data.
There are 7 western languages and 5 eastern languages in our refined dataset, which have 10,428 and 2,297 users respectively. From the figures, we learn that there is a very high majority of English users, and the top 3 languages (English, Japanese, and Spanish) constitute almost 75% of the total users. Furthermore, among eastern languages, Japanese users form the majority.
To address this problem, we must first understand the definition of 'friends' in our dataset. Friends are an ego's followers that have been mentioned by the ego at least twice. The box-plots below show the number of the followers and the friends among different languages, and the barplots show the mean value of followers and friends among different languages.
From the plots, the following statistical results are clearly observable:
1. The maximum average number of followers belong to Arabic.
2. The minimum average number of followers belong to Korean.
3. The maximum average number of friends belong to Arabic.
4. The minimum average number of friends belong to Korean.
Arabic language users are quite popular online, while korean language users tend to have less online presence.
Switching to a cultural perspective, do the users from the western culture really have higher popularity than users from eastern culture on Twitter? To answer this question, we group the data by culture according to the languages spoken in both regions.
Surprisingly, the results show that eastern culture users have more followers than western culture users, while both the cultures have similar number of friends.
First, We define the activeness using two features - total tweets counts and average daily tweet counts. The box-plots below show the number of total tweets and daily tweets among different languages, and the barplots show the mean value of total tweets and daily tweets among different languages.
From the plots, the following statistical results are clearly observable:
1. The maximum average number of statuses belong to Japanese.
2. The minimum average number of statuses belong to Germany
3. The maximum average number of daily tweets belong Japanese.
4. The minimum average number of daily tweets belong to Germany
Japanese language users spend more time on Twitter, while Germany language users tend to be less active online.
We also analyze it from a cultural perspective:
Surprisingly again, eastern culture users have higher activeness than the western culture users on Twitter.
In this section, we would like to analyze if eastern culture and western culture users have different community usage behavior on Twitter. In order to solve this problem, we focus on three seperate questions.
The 80-20 rule originally referred to the observation that 80% of Italy’s wealth belonged to only 20% of the population. From the replication we did before, we have already known that Twitter users follow the 80-20 rule. In this analysis, we seperate users into western and eastern groups and see whether both the groups still follow the 80-20 rule, or at least a power law distribution in general.
Parameter N means that only the users who have posted more than N tweets are included. The selection of N does not influence the distribution qualitatively. The distribution of number of tweets per user is quite similar for both. And the law of 80-20% is still plausible qualitatively.
The plot for the eastern and western cultures look quite different. But they still seem to follow the circadian rythm pattern as suggested in the paper.
But an interesting thing to note is that a peak in the number of tweets is visible around 22h for both west and east, but there is an equally high peak in the number of tweets around 13h in the east. Also, the number of active users gradually increase after 4h for both west and east. But there is a drastic dip in the number of active users between 15h and 21h in the east.
Moreover, western users have a lower activity on weekend, while eastern users do not have this characteristic.
In this section, we would like to know if users in different languages have a specific 'following' preference. Therefore we sum up the following relationship network between users in different languages, and we can find out the following pattern of different language users. Here, we only consider the following pairs in which egos are the followers.
From the figures, we learn that users always tend to follow users who use the same language as themselves, just like we expected. However, we also find some other interesting facts:
- All users in different languages have a high tendency to follow English accounts.
- Western language users have higher interest in English language users than eastern languages users.
- Portuguese and Spanish language users have a higher chance to follow each other compared to other languages.
Next, we introduce another feature: internationality. The internationality coefficient is computed for each ego, defined as the number of followers using a different language than the ego divided by the total number of followers.
From the figure, we observe that German language users have the largest internationality coefficient which means they have the highest interest to follow different languages accounts, instead of their own. On the other hand, people speak Japanese have the smallest internationality coefficient, which means they are less likely to follow different languages accounts, and instead, they choose to follow other Japanese users.
Can we predict the culture of a user, given his/her total activity on Tweeter (without access on his/her content)? Let's try to solve this problem in a supervised and unsupervised way using all the features extracted earlier:
- number of followers,
- number of friend,
- number of tweets,
- Average daily tweets,
In a supervised framework we split the dataset in 2 parts: we train our model with 70% of the users and we use 30% for testing our performances. We standardize the dataset but further analysis show that a polynomial expansion don't seem to improve our prediction performances. Using Logistic Regression and Random Forest we got an accuracy on the test up to 82%. The 6 coefficients of the Logistic regression describe the importance of each regressor, while the t-tests using the coefficients tell us that there is not statistical evidence that some features are useful, i.e. number of follower and friends and average daily tweets, maybe the last one because it is strong correlated with the total number of tweets published and the year of subscription.
An even more challenging problem is to classify these two cultures (western and eastern) from scratch, i.e. without a training model. If we are able to cluster the users in these two classes without knowing that they exist, it would mean that they are very strong and they exist in the real-world (outside our dataset). We applied few variant of k-means (k=2) but without almost any result. In fact, even considering different data reduction (PCA, t-SNE, 2 most representative features) to visualize the sample, it seems that there is not spatial pattern at all to split automaticaly.
From our results, we can clearly see that eastern culture have higher activeness and popularity on Twitter. Among all languages, Arabic language users have most friends and most followers, while Korean language users have least friends and followers. Besides, Japanese language users have highest average daily tweets, while German language users have lowest average daily tweets. Based on this alone, it can be concluded that Arabic culture is the most popular on Twitter, while Korean culture is the least popular. Moreover, Japanese users are the most active, while German users are the least active.
The 80-20 rule: Both eastern and western culture follow the 80-20 rule
Circadian Rhythm: Both eastern and western culture follow the Circadian rhythm. However, eastern users are more active during lunch and late night, while western users are the most active around 8 p.m.
Following Preference: People always like to follow accounts that use same language with themselves. Moreover, all users from different languages have a high tendency to follow English accounts.
Yes, in a supervised way with an accuracy of 0.82; not at all in a unsupervised way. So what does it mean?
IT REMINDS US THAT MACHINE LEARNING IS NOT MAGIC!
We can't predict patterns that don't exist and, similiarly, there are patterns that can't be explained without the right data. The 6 features that we analyzed through this report explain the different cultural attitudes on Twitter, and in a certain way, allow us the infer the culture with a certain accuracy (0.82) in a supervised way training. However, all the variance is not explained and we are not at all able to solve this problem in an unsupervised way. This led us to think if these cultural differences are really significant or not.
It is interesting to point out these results because this problem is not so different from the judgments we make every day scrolling the feed on a social networks. Are we using the right (enough and representative) data to be able to profile a user? Do these personalities really reflect the reality on the test set? Maybe there are better dataset to work with. See `REAL_LIFE.csv`.