Italians on Twitter

Personality traits and interactions





Click on the arrows at the bottom right to navigate the presentation

Click here to read the article, published on Nova (il Sole 24 Ore), containing the main high level results (in Italian) of the research

Click here for its translation in English

Domenico Bianco

Francesco Grisolia

Mauro Mario Gentile

30/6/2017

Project outline – aims and tools


Classification of Italian Twitter community's personality traits



Tool

Characteristics

Pros

Cons


BIG5

  • Five psychological traits
  • Lexical hypothesis
  • Nomothetic model
  • Concise description
  • Usable with different type of texts
  • Generalizability
  • Tendency to abstraction
  • Limited to language
  • Limited sensitivity to cult. differences

LIWC

  • Textual analysis program
  • Psychological traits detection
  • Lexicon-based (dictionaries)
  • Automated analysis
  • Value of function words
  • Significance of results
  • Predefined categories
  • Binary classification (vs gradient)
  • No account of language use

Model application to Italian Twitter users




Data collection

In two months, using 60 Twitter Apps, we collected and processed:

  • 14,2 millions unique Italian accounts

    • Scraping of Socialbakers: 1000 most followed Italian users
    • Crawling of the followers and following in the above-mentioned group
    • RestAPIs to collect general information about their followers
    • Scraping of selected followers, to accelerate the data collection process
    • Snowballing, to increase the size/statistical significance of our database
    • Filter on Italian users, extraction of unique accounts
    • Collection (every 5 minutes, for a month) of trending topics and engaged users. Inclusion rate: 99,5%
  • 1,7 billion tweets

    • Crawling of up to 3200 tweets for each user
    • Extraction of mentions and the original authors of the retweets
    • Detection of the most frequent tweet location for each user

Target Universe and its characterization

Click on rectangles to further breakdown

INFERENCE IN TWO DIMENSIONS:

  • Sex: 12 million users 1.7% classification error on a 5,500 users sample
  • Geolocation: 2,46 million users, 2.4% classification error on a 1,500 users sample

OCEAN scores computed on users whose tweets had at least 70 matches with the LIWC dictionary, in order to replicate the IBM operational conditions and to provide the highest model reliability

OCEAN: classification of 2,55 million users

Dettagli sulla determinazione del sesso

Approccio iniziale: ricerca di nomi comuni come substringhe nel campo name

  • Buoni risultati. Ma.... qual è il sesso dei seguenti account?

    • Carlo Maria
    • Carlo&Maria
    • Carlotta? Mariano?
    • Ada Merlo ok.. ma Lega Nord Padania?
    • Hotel Rosa
  • Approccio più strutturato, per fasi:

    • 0) filtraggio: eliminazione account con & e “e” come carattere isolato. Carlo&Maria eliminato
    • 0b) filtraggio account business: eliminazione che contengono: hotel, club, fans, circolo, istituto etc. Hotel Rosa eliminata
    • 1) classificazione nomi composti comuni: Carlo Maria->M
    • 2) name match-> ricerca di match esatto di nomi: Ada Merlo-> F ma AdaMerlo ignorata. Carlotta->F e Mariano->M
    • 2b) risoluzione di ambiguità della fase 2: Antonello De Maria -> M
    • 3) Split on capital: AdaMerlo-> F ; LegaNord Padania ignorata
    • 4) contains, solo su nomi lunghi. Mariarossetti -> M ma LegaNord ignorata
  • Machine learning sui rimanenti: 75% di accuracy

OCEAN traits

Trait

High score

Low score


Openness

  • Open to new experiences, ideas, cultures
  • Sensitive, immaginative
  • Curious, tolerant, progressive
  • Less creative, more authoritarian
  • Conformist, less open to change

Conscientiousness

  • Organised, inclined to planning
  • Trustworthy, coherent
  • Oriented to results and long-term goals
  • Relaxed, spontaneous, creative
  • Less bound by rules and plans

Extroversion

  • Friendly, active, talkative
  • Sociable, stimulated by the environment
  • Prone to expressing positive emotions
  • Reserved
  • Enjoying solitude

Agreeableness

  • Trustful
  • Inclined to positive social relations
  • Cooperative, able to adjust to other people’s needs
  • Assertive
  • Able to communicate unpleasant truths

Neuroticism

  • Inclined to mood swings and negative emotions
  • Stress-prone, nervous, prone to depression
  • Feel helplessness and vulnerability
  • Calm
  • Self-confident

Word embeddings


Significant examples


  • O: Renzi, government, Europe, all, new, seen, done
  • O: (I) want, (I) have to, (I) can; ;
  • C: google, android, iphone, sport, football, yesterday, today, tomorrow, evening;
  • C: (I) hope, (I) believe, (I) think, ahahaha, ahaha;
  • E: liam, zayn, one direction, heart, love
  • E: Renzi, Government, Europe, italians, google, android, iphone;
  • A: good, good morning
  • A: (I) want, (I) can, (I) have to, (I) would like
  • N: hate, disgust (how gross), shit, ahahahah, ahahah
  • N: collected_coins, food_controlled, collected_taken, life, history, cinema, theatre, movie

Frequent LIWC-OCEAN matches





No substantial differences among the most frequent LIWC matches for each trait, apart from a few exceptions

Differences can be observed in the less frequent words

Operational simplification: 25-50-25% discretisation





  • Two of the five distributions approximate normal curves
  • Three are asymmetrical (two slightly, one strongly)
  • Distinction by sex not particularly significant
  • Distribution of values on 3 levels, deliberately unbalanced to emphasise the extremes
  • 243 possible combinations of the 5 traits so distributed: psychological profile

Users distribution according to psychological profiles


Real vs Predicted Distribution

Extremes per trait Real Predicted
5 6.3 3,1%
4 18.2% 15.6%
3 25.9% 31.3%
2 17.7% 31.3%
1 15.7% 15.6%
0 16.0% 3.1%

Men vs Women


Extremes per Trait M W
5 6.4% 6.3%
4 17.9% 18.2%
3 25.1% 25.9%
2 17.8% 17.7%
1 16.2% 15.8%
0 16.6% 16.0%
  • Traits are not mutually indipendent
  • No significant differences by sex

OCEAN Geographical distribution



Commonplaces confirmed by data


  • The area in which Twitter users seem to be the most organized and planning-oriented: the North-east
  • The most extroverted and neurotic area: the South and the Islands
  • The most open-minded area: the Centre
  • Organisation and cooperation heavens: Aosta Valley and Trentino-Alto Adige
  • The most extroverted regions: Apulia, Sicily and Calabria
  • The most emotionally unstable regions: Calabria, Friuli-Venezia Giulia, Liguria

Celebrities: atypical psychological profiles






  • Strong psychological characterization: polarized values, not corresponding to the predicted distribution for a casual sample (25-50-25%)

  • Identification of similar categories:
    • "Impersonal" categories: emotionally stable, open to new ideas, planning-oriented (eg. corporate or institutional Twitter accounts)
    • Communication and Show business: social network stars stand out for high scores in emotional instability (neuroticism)
    • Hybrid profiles for politicians: communication style in between the emotional tone of the showbiz and the impersonal one of corporations and institutions
    • Musicians and athletes: trustful and results-oriented – almost identical distribution of their respective scores

Celebrities: atypical psychological profiles -details


  • Strong psychological characterization: polarized values, not corresponding to the predicted distribution for a casual sample (25-50-25%)
  • Identification of similar categories:
    • “Impersonal” categories: emotionally stable, open to new ideas, planning-oriented (eg. corporate or institutional Twitter accounts)
    • Communication and Show business: social network stars stand out for high scores in emotional instability (neuroticism)
    • Hybrid profiles for politicians: communication style in between the emotional tone of the showbiz and the impersonal one of corporations and institutions
    • Musicians and athletes: trustful and results-oriented – almost identical distribution of their respective scores

Overview: cluster centroids as stereotypes




  • The Balanced - Average scores in all the five traits:
    Pope Francis, Valentino Rossi, Emma Marrone, Fiorello, Simona Ventura

  • The Dispersives – prone to mood swings, scarcely cooperative, barely planning-oriented:
    Luca Bizzarri, Gerry Scotti, Mario Balotelli, Giuseppe Cruciani, Vittorio Feltri, Alessandro Gassman, Pierluigi Battista

  • The Focussed – Results-oriented, emotionally stable, open to new experiences:
    Jovanotti, Ligabue, Barbara D’Urso, Samantha Cristoforetti, Giorgio Chiellini, Gianluigi Buffon

  • The Innovators – Introverted, assertive and open to new ideas:
    Matteo Renzi, Beppe Grillo, Roberto Saviano, Marco Travaglio, Selvaggia Lucarelli, Nichi Vendola

  • The Conservatives – Extroverted, cooperative, scarcely inclined to change:
    Michelle Hunziker, Ezio Greggio, Antonella Clerici, Flavia Pennetta, Maria De Filippi

Stereotypes Interactions



  • Mirror effect for Innovators, Focused and Dispersive ones: they follow people included in the same stereotypes
  • Popularity of the Dispersive ones: almost every stereotype, including themselves, mostly retweet and mention them
  • Stability of the Conservatives: they predominantly retweet and mention other conservatives
  • Attraction of diversity: the Balanced ones mostly follow, retweet and mention the Dispersive ones

Further developments


  • Current model validation:
    • Use of Big5 questionnaires within a representative sample of the Italian Twitter network
  • Extension of the LIWC dictionary:
    • Addition of N-grams
    • Inclusion of Emoticons
  • Semantic Analysis
    • Gradient analysis, beyond binarism
    • Assigning weights according to language use
    • Idiomatic expressions

A special thanks to:
Prof. Ferragina, professor of Information Retrieval
and to
Prof. Alessandro Lenci, professor of computational language
at University of Pisa


Bibliography


  • On the Big 5 model
    • Caprara, G.V., Perugini, M. (1994), “Personality described by adjectives: Generalizability of the Big Five to the Italian lexical context”, European Journal of Personality, 8: 357-369. h.link
    • Costa P. T.Jr., McCrae R. R. (1992), Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Odessa (FL, USA), Psychological Assessment Resources h.link
    • Costa P. T.Jr., McCrae R. R. (2008), “The revised NEO Personality Inventory”, in Boyle G. J., Matthews G., Saklofske D. H. (a cura di), The SAGE Handbook of Personality Theory and Assessment. London, SAGE
    • McCrae R. R., Costa P.T. Jr (1996), “Toward a New Generation of Personality Theories: Theoretical Contexts for the Five-Factor Model”, in J. S. Wiggins (a cura di), The Five-Factor Model of personality: Theoretical perspectives New York, Guilford Press h.link
    • Rolland J. (2002), “The Cross-Cultural Generalizability of the Five Factor Model of Personality”, in McCrae R.R., Allik J. (a cura di), The Five Factor Model of Personality Across Cultures, New York, Kluwer Academic Publishers h.linkhttps://link.springer.com/chapter/10.1007%2F978-1-4615-0763-5_2
    • Henrich, J., Heine, S. J., Norenzayan, A. (2010), Most people are not weird, Nature, 466 (7302), 29 h.link
  • On LIWC
    • Pennebaker, J. W., Francis, M.E., Booth, R. J. (2001), Linguistic Inquiry and Word Count (LIWC): LIWC 2001, Manwah (NJ, USA), Lawrence Erlbaum Associates h.link
    • Pennebaker, J. W., Booth, R. J., Francis, M.E. (2007), Linguistic Inquiry and Word Count (LIWC): LIWC 2007, Austin (TX, USA), LIWC.net
    • Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., and Booth, R. J., The Development and Psychometric Properties of LIWC2007, The University of Texas at Austin and The University of Auckland, New Zealand; pp. 5-6 h.link
    • Pennebaker, J. W., Boyd, R. L., Jordan K.,Blackburn, K. (2015), The Development and Psychometric Properties of LIWC2015. Austin, The University of Texas at Austin.
    • Tausczik, Y. R., Pennebaker, J. W. (2010), “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods”, Journal of Language and Social Psychology, 29 (I): 24-54 h.link
  • On personality detection through online social networks
    • Kosinski, M., Bacrach, Y., Kohli, P., Stillwell, D., Graepel, T. (2014), “Manifestation of user personality in website choice and behaviour on online social networks”, Machine Learning, 95(3): 357-380 h.link
    • Kosinski, M., Stillwell, D., Graepel, T. (2013), “Private traits and attributes are predictable from digital records of human behavior”, PNAS (Proceedings of the National Academy of Sciences), 110(15): 5802-5805 h.link
    • Mairesse, F., Walker, M. (2006), “Words mark the nerds: computational models of personality recognition through language”, Proceedings of the 28th Annual Conference of the Cognitive Science Society, pp. 543–548 h.link
    • Schwartz, A. H. et al. (2013), “Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach”, PLOS ONE, 8(9): e73791 h.link
    • Yarkoni, T. (2010), “Personality in 100,000 Words: A large-scale analysis of personality and word use among bloggers”, Journal of Research in Personality, 44(3): 363-373 h.link
    • h.link
  • On Twitter
    • Goldbeck, J., Robles, C., Edmondson, M., Turner, K. (2011), “Predicting Personality from Twitter”, IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing, Boston, 9-11 ottobre 2011 h.link
    • Hughes, D.J., Rowe, M., Batey, M., Lee, A. (2011), “A tale of two sites: Twitter vs. Facebook and the personality predictors of social media usage”, Computers in Human Behavior, 28: 561-569 h.link
    • Park, P., Macy, M. (2015), “The paradox of active users”, Big Data & Society, I-4, DOI: 10.1177/2053951715606164 h.link
    • Qiu, L, Lin, H., Ramsay, J., Yang, F. (2012), “You are what you tweet: Personality expression and perception on Twitter”, Journal of Research in Personality, 46: 710-718 h.link
    • Quercia, D., Kosinski, M., Stilwell, D., Crowcroft, J. (2011), “Our Twitter Profiles, Our Selves: Predicting Personality with Twitter”, IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing, Boston, 9-11 ottobre 2011 h.link
    • Sumner, C., Byers, A., Boochever, R., Park, G.J (2012), “Predicting Dark Triad Personality Traits from Twitter Usage and a Linguistic Analysis of Tweets”, Proceeddings of the 11th International Conference on Machine Learning and Applications, ICMLA 2012, pp. 386-393 h.link