OK here goes for today’s paper for the #FLReadingGrp
Bamman, Eisenstein, & Schnoebelen (2014) Gender identity and lexical variation in social media. J. of Sociolinguistics
Bamman, Eisenstein, & Schnoebelen (2014) Gender identity and lexical variation in social media. J. of Sociolinguistics
I’ve read this paper *a lot* it’s a cracking paper and well worth a thorough read, and a re-read, by anyone interested in language and identity, whether you’re interested in forensic linguistics or not.
These guys are not forensic linguists and I don’t know them at all(!)
These guys are not forensic linguists and I don’t know them at all(!)
I like it because it uses essentially computational, or at least quantitative, methods, to demonstrate that gender identities
(and by extension other aspects of identity)
are social and performed – rather than inherent and innate.
(and by extension other aspects of identity)
are social and performed – rather than inherent and innate.
Having said it’s a quantitative paper, this shouldn’t put you off if your scared of stats or if you are a principally qualitative researcher.
Understanding the stats methods isn’t required (and you can skim them) but the findings matter a lot.
Understanding the stats methods isn’t required (and you can skim them) but the findings matter a lot.
There’s a single graph you’ve got to ‘get’
but we’ll walk through it slowly...
but we’ll walk through it slowly...
so…
We start with a statement of intent – Bamman et al are going to use a corpus of 14000 individuals on twitter.
We start with a statement of intent – Bamman et al are going to use a corpus of 14000 individuals on twitter.
They will use this corpus to show that treating social variables like gender as “immutable and essential categories of people” “gives an oversimplified and misleading picture of how language conveys personal identity”
And they are going to do this using the big data, text-analytics methods that are commonly used to,
for example,
create a model which can be used to predict your gender from some text that you’ve written.
for example,
create a model which can be used to predict your gender from some text that you’ve written.
A fairly unsophisticated example of an app that does this is here:
http://www.hackerfactor.com/GenderGuesser.php
http://www.hackerfactor.com/GenderGuesser.php
Bamman et al review previous work on language and gender.
They critique quantitative work that “disregards theoretical arguments and qualitative
evidence that gender can be enacted through a diversity of styles and stances”
They critique quantitative work that “disregards theoretical arguments and qualitative
evidence that gender can be enacted through a diversity of styles and stances”
and which disregards individuals
“whose word usage defies aggregate language-gender statistics”
In other words men who write in a language-style the previous quantitative would identify as ‘female’ (and vice versa).
“whose word usage defies aggregate language-gender statistics”
In other words men who write in a language-style the previous quantitative would identify as ‘female’ (and vice versa).
They touch on the relatively small-scale / qualitative / sociolinguistic studies that show gender to be “constructed, maintained, and disrupted by linguistic practices”
– these demonstrate that identities, including gender identities are performed
– these demonstrate that identities, including gender identities are performed
They contrast these studies with large-scale / corpus / computational studies which often take an essentialist and instrumentalists view of gender
and which attempt to predict gender as a "latent characteristic" of the analysed texts.
and which attempt to predict gender as a "latent characteristic" of the analysed texts.
such studies assert, for example, that male texts exhibit more information-rich linguistic features,
and female texts exhibit a language style that is more involved and interactional
[waves @ShlomoArgamon whose work figures large in this section]
and female texts exhibit a language style that is more involved and interactional
[waves @ShlomoArgamon whose work figures large in this section]
having reviewed these literatures they describe the corpus
they describe what Twitter is
[this is a century old paper from 2014]
and how they collected tweets:
14000 US users;
actively engaging with their social networks;
identified as male or female by username.
they describe what Twitter is
[this is a century old paper from 2014]
and how they collected tweets:
14000 US users;
actively engaging with their social networks;
identified as male or female by username.
Each of their collection criteria is rigorously defined and defended in the lengthy section on this, that you can read if you are feeling critical.
Next they build a computational linguistic model based on lexical markers of gender.
That is to say they look at which words the male users use more than the female users and vice versa – and use this to see if they can accurately predict which group and individual belongs to.
This is a well-trodden path, taken by those interested in computational profiling or predicting gender.
Bamman et al take a slightly alternative route at some stages but this is essentially replication.
Bamman et al take a slightly alternative route at some stages but this is essentially replication.
There’s a relatively computational / statistical section explaining their methods,
if you don’t know this stuff you could skim through quickly,
or you could read it carefully and learn stuff.
if you don’t know this stuff you could skim through quickly,
or you could read it carefully and learn stuff.
Bamman et al compare their model and the important words and categories with previous work like this.
and we get expected results..
Pronouns, emotion terms, emoticons etc are associated with females.
Negations, swears and taboo words are examples of male terms…
and we get expected results..
Pronouns, emotion terms, emoticons etc are associated with females.
Negations, swears and taboo words are examples of male terms…
They also do some work to extend this set into less traditional word classes associated with Computer Mediated Communication (CMC) and Twitter including hash-tags, abbreviations etc..
showing some CMC features to be more male others to be more female.
showing some CMC features to be more male others to be more female.
So they’ve now got a model that can predict various things including gender…
They test this model in various ways… clustering similar authors by topic, style etc… and examine which clusters are more male or female.
They test this model in various ways… clustering similar authors by topic, style etc… and examine which clusters are more male or female.
They note that “While most of the clusters are strongly gendered, none are 100% male or female.”
There’s interesting stuff here but ‘m going to skip a little.
There’s interesting stuff here but ‘m going to skip a little.
This then is where they get a bit cleverer than their predecessors in this kind of work.
Having built the model they then turn to the social networks of their users on Twitter.
Having built the model they then turn to the social networks of their users on Twitter.
They separateusers in terms of “homophily” - in other words how alike are you to your net work.
They create groups of male / female users, with mostly male contacts, 50:50 male:female contacts, and mostly female contacts.
They create groups of male / female users, with mostly male contacts, 50:50 male:female contacts, and mostly female contacts.
And here comes the payoff -
they find a strong correlation between the use of 'gendered language' and the gender skew of these differentiated social networks.
they find a strong correlation between the use of 'gendered language' and the gender skew of these differentiated social networks.
And here’s the bottom one (female authors) that I’ve drawn on.
Compare the two bits I’ve circled on the group for authors with female associated usernames.
Compare the two bits I’ve circled on the group for authors with female associated usernames.
The x-axis shows how well the model is performing for each group.
The y-axis shows how many women are in each woman's social network.
In the top right circle are those authors, whose social network comprises abt 80% other female users.
The y-axis shows how many women are in each woman's social network.
In the top right circle are those authors, whose social network comprises abt 80% other female users.
Bottom left are the authors who have about 60% male, 40% female (as per the y-axis).
So this left-hand group is the classifier struggles with – they represent the authors that model is least confident to classify as female.
The bottom decile on the y-axis.
So this left-hand group is the classifier struggles with – they represent the authors that model is least confident to classify as female.
The bottom decile on the y-axis.
Female twitter users, who interact mostly with male users are hardest to classify as using female language...
One interpretation here could be that if you are a woman with mostly male friends in your social network,
then you use more ‘male’ language.
<This is important.
then you use more ‘male’ language.
<This is important.
Top right are the group of female twitter users with a mostly female network.
Here the predictive model performs really well.
These users are most likely to be identified as female with a high level of confidence.
Here the predictive model performs really well.
These users are most likely to be identified as female with a high level of confidence.
This study raises lots of fascinating questions and interpretations of
the methods used here,
the ‘standard’ methods for computational profiling of gender (e.g. as used in the gender guesser app)
And of *the nature of gender itself*.
the methods used here,
the ‘standard’ methods for computational profiling of gender (e.g. as used in the gender guesser app)
And of *the nature of gender itself*.
Here’s a random assortment some of my thoughts that this article sparks:
My language style is social not individual
– if as a man, I speak to women all day, I’ll adopt a more female language style.
A possibly pejorative term for this in e.g. a male hairdresser is ‘camp’ (this is independent of anything about sexuality).
– if as a man, I speak to women all day, I’ll adopt a more female language style.
A possibly pejorative term for this in e.g. a male hairdresser is ‘camp’ (this is independent of anything about sexuality).
As a sociolinguist this is obvious from small scale studies but Bamman et al have demonstrated this in computational big(ish) data study.
This is pleasantly affirming as a linguist
This is pleasantly affirming as a linguist
Again this is strong evidence that gender isn’t within me – its not “immutable and essential categories of people”.
My gender is social not individual.
Again Bamman et al have found a new way to confirm what I theorise as a linguist
My gender is social not individual.
Again Bamman et al have found a new way to confirm what I theorise as a linguist
If computational attempts at gender profiling are inadvertently more focused on social gender than individual gender this may indeed be a critique of the kind of study @michaelerard pointed to…
Bamman et al point out that such studies should be interpreted with caution
Bamman et al point out that such studies should be interpreted with caution
But more than this,
I think we use this insight in forensic linguistics, beyond gender profiling,
if we can profile to describe the social group someone belongs to…
I think we use this insight in forensic linguistics, beyond gender profiling,
if we can profile to describe the social group someone belongs to…
… for example, can we profile the language use in child abuse chatrooms,
if so, then we might be able to identify that a particular individual participates in these communities of practice.
we can show where they hang out
every contact leaves a trace?
if so, then we might be able to identify that a particular individual participates in these communities of practice.
we can show where they hang out
every contact leaves a trace?
Don Foster, of Unabomber fame and much critiqued in the forensic linguistic community, wrote in Author Unknown “You are what you read”
Micheal Hoey from a corpus background talks of lexical priming.
Bamman et al may be empirically demonstrating this insight
and we can use it.
Micheal Hoey from a corpus background talks of lexical priming.
Bamman et al may be empirically demonstrating this insight
and we can use it.
There’s so many questions here –
Bamman et al raise a few of them in their discussion
but its an absolutely brilliant paper with implications well beyond forensic linguistics
– but absolutely essential for every forensic linguist with an interest in authorship to read.
Bamman et al raise a few of them in their discussion
but its an absolutely brilliant paper with implications well beyond forensic linguistics
– but absolutely essential for every forensic linguist with an interest in authorship to read.
That’s a wrap #FLReadingGrp will regroup next week with at least another two papers.
I’ll pick the first one on Sunday night - but as always I’m open to suggestions.
I’ll pick the first one on Sunday night - but as always I’m open to suggestions.