I’m currently ploughing through the 630 pages three-way comparison papers for the draft EU data protection Regulation as it stands currently, and I’ve spotted a problem in the definitions that raises some interesting questions.

Currently, the European Parliament’s wording for “Special Categories of Data” (i.e. sensitive personal data) is:

1. The processing of personal data, revealing race or ethnic origin, political opinions, religion or philosophical beliefs, sexual orientation or gender identity, trade-union membership and activities, and the processing of genetic or biometric data or data concerning health or sex life or, administrative sanctions, judgments, criminal or suspected offences, convictions or related security measures shall be prohibited.

Of note: “gender identity” has been included as a special category along with “sexual orientation” . . .

This is a good, logical addition in many ways. Clearly, “gender identity” has been included to protect the fundamental human rights of trans people and to prevent discrimination. However, this laudable inclusion presents an interesting definitional bind considering how we construct (Humanities word) or model (data geek word) gender as a concept.

Gender identity refers to an individual’s sense of self in relation to the concept or social construct of gender — how we understand and model “gender” (generally, do I identify as “male” or do I identify as “female” . . . or do I not feel this binary construct describes me adequately?). There is a specific definitional difference between “sex” and “gender”, but these two things are often conflated and we make a lot of social assumptions related to sex and gender. “Sex” refers to biological or physiological characteristics.

ISO/IEC 5218 has a model for describing the representation of human sexes:

  • 0 = not known,
  • 1 = male,
  • 2 = female,
  • 9 = not applicable.

“Gender”, in contrast, refers to the social or cultural constructs culturally attached to different sexes. These could be expected behaviours, activities, roles, etc. But “gender” is basically what being identified as a particular sex “means” aside from having a particular set of chromosomes or genitalia. When I say I’m a woman, I’m not just stating my biological sex (female), I’m expressing my gender identity, which has a lot more information and culturally-bound expectations attached. A short-hand way of describing the difference is contrasting “male/female” with “masculine/feminine”.

When we determine sex, we tend to also assign gender to match. But if a trans woman says she is a woman, she is stating her gender identity, even if a government birth certificate might use ISO/IEC 5218 and code her biological sex as “1”. Here we come up against the one problem with using this model of “sex” and conflating “sex” and “gender” in identifying documents. If you’re using the ISO model on an identification document, are you describing sex or gender?

Other models, still more medically based, have attempted to describe the complexity a bit more clearly.

A WHO model includes the following “genders”:

  1. Male
  2. Female
  3. Unknown: No physical exam.
  4. Indeterminate: Physical exam taken, but inconclusive.
  5. Male once Female: Male gender re-assignment.
  6. Female once male: Female gender re-assignment.
  7. Male been both: Gender re-assignment twice
  8. Female been both: Gender re-assignment twice.
  9. Hermaphrodite: Gender neutral, with some male and some female physical genital attributes.

This is still a biological/physiological characteristic based model, and it conflates sex with gender. The codes defined would be useful knowledge for a medical examination of a patient when it’s important to know physiological information and whether the patient has had a particular surgical operation or not, but it does not describe “gender identity”, nor is it a particularly useful model for non-medical contexts considering that, in general, enquiring after a person’s surgical status uninvited is extremely invasive and rude.

Trans people often distinguish their gender identity from the gender they were assigned at birth due to physical sex characteristics: “assigned male at birth” / “assigned female at birth”. So “trans” refers to a disjunct between the person’s gender identity and the gender they were assigned at birth. Those whose gender identity is the same as the identity they were assigned at birth are then “cis” gender (thanks to word-play borrowing from chemistry terms). This simplistic binary definition this still doesn’t take into account the percentage of the population who were born intersex. (As a side note, Germany has introduced an “indeterminate”, “gender: blank” option for the sex of a child on birth certificates, to enable children born intersex to choose their expressed gender later.)

Therefore, on a medical information level, we have at the bare minimum “assigned male at birth” / “assigned female at birth” / “intersex” or indeterminate. On an experienced gender identity level, we have the binary Male / Female, but we also have non-binary gender identities.1 Social networking sites have recently recognized and are processing this complexity in categorizations: Google now has “male/female/other” options on Google+, while Facebook has something like 50 custom options that people can use to describe their identity.

And again, this may or may not be different from “gender expression” for many reasons. For instance, after months of intrusive media speculation, Bruce Jenner recently came out publicly as a trans woman but still (at least currently) uses the pronoun “he” and the masculine-gendered name Bruce. So, at least some of his forms of public gender expression are “male”-identified, even though as he says, “”Yes, for all intents and purposes, I’m a woman”. (This of course is particular to Jenner’s position in the public eye at present and may change.)

So what does “gender identity” mean in legally defining a special category of personal data? Again, we can probably assume that this inclusion is probably meant to specify that information identifying whether you are cis or trans gendered should be considered particularly sensitive personal data. But in English, so far, we don’t actually have a clear definitional category that distinguishes the cis/trans/non-binary question from the male/female/non-binary question. (Partially because our language hasn’t fully caught up with recognitions of its limitations, partially perhaps because socially speaking, that question is extremely intrusive and isn’t our business to ask . . . which directly relates to privacy rights and special categories of data.)

We can see the limitations of the lack of definitional clarity in, for instance, the American Psychological Association defines gender identity as “one’s sense of oneself as male, female, or transgender”. “Gender identity” is describing identity on more than one axis: It describes identification as Male/Female / (non-binary) and identification as trans / cis / intersex / non-binary. The consistency (or inconsistency) of gender-type coding in data management is an age old challenge in more complex environments (medical, law enforcement etc.) If you’re looking at “gender identity” in the broad sense of whether someone is identified as Male or Female, making “gender identity” a special category of sensitive personal data suddenly becomes much more stringent category. Whether you’re male or female is considered sensitive personal data.

How will this work out in practice? When Castlebridge Associates teaches we give a practical definitional shorthand for “sensitive personal data” by mentioning that these categories need an extra duty of care because they have been used prejudicially to discriminate against or even persecute people, and that a good rule of thumb is that if it’s something you couldn’t ask at a job interview, it is probably sensitive personal data.

What about the old binary “M/F” question then? (Noting, of course that this excludes people who are non-binary.) That’s certainly something generally known on a job interview, but it has been found that people’s expressed binary gender has in fact been used prejudicially to discriminate against people. For instance, a Princeton study found that major symphonies orchestras’ hiring practices were in fact gender biased, and having blind auditions in which musicians auditioned behind a screen so their gender was not known increased the probability that women would advance to the next round by fifty percent. So you could argue theoretically that the broader definition of “Gender Identity” does fit the definitions for special categories of sensitive data as well . . . although this would be extremely difficult to regulate.

If it is actually the case that Gender Identity in the broad sense is a special category of data, which it could clearly be argued, there go a lot of standard forms. What reason do businesses generally have to obtain this incredibly commonly obtained (special?) category of data? Also, if gender identity can often be easily and reasonably determined by gendered names such as Katherine, what about obtaining names? Gender expression, at least, is a very public socially defining characteristic. If the draft legislation only intends to include the more stringent cis/trans/non-binary sense of “gender identity”, how does one model that definition in exclusion of more general gender identity?

So, the simple drafted inclusion of a two-word necessary category to be considered sensitive personal data in Data Protection legislation turns out to be an utter rabbit hole of examining how we identify, describe, and express gender. What are the practical implications for the quality of data models,the quality of information presentation, the design of forms (hard copy and electronic), and the definition of analytics queries? How do we ensure that data is designed and modelled to ensure that sensitive personal data is adequately protected?

A simpler version of this definitional bind crops up just a few words later, as the definition of special categories of data includes “biometric data” — considering that “biometric data” has been defined to include “facial images” (again for very good reasons, considering that passport photos are indeed considered biometric data). But if facial images are biometric data and therefore a special category of data, whither Facebook? CCTV? GoPro drones? Lifelogging (unless you have the ability to automatically blur out the faces of people who have not consented to be recorded)?

There will be interesting times ahead in the design of data!

1As a grammar nerd, this is also one of the reasons I prefer 3rd person singular “they” to the clunky “he or she” still preferred by many grammar traditionalists (including someone in the Council of Ministers drafting team)

[Editors Note: Since this blog post was first published, Bruce Jenner is now Caitlyn and goes by “she” instead of “he”. Goes to prove our point really about how the data model needs to be able to handle changing states over time!]