I envy the data mining tools used on shows like Criminal Minds. Penelope Garcia is my spirit animal. On a recent episode of Hawaii 5-0, they used software that could identify an author through samples of their writing. Of course this should be possible – maybe even easy. But in my experience, it’s also very unreliable. First, you have to feed the monster some writing samples. A considerable number of words is needed to train the beast for accuracy; a 300 word post isn’t nearly enough. A terrorist’s “manifesto” might be, but you’d need other samples, similar samples, before you could identify him. A few random Facebook posts would not, in my opinion, be sufficient for accuracy. Writing is not DNA, a fingerprint, or an ear. Professional writers are quite capable of writing in a bland, “house style” when collaborating on technical work with other writers. We may let our hair down when writing a blog post, or we may pull out all the rhetorical stops, determined to be clear and persuasive. We may have multiple editors making us sound more polished than we have any right to sound, when it comes to a published novel. Could you identify me from two samples: a sonnet, and a cereal box blurb? I think not.
What I hope you could do, or AI could do, is to rule out imposters. I hope that my family and friends know my patterns well enough to know what could not ever be me, because I’d gnaw off my fingertips before writing that badly. (Frankly, I trust their human intelligence, in that regard, more than I do any algorithm.) I have asked friends to check up on me, to be sure I have not had a stroke and am not being held hostage, desperately sending a coded S.O.S. through egregiously bad writing. But I think we’re a long way from definitively identifying an author through style and word choice, alone.
I am tickled purple, but I’m not buying the assessment of my writing skills and style from “I Write Like.” It used to be fairly consistent at guessing me to be Cory Doctorow. Tonight, I fed it three different blog posts. The first was, Civility, and this is the result I got:
Completely chuffed, I fed it another piece of my writing, I Have No Opinion at All on Jian, and got:
What fun! How could I resist another try? The third time, with Blogging is Like Baton Twirling,I got this:
I decided to quit while I was ahead.
My friend Jonathan Bailey, founder of PlagiarismToday, informed me that there’s another AI site that comes close to doing what they showed on Hawaii 5-0. I can’t help but think of “Shall we play a game?” from the movie, War Games, when facing off against Emma. First, you feed Emma a goodly chunk of your writing. This is what constitutes “machine learning,” by the way – not so much “learning” as teaching. Or, “force feeding the machine so many samples of your personal patterns it can’t help but recognize them in a dark alley.” Then, once Emma’s feeling cocky and thinks she knows you, you face off against the machine – you feed it other samples to see if it really does recognize you. A snippet of your own writing, an excerpt from a Stephen King novel, a little teaser of Oscar Wilde… Emma guesses, and you confirm or deny in a scored, head to head competition. Go on, click that and give it a try – see if you can outsmart the machine. (Our current score: Emma 3, Holly 2.) AI is good at pattern recognition and word choice, but it’s still far too easily fooled, and likely to be abused.
Other interesting AI insights can be found here, with the Personality Insights service – you can see what IBM Watson thinks your writing reveals about your personality. It’s interesting that one sample has me strongly empathetic, and another says I’m a little bit inconsiderate and people might think me “indirect.” That last assessment came from my post about semicolons. I suppose I was a bit unsympathetic to those who dislike them. Too bad. I’m surprised it didn’t say “people might describe you as being uncomfortably direct, perhaps even ‘blunt,’ and ‘overly attached to the Oxford comma.'” The rest of the samples were from my short, fictional posts. Make of that what you will, but I think both assessments are fair, depending on the context.
They’re not wrong. I think I tend to be more outgoing, more extroverted, on Twitter and Facebook than I generally am, face-to-face. My fiction has that score at 31% – just to the left of the center line. It also says I’m a lot less conscientious, maybe even inconsiderate, judging by my fiction – so which is real and which is…fiction? The rest is pretty consistent with this, and they nailed it – my favorite movie is “The Sound of Music,” and I generally can’t stand country music. When shopping for myself, I couldn’t give a rat’s ass what everyone’s buying and recommending on social media, either, and I would really rather not wear some designer’s signature on mine.
See if you can fool the AI when it comes to your gender. These algorithms appear to have improved, with time. This makes sense; in their early days, the Internet population was heavily weighted male. With more and more diverse women (not just the outlier techies and mommy-bloggers, anymore) and with more LGBTQ voices online, they have more wide-ranging samples with which to train the machine.
When I fed Hacker Factor Gender Guesser I Have No Opinion at All on Jian, a post I’d written at the height of the #YesAllWomen movement (a precursor to the #MeToo movement, for those who weren’t paying attention, yet), I got the following results:
It couldn’t quite make up its mind, and that’s as it should be. In my opinion, a writer’s gender shouldn’t scream itself from the rafters, regardless of the topic. It should never distract or detract from the writing. The goal is to have a broad appeal to readers, and my blog’s readers, as well as my Twitter followers, are an even mix of male and female.
I like the explanation and notes given for how HFGG arrives at its conclusions (don’t be offended if it calls you a “weak male”), and I think the analysis is good. The fact that it no longer has me pegged, fairly consistently, as “Weak MALE” makes me wonder if its algorithms, its training sets, or my writing have changed over the years. It was implemented in 2006.
Unfortunately, it is sometimes advantageous, as a professional writer, to “pass” for male. One infamous example can be found in Why James Chartrand Wears Women’s Underpants. I’ve heard that many of the most popular writers of gay male erotica are women, and many of the most popular romance novels for women are men. Ultimately, a writer needs enough empathy to “fool” the reader into believing the gender of the characters, without drawing any attention whatsoever to his or her own. I’ve never invented a male persona, but I like to think I could. I haven’t fooled Gender Analyzer v5, yet, but it’s only 88% sure I’m a woman. It’s still more certain about me than HFGG, but I am less certain about its methods. For all I know, my blogs are in one of its training sets, and those consist only of about 11,000 blogs. I’m not sure if that’s whole blogs or individual posts. But it’s not wrong.
I debated whether to mention this or not. Statistics are fun, and often useless. The Age Analyzer seems to think there’s about a 70% chance I’m under 65, but I might be 100. I like to think it’s my extensive vocabulary giving it that impression. I see no explanation for the underlying assumptions, and I have difficulty trusting “AI algorithms” or “machine learning” without a personal glimpse under the hood. It thinks I’m about equally likely to fall into either the 18-25 range OR the 51-65 range. (You begin to see just how magical Penelope Garcia’s spot-on insights really are?) Let this serve as a cautionary tale to Marketers: on any given day, I relate to just about any generation from age 8 to 100, in almost equal proportions. Odds are good, though, that I’m not under age 8. Why? Doesn’t everyone teach their six year olds words like “antidisestablishmentarianism” and “floccinaucinihilipilification”?