Gender Recognition on Dutch Tweets - PDF

However, his Twitter network contains mostly female friends.

In addition, the recognition is of course also influenced by our particular selection of authors, as we will see shortly. Rather than using fixed hyperparameters, we let the control shell choose them automatically in a grid search procedure, based on development data.

Trigrams Three adjacent tokens. For LP, this is by design. This is in accordance with the hypothesis just suggested for the token n-grams, as normalization too brings the character n-grams closer to token unigrams.

We first describe the features we used Section 4.

The most obvious male is authorwith a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: Their highest score when using just text features was We also varied the recognition features provided to the techniques, using both character and token n-grams.

The conclusion is not so much, however, that humans are also not perfect at guessing age on the basis of language use, but rather that there is a distinction between the biological and the social identity of authors, Worldwide dating site free language use is more likely to represent the social one cf.

The position in the plot represents the relative number of men and women who used the token at least once somewhere in their tweets. Finally, as the use of capitalization and diacritics is quite haphazard in the tweets, the tokenizer strips all words of diacritics and transforms them to lower case.

The men, on the other hand, seem to be more interested in computers, leading to important content words like software and game, and correspondingly more determiners and prepositions. Normalized 1-gram About features. In this section, we want to investigate how strong this dependency may have been.

This meant that, if we still wanted to use k-nn, we would have to reduce the dimensionality of our feature vectors. When running the underlying systems 7.

When looking at his tweets, we Experimental Data and Evaluation In this section, we first describe the corpus that we used in our experiments Section 3. In fact, for all the tokens n-grams, it would seem that the further one goes away from the unigrams, the worse the accuracy gets.

Confidence scores for gender assignment with regard to the female and male profiles built by SVR on the basis of token unigrams.

And, obviously, it is unknown to which degree the information that is present is true. Another interesting group of authors is formed by the misclassified ones. We achieved the best results, This may support ourhypothesis that allfeature types aredoingmore orlessthe same.