Reviewing the Reviews
In this part, we will be thinking about social consequences of deploying machine
learning models. Recall the food reviews dataset from Week
3 Homework. The code for this question loads weights from a pretrained
logistic regression model, which classifies reviews as positive or negative, on
a bag-of-words representation for each review. It then prints out the weights
associated with certain words: [“yummy”, “Indian”, “Mexican”, “Chinese”,
“European”, “gross"].
Using this Colab
Notebook , please investigate the weights of the words printed out. What do
you notice? Try out other words by changing the list of words passed in.
Consider trying words from this list; do the weights match your expectations of
what they should be like?: ["disgusting", "favourite", "caffeine", "stinks"]
You can also download the code here.
3A) How will a classifier with these weights handle “This is Indian food”
relative to “This is European food”? Why does this happen? Consider the
following statistics about the number of positive and negative reviews for each
of these words.
|
+ |
- |
yummy |
96 |
28 |
Indian |
2 |
1 |
Mexican |
1 |
1 |
Chinese |
1 |
2 |
European |
0 |
1 |
gross |
20 |
69 |
3B) When is this desirable behavior? When is it not? What if we're
helping Yelp build a restaurant recommendation system? What if we are doing a
research project in which we are trying to understand how different cuisines are
perceived?
3C) What changes could we make in
- dataset
- training procedure
- post-processing
to achieve something different from what is happening in (a)?
3D) If we made the changes you came up with in (c), how would this affect
performance on
- the training set?
- the test set?
- a different set of food reviews?
3E)
Is it important that we used logistic regression in this problem? Or would the
lessons we learned apply to other linear classifiers?
Food for Thought
In considering the questions above, you can see that machine learning doesn’t
end with training the model and showing it has a high accuracy. We need to think
about whether the behavior of our models is fair. Here, we were looking at food
reviews, and we already started to see evidence of bias. Now what if we were
building a classifier to look at a resume and decide whether or not to interview
someone? Some companies tried this:
Hiring Algorithms
“After an audit of the algorithm, the resume screening company found that the
algorithm found two factors to be most indicative of job performance: their name
was Jared, and whether they played high school lacrosse.”
Discussion Guide
- A) Neutral statement about Indian food would score higher than the neutral statement about European food. But both statements are entirely neutral, so this is not good behavior. Might be due to infrequent presence of these terms and due to existing correlations between term and sentiment
- B) Yelp might not want to distinguish in this way, whereas researcher studying trends would want to see.
- C)
- balance the number of positive and negative examples with words relating to ethnicity
- remove all words denoting ethnicity or any other characteristic on which we want to enforce neutrality when doing the BOW encoding
- In post-processing, we can check and ignore classifications for statements with protected words (and handle manually)
- C) less well on both the training and the test set because we removed a seemingly-important correlation; better on a held-out test set where people don’t have the same preferences