Attention supervision for multiple heads: average or summation? #27

lucasresck · 2023-05-25T22:52:07Z

Dear authors,

In the paper, it is said that the final loss of attention supervision is the average of the cross entropy loss of the attention weights in each attention head. However, in

HateXplain/Models/bertModels.py

Line 57 in 01d7422

    
           loss_att +=self.lam*masked_cross_entropy(attention_weights,attention_vals,attention_mask)

it does not seem to be an average because it is a summation and there is no division.

I am concerned about this detail because of the $\lambda$ hyperparameter. If one is going to implement the loss with an average (as the paper says), $\lambda$ is being divided by the number of heads, e.g., 12, which may impact the reproducibility of the hyperparameters in the paper.

Did I get it right? I would appreciate any clarification on this matter.

Thank you very much! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention supervision for multiple heads: average or summation? #27

Attention supervision for multiple heads: average or summation? #27

lucasresck commented May 25, 2023

Attention supervision for multiple heads: average or summation? #27

Attention supervision for multiple heads: average or summation? #27

Comments

lucasresck commented May 25, 2023