Skip to content

Co-occurrence network distance parameter: Jaccard, Euclid or Cosine? #1108

Discussion options

You must be logged in to vote

Hello,

The size of computing unit is important rather than the size of dataset as a whole.

If you use small units like "sentences" or "paragraphs", I recommend Jaccard. The Jaccard coefficient looks at whether the codes co-occurred (1) or not (0). There are only "0" and "1". There is no such thing like 5 or 10.

On the other hand, Cosine is preferable when using larger units like speech or writing as a unit. The Cosine is similar to the correlation coefficient. It will distinguish the difference between 0, 1, 5, and 10. When using larger units, we want to distinguish not only between occurring (1) and not occurring (0), but also occurring many times (10) or so (5).

Euclid has similar style…

Replies: 3 comments 6 replies

Comment options

You must be logged in to vote
5 replies
@tchiby
Comment options

@ko-ichi-h
Comment options

@tchiby
Comment options

@ko-ichi-h
Comment options

@tchiby
Comment options

Answer selected by tchiby
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@ko-ichi-h
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
3 participants
Converted from issue

This discussion was converted from issue #1107 on June 21, 2023 03:43.