Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix computation of Word2Vec loss & add loss value to logging string #2135

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

alreadytaikeune
Copy link

The current computation of word2vec losses has flaws. I address them in this PR.

This PR re-writes the computation of the loss for both CBOW and SG Word2Vec. The loss that is computed and reported is the running average NCE loss within the epoch. This means that for each new epoch, the counters are reset to 0, and the new average is computed. This was not the case before, and the loss was incremented during the whole training (but the total_examples used to average was only incremented during on epoch), which is not very informative, beside being also incorrect in the implementation. Moreover, reporting the average loss on an epoch appeared to me more in the spirit of what was trying to be achieved before.

The computation of the word2vec loss was flawed in many ways:

  • race condition on the running_training_loss parameter (updated concurrently in a
    GIL-free portion of the code)
  • As mentioned above, there was an incoherence between the accumulation of the loss on the whole run, and the reset of the averaging factor at each epoch.
  • Even if the points above were fixed, remains the following: the dividing factor for the average in the case of SG is wrong. The averaging factor in the case of SG should not be the effective words, but the effective samples (a new variable I introduce), because the loss is incremented as many times as there are positive examples that are sampled for an effective word.

Additionnally, I add the logging of the current value of the loss in the progress logger, when compute_loss is set to True, and I add a parameter to the word2vec_standalone script to trigger the reporting of the loss.

As a hint towards the fact that the current implementation is now correct, one can look at the first reported values of the loss, when the word embeddings are still relatively uniformly distributed. In this situation, the expectancy of the NCE loss (for Skip-Gram) should be -(N+1)\log(\sigma(0)). Which is 5.545 for N=7, 14.556 for N=20... which corresponds to the reported loss values in the current implementation.

This commit re-writes the computation of the loss for both CBOW and SG
Word2Vec. The loss that is computed and reported is the running average NCE
loss within the epoch. This means that for each new epoch, the counters are
reset to 0, and the new average is computed. This was not the cas before,
and the loss was incremented during the whole training, which is not
very informative, beside being also incorrect in the implementation (see below)

The computation of the word2vec loss was flawed in many ways:
- race condition on the running_training_loss parameter (updated concurrently in a
  GIL-free portion of the code)
- incorrect dividing factor for the average in the case of SG. The averaging
  factor in the case of SG should not be the effective words, but the effective
  samples (a new variable I introduce), because the loss is incremented as many
  times as there are positive examples that are sampled for an effective word.

Addtionnally, I add the logging of the current value of the loss in the progress
logger, when compute_loss is set to True, and I add a parameter to the
word2vec_standalone script to trigger the reporting of the loss.
@piskvorky
Copy link
Owner

piskvorky commented Jul 19, 2018

Thanks @alreadytaikeune . The training loss is a recent addition and may contain errors. What is your use-case for using it? How did you find about this issue?

Re. your PR: what are the performance implications of your changes? Can you post a before/after benchmark?

@alreadytaikeune
Copy link
Author

Thanks for your reply. I started digging into it because I needed to compare the evolution and values of the loss between two word2vec implementations. Therefore I needed to understand exactly how the loss was computed in gensim which lead me to uncover some issues and proposing these changes. Are the issues I pointed out clear to you? I don't know if I expressed them well enough.

As to the benchmark, I haven't really done any since the changes are very minor and should really not impact performances, but I'll do one if needed. Do you have some standard tools for this purpose that I can use, or do I need to write mine?

Finally, it seems the documentation has a hard time building. From what I understand, it is the lines

-loss 
                If present, the loss will be computed and printed during training

in the word2vec_standalone.py that are the cause. The problem is that there isn't any type after loss, because it is only a flag. I am not sure what the sphinx syntax is for this kind of stuff. Any idea?

@alreadytaikeune
Copy link
Author

@piskvorky Everything seems to be fixed now.

@piskvorky
Copy link
Owner

piskvorky commented Jul 22, 2018

Thanks! Can you run the benchmark too? Simply train a model on the text9 corpus with/without this PR, using 4 parallel workers and computing / not computing the loss. Then post the four logs (old+loss, old-loss, new+loss, new-loss) to gist and add the links here in a comment.

There'll be some minor variance in the timings, even with same parameters, that's expected. But we're interested in a sanity check here (<<10% difference).

@alreadytaikeune
Copy link
Author

alreadytaikeune commented Jul 23, 2018

@piskvorky

I have run the tests you said in the following environment

FROM python:2
RUN mkdir -p /home/gensim_user
WORKDIR /home/gensim_user
RUN apt-get update && apt-get install -y wget build-essential 
RUN pip install numpy scipy six smart_open

Using the script below:

#! /bin/bash

mkdir -p data
cur_dir=$(pwd)

if [ ! -f data/text9 ]; then
  if [ ! -f data/enwik9.zip ]; then
    wget -c http://mattmahoney.net/dc/enwik9.zip -P data
  fi
  if [ ! -f data/enwik9 ]; then
    unzip data/enwik9.zip -d data
  fi
  perl wikifil.pl data/enwik9 > data/text9
fi

GENSIM=/home/gensim_user/gensim
ARGS="-train data/text9 -output /tmp/test -window 5 -negative 5 -threads 4 -min_count 5 -iter 5 -cbow 0"
SCRIPT=/usr/local/lib/python2.7/site-packages/gensim-3.5.0-py2.7-linux-x86_64.egg/gensim/scripts/word2vec_standalone.py
if [ -d $GENSIM ]; then
  rm -r $GENSIM
fi
cd /home/gensim_user/ && git clone https://github.com/RaRe-Technologies/gensim && cd gensim &&\
git checkout 96444a7d0357fc836641ec32c65eb2fbffbee68d && python setup.py install
cd $cur_dir
echo "RUNNING WORD2VEC BASE NO LOSS"
python $SCRIPT $ARGS 2>&1 | tee logs_base_word2vec_no_loss
pip uninstall -y gensim
cd /home/gensim_user/ && git clone https://github.com/alreadytaikeune/gensim && cd gensim && git checkout develop && python setup.py install
cd $cur_dir
echo "RUNNING WORD2VEC NEW NO LOSS"
python $SCRIPT $ARGS 2>&1 |tee logs_new_word2vec_no_loss
echo "RUNNING WORD2VEC NEW+LOSS"
python $SCRIPT $ARGS -loss 2>&1 |tee logs_new_word2vec_loss

The results of the run can be found here:
(Base no loss): https://gist.github.com/alreadytaikeune/554f21e95aa73ed414482d07b2e6314b
(New no loss): https://gist.github.com/alreadytaikeune/a184e88c059e038e2023170bb29a1eb7
(New loss): https://gist.github.com/alreadytaikeune/62ac3ca2da0d5d860404f0993acc81ac

I've only run them once, because it takes a bit of time. If we need more solid proof, probably we should settle on a smaller size corpus and average more runs. Anyways, it seems that the cost incurred to the computation of the loss amounts to around 3% of the total runtime, while the computation time is left unaffected (ignoring small variations between runs) as was expected.

I haven't run the benchmark on the base code with the loss computation, because if we agree on the fact that the previous computation is flawed, then I don't really see the relevance. Maybe we should discuss more on this point to double check my reasoning and implementation.

@piskvorky
Copy link
Owner

Thanks a lot @alreadytaikeune ! That sounds good to me.

Let's wait for @menshikh-iv 's final verdict & merge, once he gets back from holiday.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice catch @alreadytaikeune 👍 please make needed fixes and I think we can merge it :)

@@ -124,7 +124,17 @@ def _clear_post_train(self):
raise NotImplementedError()

def _do_train_job(self, data_iterable, job_parameters, thread_private_mem):
"""Train a single batch. Return 2-tuple `(effective word count, total word count)`."""
"""Train a single batch. Return 3-tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow numpy-style for docstrings, more links available here


for callback in self.callbacks:
callback.on_batch_end(self)

progress_queue.put((len(data_iterable), tally, raw_tally)) # report back progress
# report back progress
progress_queue.put(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to break line (we use 120 char limit for gensim), here and everywhere

@@ -260,6 +281,7 @@ def _log_train_end(self, raw_word_count, trained_word_count, total_elapsed, job_

def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_examples=None, total_words=None,
report_delay=1.0):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still here

@@ -491,8 +491,8 @@ def train_batch_sg(model, sentences, alpha, _work, compute_loss):
cdef int negative = model.negative
cdef int sample = (model.vocabulary.sample != 0)

cdef int _compute_loss = (1 if compute_loss else 0)
cdef REAL_t _running_training_loss = model.running_training_loss
cdef int _compute_loss = (1 if compute_loss is True else 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? old variant work correctly (same for cbow)

model.running_training_loss = _running_training_loss
return effective_words
model.running_training_loss += _running_training_loss
return effective_words, effective_words
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth writing a comment, why return same value twice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the reason for that in the new docstring.

tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
return tally, self._raw_word_count(sentences)
(tally, effective_samples) = train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss)
return tally, self._raw_word_count(sentences), effective_samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update an docstrings everywhere when you change returning type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alreadytaikeune Still not done, please check

@@ -966,6 +989,9 @@ def train(self, sentences=None, input_streams=None, total_examples=None, total_w
total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)

def get_latest_training_loss(self):
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should raise NotImplementet (loss feature works only for w2v now, not for d2v/fasttext/etc) or return -1 maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to raise an exception. I think -1 will be confusing since it is not the value that will be displayed. The value that is displayed is the running training loss divided by the number of samples. That is why I chose to return 0 and -1. -1 will look like the loss is decreasing as the number of processed words increases.

else:
# Model doesn't implement the samples tallying. We assume
# that the number of samples is the effective words tally. This
# gives coherent outputs with previous implementaitons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add TODO here - if/else should be removed when compute_loss implemented for all models

cur_epoch + 1, 100.0 * example_count / total_examples, trained_word_count / elapsed,
utils.qsize(job_queue), utils.qsize(progress_queue)
)
if self.compute_loss:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fully refactor this function please (make it clearer & shorter, not if { if { } else { } } else { if { } else { } } )?

Hint

  • generate pattern (logging template first)
  • collect needed parameters to list/tuple
  • use *my_parameters

@alreadytaikeune
Copy link
Author

I have made the changes you requested, however I am a bit surprised by the outcomes of the tests. They all run smoothly on my machine and most of them pass in travis, except the py35-win one, for some reason that don't seem related at all with my changes.

@menshikh-iv menshikh-iv changed the title Fixing the computation of Word2Vec losses. Fix computation of Word2Vec loss & add loss value to logging string Aug 14, 2018
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @alreadytaikeune 👍
I think we need merge #2127 first (and after - current PR with additional fix for new "mode" of training).

BTW, I checked py35-win - it's not a your fault, this is fail of non-determenistic test, I re-run build.

@@ -260,6 +281,7 @@ def _log_train_end(self, raw_word_count, trained_word_count, total_elapsed, job_

def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_examples=None, total_words=None,
report_delay=1.0):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still here

)
div = total_words

msg = "EPOCH %i - PROGRESS: at %.2f%% examples, %.0f words/s, in_qsize %i, out_qsize %i"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be PROGRESS: at %.2f%% words (not only examples)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Good catch. I'll fix it.

gensim/models/base_any2vec.py Show resolved Hide resolved
if self.sg:
tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss)
(tally, effective_samples) = train_batch_sg(self, sentences, alpha, work, self.compute_loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

() no needed here (and same below)

@menshikh-iv
Copy link
Contributor

@alreadytaikeune thanks, looks good!

Next steps:

@alreadytaikeune I'll ping you when we'll be done with #2127.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 24, 2018

Hi @alreadytaikeune, we finally merged & release #2127, please do the steps mentioned in #2135 (comment) and I'll merge your PR too 🌟

In word2vec_inner.pyx, functions now used the new config object while still returning the number of samples.

In base_any2vec, logging includes the new loss values, (the addition of this branch)
@alreadytaikeune
Copy link
Author

Hi, I've completed the steps you mentioned, but when running the tests it seems to me that one is hanging, in the doc2vec test set. I don't really have a clue why, it wasn't the case before the merge.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 2, 2018

@alreadytaikeune I see no commits after my comment & merge conflict in PR, please do needed changes (or maybe you forgot to push your changes?).
About any test hangs - reproduce it here first, please.

@alreadytaikeune
Copy link
Author

alreadytaikeune commented Oct 18, 2018

Hey sorry for the delay, I was on holidays. Yes, I hadn't pushed my changes, I wanted to test them locally first. But here they are.

@alreadytaikeune
Copy link
Author

Ah it seems the CI server experiences the same stalling issues as me. I am not sure when I will have time to investigate that though...

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 9, 2019

@alreadytaikeune CI issues fixed, hang reproduced in Travis, do you have time to fix & finalize?

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 21, 2019

Don't worry about the conflicts. I'll take care of them during the merge: they are in autogenerated code, so it's easy to resolve.

Regarding tests: I think the actual loss output may be difficult to test (and not worth it) because it's being output by logs. What is worth unit testing is the new behavior you added to the training functions (counting and returning the effective number of samples). Can you add tests to cover that new behavior?

For example:

  • word2vec.py:train_batch_sg
  • word2vec.py:train_batch_cbow
  • word2vec.py:score_sentence_sg
  • word2vec.py:_do_train_job
  • ... and their counterparts in the Cython code.

You use the same testing logic and data for Python/Cython (they should be 100% compatible).

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 28, 2019

@alreadytaikeune What is the status of this PR? Are you able to finish it?

@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Apr 28, 2019
@tridelt
Copy link

tridelt commented Aug 19, 2019

@alreadytaikeune What do you think of the current status of this PR?
best regards from austria

@alreadytaikeune
Copy link
Author

alreadytaikeune commented Aug 19, 2019

Hello @mpenkov and @tridelt. Sorry for not being more involved in this PR. I've had plenty on my plate in my job, and in my personal life as well, and didn't feel like spending free/family time working on tests. Although I'm not comfortable with leaving this as is either. I'll do my best in the coming days to address your comments. Sorry for not handling this earlier. Best.

@alreadytaikeune
Copy link
Author

Hello, @mpenkov. I found a bit of time to throw myself back into the PR. I thought I would rebase my develop branch on the current one and work on the tests from there, but it was not a good idea.... anyway, I've restored my branch as it was before.

Just one thing, when rebasing it seemed to me that all the non-cython implementation of word2vec was removed. Is it correct? Regarding your last comment about checks, I imagine I should now only care about writing tests for the cython functions.

Sorry about the mess...

@gojomo
Copy link
Collaborator

gojomo commented Nov 12, 2019

@alreadytaikeune Yes,, the non-cython variants of many of the algorithms have been eliminated to reduce duplication/maintenance effort – so work going forward need only concern itself with the cython versions!

@alreadytaikeune
Copy link
Author

Alright thank you @gojomo . I also see that word2vec.py became a script as well as a module, and it does the same thing as scripts/word2vec_standalone.py, the file I had originally modified. Which one should I focus on?

@gojomo
Copy link
Collaborator

gojomo commented Nov 13, 2019

AFAIK, word2vec.py has always implemented a main & been runnable as a script. Unsure of the reason for (partial) overlap with scripts/word2vec_standalone.py – & neither version has been updated recently to keep-up with newer Word2Vec initialization options (like ns_exponent or corpus_file). I suspect gensim usage is far more via imported code (vs command-line invocation), and new functions only get added to the main paths when someone specifically needs them. @piskvorky or @mpenkov would have to comment on whether both, or just one or the other, should be updated going forward.

@piskvorky
Copy link
Owner

piskvorky commented Nov 13, 2019

@gojomo is right. I'm not aware of any changes to the CLI version of word2vec in Gensim. IIRC, the command line version was created to match the CLI of the original C tool by Google, including the CLI parameters names and defaults. I don't think it's been updated since.

I didn't remember what word2vec_standalone.py was. I tracked that script down to PR #593, which doesn't talk about its motivation. But a standalone script makes more sense to me, and is also cleaner (for example avoids pickling issues when run as __main__).

So my preference would be to get rid of __main__ in word2vec.py, and keep word2vec_standalone.py. We definitely don't need both.

@alreadytaikeune
Copy link
Author

Thank you for the clarification. So, going forward, I'll just write some tests for cython only implementations (which is a bit confusing since I'm working on a version with still the two implementations, hence why I wanted to rebase in the first place), and leave to you clarifying between what's to be kept and dropped between word2vec.py and word2vec_standalone.py.

@gojomo gojomo mentioned this pull request Dec 27, 2019
@mpenkov mpenkov added this to In progress in gensim-4.0 via automation Jun 10, 2020
@geopapa11
Copy link

I was wondering about the status of this PR. Will it be fixed in gensim 4.0.0? Erroneous loss computation happens only in Word2Vec or in other classes too (i.e., FastText)? Thanks in advance

@piskvorky
Copy link
Owner

Yes, ideally we want to clean this up for 4.0 too. @alreadytaikeune will you able to finish this PR?

Anyone else helping is welcome.

@geopapa11
Copy link

Thank you very much @piskvorky! Does wrong loss computation affect other classes too (e.g., FastText)?

@piskvorky
Copy link
Owner

I don't remember, but I think the loss tallying was idiosyncratic – different for each class, missing from some models.

@geopapa11
Copy link

Thanks @piskvorky that's very helpful. On an unrelated topic, do you guys plan (or find use) in supporting different initialization schemes for embeddings? Right now you are only supporting uniform initialization I think between [-0.5embedding_size, 0.5embedding_size]. I have created other types of initializations as well ('glorot_uniform', 'lecun_uniform', 'he_uniform', 'glorot_normal', 'lecun_normal', 'he_normal'). If you find it might add something in the tooling, I can give some ideas on a different PR.

@piskvorky
Copy link
Owner

Possibly; @Witiko did some work on weight initialization. That's out of topic for this ticket though.

@gojomo
Copy link
Collaborator

gojomo commented Dec 14, 2020

FYI #2617 is a sort of 'umbrella issue' mentioned/referencing all the tangled loss-issues. Ideally we'd want our measure/reporting to be similar to the output in Facebook's FastText logging. (Right now Gensim's FastText hasn't ever tracked loss, and whatever Gensim's Word2Vec is doing doesn't report similar tallies to FB FastText-in-plain-Word2Vec mode - so there's some discrepancy to be tracked-down & rationalized.)

Regarding alternate weight initializations, it shouldn't be hard to externally re-initialize (or include an option to not initialize for when a user is planning to do this themselves) between vocabulary-discovery and training. We could add some standard options if there's a strong case for how they improve things. (Do they noticeably hasten convergence?)

@geopapa11
Copy link

FYI #2617 is a sort of 'umbrella issue' mentioned/referencing all the tangled loss-issues. Ideally we'd want our measure/reporting to be similar to the output in Facebook's FastText logging. (Right now Gensim's FastText hasn't ever tracked loss, and whatever Gensim's Word2Vec is doing doesn't report similar tallies to FB FastText-in-plain-Word2Vec mode - so there's some discrepancy to be tracked-down & rationalized.)

Regarding this, how are you updating the embedding weights on every iteration if you don't compute the loss? Do you use some kind of ready formula for gradient that's directly applied to update the embedding weights?

Regarding alternate weight initializations, it shouldn't be hard to externally re-initialize (or include an option to not initialize for when a user is planning to do this themselves) between vocabulary-discovery and training. We could add some standard options if there's a strong case for how they improve things. (Do they noticeably hasten convergence?)

Not sure if they improve or change convergence in a meaningful way. I just mentioned it because Embeddings as defined in other libraries (e.g., TensorFlow) offer different initialization schemes.
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding (see embeddings_initializer argument)
https://www.tensorflow.org/api_docs/python/tf/keras/initializers

Also, your initialization which is more like [-0.5*fan_out, 0.5*fan_out] does not really correspond to one of the more known ones: He, Glorot, Lecun but maybe that's ok! :-)

@Witiko
Copy link
Contributor

Witiko commented Dec 17, 2020

Possibly; @Witiko did some work on weight initialization. That's out of topic for this ticket though.

@piskvorky @geopapa11 I evaluated different initizations of the positional model of Mikolov et al. (2018) and Grave et al. (2018), since no reference implementation existed and it's unclear how the weights should be initialized, see pages 59 through 62 in the RASLAN 2020 proceedings.

@mpenkov mpenkov removed this from In progress in gensim-4.0 Feb 25, 2021
@nicocheh
Copy link

I was wondering about the status of this PR. Is there any news about this PR? @mpenkov

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 25, 2022

Not that I'm aware of. I think we're waiting for the original contributor to finish the PR: #2135 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
PR triage
  
Needs triage
Development

Successfully merging this pull request may close these issues.

None yet

9 participants