Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum Vocabulary Size #42

Open
vikram-gupta opened this issue Jul 15, 2016 · 3 comments
Open

Maximum Vocabulary Size #42

vikram-gupta opened this issue Jul 15, 2016 · 3 comments

Comments

@vikram-gupta
Copy link

vikram-gupta commented Jul 15, 2016

Hi @macournoyer

We are replacing the words with "unknown" after we have encountered unique words equal to the vocab size.

if self.maxVocabSize > 0 and self.wordsCount >= self.maxVocabSize then
    -- We've reached the maximum size for the vocab. Replace w/ unknown token
    return self.unknownToken
end

I think we might get better results, if we replace the words with on basis of their frequency in the corpus and not the order of occurrence. We will start replacing the words in decreasing order of their frequency till we hit the vocab size. What do you think?

This might be a reason of inferior results when we restrict the vocabulary.

@macournoyer
Copy link
Owner

Yes it might be the reason. But restricting based on frequency ends up being a lot more difficult to implement. Since you have to rewrite all the examples because word IDs will change when you remove a word.

@vikram-gupta
Copy link
Author

vikram-gupta commented Jul 15, 2016

@macournoyer

I think we could take a pass on the dataset (only the lines used for the training) to count the frequencies of the words and then keep removing the words in decreasing order of frequency till we hit the vocabulary size.

I think @chenb67 has already done it in the PR .

We would not have to rewrite examples if dataset size and vocabulary size is same. In other case, we would have to !
If it improves the accuracy, it is worthwhile i guess :)

@mtanana
Copy link

mtanana commented Jul 16, 2016

Hey guys- I have a fork that does that TorchNeuralConvo

And that's basically how I did it. (Order the vocab by count and then replace)

But, there are some tricks if you want to stay within the LuaJIT memory limits, but load huge files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants