Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"comparison of Float with NaN failed"...and GSL is Installed #8

Open
tra38 opened this issue Sep 2, 2017 · 2 comments
Open

"comparison of Float with NaN failed"...and GSL is Installed #8

tra38 opened this issue Sep 2, 2017 · 2 comments

Comments

@tra38
Copy link
Owner

tra38 commented Sep 2, 2017

While trying to fix an unrelated issue, I experimented with the code from #5, but using ZombieWriter::MachineLearning rather than ZombieWriter::Randomization.

zombie = ZombieWriter::MachineLearning.new

zombie.add_string(content: "This is filler text that I invented.This is also a paragraph that could be used")
zombie.add_string(content: "This post is amazing. Please take a look")
zombie.add_string(content: "For all sports fan, you must watch this video. Hey you have to check this out.")

array = zombie.generate_articles

p array

#/Users/tariqali/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/kmeans-clusterer-0.11.4/lib/kmeans-clusterer.rb:237:in `sort_by': comparison of Float with NaN failed (ArgumentError)

The culprit is the third string. Classifier-Reborn classified its lsi_norm as a vector of NaNs...

 "For all sports fan, you must watch this video. Hey you have to check this out.\n"=>
  #<ClassifierReborn::ContentNode:0x007fdec4b25ae8
   @categories=[],
   @lsi_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
   @lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
   @raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
   @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
   @word_hash={:for=>1, :sport=>1, :fan=>1, :must=>1, :watch=>1, :video=>1, :hei=>1, :check=>1, :out=>1}>}

Changing the third string slightly resolves the issue.

zombie = ZombieWriter::MachineLearning.new

zombie.add_string(content: "This is filler text that I invented.This is also a paragraph that could be used")
zombie.add_string(content: "This post is amazing. Please take a look")
zombie.add_string(content: "For all sports fan, you must watch this video. Hey you have to check this out. Filler, filler, filler.")

array = zombie.generate_articles

p array
 "For all sports fan, you must watch this video. Hey you have to check this out. Filler, filler, filler.\n"=>
  #<ClassifierReborn::ContentNode:0x007fd931432fd0
   @categories=[],
   @lsi_norm=GSL::Vector
[ 6.205e-01 1.432e-01 1.432e-01 1.432e-01 1.432e-01 1.432e-01 0.000e+00 ... ],
   @lsi_vector=GSL::Vector
[ 6.593e-01 1.522e-01 1.522e-01 1.522e-01 1.522e-01 1.522e-01 0.000e+00 ... ],
   @raw_norm=GSL::Vector
[ 5.547e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
   @raw_vector=GSL::Vector
[ 6.272e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
   @word_hash={:for=>1, :sport=>1, :fan=>1, :must=>1, :watch=>1, :video=>1, :hei=>1, :check=>1, :out=>1, :filler=>3}>}

But why? Both scenarios appeared to have a @word_hash, so it isn't quite clear why one string had a vector of NaNs and the other one doesn't. Is it because in the second scenario, the third string had words that were similar to that of the first string? I will have to research this issue more carefully and decide how to gracefully handle this potential error.

This problem is probably not likely to happen in the real-world...if you add long passages to ZombieWriter, there's bound to be a few overlaps of words that classifier-reborn can detect. But it could happen...which is why I need to figure out how to fix it.

@mahaina
Copy link

mahaina commented Jul 6, 2018

same problem here. hope to see an answer

@tra38
Copy link
Owner Author

tra38 commented Jul 15, 2018

Hi @mahaina. I'll see if I can work on this issue, probably in the next two weeks. If you have a sample corpus where this error can occur reliably, please send that over to me so that I can use it as 'test' material (though it's not necessary and I can work with the existing corpus within the OP). Right now though, I'm using those three sentences I mentioned in the OP, which allows me to reliably reproduce the error, but it's possible that your corpus might have some unique characteristics as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants