Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert quotes in input files that can foil Boost prog opts #644

Merged
merged 1 commit into from
Sep 29, 2017

Conversation

briadam
Copy link
Contributor

@briadam briadam commented Sep 29, 2017

(Merging this will leave one failure in test_intercomm0, at least on my system, when BPO is enabled. My understanding, subject to stray cosmic rays, is that this test is expected to be stochastic.)

Partially addresses #642 by partially reverting
"Make boost po opt-in, convert all test input files to getpot format"

This reverts the input files modified in
commit 00477bd.

Conflicts:
configure.ac

Partially addresses libqueso#642 by partially reverting
  "Make boost po opt-in, convert all test input files to getpot format"

This reverts the input files modified in
  commit 00477bd.

Conflicts:
	configure.ac
@briadam
Copy link
Contributor Author

briadam commented Sep 29, 2017

Hmm, Travis reports test_uqGaussianVectorRVClass fails. Which for me is a stochastic fail that I haven't seen in a while. Is this expected?

@briadam
Copy link
Contributor Author

briadam commented Sep 29, 2017

Well it appears the seed for that test is taken from /dev/random, so I guess unlikely that the run would be identical, but the stochsatic test appears to have a tolerance such that it should pass cross-platform?

@roystgnr
Copy link
Member

Sadly, there is no such thing as "a tolerance such that it should pass cross-platform", only "a tolerance such that it should pass cross-platform N% of the time". Hence the pass with clang and failure with gcc this time.

I don't know offhand how to "kick" a Travis check other than by doing an interactive rebase, minor commit message change, and force push. You can try that if you like.

I'll build and test this branch on a couple of my own machines, and if it passes on both then I'll merge.

@briadam
Copy link
Contributor Author

briadam commented Sep 29, 2017

This worked to restart it, once I signed into Travis using my GitHub account: https://stackoverflow.com/questions/17606874/trigger-a-travis-ci-rebuild-without-pushing-a-commit

@roystgnr
Copy link
Member

There might actually be something seriously wrong here. The failure seems outright deterministic for me on one system.

$ wc $(find -name sfp_gravity_p_seq.m)
  50100   50300 1203450 ./test/output_test_intercomm0_gravity_2/sfp_gravity_p_seq.m
  51102   51306 1227519 ./test/output_test_intercomm0_gravity_1/sfp_gravity_p_seq.m

They mostly match in the diff, but the latter file has a thousand samples which the former file doesn't.

What's especially weird is that they have TEN TIMES as many samples as the same runs on a different system, and neither matches what I'd have expected to see - doesn't the input file ask for ip_mh_rawChain_size of 2e4, not 5e3 or 5e4??

@briadam
Copy link
Contributor Author

briadam commented Sep 29, 2017

Yes, I saw that behavior as well, though I think I saw it with test_intercomm0, not test_uqGaussianVectorRVClass. The latter passes in all my local builds. I also thought something is actually wrong, given that it consistently PASSed with GetPot and FAILed with BPO. However, I stopped digging when you told me it was an inconsistent regression test. I'm probably not qualified to root cause it, but can attempt to dig back in if needed, especially given some pointers on where to look.

@codecov
Copy link

codecov bot commented Sep 29, 2017

Codecov Report

Merging #644 into dev will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #644      +/-   ##
==========================================
- Coverage   74.87%   74.84%   -0.03%     
==========================================
  Files         312      312              
  Lines       23789    23789              
==========================================
- Hits        17812    17805       -7     
- Misses       5977     5984       +7
Impacted Files Coverage Δ
src/contrib/getpot/getpot.h 33.23% <0%> (-1.54%) ⬇️
src/stats/src/MetropolisHastingsSGOptions.C 68.61% <0%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51eeaea...6391f8b. Read the comment docs.

@roystgnr
Copy link
Member

Sorry, didn't mean for any confusion, it actually is test_intercomm0 that's failing (on one system) for me. test_uqGaussianVectorRVClass is working fine with every test configuration I throw at it.

@briadam
Copy link
Contributor Author

briadam commented Sep 29, 2017

Not sure what your team development practices are w.r.t. this, but it seems that test_intercomm0 w/BPO is a pre-existing instability on dev and not affected by this PR, so we could merge it. Would you rather track and resolve in this issue #642 / PR #644, #640, or should I create a new issue? I know it sucks to have to keep track of second-order failures like "this PR doesn't make the existing failures any worse," so I'm happy to handle however you like.

@roystgnr
Copy link
Member

Yeah, test_intercomm0 is a failure on dev for me too. (Although again, only on one of the systems I tried...) I'll try bisecting it (although I swear it was working for me on the gpmsa_new_functional branch, which is based on dev!) and swapping around MPI stacks and see what I can figure out, but IMHO there's no reason it should hold up this PR.

@roystgnr roystgnr merged commit e764b94 into libqueso:dev Sep 29, 2017
@briadam briadam deleted the bpo_single_quotes branch September 29, 2017 21:35
@dmcdougall
Copy link
Member

My understanding, subject to stray cosmic rays, is that this test is expected to be stochastic.

No, test_intercomm0 should pass every time.

@dmcdougall
Copy link
Member

dmcdougall commented Sep 30, 2017

They mostly match in the diff, but the latter file has a thousand samples which the former file doesn't.

Chain size is 20000 and filtered chain lag is 20. 20000 / 20 = 1000.

What's especially weird is that they have TEN TIMES as many samples as the same runs on a different system, and neither matches what I'd have expected to see

QUESO appends output to output files, not truncate. I suspect this clearly broken rm command is the cause of the problem.

@dmcdougall
Copy link
Member

Oops, this is the url I meant to paste in my previous comment: https://github.com/libqueso/queso/blob/dev/test/test_intercomm0/test_intercomm0_gravity_run.sh#L6

@dmcdougall
Copy link
Member

Disregard above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants