Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

respiting of a subjoin always fails #2372

Open
heistera opened this issue Jul 6, 2024 · 10 comments
Open

respiting of a subjoin always fails #2372

heistera opened this issue Jul 6, 2024 · 10 comments

Comments

@heistera
Copy link

heistera commented Jul 6, 2024

Dear developers! Using the latest released ganga version 8.7.8 a respiting of a job always fails. Earlier versions did not have this issue. Could you please have a look? Thx!

Ganga In [21]: jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))
ERROR    !!Unknown/Unexpected ERROR!!
ERROR    If you're able to reproduce this please report this to the Ganga developers!
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 1
----> 1 jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/GPIDev/Base/Proxy.py:734, in proxy_wrap.<locals>.proxy_wrapped(*args, **kwargs)
    732 s_kwargs = dict((name, stripProxy(a)) for name, a in kwargs.items())
    733 try:
--> 734     r = f(*s_args, **s_kwargs)
    735 except KeyboardInterrupt:
    736     logger.error("Command was interrupted by a Ctrl+C event!")

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/GPIDev/Lib/Job/Job.py:1865, in Job.resplit(self, new_splitter)
   1862     i += 1
   1863 if isType(mJob.subjobs, SubJobJsonList):
   1864     # If we have loaded subjobs from disk, append then new jobs to the SubJobXMLList
-> 1865     mJob.subjobs.appendJobs(new_dict, mJob)
   1866 elif not isType(mJob.subjobs, (list, GangaList)):
   1867     # If we don't have a GangaList, turn the subjobs into one and append the new ones
   1868     temp_list = GangaList()

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/Core/GangaRepository/SubJobXMLList.py:129, in SubJobXMLList.appendJobs(self, job_dict, parent)
    127 for k in job_dict.keys():
    128     if k in self._cachedJobs.keys():
--> 129         raise RepositoryError("Subjob key %s is already in the subjob XML cache!" % k)
    130 # Now set the parent for the new subjobs
    131 for k, v in job_dict.items():

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/Core/exceptions.py:142, in RepositoryError.__init__(self, repo, what)
    140 from GangaCore.Utility.logging import getLogger
    141 logger = getLogger()
--> 142 logger.error("A severe error occurred in the Repository '%s': %s" % (repo.registry.name, what))
    143 logger.error('If you believe the problem has been solved, type "reactivate()" to re-enable ')
    144 try:

AttributeError: 'str' object has no attribute 'registry'

[12:35:07]
Ganga In [22]: jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))
ERROR    JobError: You can only resplit subjobs in the failed, killed or completed status!

[12:35:48]
@mesmith75
Copy link
Contributor

mesmith75 commented Jul 6, 2024

This works fine for me - it looks like you have doing something odd to your job repository.
Perhaps this hints as to what:

 raise RepositoryError("Subjob key %s is already in the subjob XML cache!" % k)

@heistera
Copy link
Author

heistera commented Jul 6, 2024

I've double checked it: with ganga version 8.6.10 it works as expected ...

@heistera
Copy link
Author

heistera commented Jul 6, 2024

Ganga In [6]: jobs(31).subjobs(683).resplit(SplitByFiles( filesPerJob = 1 ))
Ganga Out [6]: 1

[13:06:31]
Ganga In [7]:

@mesmith75
Copy link
Contributor

So we fixed the error in the exception raising - maybe you could try again with the DEV version?

@heistera
Copy link
Author

Thanks for the fix and sorry for the late reply. I cannot test exactly the same subjob, because this subjoin is in a state which does not allow resplitting:
Ganga In [1]: jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))
ERROR JobError: You can only resplit subjobs in the failed, killed or completed status!

For another subjob of the same job the respiting is in progress for about 30min right now. This is possibly another issue of the ganga 8.7.X versions ... for large amount of jobs, subjoins all 8.7.X versions I've used the processing times for commands are incredibly slow ... since I will stop working now I keep the ganga session open and will report back what will have happened ...

@heistera
Copy link
Author

Still by now the command to resplit is stuck:
[20:19:44]
Ganga In [4]: jobs(31).subjobs(56).resplit(SplitByFiles( filesPerJob = 1 ))

My provider will change IP address of my home router soon and will stop my lxplus session ...

@mesmith75
Copy link
Contributor

mesmith75 commented Jul 13, 2024

What do you mean? How exactly have you managed to produce this? Have you had a look at what the debug say. We need more information otherwise we cannot help. It works fine for me, so there must be something particular about your setup.

@heistera
Copy link
Author

I have a lot of jobs in my session and I am trying to be very careful when e.g. ending a ganga session. Using the newer ganga version for the amount of jobs I am processing takes me a lot of patience, because ganga is very, very slow. I cannot remove the ganga session output, because I need it to be able to find out the URLs of the data at the storage elements. So I've used ganga on a different cluster than lxplus to have a more reactive session ... perhaps it would be good to develop some sort of stress test using a lot of jobs for testing new ganga versions ...?

@mesmith75
Copy link
Contributor

mesmith75 commented Jul 15, 2024

I use Ganga myself to submit many thousands of jobs for my own analyses and have had no issues. We work very hard to test it and make it faster - in fact the upgrade from 8.6.9 -> 8.7.0 brought significant speed increases.

Could you start Ganga in debug mode, try to resplit and send the output?

@heistera
Copy link
Author

I have started ganga in debug more. Here are the results and sorry for the delay.

1.) ganga start takes long. A while ganga was stuck during the startup here is the printout I saw:
2024-07-18 20:24:39,388 MainThread RegistrySlice ::do_select :292 DEBUG : NOT Actually Selected
2024-07-18 20:24:39,388 MainThread Registry ::_locked_load :554 DEBUG : _locked_load
2024-07-18 20:24:39,388 MainThread GangaRepositoryXML ::load :1032 DEBUG : Loading Repo object(s): [30]
2024-07-18 20:24:39,561 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created2024-07-18 20:24:39,770 MainThread GangaRepositoryXML ::_load_xml_from_obj :941 DEBUG : Loading XML file for ID: 30 took 0.3006861209869385 sec
2024-07-18 20:24:39,770 MainThread GangaRepositoryXML ::_load_xml_from_obj :949 DEBUG : Checking children: 30

  1. During the longish start I saw tons of these messages:
    h/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/782/data for job 34
    2024-07-18 20:27:34,688 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
    2024-07-18 20:27:34,707 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: Inputfiles not copied to WN for GaudiExec #783
    2024-07-18 20:27:34,708 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: Inputfiles not copied to WN for GaudiExec #783 from disk
    2024-07-18 20:27:34,708 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/783/data for job 34
    2024-07-18 20:27:34,718 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
    2024-07-18 20:27:34,724 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: just a test #784
    2024-07-18 20:27:34,724 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: just a test #784 from disk
    2024-07-18 20:27:34,724 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/784/data for job 34
    2024-07-18 20:27:34,748 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
    2024-07-18 20:27:34,771 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: Cleanup xml/sessionlock implementation #785
    2024-07-18 20:27:34,771 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: Cleanup xml/sessionlock implementation #785 from disk
    2024-07-18 20:27:34,771 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/785/data for job 34
    2024-07-18 20:27:34,805 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
    2024-07-18 20:27:34,829 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: More checks needed in DIRACBase #786
    2024-07-18 20:27:34,830 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: More checks needed in DIRACBase #786 from disk
    2024-07-18 20:27:34,830 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/786/data for job 34
    2024-07-18 20:27:34,919 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
    2024-07-18 20:27:34,957 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: Organisation of tests #787
    2024-07-18 20:27:34,957 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: Organisation of tests #787 from disk
    2024-07-18 20:27:34,957 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/787/data for job 34
    2024-07-18 20:27:34,992 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
    2024-07-18 20:27:35,040 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: work to stabilize GangaDirac API #788
    2024-07-18 20:27:35,040 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: work to stabilize GangaDirac API #788 from disk

  2. Resplitting of the job worked after some time but then the ganga session seems to have died ... all commands I issued ended with no printout ... I pasted the output of the session as I saw on my screen:
    Ganga In [3]: jobs(30).subjobs(641).resplit(SplitByFiles( filesPerJob = 1 ))
    Generating proxy...
    Enter Certificate password: *************
    Added VOMS attribute /lhcb/Role=user
    Uploading proxy..
    Proxy generated:
    subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister/CN=3575311884/CN=2728184313
    issuer : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister/CN=3575311884
    identity : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister
    timeleft : 23:53:59
    DIRAC group : lhcb_user
    path : /tmp/x509up_u7443
    username : heistera
    properties : NormalUser, PrivateLimitedDelegation
    VOMS : True
    VOMS fqan : ['/lhcb/Role=user']

Proxies uploaded:
DN | Until (GMT)
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister | 2025/06/09 17:57
[20:35:42]
Ganga In [4]: jobs(3)
[20:36:10]
Ganga In [5]: jobs(30)
[20:36:17]
Ganga In [6]: jobs
[20:37:10]
Ganga In [7]: jobs
Ganga Out [7]: [20:39:46]
Ganga In [8]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants