Abandoned Tasks?
log in

Advanced search

Message boards : Number crunching : Abandoned Tasks?

Author Message
Profile Charles Dennett
Avatar
Send message
Joined: 18 Dec 14
Posts: 88
Credit: 3,342,826
RAC: 0
Message 2515 - Posted: 3 Aug 2015, 11:55:47 UTC

I just noticed on one of my crunchers this morning that all the tasks had been abandoned and returned as an error and given out to a different user. They were not old. The oldest task had been given to me on 1 Aug 2015, 21:27:42 UTC. The newest task had come in at 3 Aug 2015, 7:20:02 UTC. All these tasks (50 of them, the max for this system) were abandoned and returned at 3 Aug 2015, 7:47:49 UTC, less than half an hour after the newest task had arrived.

Any idea why this happened?

Charlie
____________

Profile Charles Dennett
Avatar
Send message
Joined: 18 Dec 14
Posts: 88
Credit: 3,342,826
RAC: 0
Message 2516 - Posted: 3 Aug 2015, 12:29:04 UTC

Well, I did find one thing. These were abandoned when my system (running Linux as all my crunchers do) performed its weekly task of rotating log files. Rotating happens every week but has never in the past caused any task to be abandoned. Rotating the boinc log files involves stopping boinc, renaming the current log files, creating new log files and restarting boinc. (As a side effect, because these tasks do not checkpoint, any cpu activity on active tasks is lost and crunching starts from the beginning.)

This has never in the past caused a problem on any of my 4 crunchers. Log rotation is a normal Linux function and the boinc package includes the little config file for rotating boinc logs. In other words, it's not something I invented,

When boinc restarted, it logged the usual activity when it starts, but I also found this one line:

03-Aug-2015 03:47:45 [FiND@Home] Sending scheduler request: Project initialization.

Note, my time is Eastern Daylight time in the US. it is 4 hours behind UTC. In fact, the other two projects I have also had a similar log entry. I have no idea why boinc would decide to reinitialize all my projects on the one cruncher. Hmm. Strange.

Charlie
____________

Profile Charles Dennett
Avatar
Send message
Joined: 18 Dec 14
Posts: 88
Credit: 3,342,826
RAC: 0
Message 2517 - Posted: 3 Aug 2015, 12:43:58 UTC

A couple more interesting log entries that are not in any other boinc log on any of my systems. Interesting ones are in bold.

03-Aug-2015 03:47:03 [---] Starting BOINC client version 7.2.42 for i686-pc-linux-gnu
03-Aug-2015 03:47:03 [---] log flags: file_xfer, sched_ops, task
03-Aug-2015 03:47:03 [---] Libraries: libcurl/7.40.0 NSS/3.19.1 Basic ECC zlib/1.2.8 libidn/1.29 libssh2/1.5.0
03-Aug-2015 03:47:03 [---] Data directory: /var/lib/boinc
03-Aug-2015 03:47:03 [---] No usable GPUs found
03-Aug-2015 03:47:03 [---] Using state file client_state_next.xml
03-Aug-2015 03:47:03 [---] Host name: ash

03-Aug-2015 03:47:03 [---] Config: report completed tasks immediately
03-Aug-2015 03:47:03 [---] Config: GUI RPC allowed from any host
03-Aug-2015 03:47:03 [---] Config: GUI RPCs allowed from:
03-Aug-2015 03:47:03 [---] oak
03-Aug-2015 03:47:03 [---] 192.168.1.99
03-Aug-2015 03:47:03 [---] Version change (0.0.0 -> 7.2.42)
03-Aug-2015 03:47:03 [NRG] URL http://boinc.med.usherbrooke.ca/nrg/; Computer ID not assigned yet; resource share 100
03-Aug-2015 03:47:03 [DENIS@home] URL http://denis.usj.es/denisathome/; Computer ID not assigned yet; resource share 0
03-Aug-2015 03:47:03 [FiND@Home] URL http://findah.ucd.ie/; Computer ID not assigned yet; resource share 100


So, the clinet_state.xml file was apparently not present when boind was stopped. Perhaps stopping it interrupted something else that was going on?

Charlie
____________

Thyme Lawn
Send message
Joined: 25 Oct 12
Posts: 55
Credit: 371,176
RAC: 292
Message 2525 - Posted: 6 Aug 2015, 14:56:11 UTC - in response to Message 2517.

client_state_next.xml is a temporary stage in updating client_state.xml as follows:


  • write client_state_next.xml
  • rename client_state.xml to client_state_prev.xml
  • rename client_state_next.xml to client_state.xml


When BOINC starts up it checks state file validity by confirming that it starts with a <client_state> tag and contains a </client_state> tag. The order of checking is client_state_next.xml, client_state.xml and client_state_prev.xml, with the first valid one being used.

Your 4 highlighted messages suggest that client_state_next.xml contained the validity tags but didn't contain the core client version number settings or a host id for NRG, DENIS@home or FiND@Home. The missing host ids might be because the projects were totally missing from the file and picked up from the account files, but in any case that would have caused your client to reattach to those projects. Presumably this one detected it as a recognised host, reused the original host id and marked the in progress tasks as abandoned.

As to why client_state_next.xml was incomplete in that way, I've absolutely no idea ...
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile Charles Dennett
Avatar
Send message
Joined: 18 Dec 14
Posts: 88
Credit: 3,342,826
RAC: 0
Message 2526 - Posted: 6 Aug 2015, 15:53:28 UTC - in response to Message 2525.

I'm chalking it up to a timing issue between stopping boinc for log rotation and whatever boinc was doing at that instant in time when it was told to stop. As I noted, this has never happened on the past.

Charlie
____________

Richard Haselgrove
Send message
Joined: 30 May 15
Posts: 25
Credit: 1,979,129
RAC: 1,584
Message 2528 - Posted: 6 Aug 2015, 23:43:40 UTC

If client_state.xml was damaged in the way that Thyme_Lawn describes, it is likely that the <rpc_seqno> - "Remote Procedure Call - sequence number" - value was lost as well. There is a long-standing bug in the current server code which causes work in progress to be marked as abandoned when an <rpc_seqno> anomaly is detected.

We got the bug fixed about a month ago: don't mark jobs as abandoned if request lists jobs. But it's a rare problem, and I wouldn't advocate updating the server code just for this: the BOINC project is having significant problems with its servers and code repositories just at the moment, and it would make more sense to wait until they've settled down a bit.

Message boards : Number crunching : Abandoned Tasks?


Main page · Your account · Message boards


Copyright © 2017 Dr Anthony Chubb