Client Errors

Message boards : Number crunching : Client Errors

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
The Pirate
Avatar

Send message
Joined: 22 Sep 05
Posts: 20
Credit: 7,090,933
RAC: 0
Message 663 - Posted: 27 Sep 2005, 21:54:29 UTC
Last modified: 27 Sep 2005, 21:56:35 UTC

Last night I added three more windows computers to this project. One runs Windows 32 bit with dual AMD MP 2100's and one is running Windows 64 bit with dual Opteron 275's. Both multi processor computers are getting some Client errors on Rosetta@Home. Some WU's complete ok however. On S@H, P@h, E@H and LHC they run just fine.

ID: 663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Housing and Food Services

Send message
Joined: 1 Jul 05
Posts: 85
Credit: 155,098,531
RAC: 0
Message 664 - Posted: 27 Sep 2005, 22:04:21 UTC

Do you have the option checked to leave Rosetta in memory? As has been noted in several threads, that seems to fix a couple problems for folks working on multiple projects.
ID: 664 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The Pirate
Avatar

Send message
Joined: 22 Sep 05
Posts: 20
Credit: 7,090,933
RAC: 0
Message 667 - Posted: 27 Sep 2005, 23:11:06 UTC

I just set it to stay in memory. I'll see if that helps.

ID: 667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Red Squirrel

Send message
Joined: 26 Sep 05
Posts: 13
Credit: 3,613
RAC: 0
Message 690 - Posted: 28 Sep 2005, 11:15:38 UTC

I just got a Client Error on my first Rosetta WU. I think it crashed as it got to the end of the model. It had reached the 83% completed mark and had run for about 50 mins of it's next hourly slot when it crashed. Here's the relevant part of the message log:-
27/09/2005 18:56:13|rosetta@home|Resuming result 1btn__abrelax_11836_0 using rosetta version 4.77
27/09/2005 19:50:01|rosetta@home|Unrecoverable error for result 1btn__abrelax_11836_0 ( - exit code -1073741819 (0xc0000005))
27/09/2005 19:50:01||request_reschedule_cpus: process exited
27/09/2005 19:50:01|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds
27/09/2005 19:50:01|rosetta@home|Computation for result 1btn__abrelax_11836_0 finished
27/09/2005 19:50:01|climateprediction.net|Resuming result 0rru_000056379_0 using hadsm3 version 4.10
27/09/2005 19:51:02|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
27/09/2005 19:51:02|rosetta@home|Requesting 0 seconds of work, returning 1 results
27/09/2005 19:51:04|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded

I am using Windows XP on an Athlon XP 2000+ with 512MB memory, and, yes, I do have it set to remain in memory when it's preempted.
Anybody got any ideas?
Thanks in advance,
Alan

ID: 690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Halifax--lad
Avatar

Send message
Joined: 17 Sep 05
Posts: 157
Credit: 2,687
RAC: 0
Message 706 - Posted: 28 Sep 2005, 17:16:11 UTC

I have recently had this problemn too the last 2 Work Units have failed due to a computational error
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J D K
Avatar

Send message
Joined: 23 Sep 05
Posts: 168
Credit: 101,266
RAC: 0
Message 719 - Posted: 28 Sep 2005, 20:21:30 UTC

Check out what the WIKI has to say.
BOINC Wiki

ID: 719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The Pirate
Avatar

Send message
Joined: 22 Sep 05
Posts: 20
Credit: 7,090,933
RAC: 0
Message 721 - Posted: 28 Sep 2005, 21:59:17 UTC

I am also getting the " Exit Code -1073741819 (0xc0000005))" error only on my multi-processor pc's.
Both computers are not over clocked or over heating. Since Rosete@home is the ONLY app. that is getting this error and both get the same error. All other apps.,Seti@home, Einstein@home, LHC@home, ProteinPredictor@home and SetiBeta@home are running without errors, at this point, I am disinclined to start swapping things around. Of course I have seen stranger things happen in the past.

ID: 721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Halifax--lad
Avatar

Send message
Joined: 17 Sep 05
Posts: 157
Credit: 2,687
RAC: 0
Message 722 - Posted: 28 Sep 2005, 22:14:22 UTC

Something has got to have been changed in the WU's by the developers or a bug has managed to creep into the WU's due to another reason.

Its happining to too many people and too often at this moment in time, all my other projects are working correctly
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
makaveli001

Send message
Joined: 24 Sep 05
Posts: 1
Credit: 6,909
RAC: 0
Message 725 - Posted: 29 Sep 2005, 0:19:55 UTC

I am getting client errors on both of my systems as well. The WUs seems to stop at 83.33% for a while and then just error.

Jason




ID: 725 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keith E. Laidig
Volunteer moderator
Project developer
Avatar

Send message
Joined: 1 Jul 05
Posts: 154
Credit: 117,189,961
RAC: 0
Message 741 - Posted: 29 Sep 2005, 4:02:00 UTC
Last modified: 29 Sep 2005, 4:17:24 UTC

Wouldn't you know it, just as DK is on vacation the s**t hits the fan...
As soon as he's back we'll drop this problem in his lap. We appreciate the feedback.

ID: 741 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PlaNed

Send message
Joined: 25 Sep 05
Posts: 3
Credit: 37,334
RAC: 0
Message 761 - Posted: 29 Sep 2005, 12:56:47 UTC

Two computers ( Barthon 256 RAM WinXP SP2 ).
For two days 6 WU only OK and 16 Client error!
I stop for any days...
<img src="http://boinc.mundayweb.com/one/stats.php?userID=120&amp;trans=off">
ID: 761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 763 - Posted: 29 Sep 2005, 14:00:22 UTC - in response to Message 741.  

Wouldn't you know it, just as DK is on vacation the s**t hits the fan...
As soon as he's back we'll drop this problem in his lap. We appreciate the feedback.


My percent completed seems to hang, but the work itself continues to progress to completion.

Glad DK took a vacation. All things considered, things imho seem to be going quite well for such a new project. Better than some of your BOINC project competition! ;)
Regards,
Bob P.
ID: 763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 766 - Posted: 29 Sep 2005, 14:32:22 UTC
Last modified: 29 Sep 2005, 14:45:29 UTC

Just for balance...

I have processed something like 50 work units without any errors over the past 2-3 days (on 5 machines, all with different specs, several versions of Windows). Two of the errors I had earlier were most likely due to extreme overclocking experiments.

The last one that caused errors for me was https://boinc.bakerlab.org/rosetta/workunit.php?wuid=55702

Perhaps there was a "bad batch" of work units in that range. The one mentioned above is the ONLY one that crashed on me recently and the ONLY one in that range (i.e. I have no other WU's with an id in the 50000-60000 range)

Perhaps others are also running more projects on their machines than I do and thus run into what looks like memory management bugs that are not an issue on my machines (not enough RAM to hold every work unit in memory, so the biggest one, Rosetta, suffers?)

One of my machines has only 256MB RAM. I run only Rosetta on that one and it's doing fine. The others have 512MB or 1GB and some of those run just one extra project at a time, to conserve memory.
*** Join BOINC@Australia today ***
ID: 766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 767 - Posted: 29 Sep 2005, 14:50:40 UTC

If I had to guess, and it is only a guess, the problem might be that the wu currently going out are for a bigger protein that requires more memory. Before David left he set up calculations for five different proteins so we could get a broad overview of the landscapes you are mapping out. The memory requirements obviously increase with protein length because there is more for the computer to model. If we can figure out what protein is causing the problem, and also figure out how to remove the corresponding wu, we will do it today.

Does it seem like current wu are taking more memory? Ultimately, I wonder if it is possible to direct different wu to different hosts depending on the size of the protein being modeled and the available machine memory. unfortunately, there isn't a lot we can do to significantly reduce the memory required--there are a lot of atoms to model!
ID: 767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 768 - Posted: 29 Sep 2005, 14:58:24 UTC - in response to Message 767.  

Ultimately, I wonder if it is possible to direct different wu to different hosts depending on the size of the protein being modeled and the available machine memory. unfortunately, there isn't a lot we can do to significantly reduce the memory required--there are a lot of atoms to model!


That is a great idea!

I think it might be possible because Einstein@home has mentioned that in the future they might send different workunits to different machines based on the requirements of the individual workunits. This would also make a project run more efficiently, which is to everyone's benefit. :)
Regards,
Bob P.
ID: 768 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter M. Nielsen

Send message
Joined: 17 Sep 05
Posts: 10
Credit: 423
RAC: 0
Message 772 - Posted: 29 Sep 2005, 15:50:07 UTC - in response to Message 768.  

[quote] Ultimately, I wonder if it is possible to direct different wu to different hosts depending on the size of the protein being modeled and the available machine memory. unfortunately, there isn't a lot we can do to significantly reduce the memory required--there are a lot of atoms to model!


You can define some minimum ram requirement for the client the wu is sent to - but i don't think you can do the opposite.

- Peter

_
ID: 772 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 783 - Posted: 29 Sep 2005, 20:12:46 UTC

Just a data point... to this point I have not had a work unit crash, at least not according to the work listed in my account. Both systems, however, have at least 1G RAM, with the PowerMac actually have 2.5G ... if is IS a RAM size problem that MAY be a clue ...

Other points, I do not "tweak" the systems much if at all, by that I mean, about the only thing I will do is "flush" the Results completed but not reported.
ID: 783 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tangent

Send message
Joined: 17 Sep 05
Posts: 4
Credit: 18,859
RAC: 0
Message 786 - Posted: 29 Sep 2005, 21:47:26 UTC

I've not had any errors with the new app. I'm running Rosetta on 2 systems, both have 448MB of memory (512MB installed with 64MB shared for the on-board video). I do run a 2GB swap file on both machines and have not seen any noticable performance degradation.


ID: 786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J D K
Avatar

Send message
Joined: 23 Sep 05
Posts: 168
Credit: 101,266
RAC: 0
Message 787 - Posted: 29 Sep 2005, 21:58:01 UTC

JUst had my first ***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005)
840ee 3.2 HT 1 gig ram soon to be 3 gig.

All 4 had the error about the same time......


Result ID 148432
Name 1btn__abrelax_no_cst_06187_1
Workunit 80003


Result ID 148435
Name 1btn__abrelax_no_cst_06414_1
Workunit 80684

Result ID 148430
Name 1btn__abrelax_no_cst_05998_1
Workunit 79439

Result ID 148332
Name 1btn__abrelax_no_cst_06255_1
Workunit 80207

BOINC Wiki

ID: 787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 793 - Posted: 30 Sep 2005, 0:12:02 UTC

Same thing on a dual Xeon HT box, running 4 Rosetta processes, W2K Adv Server SP4 :

9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_15688_0 ( - exit code -1073741819 (0xc0000005))
9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16013_0 ( - exit code -1073741819 (0xc0000005))
9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16374_0 ( - exit code -1073741819 (0xc0000005))
9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16963_0 ( - exit code -1073741819 (0xc0000005))


Nothing else happened at that time that was recorded in the event logs.



Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Client Errors



©2024 University of Washington
https://www.bakerlab.org