Message boards : Number crunching : Client Errors
Author | Message |
---|---|
The Pirate Send message Joined: 22 Sep 05 Posts: 20 Credit: 7,090,933 RAC: 0 |
Last night I added three more windows computers to this project. One runs Windows 32 bit with dual AMD MP 2100's and one is running Windows 64 bit with dual Opteron 275's. Both multi processor computers are getting some Client errors on Rosetta@Home. Some WU's complete ok however. On S@H, P@h, E@H and LHC they run just fine. |
Housing and Food Services Send message Joined: 1 Jul 05 Posts: 85 Credit: 155,098,531 RAC: 0 |
Do you have the option checked to leave Rosetta in memory? As has been noted in several threads, that seems to fix a couple problems for folks working on multiple projects. |
The Pirate Send message Joined: 22 Sep 05 Posts: 20 Credit: 7,090,933 RAC: 0 |
|
Red Squirrel Send message Joined: 26 Sep 05 Posts: 13 Credit: 3,613 RAC: 0 |
I just got a Client Error on my first Rosetta WU. I think it crashed as it got to the end of the model. It had reached the 83% completed mark and had run for about 50 mins of it's next hourly slot when it crashed. Here's the relevant part of the message log:- 27/09/2005 18:56:13|rosetta@home|Resuming result 1btn__abrelax_11836_0 using rosetta version 4.77 27/09/2005 19:50:01|rosetta@home|Unrecoverable error for result 1btn__abrelax_11836_0 ( - exit code -1073741819 (0xc0000005)) 27/09/2005 19:50:01||request_reschedule_cpus: process exited 27/09/2005 19:50:01|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds 27/09/2005 19:50:01|rosetta@home|Computation for result 1btn__abrelax_11836_0 finished 27/09/2005 19:50:01|climateprediction.net|Resuming result 0rru_000056379_0 using hadsm3 version 4.10 27/09/2005 19:51:02|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 27/09/2005 19:51:02|rosetta@home|Requesting 0 seconds of work, returning 1 results 27/09/2005 19:51:04|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded I am using Windows XP on an Athlon XP 2000+ with 512MB memory, and, yes, I do have it set to remain in memory when it's preempted. Anybody got any ideas? Thanks in advance, Alan |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
I have recently had this problemn too the last 2 Work Units have failed due to a computational error Join us in Chat (see the forum) Click the Sig Join UBT |
J D K Send message Joined: 23 Sep 05 Posts: 168 Credit: 101,266 RAC: 0 |
|
The Pirate Send message Joined: 22 Sep 05 Posts: 20 Credit: 7,090,933 RAC: 0 |
I am also getting the " Exit Code -1073741819 (0xc0000005))" error only on my multi-processor pc's. Both computers are not over clocked or over heating. Since Rosete@home is the ONLY app. that is getting this error and both get the same error. All other apps.,Seti@home, Einstein@home, LHC@home, ProteinPredictor@home and SetiBeta@home are running without errors, at this point, I am disinclined to start swapping things around. Of course I have seen stranger things happen in the past. |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Something has got to have been changed in the WU's by the developers or a bug has managed to creep into the WU's due to another reason. Its happining to too many people and too often at this moment in time, all my other projects are working correctly Join us in Chat (see the forum) Click the Sig Join UBT |
makaveli001 Send message Joined: 24 Sep 05 Posts: 1 Credit: 6,909 RAC: 0 |
I am getting client errors on both of my systems as well. The WUs seems to stop at 83.33% for a while and then just error. Jason |
Keith E. Laidig Volunteer moderator Project developer Send message Joined: 1 Jul 05 Posts: 154 Credit: 117,189,961 RAC: 0 |
Wouldn't you know it, just as DK is on vacation the s**t hits the fan... As soon as he's back we'll drop this problem in his lap. We appreciate the feedback. |
PlaNed Send message Joined: 25 Sep 05 Posts: 3 Credit: 37,334 RAC: 0 |
Two computers ( Barthon 256 RAM WinXP SP2 ). For two days 6 WU only OK and 16 Client error! I stop for any days... <img src="http://boinc.mundayweb.com/one/stats.php?userID=120&trans=off"> |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
Wouldn't you know it, just as DK is on vacation the s**t hits the fan... My percent completed seems to hang, but the work itself continues to progress to completion. Glad DK took a vacation. All things considered, things imho seem to be going quite well for such a new project. Better than some of your BOINC project competition! ;) Regards, Bob P. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Just for balance... I have processed something like 50 work units without any errors over the past 2-3 days (on 5 machines, all with different specs, several versions of Windows). Two of the errors I had earlier were most likely due to extreme overclocking experiments. The last one that caused errors for me was https://boinc.bakerlab.org/rosetta/workunit.php?wuid=55702 Perhaps there was a "bad batch" of work units in that range. The one mentioned above is the ONLY one that crashed on me recently and the ONLY one in that range (i.e. I have no other WU's with an id in the 50000-60000 range) Perhaps others are also running more projects on their machines than I do and thus run into what looks like memory management bugs that are not an issue on my machines (not enough RAM to hold every work unit in memory, so the biggest one, Rosetta, suffers?) One of my machines has only 256MB RAM. I run only Rosetta on that one and it's doing fine. The others have 512MB or 1GB and some of those run just one extra project at a time, to conserve memory. *** Join BOINC@Australia today *** |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
If I had to guess, and it is only a guess, the problem might be that the wu currently going out are for a bigger protein that requires more memory. Before David left he set up calculations for five different proteins so we could get a broad overview of the landscapes you are mapping out. The memory requirements obviously increase with protein length because there is more for the computer to model. If we can figure out what protein is causing the problem, and also figure out how to remove the corresponding wu, we will do it today. Does it seem like current wu are taking more memory? Ultimately, I wonder if it is possible to direct different wu to different hosts depending on the size of the protein being modeled and the available machine memory. unfortunately, there isn't a lot we can do to significantly reduce the memory required--there are a lot of atoms to model! |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
Ultimately, I wonder if it is possible to direct different wu to different hosts depending on the size of the protein being modeled and the available machine memory. unfortunately, there isn't a lot we can do to significantly reduce the memory required--there are a lot of atoms to model! That is a great idea! I think it might be possible because Einstein@home has mentioned that in the future they might send different workunits to different machines based on the requirements of the individual workunits. This would also make a project run more efficiently, which is to everyone's benefit. :) Regards, Bob P. |
Peter M. Nielsen Send message Joined: 17 Sep 05 Posts: 10 Credit: 423 RAC: 0 |
[quote] Ultimately, I wonder if it is possible to direct different wu to different hosts depending on the size of the protein being modeled and the available machine memory. unfortunately, there isn't a lot we can do to significantly reduce the memory required--there are a lot of atoms to model! You can define some minimum ram requirement for the client the wu is sent to - but i don't think you can do the opposite. - Peter _ |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Just a data point... to this point I have not had a work unit crash, at least not according to the work listed in my account. Both systems, however, have at least 1G RAM, with the PowerMac actually have 2.5G ... if is IS a RAM size problem that MAY be a clue ... Other points, I do not "tweak" the systems much if at all, by that I mean, about the only thing I will do is "flush" the Results completed but not reported. |
Tangent Send message Joined: 17 Sep 05 Posts: 4 Credit: 18,859 RAC: 0 |
I've not had any errors with the new app. I'm running Rosetta on 2 systems, both have 448MB of memory (512MB installed with 64MB shared for the on-board video). I do run a 2GB swap file on both machines and have not seen any noticable performance degradation. |
J D K Send message Joined: 23 Sep 05 Posts: 168 Credit: 101,266 RAC: 0 |
JUst had my first ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) 840ee 3.2 HT 1 gig ram soon to be 3 gig. All 4 had the error about the same time...... Result ID 148432 Name 1btn__abrelax_no_cst_06187_1 Workunit 80003 Result ID 148435 Name 1btn__abrelax_no_cst_06414_1 Workunit 80684 Result ID 148430 Name 1btn__abrelax_no_cst_05998_1 Workunit 79439 Result ID 148332 Name 1btn__abrelax_no_cst_06255_1 Workunit 80207 BOINC Wiki |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
Same thing on a dual Xeon HT box, running 4 Rosetta processes, W2K Adv Server SP4 : 9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_15688_0 ( - exit code -1073741819 (0xc0000005)) 9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16013_0 ( - exit code -1073741819 (0xc0000005)) 9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16374_0 ( - exit code -1073741819 (0xc0000005)) 9/29/2005 6:01:34 AM|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16963_0 ( - exit code -1073741819 (0xc0000005)) Nothing else happened at that time that was recorded in the event logs. Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
Message boards :
Number crunching :
Client Errors
©2024 University of Washington
https://www.bakerlab.org