Message boards : Cafe Rosetta : Losing WU progress
Author | Message |
---|---|
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
Hello, I have been running Rosetta@Home COVID-19 WUs on a really low end hardware. That computer has AMD Turion X2 TL-60 inside and each WU takes 21 hours to complete. At home I don't feel comfortable leaving old hardware running overnight and I have to shut down that PC after 13-15 hours. Sadly, it seems that after shut down I lost all progress of those two WUs and they simply started from scratch. I run latest linux Mint there. Is there anything I can do about this? Cosmology@Home WUs work just fine and save their progress. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It sounds like perhaps the WU did not reach the completion of the first model. The end of a model is always a checkpoint, where work is saved. Some WUs have checkpoints within a model as well. Can you look at the properties of the WUs and see the time since their last checkpoint as compared to CPU time? Rosetta Moderator: Mod.Sense |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
CPU time is equal to CPU time since checkpoint. Right now it's at 3:40:40. It seems that previous progress was completely discarded. Also it's weird that my other PC had working checkpoints. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Different types of work units checkpoint at different time. The one thing they all have in common is that at the end of a model, a checkpoint is taken. Any time you shutdown BOINC, the work since the last checkpoint is lost. The objective is that this would normally be less than 15 minutes of work. When the task restarts, it restores the checkpoint and continues from there. In your case, unfortunately, it sounds like a checkpoint was never reached. Taking so long to reach is checkpoint is very unusual. I can only suggest looking for other normal bottlenecks, memory, and BOINC settings for how much memory can be used, leave tasks in memory when suspended, don't use BOINC as a screensaver, don't leave the graphics display running, etc. If the problem continues, rest assured that it will be detected and ended. The message says something like "too many restarts with no progress". But it takes 4 or 5 times. Since the project is out of work, there won't be others to replace them. If you can, it would be interesting to see how it looks if you leave the machine on overnight. Rosetta Moderator: Mod.Sense |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
I will try to leave that computer running. I don't use it for anything else and it once was DV6000 laptop. Now it's a halftop. There's no screen, battery, keyboard touchpad. The only plastic left is the one which holds everything together. It was also upgraded from 1.8GHz Sempron to Turion X2 2GHz, from 1GB DDR2 to 4GB, from 80GB HDD to 120GB SSD. So naturally now I use it as "desktop". Meaning that I plug in keyboard/touchpad (Logitech K400), to monitor via VGA, use old Android phone as WiFi adapter. Not having internal display means that I can't see how to enter BIOS and lack of internal keyboard means that support for external ones is finicky. So I can't really access it anymore and if I unplug it from power all settings are lost. I think that by default there's only 32MB assigned to GPU (GeForce 6150 Go), which might be problematic nowadays. Not sure if it could be related to not saving progress. Not sure about virtualization and security setting (NX bit maybe). Hopefully this information helps a bit. |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
Small update: The whole thing froze. BOINC window in fullscreen froze, Mint menus froze. PC doesn't respond to lost internet connection. Only mouse cursor moves. It looks like it won't resolve itself, but I will keep it running and won't try to close BOINC window. There's only 1 hour and 28 minutes work left. That's so frustrating. |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
And I had to restart machine afterall. During night it became completely unresponsive. After restart both Rosetta WUs have failed ("computation error"). However it seems that both are 100% complete. |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
I'm really sorry for ruining those 2 WUs. After that incident I set Rosetta to not get any new tasks, but during that freeze it downloaded one WU and I noticed that it's 95% complete. There's not much left, so I will keep it going. Checkpoint functionality is again nonexistant on that WU, however Cosmology's WU seems to have working checkpoints. |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
So, that one WU was crunched successfully. However, checkpoints didn't work. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Can you give a link to the task? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
The red spirit Send message Joined: 22 Nov 15 Posts: 10 Credit: 214,036 RAC: 557 |
Can you give a link to the task? Here they are (the failed ones): https://boinc.bakerlab.org/rosetta/result.php?resultid=1139781657 https://boinc.bakerlab.org/rosetta/result.php?resultid=1139781737 And here's the successful one: https://boinc.bakerlab.org/rosetta/result.php?resultid=1140448033 |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
This has been mentioned before. The Rosetta i686 applications aren't working correctly since last week. They only give 20 credits upon completion and only produce a single decoy. This also explains why checkpointing isn't working midway (I've observed this as well, by the way). You can correct this by working on 64-bit applications only by disabling alt_platformsas per the guide: https://boinc.berkeley.edu/wiki/Client_configuration |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
I also have a few theories about your freeze.
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So would the alt_platform look like this then?? <alt_platform>(Nope, I had it wrong)</alt_platform> Rosetta Moderator: Mod.Sense |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
This is what /var/lib/boinc-client/cc_config.xml looks like over here: <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> </log_flags> <options> <no_alt_platform>1</no_alt_platform> </options> </cc_config> After restarting the boinc daemon (systemctl restart boinc-client), this will completely remove the alt_platform line from your client_state.xml and all future scheduler requests. |
Message boards :
Cafe Rosetta :
Losing WU progress
©2024 University of Washington
https://www.bakerlab.org