Message boards : Cafe Rosetta : Other projects.
Previous · 1 · 2 · 3 · 4 · 5
| Author | Message |
|---|---|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
I took a very rare visit to the WCG forums to see what was being said and the credit increase was almost completely ignored All recent credits have been withdrawn and, having re-read their forum thread, I think they actually had it sussed quite early on, but I couldn't follow the discussion well enough first time around. Let's hope their second attempt goes better (and mine tbf) My Boinc WCG credits are back down from 54.5m to where they started before the weekend - 35,655,547
|
|
mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 564 |
On a linux box ( - mine is Linux Mint 22.3), an easy way to poll for new rosetta tasks is using https://www.sidock.si/sidock/ is doing some Chemistry related stuff as well |
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
SiDock has just come back. All transfers uploaded, tasks sent and 8 tasks received back. WCG still down and Rosetta still hit-and-hope
|
|
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,696,256 RAC: 13,458 |
A new update from WCG :- April 9, 2026 BOINC web traffic has been blocked at the load balancer for maintenance, all BOINC scheduler, downloads, uploads will be met with HTTP 503 codes until maintenance is completed - we expect completion between April 10th and April 11th, but no earlier than 18:00 UTC on April 10th to allow projected file transfers and migration of sharded database table records between citus postgres workers to complete. We will update here and on the forums if we expect extended maintenance over the weekend. Volunteers should expect that a successful rollout during this maintenance window will increase workunit availability, and put another dent in the validation backlog. |
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
A new update from WCG Not back yet, so I'll keep my fingers crossed for tomorrow Meanwhile, SiDock went back down again not long after my first results were sent back. And some of the tasks that got sent back initially when it came back up missed deadline and weren't credited from the look of things. Crunching tasks is like pulling teeth at the moment, even if we finally get sent some work to do Very frustrating
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
A new update from WCG <sigh>
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
WCG April 13, 2026 We are aware of the web site and forum issues - looking into it. Our certificates are valid. We aren't back online - yet - we are still waiting for some answers from UT and UHN about the cause of this issue. We will update once we have some answers.
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
WCG April 13, 2026 All my outstanding WCG transfers went through yesterday, but tasks weren't able to upload. This evening they uploaded. No new tasks have come done yet, though. In the meantime I'm still only sneaking the occasional one or two Rosetta tasks - not all of which run successfully either. While completed SiDock tasks are still failing to upload and every attempt leads to a 24hr backoff
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
All my outstanding WCG transfers went through yesterday, but tasks weren't able to upload. Tasks coming down now - 138 at a first grab for me
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
All my outstanding WCG transfers went through yesterday, but tasks weren't able to upload. WCG tasks are still coming down However, nothing seems to be transferring back, so tasks are piling up with a status of 'uploading' Glancing at the WCG forums I now realise a huge credit update has gone to Boincstats Boincstats seems to have an issue between hosts and teams I've had an update totalling 2.85m across 3 hosts, but my team update is only 987k
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
However, nothing seems to be transferring back, so tasks are piling up with a status of 'uploading' Transfers went through about an hour ago and new tasks started arriving just now Some downloads failed, I suspect because many hosts are all polling at the same time I got 143 tasks, then a further 20, so availability seems high, for the moment at least
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
However, nothing seems to be transferring back, so tasks are piling up with a status of 'uploading' Now getting "HTTP service unavailable" again, so it's not currently reliable Perhaps give it an extra few hours before piling in for new tasks, although uploading transfers seems ok for the moment
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
Sorry - a bit late on this from WCG April 17, 2026 More problems - this time issues with the data center. Website/Forum Outage - possibly due to a service interruption in our cloud environment that hosting is working to fix and prevents us from accessing our running instances for maintenance, and may be responsible for other issues although this is currently unclear. Hosting provided an ETA of 1h at 19:00 UTC today April 17th, 2026, and we will keep volunteers posted as we get information and attempt to come back online. BOINC Backend Outage Ongoing after brief success window on Wednesday, April 15th, 2026 - we enabled the BOINC stats dump after seeing the new architecture handle load and fixing the 503 upload issue. However, in attempting to fix the 404s on the download path by rebuilding the input files and writing them to the tmpfs cache from which downloads are served, we pushed several nodes into various failure states such as SUnreclaim slab memory exhaustion due to the overhead of writing each file to tmpfs en masse, and ill-advised queries run against postgres before EXPLAIN and without paging results to disk that backed up everything else waiting on postgres and caused soft lockup on the node. In addition to issues with the io_method = 'io_uring' setting in combination with our network-attached volumes for the postgres datadir, a setting we may have to change and evaluate before restarting. The naive "should be backup tonight" note in the forums by the WCG Tech was based on having recovered from one of these soft lockups many times in the past few weeks, and before understanding the reason for the initial crash or causing the later crashes on other nodes while attempting various methods of safely regenerating workunits that were throwing 404s on download after being assigned by the scheduler. We will bring the system online as soon as we are confident the results the scheduler will assign have download URLs with files at those paths, and the cluster is stable again.
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
WCG has stuttered to a restart in the last few hours. Not going smoothly, but over 3-4 hours tasks are coming down, some running, some aborted, eventually getting uploaded and more tasks coming down. Hopefully it settles down shortly, but I'm not betting on it just yet
|
|
Michael E. Send message Joined: 5 Apr 08 Posts: 17 Credit: 2,032,556 RAC: 1,688 |
Anybody got an idea about what's going on over at WCG? You might look at the Operational Status tab here: [url]https://www.cs.toronto.edu/~juris/jlab/wcg.html [/url] WCG has been unpredictable for months. They have MCM tasks available today. See: [url]https://wuprop.statseb.fr/active_projects.py [/url] |
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
WCG has stuttered to a restart in the last few hours. It does seem to have settled down and it looks like there's been a credit update at Boincstats - hopefully correctly this time
|
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2581 Credit: 47,219,985 RAC: 1,303 |
April 29, 2026 BOINC traffic resumed around 12:30 UTC on April 25th, 2026 and cluster is currently stable as of this writing (18:00 UTC on April 29th, 2026) - the issues at the data center were resolved, recovery was successful and the issues causing large numbers of 404s on download and 503s on upload unrelated to the issue at the data center were both resolved as well. We implemented back-pressure for the validators, juggled some services like the backfill validations off the database cluster nodes to other nodes in the cluster, further tuned postgres for our workload, and modified some BOINC components to harden the cluster against future outages. BOINC stats export to https://download.worldcommunitygrid.org/boinc/stats resumed - server status page will follow once ARP1 and MAM1 are up and running again. IN_PROGRESS results do not display on the website - until the credit_flusher batch upserts those rows into the legacy MariaDB database after validation, these workunits are not visible to the website APIs. We are working to add to the Results API a fetch and cache from the new BOINC database postgres cluster so that these IN_PROGRESS results can be seen on the website and retrieved from the APIs. Likely, this fix will coincide with improvements to allow users with Result sets large enough to timeout the API to see or at least download their results, and fixes and tooltips for the new Summary feature on the Results page. Data Sharing radio button does not work - thank you for the report. Working to fix this. 403 Forbidden - frustrating forum users - when we fixed the team challenge registration, the issue causing 403 Forbidden on that page was the updated mod-security rules for apache2 on the load balancer server. We will start there and look at the mod-security rules from load balancer through to the container that hosts the website behind HAProxy which also has it's own set of rules, and hopefully provide relief soon. When will ARP1 be released? - current blocker is the geographical split with overlapping edges between regions to match our partitioned backend, so that downloads and uploads will be routed to a mostly contiguous geographic region of the overall sub-Saharan region for which the project is predicting the weather. As this involves fetching ARP1 results across boundaries of those mostly contiguous regions so that "halo" domains can have their next generations of workunit inputs generated from the completed work of all their neighbours within, and the neighbours across the partition border, it requires more devleopment and testing. Now that we are seeing stability in the new architecture, this is a priority and we will update as we get a better sense for the exact timing. We may release workunits that are completely within a geographic partition to test the ARP1 BOINC components in general, before we can announce that the project is back up and running. When will MAM1 and the GPU build "MAMG" workunits be released? - while MAM1 and the corresponding beta30 project are up to date with the MCM1 pipeline and could have work issued at any time, we are exploring the options afforded to us by upgrading our BOINC build such as the BOINC Universal Docker App "BUDA" (https://github.com/BOINC/boinc/wiki/BUDA-overview), and also using Mojo with the existing LibTorch support within the application to run on newer AMD and NVIDIA GPUs https://docs.modular.com/max/develop/custom-kernels-pytorch). We expect to start sending out batches of MAM1 workunits for the mt CPU build through the beta30 app and possibly the MAM1_9999900+ testing range this week, barring some new blocker, and will update on GPU support as we release new builds through the beta30 application.
|
Message boards :
Cafe Rosetta :
Other projects.
©2026 University of Washington
https://www.bakerlab.org