| Author | Message |
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
And yet another :-
October 3, 2025
We are aware of the issue with the scheduler returning "Another scheduler instance is running for this host" and have identified the cause in the config.xml template we adapated for the new containerzied environment. We will fix it once we have confirmed that the new event-driven validation and assimilation pipelines are working correctly.
Uploads are being processed normally, we've confirmed the new architecture for the containerized file_upload_handler pool behind Apache is correctly producing to the per-application Kafka (Redpanda) topics, storing the event and result data in separate queues on the local brokers partition.
As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent.
I've been away a few days and can confirm the above from my logs
30/09/2025 12:04:54 | World Community Grid | Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: HTTP service unavailable
30/09/2025 23:42:51 | World Community Grid | Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: Error 403
...
03/10/2025 09:24:03 | World Community Grid | Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: Error 403
03/10/2025 13:19:00 | World Community Grid | Another scheduler instance is running for this host
|
|
Bryn Mawr
Send message
Joined: 26 Dec 18 Posts: 430 Credit: 14,933,398 RAC: 10
|
Further progress :-
October 7, 2025
We have resolved the issue with the BOINC scheduler configuration causing "Another scheduler instance is running for this host". Users should be able to report tasks. We will update as soon as we begin creating new workunits as we are still working to stand up the rest of the BOINC backend architecture.
Website went down briefly as we brought the scheduler online. We have adjusted the HAProxy configuration, and we will continue to adjust Apache/HAProxy config if we see the website stops responding again.
Still debugging issues with the new Kafka-based validation workflow that works together with HAProxy routing rules to partition BOINC downloads and uploads by assigning servers equal hex buckets using the https://github.com/BOINC/boinc/wiki/DirHierarchy BOINC expects, and emitting events from the new file_upload_handler we wrote to Kafka so we can batch and respond to them in parallel. This removes the need for multiple round trips to the database for row-wise operations and polling, which are now simply batch applications of state after consuming workunits ready for validation in the relevant Kafka topic for that application. This allows us to perform validation and assimilation in the same process, at least for the projects we run ourselves (MCM1, MAM1, ARP1), and while the Kafka/Redpanda learning curve was significant, we have successfully transitioned to an event-driven in-memory partitioned architecture that should let us keep pace with the upcoming GPU enabled MAM1 application.
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
Err...
October 15, 2025
Testing the validators right now, been a lot of iterations on these.
As soon as the validator works, we will deploy across the six partitions and clear the backlog. Then we can check the transitioner interaction. If that is all good, we can finally start sending new work.
Going to finalize object storage for the archive - instead of previous tape backup.
October 18, 2025
We are sending small batches of workunits out starting tonight with batch IDs in the range 9999900+ for MCM1 to test the new distributed partition-aware batch upserting app-specific create_work daemons. The few volunteers who get these workunits before we start releasing larger batches as we gain confidence that the new system is working as expected may notice these workunits have a much smaller number of signatures and run much faster than normal. These are still meaningful workunits, but key parameters such as number of signatures to test per workunit were reduced so we could get feedback quckly.
Similar to ARP1, we have moved all workunit templating and preparation to WCG servers for MCM1. We did this for the MAM1 beta (beta30) already, but we were able to move the rendering of workunit templates per batch into the create_work daemon C++ code directly, where it consumes a protobuf schema from Kafka/Redpanda's schema registry that it then hydrates to produce all workunits for the batch according to the desired parameters it consumes from the "plan" topic via Kafka. Hence, "app-specific" above. Then, it updates the BOINC database in bulk instead of calling BOINC's create_work() function. Metadata is local, partitioned, replicated in Kafka for durability, each batch writes files to that nodes' 1/6th of the buckets from the BOINC dir_hier fanout directory and commits 1/6th of the batch records to the database in non-overlapping ranges per 10k workunits per batch.
The new validators are working and deployed. In our new distributed, partitioned approach, validators process workunits local to their host ONLY, uploads are partitioned according to the fanout directory assigned by BOINC, routed to the correct backend node by HAProxy corresponding to the BOINC fanout buckets. We split the buckets between nodes, instead of using them to fanout across the filesystem and avoid massive numbers of files in a single BOINC upload path, we fan out across the cluster and read/write these buckets in tmpfs so Apache serves downloads and accepts uploads in-memory, validators read in-memory, Kafka/Redpanda gets a copy of uploads into a disk-persisted, replicated topic for durability so if a node goes down and we lose the in-memory cache of downloads and uploads, we can replay and recover.
By subscribing to a Kafka topic containing the count of uploads, a reduction on upload events emitted to Kafka topics from the new file_upload_handlers for only the local buckets of that partition, file locations pertaining to a pair of workunits, and emits success or failure to another queue for downstream "assimilation". We have written and are testing a batch applier that collects successful validation events on each partition, and batch updates the BOINC database so that the transitioner and scheduler can work together to evaluate the state of those workunits. Once we are confident the batch updates work as expected from the applier, users should start seeing workunits pending validation clear to valid.
We are not running file_deleter or db_purge at the moment, they need to be rearchitected to match the new setup, or at minimum assessed to make sure it makes sense to start them unchanged. We have no concerns about running out of space in the database or on disk at the moment, only making mistakes, so we will get around to assessing what if anything needs to change about file_deleter and db_purge soon but not now. Likely, they will also take advantage of per-workunit event data from Redpanda/Kafka instead of just talking to the BOINC database and operate on local partitions across the cluster. But as we are producing events for every workunit's full lifecycle to Kafka topics we have a level of visibility and control we were never able to achieve with the legacy system, and we were able to set up prometheus node_exporter, tap into docker stats endpoints per node across the cluster, and likewise for Redpanda/Kafka with the helpful https://github.com/redpanda-data/observability repo to get a Grafana dashboard going that will let us do many things, such as serve up server status pages, and improve the stats pages.
October 21, 2025
Finally stress testing rather than correctness testing.
Sent a batch of 100,000 workunits (fast running, not full size in case something crashed.
Thank you for your patience and continued support.
21/10/2025 20:49:14 | World Community Grid | Sending scheduler request: To fetch work.
21/10/2025 20:49:14 | World Community Grid | Requesting new tasks for CPU and NVIDIA GPU
21/10/2025 20:49:24 | World Community Grid | Scheduler request completed: got 163 new tasks
Already completed 2x16 tasks without noticing!
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
Already completed 2x16 tasks without noticing!
163 tasks completed, none further came down
Except 25 minutes later I somehow picked up 60 Rosetta tasks - no idea where they came from
|
|
just1vet
Send message
Joined: 13 Nov 05 Posts: 5 Credit: 6,356,895 RAC: 82
|
Looks like WCG tried a 100k batch yesterday. Think they are having problems processing the returned ones. Haven't seen anything today. At least they are making some progress in getting it lined out. I would rather feed my machine medical units than space projects.
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
Looks like WCG tried a 100k batch yesterday. Think they are having problems processing the returned ones. Haven't seen anything today. At least they are making some progress in getting it lined out. I would rather feed my machine medical units than space projects.
Agreed. None of have turned up on my work PC or my 2nd home PC
On the plus side, I'm now getting SiDock tasks coming through, so things are more generally looking up
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
October 24, 2025
We have paused uploads and release of test batches while we work on the validation throughput issue.
We think we have identified the root cause of the low validation rate for the MCM1 test batches, and we will send a few more test batches to confirm the fix does works.
If this is the final fix and we see the expected validation rate for pairs of MCM1 uploads for new batches, we will replay the Kafka consumer on the upload events fired for test batches received earlier in the week, and this should idempotently allow the new batch assimilator to process those validations and assign credit.
If the above goes well, we will schedule regular MCM1 batches to resume instead of the test batches.
As volulnteers have noted, we have not yet reconciled uploads of regular MCM1 results submitted before we began sending test batches, and before the migration, but we have those files and will be able to do this in a batch update once the path for new workunits is working as described above.
Naturally, we will resume ARP and MAM only after these issues are fully resolved.
Checking this, I have 1468 tasks pending validation - about 600 from August and about 850 since the restart
Checking validations, only 1 task returned since the restart has been validated. 1 out of ~600
Let's hope that gets fixed soon...
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
October 24, 2025
We have paused uploads and release of test batches while we work on the validation throughput issue.
We think we have identified the root cause of the low validation rate for the MCM1 test batches, and we will send a few more test batches to confirm the fix does works.
If this is the final fix and we see the expected validation rate for pairs of MCM1 uploads for new batches, we will replay the Kafka consumer on the upload events fired for test batches received earlier in the week, and this should idempotently allow the new batch assimilator to process those validations and assign credit.
If the above goes well, we will schedule regular MCM1 batches to resume instead of the test batches.
As volunteers have noted, we have not yet reconciled uploads of regular MCM1 results submitted before we began sending test batches, and before the migration, but we have those files and will be able to do this in a batch update once the path for new workunits is working as described above.
Naturally, we will resume ARP and MAM only after these issues are fully resolved.
Checking this, I have 1468 tasks pending validation - about 600 from August and about 850 since the restart
Checking validations, only 1 task returned since the restart has been validated. 1 out of ~600
Let's hope that gets fixed soon...
Returning home last night to see persistent transient http errors uploading my last 16 tasks to WCG
This morning they'd all cleared and 181 more tasks came down
Remembering the last update (above) I thought I'd recheck where I was on validation and about 620 tasks returned since the restart have now validated - up from 1 - so it has been fixed quickly - though the older tasks remain, as stated. I have a total of 848 tasks pending validation, with ~200 of those since the restart
Happening slowly, but definitely progressing
Edit: Which is all just as well as SiDock tasks have run out and I'm down to my last 3
But I've only just noticed these new WCG tasks only run for 5-6 minutes each.
This must be what they mean by short-running test batches. I've only just understood that.
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
Edit: Which is all just as well as SiDock tasks have run out and I'm down to my last 3
Some SiDock tasks are back
Post restart WCG tasks are continuing to validate, if quite slowly - 181 currently pending, down from 198 at the previous count
|
|
Sid Celery
Send message
Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258
|
October 28, 2025
We have fixed the main validation throughput issues with the new Kafka-based workflow, and reprocessed uploads from around the time we started sending out test batches. We are reviewing the Kafka topics and BOINC database to see if the volunteer reports of both results for a test workunit uploaded but no validation/assimilation occurred during the reprocess is another bug to fix, and if so is it severe enough to block regular MCM1_024% batch distribution until resolved.
In reviewing the transitioner implementation (which we intended to start yesterday to begin triggering resends for test batches), we found the new paradigm for storing configuration details that are required to populate resends in the result table needed to be incorporated into key functions. We are testing these relatively minor changes to the transitioner now.
Our plan is to deploy the updated transitioner, verify resends work, verify it times out expired workunits, and depending on how that and the review of "missed validations" noted above goes we may then be ready to resume MCM1 batches in the normal range.
Regarding uploads that span the downtime for migration, we will reconcile validation and credit for these workunits as soon as the production path for MCM1 described above is running. We should be able to use the new components to do that, after walking the filesystems where those uploads live, double-checking the list that needs validation and crediting in the database, and pursuing a similar "reprocessing" path which worked well to re-attempt validation and crediting of the test MCM1 batches.
Then, we will begin testing beta30/MAM1, and ARP1 using the new system, which we expect to progress much faster now that we have ironed out the logic with MCM1.
Stats updates will be restarted as soon as the MCM1 workflow is stable, that will include the daily export to https://download.worldcommunitygrid.org/boinc/stats/
Only 13 more WCG tasks validated since I last mentioned it - hardly any.
Some new tasks started coming down yesterday, which are running for as long as they used to for me - 77mins in my case
I think I've grabbed just 15 of them and returned 6 already. No idea if that's a limited number of new tasks because I'm pretty stacked with SiDock tasks atm and my buffer looks full
Good to see mention of stats export. Boincstats hasn't shown any new credits yet since WCG came back up. I'm short about 50k credits
|
|