Other projects.

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113134 - Posted: 5 Oct 2025, 16:41:49 UTC - in response to Message 113131. And yet another :- October 3, 2025 We are aware of the issue with the scheduler returning "Another scheduler instance is running for this host" and have identified the cause in the config.xml template we adapated for the new containerzied environment. We will fix it once we have confirmed that the new event-driven validation and assimilation pipelines are working correctly. Uploads are being processed normally, we've confirmed the new architecture for the containerized file_upload_handler pool behind Apache is correctly producing to the per-application Kafka (Redpanda) topics, storing the event and result data in separate queues on the local brokers partition. As a result, there will be at least one more weekend sprint. Tentatively, we expect to be producing new workunits next week for MCM1, ARP1, and MAM1 beta version 7.07, validations should resume over the weekend, initial releases of batches will be intermittent. I've been away a few days and can confirm the above from my logs 30/09/2025 12:04:54 \| World Community Grid \| Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: HTTP service unavailable 30/09/2025 23:42:51 \| World Community Grid \| Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: Error 403 ... 03/10/2025 09:24:03 \| World Community Grid \| Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: Error 403 03/10/2025 13:19:00 \| World Community Grid \| Another scheduler instance is running for this host ID: 113134 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,398 RAC: 10	Message 113141 - Posted: 8 Oct 2025, 7:12:07 UTC Further progress :- October 7, 2025 We have resolved the issue with the BOINC scheduler configuration causing "Another scheduler instance is running for this host". Users should be able to report tasks. We will update as soon as we begin creating new workunits as we are still working to stand up the rest of the BOINC backend architecture. Website went down briefly as we brought the scheduler online. We have adjusted the HAProxy configuration, and we will continue to adjust Apache/HAProxy config if we see the website stops responding again. Still debugging issues with the new Kafka-based validation workflow that works together with HAProxy routing rules to partition BOINC downloads and uploads by assigning servers equal hex buckets using the https://github.com/BOINC/boinc/wiki/DirHierarchy BOINC expects, and emitting events from the new file_upload_handler we wrote to Kafka so we can batch and respond to them in parallel. This removes the need for multiple round trips to the database for row-wise operations and polling, which are now simply batch applications of state after consuming workunits ready for validation in the relevant Kafka topic for that application. This allows us to perform validation and assimilation in the same process, at least for the projects we run ourselves (MCM1, MAM1, ARP1), and while the Kafka/Redpanda learning curve was significant, we have successfully transitioned to an event-driven in-memory partitioned architecture that should let us keep pace with the upcoming GPU enabled MAM1 application. ID: 113141 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113164 - Posted: 21 Oct 2025, 23:27:24 UTC Err... October 15, 2025 Testing the validators right now, been a lot of iterations on these. As soon as the validator works, we will deploy across the six partitions and clear the backlog. Then we can check the transitioner interaction. If that is all good, we can finally start sending new work. Going to finalize object storage for the archive - instead of previous tape backup. October 18, 2025 We are sending small batches of workunits out starting tonight with batch IDs in the range 9999900+ for MCM1 to test the new distributed partition-aware batch upserting app-specific create_work daemons. The few volunteers who get these workunits before we start releasing larger batches as we gain confidence that the new system is working as expected may notice these workunits have a much smaller number of signatures and run much faster than normal. These are still meaningful workunits, but key parameters such as number of signatures to test per workunit were reduced so we could get feedback quckly. Similar to ARP1, we have moved all workunit templating and preparation to WCG servers for MCM1. We did this for the MAM1 beta (beta30) already, but we were able to move the rendering of workunit templates per batch into the create_work daemon C++ code directly, where it consumes a protobuf schema from Kafka/Redpanda's schema registry that it then hydrates to produce all workunits for the batch according to the desired parameters it consumes from the "plan" topic via Kafka. Hence, "app-specific" above. Then, it updates the BOINC database in bulk instead of calling BOINC's create_work() function. Metadata is local, partitioned, replicated in Kafka for durability, each batch writes files to that nodes' 1/6th of the buckets from the BOINC dir_hier fanout directory and commits 1/6th of the batch records to the database in non-overlapping ranges per 10k workunits per batch. The new validators are working and deployed. In our new distributed, partitioned approach, validators process workunits local to their host ONLY, uploads are partitioned according to the fanout directory assigned by BOINC, routed to the correct backend node by HAProxy corresponding to the BOINC fanout buckets. We split the buckets between nodes, instead of using them to fanout across the filesystem and avoid massive numbers of files in a single BOINC upload path, we fan out across the cluster and read/write these buckets in tmpfs so Apache serves downloads and accepts uploads in-memory, validators read in-memory, Kafka/Redpanda gets a copy of uploads into a disk-persisted, replicated topic for durability so if a node goes down and we lose the in-memory cache of downloads and uploads, we can replay and recover. By subscribing to a Kafka topic containing the count of uploads, a reduction on upload events emitted to Kafka topics from the new file_upload_handlers for only the local buckets of that partition, file locations pertaining to a pair of workunits, and emits success or failure to another queue for downstream "assimilation". We have written and are testing a batch applier that collects successful validation events on each partition, and batch updates the BOINC database so that the transitioner and scheduler can work together to evaluate the state of those workunits. Once we are confident the batch updates work as expected from the applier, users should start seeing workunits pending validation clear to valid. We are not running file_deleter or db_purge at the moment, they need to be rearchitected to match the new setup, or at minimum assessed to make sure it makes sense to start them unchanged. We have no concerns about running out of space in the database or on disk at the moment, only making mistakes, so we will get around to assessing what if anything needs to change about file_deleter and db_purge soon but not now. Likely, they will also take advantage of per-workunit event data from Redpanda/Kafka instead of just talking to the BOINC database and operate on local partitions across the cluster. But as we are producing events for every workunit's full lifecycle to Kafka topics we have a level of visibility and control we were never able to achieve with the legacy system, and we were able to set up prometheus node_exporter, tap into docker stats endpoints per node across the cluster, and likewise for Redpanda/Kafka with the helpful https://github.com/redpanda-data/observability repo to get a Grafana dashboard going that will let us do many things, such as serve up server status pages, and improve the stats pages. October 21, 2025 Finally stress testing rather than correctness testing. Sent a batch of 100,000 workunits (fast running, not full size in case something crashed. Thank you for your patience and continued support. 21/10/2025 20:49:14 \| World Community Grid \| Sending scheduler request: To fetch work. 21/10/2025 20:49:14 \| World Community Grid \| Requesting new tasks for CPU and NVIDIA GPU 21/10/2025 20:49:24 \| World Community Grid \| Scheduler request completed: got 163 new tasks Already completed 2x16 tasks without noticing! ID: 113164 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113165 - Posted: 22 Oct 2025, 11:20:20 UTC - in response to Message 113164. Already completed 2x16 tasks without noticing! 163 tasks completed, none further came down Except 25 minutes later I somehow picked up 60 Rosetta tasks - no idea where they came from ID: 113165 · Rating: 0 · rate: / Reply Quote

just1vet Send message Joined: 13 Nov 05 Posts: 5 Credit: 6,356,895 RAC: 82	Message 113166 - Posted: 22 Oct 2025, 17:37:57 UTC - in response to Message 113165. Looks like WCG tried a 100k batch yesterday. Think they are having problems processing the returned ones. Haven't seen anything today. At least they are making some progress in getting it lined out. I would rather feed my machine medical units than space projects. ID: 113166 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113168 - Posted: 24 Oct 2025, 1:03:20 UTC - in response to Message 113166. Looks like WCG tried a 100k batch yesterday. Think they are having problems processing the returned ones. Haven't seen anything today. At least they are making some progress in getting it lined out. I would rather feed my machine medical units than space projects. Agreed. None of have turned up on my work PC or my 2nd home PC On the plus side, I'm now getting SiDock tasks coming through, so things are more generally looking up ID: 113168 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113169 - Posted: 25 Oct 2025, 1:09:57 UTC October 24, 2025 We have paused uploads and release of test batches while we work on the validation throughput issue. We think we have identified the root cause of the low validation rate for the MCM1 test batches, and we will send a few more test batches to confirm the fix does works. If this is the final fix and we see the expected validation rate for pairs of MCM1 uploads for new batches, we will replay the Kafka consumer on the upload events fired for test batches received earlier in the week, and this should idempotently allow the new batch assimilator to process those validations and assign credit. If the above goes well, we will schedule regular MCM1 batches to resume instead of the test batches. As volulnteers have noted, we have not yet reconciled uploads of regular MCM1 results submitted before we began sending test batches, and before the migration, but we have those files and will be able to do this in a batch update once the path for new workunits is working as described above. Naturally, we will resume ARP and MAM only after these issues are fully resolved. Checking this, I have 1468 tasks pending validation - about 600 from August and about 850 since the restart Checking validations, only 1 task returned since the restart has been validated. 1 out of ~600 Let's hope that gets fixed soon... ID: 113169 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113170 - Posted: 26 Oct 2025, 8:37:03 UTC - in response to Message 113169. Last modified: 26 Oct 2025, 8:46:13 UTC October 24, 2025 We have paused uploads and release of test batches while we work on the validation throughput issue. We think we have identified the root cause of the low validation rate for the MCM1 test batches, and we will send a few more test batches to confirm the fix does works. If this is the final fix and we see the expected validation rate for pairs of MCM1 uploads for new batches, we will replay the Kafka consumer on the upload events fired for test batches received earlier in the week, and this should idempotently allow the new batch assimilator to process those validations and assign credit. If the above goes well, we will schedule regular MCM1 batches to resume instead of the test batches. As volunteers have noted, we have not yet reconciled uploads of regular MCM1 results submitted before we began sending test batches, and before the migration, but we have those files and will be able to do this in a batch update once the path for new workunits is working as described above. Naturally, we will resume ARP and MAM only after these issues are fully resolved. Checking this, I have 1468 tasks pending validation - about 600 from August and about 850 since the restart Checking validations, only 1 task returned since the restart has been validated. 1 out of ~600 Let's hope that gets fixed soon... Returning home last night to see persistent transient http errors uploading my last 16 tasks to WCG This morning they'd all cleared and 181 more tasks came down Remembering the last update (above) I thought I'd recheck where I was on validation and about 620 tasks returned since the restart have now validated - up from 1 - so it has been fixed quickly - though the older tasks remain, as stated. I have a total of 848 tasks pending validation, with ~200 of those since the restart Happening slowly, but definitely progressing Edit: Which is all just as well as SiDock tasks have run out and I'm down to my last 3 But I've only just noticed these new WCG tasks only run for 5-6 minutes each. This must be what they mean by short-running test batches. I've only just understood that. ID: 113170 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113171 - Posted: 26 Oct 2025, 19:19:55 UTC - in response to Message 113170. Edit: Which is all just as well as SiDock tasks have run out and I'm down to my last 3 Some SiDock tasks are back Post restart WCG tasks are continuing to validate, if quite slowly - 181 currently pending, down from 198 at the previous count ID: 113171 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2478 Credit: 46,506,558 RAC: 2,258	Message 113172 - Posted: 29 Oct 2025, 1:37:20 UTC Last modified: 29 Oct 2025, 1:41:05 UTC October 28, 2025 We have fixed the main validation throughput issues with the new Kafka-based workflow, and reprocessed uploads from around the time we started sending out test batches. We are reviewing the Kafka topics and BOINC database to see if the volunteer reports of both results for a test workunit uploaded but no validation/assimilation occurred during the reprocess is another bug to fix, and if so is it severe enough to block regular MCM1_024% batch distribution until resolved. In reviewing the transitioner implementation (which we intended to start yesterday to begin triggering resends for test batches), we found the new paradigm for storing configuration details that are required to populate resends in the result table needed to be incorporated into key functions. We are testing these relatively minor changes to the transitioner now. Our plan is to deploy the updated transitioner, verify resends work, verify it times out expired workunits, and depending on how that and the review of "missed validations" noted above goes we may then be ready to resume MCM1 batches in the normal range. Regarding uploads that span the downtime for migration, we will reconcile validation and credit for these workunits as soon as the production path for MCM1 described above is running. We should be able to use the new components to do that, after walking the filesystems where those uploads live, double-checking the list that needs validation and crediting in the database, and pursuing a similar "reprocessing" path which worked well to re-attempt validation and crediting of the test MCM1 batches. Then, we will begin testing beta30/MAM1, and ARP1 using the new system, which we expect to progress much faster now that we have ironed out the logic with MCM1. Stats updates will be restarted as soon as the MCM1 workflow is stable, that will include the daily export to https://download.worldcommunitygrid.org/boinc/stats/ Only 13 more WCG tasks validated since I last mentioned it - hardly any. Some new tasks started coming down yesterday, which are running for as long as they used to for me - 77mins in my case I think I've grabbed just 15 of them and returned 6 already. No idea if that's a limited number of new tasks because I'm pretty stacked with SiDock tasks atm and my buffer looks full Good to see mention of stats export. Boincstats hasn't shown any new credits yet since WCG came back up. I'm short about 50k credits ID: 113172 · Rating: 0 · rate: / Reply Quote