Message boards : Cafe Rosetta : A couple of "old school" IT suggestions
Author | Message |
---|---|
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
Back in the prehistoric times when I supported computers used by research groups, there were these archaic devices called "pagers." All sorts of software conditions (missing processes, low/failed storage issues, systems down) and environmental conditions (HVAC faults, power failures) would page the on-call member of the team, and if he/she failed to respond the page would eventually shift to their backup. Remote access was a primitive thing done with dial-up modems, so some times things could be fixed from home, other times it was a drive to the servers, and you might or might not have been meeting the vendors' field engineer on-site. Strangely enough, the on-call schedule provided coverage across holidays once the team has puzzled out who was likely to be available on what dates. In these days when every cell phone on the planet can receive text messages, it seems to require no additional hardware and not much development cost to have any Server Status "Not Running" fault also send an alert message to Rosetta's IT people. Since all the really cool people have smart phones that can surf the web, another thought comes to mind ... It would seem to be in the self-interest of the researchers with active projects on Rosetta to visit the web site every couple of days and check to see if, "work is happening." I'll admit I am somewhat confused because the front page claims 567,925 queued jobs, while Server Status reports only 1,281 jobs pending, but someone must understand which number is accurate. Not really griping, just tossing out some ideas to avoid future failures, Alan |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, there are two figures for number of work units available. One on the homepage, and another on the server status page. And yes, it is confusing. Basically, the server status page shows all of the work that has been created within BOINC. The number on the homepage is a queue of work that is READY to be created within BOINC. And there is this little make work task in between the two that takes work from the queue and creates it within BOINC. This is the task that is apparently not functioning these last few days. The few shown on the server status page all seem to be coming from tasks that are crossing their expiration dates without a result, rather then from the queue of pending work. Rosetta Moderator: Mod.Sense |
Alan Roberts Send message Joined: 7 Jun 06 Posts: 61 Credit: 6,901,926 RAC: 0 |
Yes, there are two figures for number of work units available. ... Mod.Sense thanks for the explanation of the difference. I guess this means it would take a slightly more careful glance at the home page (noticing that Credits last 24h has plummeted perhaps) for anyone doing a status check to realize that work isn't happening. Regards, Alan |
joseps Send message Joined: 25 Jun 06 Posts: 72 Credit: 8,173,820 RAC: 0 |
Have patience. We are running on volunteer work sort company. It's not business for profit operation. A major supplier would not let this happen. All volunteers are like comsumers. If one supplier is down, we simply switch to the next supplier. That's why I just switch temporarily to WCG for work. I turned off my 5computers when I went on vacation. When I return today, I can not upload work. Need work units to run computers. joseps |
darwincollins Send message Joined: 1 Oct 09 Posts: 7 Credit: 5,586,679 RAC: 0 |
I wouldn't excuse it as 'non-profit' vs 'for-profit'. I know of several non-profit and volunteer run organizations that run much tighter ships than my day job. Even at my day job (govenment non-profit), the IT staff that don't give a flip still wouldn't have equipment (officially) down for days. In this case, Rosetta folks are probably doing the best that they can do, and may be doing alot of on-the-job training about SANS, etc, to get us back running. For the clients, we need to realize that they can also be as dedicated. If they have multiple projects, then, they may not notice any downtime. If the client was pushing (solely) for Rosetta, then they now have to decide to remain faithfull, or 'add projects' to boinc. As the days of downtime continue, it will cause some clients to have the perception that there is a lack of dedication or motivation, and so will move on to other projects all together. |
Message boards :
Cafe Rosetta :
A couple of "old school" IT suggestions
©2024 University of Washington
https://www.bakerlab.org