A couple of "old school" IT suggestions

Message boards : Cafe Rosetta : A couple of "old school" IT suggestions

To post messages, you must log in.

AuthorMessage
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 64728 - Posted: 2 Jan 2010, 14:57:44 UTC

Back in the prehistoric times when I supported computers used by research groups, there were these archaic devices called "pagers." All sorts of software conditions (missing processes, low/failed storage issues, systems down) and environmental conditions (HVAC faults, power failures) would page the on-call member of the team, and if he/she failed to respond the page would eventually shift to their backup.

Remote access was a primitive thing done with dial-up modems, so some times things could be fixed from home, other times it was a drive to the servers, and you might or might not have been meeting the vendors' field engineer on-site.

Strangely enough, the on-call schedule provided coverage across holidays once the team has puzzled out who was likely to be available on what dates.

In these days when every cell phone on the planet can receive text messages, it seems to require no additional hardware and not much development cost to have any Server Status "Not Running" fault also send an alert message to Rosetta's IT people.


Since all the really cool people have smart phones that can surf the web, another thought comes to mind ... It would seem to be in the self-interest of the researchers with active projects on Rosetta to visit the web site every couple of days and check to see if, "work is happening."

I'll admit I am somewhat confused because the front page claims 567,925 queued jobs, while Server Status reports only 1,281 jobs pending, but someone must understand which number is accurate.


Not really griping, just tossing out some ideas to avoid future failures,
Alan

ID: 64728 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64734 - Posted: 2 Jan 2010, 18:28:43 UTC

Yes, there are two figures for number of work units available. One on the homepage, and another on the server status page. And yes, it is confusing. Basically, the server status page shows all of the work that has been created within BOINC. The number on the homepage is a queue of work that is READY to be created within BOINC. And there is this little make work task in between the two that takes work from the queue and creates it within BOINC. This is the task that is apparently not functioning these last few days. The few shown on the server status page all seem to be coming from tasks that are crossing their expiration dates without a result, rather then from the queue of pending work.
Rosetta Moderator: Mod.Sense
ID: 64734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 64735 - Posted: 2 Jan 2010, 18:42:59 UTC - in response to Message 64734.  

Yes, there are two figures for number of work units available. ...


Mod.Sense thanks for the explanation of the difference. I guess this means it would take a slightly more careful glance at the home page (noticing that Credits last 24h has plummeted perhaps) for anyone doing a status check to realize that work isn't happening.

Regards,
Alan


ID: 64735 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile joseps

Send message
Joined: 25 Jun 06
Posts: 72
Credit: 8,173,820
RAC: 0
Message 64738 - Posted: 2 Jan 2010, 20:03:39 UTC

Have patience. We are running on volunteer work sort company. It's not business for profit operation. A major supplier would not let this happen.
All volunteers are like comsumers. If one supplier is down, we simply switch to the next supplier. That's why I just switch temporarily to WCG for work.
I turned off my 5computers when I went on vacation. When I return today, I can not upload work. Need work units to run computers.
joseps
ID: 64738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
darwincollins

Send message
Joined: 1 Oct 09
Posts: 7
Credit: 5,586,679
RAC: 0
Message 64806 - Posted: 5 Jan 2010, 1:50:12 UTC

I wouldn't excuse it as 'non-profit' vs 'for-profit'. I know of several non-profit and volunteer run organizations that run much tighter ships than my day job. Even at my day job (govenment non-profit), the IT staff that don't give a flip still wouldn't have equipment (officially) down for days.

In this case, Rosetta folks are probably doing the best that they can do, and may be doing alot of on-the-job training about SANS, etc, to get us back running.

For the clients, we need to realize that they can also be as dedicated.
If they have multiple projects, then, they may not notice any downtime.

If the client was pushing (solely) for Rosetta, then they now have to decide to remain faithfull, or 'add projects' to boinc.

As the days of downtime continue, it will cause some clients to have the perception that there is a lack of dedication or motivation, and so will move on to other projects all together.
ID: 64806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Cafe Rosetta : A couple of "old school" IT suggestions



©2024 University of Washington
https://www.bakerlab.org