Posts by Steven Meyer

11)

Message boards : Number crunching : Bug Report - Random Reboots

( Message 5300 )
Posted 2923 days ago by Steven Meyer
There seem to be some bug(s) that may be causing my computer to randomly reboot.

This is the situation:

I have been running S@H on my Q6600 for about a year. During this time I have not seen any random reboots.

Recently, I upgraded the NVidia video driver in order to support the optimized CUDA app from S@H.

Subsequently, I started to notice some occasional random reboots about 3-4 times per week.

I decided that it was probably some issue with the new NVidia video driver, but was not bothered enough by it to do something about it yet.

Then, for reasons having to do with improving the through-put of this computer by letting it use mostly optimized apps, I set D@H to "No new Tasks" on this computer.

Eventually, it ran out of D@H tasks to do and was running only S@H for a several days.

Today I temporarily set D@H to allow new tasks, so that there would be a few in the queue for the times when S@H is having problems servicing their client computers.

I got 6 D@H tasks, and 3 of them started running immediately.

The computer immediately had another random reboot, which I had not seen while there were no D@H tasks to be run.

Now I have set D@H to "No new Tasks" once again, and suspended those tasks that were downloaded.

We shall see if there are any more random reboots while D@H is not running.

I wonder if there are any switches that I can set that will cause the D@H apps to write more debugging info to the log so that the cause of the random reboots can be determined.
12)

Message boards : Web site : New Docking@Home Website

( Message 5164 )
Posted 2946 days ago by Steven Meyer
The links from the Notification section of the " Your Docking@Home account " page, are returning . . .

Not Found

The requested URL /account/community/forum/thread.php was not found on this server.
Apache/2.2.6 (Fedora) Server at docking.cis.udel.edu Port 80

For example "High Priority" Strikes Again .


This is still not working .
Click HERE to see an example of the error messages.
13)

Message boards : Number crunching : Wrong numbers in "max # of error/total/success tasks"

( Message 5090 )
Posted 2967 days ago by Steven Meyer
As you haven't changed the numbers, am I correct that you will never resend non-valid completed WUs ever?

In connection with this threat I think that's right to assume. Does that mean you don't need all WUs crunched to achieve your result?


Hi, I want to address the question "Does that mean you don't need all WUs crunched to achieve your result?" We consider any WU as an important WU however the WU is not relevant as a single result but as part of a larger set of results. We post-process the set of results and we identify tendencies. The more results we have, the larger is the probability that they converge toward a correct answer.

In other words, we replaced redundancy (that resulted in wasted cycles) with a powerful post-processing analysis that allows us to identify outliers (possible results affected by errors) and convergent results (tentative answers).

Michela


Michela, some time ago, when I was first starting to crunch D@H WU, I was sent a large number of WU with a short deadline. Since D@H is not my only project, the work overload caused D@H to run in "High Priority" thus shutting down the other project. In order to reduce the work overload, I aborted about half of the D@H WU.

Then I checked one of the aborted WU on the web site and saw the line.

max # of error/total/success tasks 0, 1, 1

Since the abort was counted as an error, all of the aborted Work Units will never be sent out again.

This may or may not be an issue, given your post-processing.

Now, however, I see that the number of success tasks has been set to 2, but the error and total numbers are unchanged.

max # of error/total/success tasks 0, 1, 2

It might make sense to change the counts to be 1 for errors, 1 or 2 for total, and 2 for success so that an abort will not prevent the WU from being reissued.

D@H settings of errors 0, total 1, success 2 will cause the WU to be abandoned with one error or any 2 results.

Again, maybe this is OK with your post-processing...
14)

Message boards : Number crunching : When a work unti fails, the computer doesn't keep going. . .

( Message 5089 )
Posted 2967 days ago by Steven Meyer
I came in this mornign to find all my computers were giving this error:

charmm34_6.15_windows_intelx86 has enoucred a problem and needs to close. We are sorry for the inconvenience.

The tastks giving this error had been running much longer than normal (14 hours instead of 3 hours). Even worse, they didn't just accept the errror and move on to the next work unit. Thus, all of my processors have been sitting idle for a really long time.

Is the source of these failing work units known, and is there someway, when one fails, for the program to just moveon to the next unit? It's gonna be a while before the RAC recovers from this (16 processors down for a significant chunk of time and returning only a hanful of errored work units that dont' receive credit).


This message
such-and-so-program has encountered a problem and needs to close
is from the Windows Operating System, and thus is outside the control of BOINC, or the client programs from D@H. The operating system puts the dialog box on the screen in order to inform you of the failure of the program and then waits for you to respond by clicking on the "OK" button. (IMO this should really be a "Bummer" button!) In any case, the operating system is really patient and will wait forever for you to respond.

Although the message box is outside the control of D@H, I would think that the program developers at D@H will be interested in why their program raised such an error on so many computers at about the same time since it is likely that there is a bug in their code.

The other possibility is that some other program that is running on all of your computers is the cause of the failure by stepping on something needed by the D@H program.

Note: That other program could be a computer virus or worm. Do be sure to check your computers for infections.

Can you think of something else that was running at the same time on all of those computers?

Note too: That other program could be your virus scanner! It could be, for example, that the virus scanner will open some file in order to scan it for viruses with an exclusive lock, which prevents other programs from opening the file until the virus scanner is done with the file. If the D@H code tries to open the file and does not handle the failure to open the file, then that can be an error that will be raised to the operating system and may result in the message that you saw.
15)

Message boards : Web site : Web Site Mix-up?

( Message 5077 )
Posted 2982 days ago by Steven Meyer
Now it appears that my own account is the "Default" account on the Server Status page. I logged out, then clicked on the Server Status link, and it told me that I was logged on. However, a click on the "My Account" link showed the login page, indicating that I was not logged on.
16)

Message boards : Number crunching : "Too many error results"

( Message 5076 )
Posted 2982 days ago by Steven Meyer
Timing and scheduling are part of the problem, but the biggest part is that a single user can cause work units to never be calculated by simply aborting them .

There are two solutions that I can think of . . .

  • Increase the number of error results allowed on each WU to at least 2
  • - OR -
  • Do not count an Abort as an Error.

17)

Message boards : Web site : New Docking@Home Website

( Message 5075 )
Posted 2982 days ago by Steven Meyer
The links from the Notification section of the " Your Docking@Home account " page, are returning . . .

Not Found

The requested URL /account/community/forum/thread.php was not found on this server.
Apache/2.2.6 (Fedora) Server at docking.cis.udel.edu Port 80

For example "High Priority" Strikes Again .
18)

Message boards : Number crunching : "Too many error results"

( Message 5061 )
Posted 2988 days ago by Steven Meyer
This WU , and many more like it, were sent to my computer with a very short deadline, resulting in everything running at "High Priority" in order to try to get them all done before the deadline. In order to reduce the work overload, I aborted about half of the tasks.

There are two problems here.

  • "High Priority" causes D@H to incur major debt with respect to other projects.
  • The abort of a task gets counted as an "Error". Since a WU is allowed no more than 1 error, the WU will not be sent out again.

19)

Message boards : Number crunching : "High Priority" Strikes Again

( Message 5057 )
Posted 2990 days ago by Steven Meyer
I recently started crunching for Docking@Home as a second project, S@H being my first.

Docking@Home has repeatedly fetched a large stack of work units, all of which are due in such a short amount of time that all of them are required to run "High Priority" in order to get all of them done in time.

This cheats other projects out of their share of CPU time, and puts D@H into a large amount of Debt in relationship to other projects.

In order to remedy the problem by reducing the work load from D@H, I have had to abort dozens of D@H work units.

Something needs to be done to reduce the number of D@H work units fetched or else increase the time allowed to complete them so that the work load is not so heavy that "High Priority" is required to get the job done in time.
20)

Message boards : Web site : Web Site Mix-up?

( Message 5048 )
Posted 2993 days ago by Steven Meyer
That shouldn't happen, let me ask Brian

Thank you for letting us know


Happened again this morning with a different name:

Hello! You are logged in as:
deee



Next 10 posts