Shorter work units or BOINC gone silly?
Message boards : Number crunching : Shorter work units or BOINC gone silly?
Author | Message | |
---|---|---|
My 2.4GHz Pentium 4 has suddenly decided to download over 30 work units. This is odd because:
--- 21/11/2006 18:47:07 Running CPU benchmarks --- 21/11/2006 18:48:06 Benchmark results: --- 21/11/2006 18:48:06 Number of CPUs: 1 --- 21/11/2006 18:48:06 1078 floating point MIPS (Whetstone) per CPU --- 21/11/2006 18:48:06 2177 integer MIPS (Dhrystone) per CPU --- 21/11/2006 18:48:06 Finished CPU benchmarks --- 21/11/2006 18:48:07 Resuming computation --- 21/11/2006 18:48:07 Rescheduling CPU: Resuming computation --- 21/11/2006 18:48:07 Resuming network activity Docking@Home 21/11/2006 18:48:07 Resuming task 1tng_mod0001_11944_308664_2 using charmm version 503 Docking@Home 21/11/2006 19:03:35 Sending scheduler request to http://docking.utep.edu/docking_cgi/cgi Docking@Home 21/11/2006 19:03:35 Reason: To fetch work Docking@Home 21/11/2006 19:03:35 Requesting 635712704 seconds of new work I set it to "no new work" as soon as I saw it and will abort all but two of these WU so they can be re-issued, then re-install BOINC. Something's obviously amiss. Glad I happened to be looking at that PC as it happened. Imagine 20 years worth of work with a 5 day deadline - I don't think this CPU will last that long ;D Has anyone seen this strange behaviour before? ____________ Join the #1 Aussie Alliance on Docking@Home |
||
ID: 1538 | Rating: 0 | rate: / | ||
Yes, the BOINC Client can do weird things @ times ... IMO
|
||
ID: 1539 | Rating: 0 | rate: / | ||
Yes, the BOINC Client can do weird things @ times ... IMO Yeah, has to be BOINC. A clean install needed when these WU are finished. Your supercharged X6800 might have some hope to complete 33 WU in 5 days, but my poor old P4 hasn't got a hope ;D ____________ Join the #1 Aussie Alliance on Docking@Home |
||
ID: 1540 | Rating: 0 | rate: / | ||
It seems that many hosts have downloaded results more than needed / computed in time. So did my hosts. It's supposed that deadline is too short. On another thread named "No windows work?" I asked Andre to work on this issue...
|
||
ID: 1541 | Rating: 0 | rate: / | ||
Your supercharged X6800 might have some hope to complete 33 WU in 5 days, but my poor old P4 hasn't got a hope ;D By my calculations for my X6800EE it should be able to do 75 in a 5 day time span, the E6600's I have attached aren't to far behind the X6800 though, they should be able to get over 65 done. They all downloaded about 50 WU's yesterday so I should be able to get them done with room to spare ... I'll probably throw 2 or 3 more E6600's at the Project in the next week or so after I see how the WU's I have already work out ... :) |
||
ID: 1542 | Rating: 0 | rate: / | ||
Your supercharged X6800 might have some hope to complete 33 WU in 5 days, but my poor old P4 hasn't got a hope ;D I'm bit afraid that these super good machines would get all workunits and prevent PIII machines from reproducing errors in order to find out a solution of the error...lol |
||
ID: 1543 | Rating: 0 | rate: / | ||
I'm bit afraid that these super good machines would get all workunits and prevent PIII machines from reproducing errors in order to find out a solution of the error...lol Naaaaa, right now I don't even have a call in for more work & I've lowered my connection time too. I really don't need that many WU's @ 1 time, can't process for any other projects when I get that many all @ once ... |
||
ID: 1544 | Rating: 0 | rate: / | ||
Maybe this has to do with the fact that for some time you all got the 'commited to other platforms' message and the boinc client is trying to catch up whenever it gets a chance. It still doesn't make sense to download more work than ever can be completed before the deadline though... Can an experienced boincer explain a bit more about the work fetch policy maybe?
It seems that many hosts have downloaded results more than needed / computed in time. So did my hosts. It's supposed that deadline is too short. On another thread named "No windows work?" I asked Andre to work on this issue... ____________ D@H the greatest project in the world... a while from now! |
||
ID: 1564 | Rating: 0 | rate: / | ||
Can an experienced boincer explain a bit more about the work fetch policy maybe? Well I'm not a real authority on it but from after running the BOINC Projects from the on-set of the Beta Seti Project I've had a lot of experience learning how to adjust my Preferences so as to get more or less work from a Project. A lot of variables come into play, your Resources Share, your Connection Time, your Debt to the other Projects you are Attached to and your Bench Marks all come into play. Then if your over-inflating your BenchMarks you can throw that all out because your going to get more work than the PC is capable of processing if the other settings are to high. John McCloud II can explain it in more detail, he's the Authority on most of this BOINC stuff, don't know if he is Attached to the Project yet ... |
||
ID: 1575 | Rating: 0 | rate: / | ||
There is a bug in 5.4.11 and earlier where a task that is downloading does not count as present on the system, and therefore more work will be requested until some actually completes download.
|
||
ID: 1577 | Rating: 0 | rate: / | ||
Sorry about the name butcher there John, at least I got the First name right anyway ... ;) |
||
ID: 1579 | Rating: 0 | rate: / | ||
hundreds of wu ? been there done that (ufluids and sztaki not here yet)
* the only wu's were short |
||
ID: 1582 | Rating: 0 | rate: / | ||
This was a problem for many of us when Docking first opened alpha stage. My computer downloaded 300 WU's in the first few minutes after attatching to the project. BOINC (or my system, or some code in the app, whatever is responsible for making this judgement) initially interpreted the WUs to be 5 minutes long to crunch, as opposed to their more accurate time of 3 hours. Even though my "connect every _" was set to 0.1 days, the WUs just poured in to the point that my system was lagging (which is when I noticed it and shut it down). Most of us had no choice but to abort a lot of work.
|
||
ID: 1591 | Rating: 0 | rate: / | ||
This was a problem for many of us when Docking first opened alpha stage. My computer downloaded 300 WU's in the first few minutes after attatching to the project. BOINC (or my system, or some code in the app, whatever is responsible for making this judgement) initially interpreted the WUs to be 5 minutes long to crunch, as opposed to their more accurate time of 3 hours. Even though my "connect every _" was set to 0.1 days, the WUs just poured in to the point that my system was lagging (which is when I noticed it and shut it down). Most of us had no choice but to abort a lot of work. The developers set an estimate of the floating point operations required to complete a typical result. Therefore it is the server side where the cause of the problem lies. |
||
ID: 1595 | Rating: 1 | rate: / | ||
We calculate the estimate based on the runtime of a result on a 3 GHz linux box. I assume that when the estimated runtime on my boinc manager is equal to the actual runtime, the FP estimate must be correct and will correctly scale to all other platforms. If that is not the case, could there be a problem in boinc? If this is not the correct way of getting a good FP estimate, please somebody step forward and tell us how to do this :-)
The developers set an estimate of the floating point operations required to complete a typical result. Therefore it is the server side where the cause of the problem lies. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 1600 | Rating: 0 | rate: / | ||
We calculate the estimate based on the runtime of a result on a 3 GHz linux box. I assume that when the estimated runtime on my boinc manager is equal to the actual runtime, the FP estimate must be correct and will correctly scale to all other platforms. If that is not the case, could there be a problem in boinc? If this is not the correct way of getting a good FP estimate, please somebody step forward and tell us how to do this :-) It seems that you picked a system that is a bit more efficient than average at these computations. Of the machines that I have attached, most are in the range of 5 to 6 which means that the calculations are taking 5 to 6 times as long as expected. Not all computers are exactly as efficient as expected on a given calculation. Therefore, the Duration Correction Factor was developed to deal with these discrepencies. Some of the things that matter that are not measured in the benchmarks are: L1 and L2 cache sizes and speeds, memory bandwidth, memory size (too small can cause thrashing and very slow results - see BURP and Render@Home for worst case examples). If all of the DCFs for the project average about 1, then this is the best that can be expected. The real problems arise when a project is off by more than an order of magnitude in its estimates. If the actual averate is around 5, it is not too bad, but it could be improved. |
||
ID: 1602 | Rating: 0 | rate: / | ||
Hi John,
It seems that you picked a system that is a bit more efficient than average at these computations. Of the machines that I have attached, most are in the range of 5 to 6 which means that the calculations are taking 5 to 6 times as long as expected. ____________ D@H the greatest project in the world... a while from now! |
||
ID: 1603 | Rating: 0 | rate: / | ||
Hi John, Yes, I believe that is probable. The factor to change the fpops estimate by should be about the average of the current duration_correction_factors that you have in your database / client. After you do this, it will take some time for the DCF values to reach an equilibrium again as they will be headed down. BTW, the worst estimate I have seen from any of the projects were on the low side about 1000, and on the high side was E+70. So yours is not too bad for an initial estimate. |
||
ID: 1604 | Rating: 0 | rate: / | ||
Will completed WUs that finish past the deadline still count or should I just abort those? I also had received too many for me to complete even with my machine running 24/7. I don't want to waste time if it doesn't matter. Thanks.
|
||
ID: 1611 | Rating: 0 | rate: / | ||
Will completed WUs that finish past the deadline still count or should I just abort those? I also had received too many for me to complete even with my machine running 24/7. I don't want to waste time if it doesn't matter. Thanks. I fear no unfortunately. If the deadline past, the server produce and distribute a new result, regarding the late result as "no reply". ____________ I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions. |
||
ID: 1612 | Rating: 0 | rate: / | ||
10-4 I just didn't want to waste the time if the result didn't matter.
|
||
ID: 1613 | Rating: 0 | rate: / | ||
Will completed WUs that finish past the deadline still count or should I just abort those? I also had received too many for me to complete even with my machine running 24/7. I don't want to waste time if it doesn't matter. Thanks. Actually, it depends on how late the return is, how fast the server is at generating new tasks when a result is late, how fast they are handed out, and how fast they are returned. If the late work is returned before the replacement is actually sent, the replacement will not be sent. If the late work is returned before the replacement is verified, then the late work will at least contribute towards the verification. If the late work is returned after a quorum is met and verified, it counts for nothing. |
||
ID: 1616 | Rating: 3 | rate: / | ||
Will completed WUs that finish past the deadline still count or should I just abort those? I also had received too many for me to complete even with my machine running 24/7. I don't want to waste time if it doesn't matter. Thanks. thanks for information. BTW everyone can use the function of rating...I tested just now:) ____________ I'm a volunteer participant; my views are not necessarily those of Docking@Home or its participating institutions. |
||
ID: 1618 | Rating: 1 | rate: / | ||
We calculate the estimate based on the runtime of a result on a 3 GHz linux box. I assume that when the estimated runtime on my boinc manager is equal to the actual runtime, the FP estimate must be correct and will correctly scale to all other platforms. If that is not the case, could there be a problem in boinc? This is still a problem. I just re-attached my 3.4GHz Pentium 4 with HT, running Windows. It typically takes 10 hours per work unit and BOINC asked for 1 day worth. Your server sent 19 work units, which will take this computer 4 full days if I suspend everything else on it and don't switch it off at night. Docking has only a 20% share of resources and the computer is only running 18 hours a day so I only needed 1 work unit. BOINC is certainly partly to blame - it asked for two days worth (1 day per logical CPU) when it should have asked for about 7 hours of work (20% of 18 hours a day available on each logical CPU). But the WU time estimates are still way off the mark. Had BOINC requested only 7 hours worth, your scheduler (which seems to think they will take 1:45 each when in reality they take 10 hours) might still have sent 4 work units (about 40 hours of crunching, rather than the 7 needed). The fastest I've completed a WU that was valid on my Athlon XP 3000+ was 4 hours (running Linux) and 6.5 hours (running Windows). Even if we take into account that these WU take longer under Windows than Linux, it's still too big a discrepancy. My fastest PC (Athlon 64 at 2.6GHz with 500MHz dual channel DDR) still takes over 5 hours per work unit under Windows... Just for the record, I'm running with the stock BOINC client and low benchmarks, so the discrepancy in estimated time to complete is not a matter of inflated benchmarks. Might be worth comparing the benchmarks on your 3GHz Linux system and my 3.4Ghz Windows system? ____________ Join the #1 Aussie Alliance on Docking@Home |
||
ID: 1717 | Rating: 0 | rate: / | ||
BOINC is certainly partly to blame - it asked for two days worth (1 day per logical CPU) when it should have asked for about 7 hours of work (20% of 18 hours a day available on each logical CPU). This will be fixed in 5.8.x (the next released version). The code is already in the client, but other problems are holding up the release. |
||
ID: 1718 | Rating: 0 | rate: / | ||
Thanks for the info Yoda.
____________ D@H the greatest project in the world... a while from now! |
||
ID: 1719 | Rating: 0 | rate: / | ||
The worst problem only exhibits itself until the client has completed its first result. After that point the Duration Correction Factor will be large enough to prevent download of vast quantities of work. |
||
ID: 1720 | Rating: 0 | rate: / | ||
I think over-estimating the time taken to complete a work unit is less likely to cause problems than under-estimating. Doubling the estimate is certainly a step in the right direction.
|
||
ID: 1721 | Rating: 0 | rate: / | ||
Hello Webmaster Yoda, I found this information by Cold Shot in another thread, that may help your Linux problem with Ubuntu 6.10,
|
||
ID: 1722 | Rating: 0 | rate: / | ||
Hello Webmaster Yoda, I found this information by Cold Shot in another thread, that may help your Linux problem with Ubuntu 6.10, I used the official BOINC 5.4.9 (I didn't see a later one other than development versions). ulimit was already set to unlimited (checked that) but I guess I could specify it just to be sure. Am trying to resurrect an old hard-drive to see if the lack of swapfile (running just in RAMdisk) has anything to do with getting this same error. ____________ Join the #1 Aussie Alliance on Docking@Home |
||
ID: 1723 | Rating: 0 | rate: / | ||
Hello Webmaster Yoda.
|
||
ID: 1725 | Rating: 0 | rate: / | ||
Thanks David, I tried it under VMWare as well, but it (and Linux) dosn't like my wireless network, so I gave up on that idea.
|
||
ID: 1729 | Rating: 0 | rate: / | ||
Thanks David, I tried it under VMWare as well, but it (and Linux) dosn't like my wireless network, so I gave up on that idea. Did you have an active firewall on the VmWare Network Adapter(s)..? I've turned it off on both my VmWare adapters, because WinXP doesn't seem to work well, while they were turned on. Norton just did not came up, VmWare was complaining about the network not being active. Turning the firewall off on those cleared my problem. ;-) ____________ |
||
ID: 1732 | Rating: 0 | rate: / | ||
The checkpointing method and period will be the first thing we work on when Charmm c33b1 is released by the Charmm developers. This hopefully won't take too long anymore and will bring the disk activity down to an acceptable level (although I'm kind of getting desperate slowly slowly...). Also, we currently write a lot of debug information to the charmm logfile, which is also a cause of lots of disk activity; since we are in alpha we will have to do this to find out problems more easily. as soon as we have most of our pressing problems solved, we can cut back on this too. Thanks Andre ____________ D@H the greatest project in the world... a while from now! |
||
ID: 1736 | Rating: 0 | rate: / | ||
Thanks Andre. I'll keep an eye on the news and message boards. ____________ Join the #1 Aussie Alliance on Docking@Home |
||
ID: 1741 | Rating: 0 | rate: / | ||
Hello Webmaster Yoda. I got linux running on vmware under windows and indeed it runs faster(D@H). |
||
ID: 1758 | Rating: 0 | rate: / | ||
Message boards : Number crunching : Shorter work units or BOINC gone silly?
Database Error: The MySQL server is running with the --read-only option so it cannot execute this statement
array(3) { [0]=> array(7) { ["file"]=> string(47) "/boinc/projects/docking/html_v2/inc/db_conn.inc" ["line"]=> int(97) ["function"]=> string(8) "do_query" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#41 (2) { ["db_conn"]=> resource(120) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(51) "update DBNAME.thread set views=views+1 where id=109" } } [1]=> array(7) { ["file"]=> string(48) "/boinc/projects/docking/html_v2/inc/forum_db.inc" ["line"]=> int(60) ["function"]=> string(6) "update" ["class"]=> string(6) "DbConn" ["object"]=> object(DbConn)#41 (2) { ["db_conn"]=> resource(120) of type (mysql link persistent) ["db_name"]=> string(7) "docking" } ["type"]=> string(2) "->" ["args"]=> array(3) { [0]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "109" ["forum"]=> string(1) "2" ["owner"]=> string(3) "271" ["status"]=> string(1) "0" ["title"]=> string(39) "Shorter work units or BOINC gone silly?" ["timestamp"]=> string(10) "1165941720" ["views"]=> string(4) "1743" ["replies"]=> string(2) "35" ["activity"]=> string(23) "1.0052660629727998e-126" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1164108556" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } [1]=> &string(6) "thread" [2]=> &string(13) "views=views+1" } } [2]=> array(7) { ["file"]=> string(63) "/boinc/projects/docking/html_v2/user/community/forum/thread.php" ["line"]=> int(184) ["function"]=> string(6) "update" ["class"]=> string(11) "BoincThread" ["object"]=> object(BoincThread)#3 (16) { ["id"]=> string(3) "109" ["forum"]=> string(1) "2" ["owner"]=> string(3) "271" ["status"]=> string(1) "0" ["title"]=> string(39) "Shorter work units or BOINC gone silly?" ["timestamp"]=> string(10) "1165941720" ["views"]=> string(4) "1743" ["replies"]=> string(2) "35" ["activity"]=> string(23) "1.0052660629727998e-126" ["sufferers"]=> string(1) "0" ["score"]=> string(1) "0" ["votes"]=> string(1) "0" ["create_time"]=> string(10) "1164108556" ["hidden"]=> string(1) "0" ["sticky"]=> string(1) "0" ["locked"]=> string(1) "0" } ["type"]=> string(2) "->" ["args"]=> array(1) { [0]=> &string(13) "views=views+1" } } }query: update docking.thread set views=views+1 where id=109