|Summary:||Make yum feel snappier by caching repodata files|
|Product:||[Fedora] Fedora||Reporter:||Sigge Kotliar <sigge>|
|Component:||yum||Assignee:||Jeremy Katz <katzj>|
|Status:||CLOSED RAWHIDE||QA Contact:|
|Fixed In Version:||Doc Type:||Enhancement|
|Doc Text:||Story Points:||---|
|Last Closed:||2005-09-21 18:06:29 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Sigge Kotliar 2005-07-17 17:08:25 UTC
Hi! I apologise in advance for this entry being very long, but I hope it can be worth the read. As it is right now, every call to yum causes it to download one small file per repo, this file presumably checking whether the repo has changed since the last time, if yes - it downloads a full repo list, if no it uses the old one. Although these files are all ~1kB, they each make up one separate HTTP request to one separate server. On my machine this takes approximately 1 second per repo. In my case I have six repos, giving me a bit more than six seconds to wait before any real action is done. In many cases as I will show later, this checking for updated repos can be avoided most of the times, making yum seem a lot faster. Even to this day, with all the great speed improvements yum has had, some people are complaining about yum being "slow". By slow they don't mean slow on calculating dependencies, but slow having in mind the time needed from the issue of a command to it being executed, much because of these unnecessary calls to HTTP servers. Some use cases: 1) User performs a yum check-update. Servers get checked, updates listed. The user then decides to update packages a, b, and c. When he/she issues "yum update a b c" - the HTTP requests for repodata files are sent again, causing a couple of seconds of unnecessary wait. 2) User wants to remove package foo. He issues "yum remove foo". It's only a remove, no need to get the latest versions of the repo data, just calculate dependencies from what is already stored. I personally always use the " -C" option to eliminate the getting of repo data 3) User wants to install package foo. Issues "yum install foo", repos updated, everything ok. But if the user then wants to install bar, the repos update again. Why? Nothing has changed, isn't it safe to assume the repos are still there? 4) User wants info on package foo. Issues "yum info foo". Needs to wait 6 unnecessary seconds before the info is showed. If he decides to read the info on package bar, it takes another 6 seconds. 5) The "base" repo never changes. Ever. Yet it gets "pinged" every time I do "yum anything" My suggestions on how to improve: a) Whenever the repodata.xml files are fetched, the time of the event is stored, and no newer updates of the repodata.xml are done for X period of time. (X perhaps needs some discussion, I'd say 12hrs or so.) b) Actions that don't necessarily need interaction with the servers, like yum remove and yum info should run with something equivalent to the -C option, and not connect to any server unless it is needed. c) A command line option like "--force-update" is created to always force yum to update the latest repodata.xml files. This is for "power users", that want the newest stuff, but it's not the default behaviour as most users want a quick response. Benefits of the change I propose: * Yum feels snappier. Makes people not afraid of using yum for simple tasks like installing a downloaded rpm package, removing a package without dependencies, or reading the info. * A large number of repos installed doesn't bog down yum as much, no more +1s wait for each repo installed. * People will use rpm -Uvh and rpm -e less, and yum install/yum remove more, now that it doesn't make them wait a couple of seconds on every command. Perhaps, rpm commands will never need be used by regular users anymore, in favour of yum for all rpm-related tasks. * Less strain on servers. Probably not that big a difference, but still. * Yum frontends become much much faster. Right now, tools like yumex and perhaps in the future pup have to do this: 1) do a check-update to get list. 2) Perform the action 3) do another check-update to see the list again. This means 3 seconds wasted PER REPO. That's a lot of seconds. I'm sure usability people have figures on how long is too long, but this surely must be too long. Cons: * Users will sometimes get a slightly not-updated repodata when they use yum. However, this is already the case with mirrors not being in perfect sync all the time, and should not matter for the average user All of this shouldn't be too big of a hassle for you yum developers to implement, and I'm sure the benefits will outweigh the work of implementing this, and wouldn't be great to brag about "yum processing time reduced to half" or something?
Comment 1 Seth Vidal 2005-07-22 15:47:55 UTC
I think the recommended implementation would end up with extremely frustrated users. We'd be better off getting the last-changed info from the repository server and compare that to the repomd.xml we have on disk - but even then we'll still need to contact the repository
Comment 2 Sigge Kotliar 2005-07-22 21:18:47 UTC
Well, I can agree that it will be a slight frustration, but I think that the current frustration of waiting ~5 seconds every time is worse. Apt (yes, the other packaging system, sorry for bringing it up) needs an apt-get update before you run any command to get the latest data. Although I'm not suggesting the same system, but frustrated users could do a "yum check-update" to get the absolutely latest packs. Also, as the system works now with mirrors, you can quite often get two different versions of the "yum check-update" list if you do it twice. I've had cases with three different versions on different mirrors. So the frustration you talk about is already there. Now we'll at least lessen it. Just getting the last-changed header I think would be an improvement, but an extremely small one. Getting 1kB of data takes virtually no time, even on a modem, it's the connecting part that takes time. Just my 2 cents, but I really believe this is the way to go.
Comment 3 Jeremy Katz 2005-09-21 18:06:29 UTC
The two stage process used by apt is a good way of ensuring that users don't actually get updates (since they don't remember to do the first part or other such things). Getting the repodata every time is really the only way to be assured consistency.
Comment 4 Sigge Kotliar 2005-09-21 21:34:31 UTC
Jeremy: with all though respect, I don't think you read my suggestion properly. I'm not suggesting an apt-like two stage process. What I'm suggesting is that if a repodata file has been stored less than say an hour ago that yum doesn't get a new one. No users would miss any updates - cause the synk difference is even bigger between different mirrors than it would be with this one hour storage of repodata files. It would also help tremendously for everyone who is doing consecutive requests, ie doing: yum install app-a; yum install app-b. Why two different connections are needed I don't see. With commands like yum search, yum info and yum remove, not even one is needed, unless the repodata file is terribly out of sync, since the names of the rpm's on the server rarely change - only the versions. Please review this again, I really do believe in this.
Comment 5 Seth Vidal 2005-11-07 03:15:39 UTC
So we figured out a way to do this w/o going crazy. Refiling this as rawhide - it will hit rawhide in a future yum release. not quite how you wanted it done - but I think it will cover the problem.
Comment 6 Sigge Kotliar 2005-11-08 14:23:20 UTC
Just downloaded and tried out yum-2.4.0-9. Is this the version you are talking about? Remove operations are now really speedy. On this point I'm really happy. But for Doing "yum install something; yum install somethingelse" however still pings each repo twice - once per command. Same goes for search. On my system the difference was about 6 secs between "yum search verylongquery" and "yum -C search verylongquery". Of course, with the total time spent being 20 or 26 second this may not seem much, but for modem users, or people with bad connections it would be ever greater. Perhaps this is something that can be further improved with the new plugin architecture?
Comment 7 Seth Vidal 2005-11-08 19:06:07 UTC
the version if rawhide does not have that patch.
Comment 8 Sigge Kotliar 2005-11-08 23:37:20 UTC
oh. ok then =) Will wait for the next rawhide release of yum then, and get back to you then.
Comment 9 Seth Vidal 2005-11-08 23:38:33 UTC
wait until yum 2.4.1 hits rawhide.