Jeff Mitchell

Syndiquer le contenu
Amarok, KDE, and all that good stuff
Mis à jour : il y a 4 minutes 27 secondes

The Collection Scanner’s Ultimate Speed Bump

Novembre 3, 2009 - 18:19

A couple of weeks ago I spent a long period of time looking at ways to increase scanning speed. Yes, again. I had written previously (here and here) on changes I'd made to increase the speed of scanning collections, and I'd made a lot of other changes that I didn't write about. These certainly helped in many situations, and made significant differences for particular users, but overall we still had a large speed boondoggle: the number of queries issued per track to the database.

This number was at least three. At the very minimum, assuming the artist, album, genre, composer, and year had been cached, you had a lookup for the uniqueid, a lookup to check for whether the track was part of a compilation, and an insert. This number could be significantly higher in some situations; but these database queries were a large part of the reason that a scan might run, finish disk I/O, and then spend quite a long time eating up CPU time before it was done.

I looked at increasing the number of values I was pulling in using UNION queries, which was currently up to five values (depending on whether they were already cached). I also looked at prepared statements, which it turns out were not likely to give us a large speed bump and are an absolute and utter pain in the ass when using the C API as we are. (We use the C API for historical reasons; also, the official C++ API is not easily available in many distributions, and there are many unofficial contenders.)

Finally, I looked at ripping the guts out of the scan result processor and doing what  several of us devs had decided would probably be the best way to improve speed, but was at the same time the most challenging: replacing the SQL statements with local memory storage, populating these at the beginning of a scan and re-populating the database at the end. It wasn't easy. I selected data structures very carefully for speed, looking at how often each query would be run and the time complexities of insertion and lookup into various types of structures.  In addition, because of a few complex queries using joins, the behavior we needed could only be emulated by using very complex data structures (one of them is a QHash<QPair<int, QString>, QStringList*>; another is a QHash<QString, QLinkedList<QStringList*> *>). I also wanted to minimize memory usage...I did some rough back-of-the-hand calculations which indicated that using my design, for the vast, vast majority of collections, memory usage during the scan was likely to go up by only a couple of megabytes at the very most, which would be reclaimed at the end.

In addition, I implemented logic to drastically decrease the total queries needed for insertion. Since I'm constructing the final queries from the data structures, I could easily do insertion of multiple values at once. So now the database is queried at the start to see the maximum query size it will accept; then values for insertion are appended to the query up until this value is hit, at which point the query is run and a new query begun. By default I think this is one megabyte for most installations, which means that instead of possibly thousands upon thousands of insertion queries (depending on the size of your collection), you'd have to have an absolutely extraordinarily large collection to have more than 25 or so queries for the entire scan. Since there is not only round-trip delay sending data to and receiving responses from the database but also the database must parse each query each time it's run (the purpose of prepared statements is to reduce this), then this makes a drastic difference. To put it another way, the number of queries per track has gone from a minimum of three to an asymptotic value of zero.

Overall, the work seems to have paid off rather nicely. Benchmark reports that I got back indicated anywhere from 30% to 300% gains in the total time for a scan, depending on size of collection and your particular I/O throughput/CPU/etc.

I call this the Ultimate Speed Bump in the title because there is far less that can be done from this point on to increase speed, at least as far as I can currently tell -- this was the Big Mama, the elephant in the room. But hopefully it will increase scan processing enough that further increases won't really be so much of a necessity anymore.

One other data point: the database is now case-sensitive, which is something we've intended to do for a long time but is now being done. It's been a very long-standing feature request that we've finally implemented (we wanted it too), and thankfully it wasn't much work to do this and change the appropriate points in the rest of the code.

(By the by, this will all be in 2.2.1 -- enjoy!)

Catégories: Planet Amarok

Award-winning professor Philip Bourne to speak at Camp KDE

Octobre 26, 2009 - 20:04

Philip Bourne, the 2009 Benjamin Franklin Award-winning computational biologist, will be speaking at Camp KDE 2010.

Professor Bourne is well-known in his field for contributions to open-source bioinformatics software and is a leading advocate of open access to data. Quoting from the UCSD News Center:

Bourne is co-founder of SciVee, the Web 2.0 resource dedicated to the dissemination of scientific research and science-specific research networking. Launched in late 2007 as a collaboration between the National Science Foundation and the San Diego Supercomputer Center (SDSC) at UC San Diego, SciVee has been used by hundreds of thousands of students and professional scientists as a means of learning and sharing their research through online science videos that supplement peer-reviewed journal articles, stimulate discussion, and promote collaboration. SciVee earlier this month announced a number of significant upgrades to its site, along with the addition of 32 new science categories.

Open data access and open source share similar goals, and Professor Bourne's discussion of his experiences with open data access should prove informative and interesting to all.

Catégories: Planet Amarok

Speed never gets old. At least, in software.

Octobre 15, 2009 - 03:17

As my regular readers -- such as that may be, considering how rarely I post -- may know, I've been doing all sorts of things to speed up scanning your files into Amarok's local collection. I have some really nice news on that front: a few goodies that will be very useful to you, especially if you have very flat collections. These improvements are in the forthcoming 2.2.1.

First up, and the lesser of the two: If a directory is encountered multiple times during scan, the scanner will now only scan them once. There are a few corner cases where this could happen (for instance, if you have a top-level directory specified as well as one of its subdirectories, and the subdirectory's name changes, changing both its and its parent's mtime). Not too useful for most cases, but was implemented while working on...

Second, and the biggie: When doing an incremental recursive scan, Amarok will now no longer scan subdirectories whose mtimes have not changed. This is big news if you have a large, flat collection (which I dislike myself but some really enjoy...to each their own). For instance, in a setup like 3,000 files in your main folder, plus a few thousand files each in a few subfolders for a total of 10,000 tracks, if you added a single file to the top-level folder and had incremental recursive scanning turned on (the default), you'd cause a rescan of your entire music collection -- all 10,000 tracks or so. Now, if you add a single file to the top-level folder, you'll only cause a rescan of 3,000 tracks...which is still a lot, but even more of a reason to use a hierarchy.

This should help scanning time even for those with hierarchies, as adding a new album to an artist won't cause the artist's other album folders to be rescanned, or adding a new artist to a genre won't cause all the artists and albums in that genre to rescan.

It's really the proper way to do things, but it required some changes to the data sent between Amarok proper and the collection scanner, so had to be done carefully (as far as I am currently aware I didnt' cause any regressions). Regardless, I'm glad to have it in there and working, and hopefully you will be too.

Catégories: Planet Amarok

Say goodbye to history

Septembre 30, 2009 - 00:11

And by that, I mean say goodbye to an historical (read: old) bug.

I'd heard these whispers recently about "whenever I add or remove an album from the collection and do an incremental update, my collection gets messed up." Okay -- we've heard these whispers for a long time, but there were so many other things that had to be fixed first that it wasn't clear whether this was a symptom of a different problem or a problem of its own...and it could always be fixed with a full rescan.

However, with the collection being much more solid these days and with this being the only super-visible bug left that I knew of, I decided to tackle this. It turns out that, once the rest of the collection was behaving, this wasn't that hard to find...it just took a lot of debug tracing, because it wasn't obvious. Since the cause was the wrong field being used in a DB query to remove some tracks during an incremental scan, with anything other than a very tiny test collection it could quickly start to make no logical sense what was happening.

The reason I say the bug is historical is that the code causing it has been there since May 2008. While admitting that provides some fodder for naysayers and haters to harp on Amarok for such visible and longstanding bugs, I prefer to take the other approach: it means that all the various issues users have found since the total rewrite that was 2.0 *are* being found and *are* being solved. Some of them just take some time; it's a *lot* of code. But the proof is in the 2.2 ChangeLog.

I managed to sneak in the fix a day before 2.2 was tagged, so when Amarok 2.2 comes out scanning (both full and incremental) should be really quite solid. In fact, since this has been fixed in git, I've yet to hear of any more problems, just lots of happy users. It's just another way that the (very close!) 2.2 is going to *rock*.

Catégories: Planet Amarok

AFT and MusicBrainz track identifiers, redux

Septembre 24, 2009 - 13:08

A bit ago I blogged about how Amarok File Tracking can now use MusicBrainz identifiers to do its stuff.

Then, a little while later, I started getting bug reports of peoples' music disappearing from their collection, and requested some of the reporters send me some files. One of the users did so, and I found something curious in his tags (if I had a penny for every time I've personally seen users have something odd or strange in their tags, I'd have...well, a few dollars at least). Several of his files had full MusicBrainz tags -- with absolutely no data populating them, meaning that the MusicBrainz identifier (and all other MB data) for all of those files was ending up the same (blank) and Amarok was thinking them the same file.

It was a quick fix (use generated non-embedded AFT IDs when the MB tags are empty) but just adds to the evidence that you can never, ever trust users' tags. Also, that users that use your Git-based version or betas really rock for finding this stuff before release...so in case I don't say this enough: thanks users!

Catégories: Planet Amarok

Camp KDE 2010 Announced!

Août 7, 2009 - 12:18

I'm pleased as punch/as a fat cat/etc. to point you to The Dot (specifically here) to see the official announcement and some details. More details will be forthcoming soon (and especially as we get the web site in order). Start clearing your schedule and working on your presentations!

Catégories: Planet Amarok

AFT and MusicBrainz track identifiers

Août 7, 2009 - 12:15

A heads-up: Amarok File Tracking can now use MusicBrainz track identifiers for its embedded IDs. This means people that have used Picard to tag their files but not amarok_afttagger can still get some embedded AFT goodness! It also enables an interesting "mode" because it essentially enables song tracking vs. actual file tracking (which you may or may not want, depending on your particular needs).

Full details are here.

Catégories: Planet Amarok

Camp KDE 2010 Announced!

Août 7, 2009 - 08:18
I'm pleased as punch/as a fat cat/etc. to point you to a href=http://dot.kde.orgThe Dot/a (specifically a href=http://dot.kde.org/content/announcing-camp-kde-2010here/a) to see the official announcement and some details. More details will be forthcoming soon (and especially as we get the web site in order). Start clearing your schedule and working on your presentations!br /
Catégories: Planet Amarok

AFT and MusicBrainz track identifiers

Août 7, 2009 - 08:15
pA heads-up: Amarok File Tracking can now use MusicBrainz track identifiers for its embedded IDs. This means people that have used Picard to tag their files but not amarok_afttagger can still get some embedded AFT goodness! It also enables an interesting quot;modequot; because it essentially enables song tracking vs. actual file tracking (which you may or may not want, depending on your particular needs)./p br / pFull details a href=/wiki/Amarok_File_Tracking#Using_MusicBrainz_identifiersare here/a.br //p
Catégories: Planet Amarok

Presenting the KDE network on Facebook

Juillet 22, 2009 - 00:45

Many KDE developers are on Facebook. A while back I wondered if it would be possible to have an official KDE developers' network on Facebook -- after all, there are networks for schools, jobs, cities, and more (and for many developers, KDE is literally or figuratively a job...)

As it turned out, there was a "Kde" network -- but something was odd. To join a work network you have to have an email address affiliated with the network. KDE owns kde.com and kde.org -- so who was this? The only other "KDE" I could find that seemed like it would be legit was the Kentucky Department of Education, and I rather doubted it was them, because they would likely have used all-uppercase KDE as well. So I started an inquiry with Facebook, trying to figure out if either it was someone squatting on our name (and trademark) or whether it was some legit organization -- in which case, would they mind donating the network to us?

After several months of back-and-forth with the people at Facebook, who were very nice (if a bit slow :-) ), I'm happy to say that we've regained the KDE network (properly capitalized) as our own. I still don't know the whole story as to who was there before, and never will due to their privacy policies, but I'll say this:

  • If you were in the "Kde" network before and Facebook asked if you would mind donating it to us, and you did, thanks so much!
  • If someone was simply squatting in the "Kde" network before, then thanks, Facebook, for kicking them out!

To join the network, go to Settings -> Networks, and enter KDE and your kde.org email address in the appropriate fields.

Catégories: Planet Amarok

Presenting the KDE network on Facebook

Juillet 21, 2009 - 20:45
pMany KDE developers are on Facebook. A while back I wondered if it would be possible to have an official KDE developers' network on Facebook -- after all, there are networks for schools, jobs, cities, and more (and for many developers, KDE is literally or figuratively a job...)br //p br / pAs it turned out, there was a quot;Kdequot; network -- but something was odd. To join a work network you have to have an email address affiliated with the network. KDE owns kde.com and kde.org -- so who was this? The only other quot;KDEquot; I could find that seemed like it would be legit was the Kentucky Department of Education, and I rather doubted it was them, because they would likely have used all-uppercase KDE as well. So I started an inquiry with Facebook, trying to figure out if either it was someone squatting on our name (and trademark) or whether it was some legit organization -- in which case, would they mind donating the network to us?br //p br / pAfter several months of back-and-forth with the people at Facebook, who were very nice (if a bit slow img src=http://amarok.kde.org/blog/templates/default/img/emoticons/smile.png alt=:-) style=display: inline; vertical-align: bottom; class=emoticon / ), I'm happy to say that we've regained the KDE network (properly capitalized) as our own. I still don't know the whole story as to who was there before, and never will due to their privacy policies, but I'll say this:/p br / ulbr / liIf you were in the quot;Kdequot; network before and Facebook asked if you would mind donating it to us, and you did, thanks so much!/li br / liIf someone was simply squatting in the quot;Kdequot; network before, then thanks, Facebook, for kicking them out!/li br / /ulTo join the network, go to Settings -gt; Networks, and enter KDE and your kde.org email address in the appropriate fields.br /
Catégories: Planet Amarok

DB changes — call for benchmarkers!

Juillet 18, 2009 - 03:06

I've done some work in trunk over the past week that may have a huge impact on many of you Amarokers. Read on, and if you can do some benchmarks for me, fantastic.

First, the schema/table changes.

  1. We've seen some issues where people have, for whatever reason, ended up with InnoDB tables instead of MyISAM tables. This is probably the result of their DB being created long ago before we were explicitly telling the mysqle startup to skip InnoDB. This mainly causes a problem because some columns cannot be as wide as we'd like them to be when using InnoDB. So, the first thing being done is that an ALTER TABLE is being forced on every table to explicitly convert to MyISAM. In addition, ENGINE parameters are now used during table creation to be more explicit in the future.
  2. Some of you might have seen complaints in the debug output about indexes not being able to be created due to a max key length, which by default in MySQL is 1000 (compile-time option). So, some columns have had their widths adjusted so that all indexes are now successfully created.

Now, the other changes:

As we added more features, scanning got slow. Like, really slow. You'd spend more time running SQL queries than actually scanning your files. So I've been aiming to change that.

Over the past week I've committed changes that remove, per track, anywhere from 1 to 6 SQL queries. The exact amount is highly dependent on your file set, but there is a minimum of one less SQL query per track. If you've done a lot of file moves and AFT kicks in, it'll be an even more massive speedup. I'm going to try to do some further tuning, but already results are looking positive.

Nikolaj has reported that his scan time went from 68 seconds to 18 seconds -- more than 3x faster. Mikko didn't notice a speedup, but he said that whereas scanning used to peg his CPU at 100%, it no longer does so. What I want to know is: how does this affect *you*?

If you want to help, do the following:

  1. Backup your DB. If you're using external MySQL do a mysqldump, if you're using internal MySQLe backup the mysqle folder in the Amarok data directory.
  2. Update to a revision from a week ago...say, 995000.
  3. Wipe your DB.
  4. Start Amarok -- it will do a full scan because of the empty DB. Time it as it does the scan.
  5. Repeat steps 3 & 4, so that you can see what the time is like after caching.
  6. Update to current trunk (at least 998470).
  7. Repeat step 3.
  8. Repeat steps 4 and 5.

Then leave a reply here with your values. If you watch your CPU during each of the scans, report that here too. Thanks!

Catégories: Planet Amarok

DB changes -- call for benchmarkers!

Juillet 17, 2009 - 23:06
pI've done some work in trunk over the past week that may have a huge impact on many of you Amarokers. Read on, and if you can do some benchmarks for me, fantastic./p br / p First, the schema/table changes./p br / ol br / liWe've seen some issues where people have, for whatever reason, ended up with InnoDB tables instead of MyISAM tables. This is probably the result of their DB being created long ago before we were explicitly telling the mysqle startup to skip InnoDB. This mainly causes a problem because some columns cannot be as wide as we'd like them to be when using InnoDB. So, the first thing being done is that an ALTER TABLE is being forced on every table to explicitly convert to MyISAM. In addition, ENGINE parameters are now used during table creation to be more explicit in the future./li br / liSome of you might have seen complaints in the debug output about indexes not being able to be created due to a max key length, which by default in MySQL is 1000 (compile-time option). So, some columns have had their widths adjusted so that all indexes are now successfully created./li br / /ol br / pNow, the other changes:/p br / pAs we added more features, scanning got slow. Like, really slow. You'd spend more time running SQL queries than actually scanning your files. So I've been aiming to change that. /p br / pOver the past week I've committed changes that remove, per track, anywhere from 1 to 6 SQL queries. The exact amount is highly dependent on your file set, but there is a minimum of one less SQL query per track. If you've done a lot of file moves and AFT kicks in, it'll be an even more massive speedup. I'm going to try to do some further tuning, but already results are looking positive.br //p br / pNikolaj has reported that his scan time went from 68 seconds to 18 seconds -- more than 3x faster. Mikko didn't notice a speedup, but he said that whereas scanning used to peg his CPU at 100%, it no longer does so. What I want to know is: how does this affect *you*?/p br / pIf you want to help, do the following:/p br / ol br / li Backup your DB. If you're using external MySQL do a mysqldump, if you're using internal MySQLe backup the mysqle folder in the Amarok data directory./li br / liUpdate to a revision from a week ago...say, 995000./li br / liWipe your DB./li br / liStart Amarok -- it will do a full scan because of the empty DB. Time it as it does the scan./li br / liRepeat steps 3 amp; 4, so that you can see what the time is like after caching./li br / liUpdate to current trunk (at least 998470)./li br / liRepeat step 3.br //li br / liRepeat steps 4 and 5./li br / /ol br / pThen leave a reply here with your values. If you watch your CPU during each of the scans, report that here too. Thanks!br //p
Catégories: Planet Amarok