Archive for the 'wiki' Category

Mobile gateway search

Saturday, May 24th, 2008

So it turns out that the search function on Wikipedia’s HawHaw-powered mobile gateway hasn’t been working for a long time, not because it wasn’t implemented, but because it was screen-scraping the search results page.

Some little detail of the results layout changed ages ago, breaking it. Nice! Well, I’ve redone it to use the MediaWiki web service API which should be a little more stable.

Search works again, yay!

Even if the correct search result is fifth in the output *cough* :)

Hey, we’re workin’ on it. ;)

More CentralAuth comin’ Tuesday

Saturday, May 24th, 2008

Hey, just to give y’all a heads-up… after a couple months of good testing w/ the sysops & power users, we’re going to widen the CentralAuth rollout to allow everybody on Wikimedia sites to opt-in to the system.

We’re going to keep automatic migration off for now to keep the volume down, as we may want to roll out more helper tools in response to new issues people might have.

UTF-8 support in Firefox 3 location bar

Friday, May 23rd, 2008

I don’t usually repost other blogs, but this is a big usability help for our non-Latin wikis… Firefox 3 is joining Safari and Opera 9 in displaying human-legible Unicode URLs in the location bar.

Woohoo!

RecentChangesCamp!

Friday, May 9th, 2008

About to head out to RecentChangesCamp 2008 in Palo Alto, CA… see y’all there!

Diff bug fixed, hopefully

Saturday, April 26th, 2008

For a long time we’ve had intermittent problems with diffs displaying incorrectly, with lines on the left side mysteriously repeated:

Reports skyrocketed the other day, when the wikidiff2 extension (our C++ reimplementation of MediaWiki’s diff algorithm, about a billion times faster than the PHP one) was upgraded to match upgrades of PHP on our older, Fedora Core-based servers.

I added in some logging hacks to try to track it down, but didn’t get a lot of data points until I tried the simple expedient of running every diff twice — if the results don’t match, log the error.

With a few hundred instances logged, it became clear that the problem was limited to servers running Fedora 4; even-older Fedora 3 boxes were unaffected, as were all our newer Ubuntu boxes. Mysterious problems caused by C++ run-time library mismatches between different Linux releases are not at all uncommon; it looked like we’d installed an FC3 binary on all the machines, and it was intermittently failing on FC4.

I recompiled the extension, this time with separate builds on FC3 and FC4, and haven’t seen any bad diffs come through my log in the last half hour… so far so good! :)

So what’s in the job queue anyway?

Tuesday, April 22nd, 2008

In en.wikipedia.org’s job queue at the moment, breakdown by job type…

job_cmd count(*)
htmlCacheUpdate 31,147
refreshLinks 10,106,739
renameUser 119

Note that the current system allows for duplicate entries to get put in the queue; the dupes are removed as the first one in the stack gets run. This makes the raw number of refreshLinks entries much higher than it “really” is — Talk:Union Station (Louisville) is listed 9 times, presumably once for each template edit that triggered an “update me!” job.

Update: Figured out why the queues were growing so big last few days — system clock was 7 seconds slow on the database master. This made the replication lag detection misread a 7-second minimum lag on every slave. The job queue batch runners were all sitting waiting for the lag to resolve. :)

Resynced the clock (presumably drifted during the period when some IPs were broken), things are moving again.

Suggestion search drop-down

Monday, April 21st, 2008

Another in today’s series of fun feature enablings…

The search boxes on Wikimedia wikis now have an AJAX-powered search suggestion drop-down. This calls our JSON OpenSearch suggestion interface, which has been used for some time by Firefox’s search box and Mac OS X 10.5’s Dictionary application, but is now built-in for your viewing pleasure.

(In MediaWiki 1.13 development trunk, turn on $wgEnableMWSuggest to experience this yourself!)

A similar AJAX-powered search feature has been in MediaWiki for some time, but the user interface for it took over the whole article area, which was a bit distracting, and we never used it ourselves.

Robert Stojnic, the tireless coder who’s put a huge amount of effort into fixing up our Lucene-based search engine over the last months, patched up the front-end to fit more naturally into the existing forms.

The built-in search for suggestions is currently a simple prefix match, so it’ll help you complete words and names, but isn’t smart enough to fill out from a last name or skip “the” etc. Robert’s got a new backend in the works, which will add all those smarts when we’re ready to upgrade the search systems with the new software and a bit beefier hardware.

Prefix matches are a heck of a lot better than nothing, though, and as long as it’s not causing undue server load we’ll keep it on until the new backend’s ready.

(If you don’t like the suggestions widget, you can disable them by checking “Disable AJAX suggestions” in the “Search” tab at Special:Preferences.)

HttpOnly cookies

Monday, April 21st, 2008

Thanks to Werdna’s implementation of support, and Tim’s mass upgrade of our older PHP installations, I’ve today enabled the use of HttpOnly cookies on the Wikimedia wikis for our login session data.

“What’s that,” I hear you say, “and why do I want it?”

The HttpOnly marker on cookies tells a supporting browser that the cookie will only be used directly by the web server (sent only with the HTTP requests for each page), so it will hide the cookie from any JavaScript client code which asks for it.

This provides protection against certain kinds of security vulnerabilities — namely, XSS attacks which steal authenticated session and long-term login token cookies.

HttpOnly doesn’t fix XSS, not by a long shot, but it does reduce what an attacker can do; particularly nice when we’re soon going to start using global login cookies which will allow a unified account to continue a login session across multiple wikis on different domains.

The same origin policy prevents JavaScript on one subdomain from directly accessing another domain. Keeping the cross-domain session cookies away from compromised JavaScript will help prevent a hypothetical attack on one domain from jumping to other subdomains without the vulnerability.

Unfortunately, this marker isn’t standard; it’s an extension which Microsoft added for Internet Explorer in 6.0 SP1, but support has been slowly creeping into other browsers, finally hitting Firefox somewhere in the 2.0 patch cycle while nobody was looking.

Browsers I tested that currently support HttpOnly cookies:

  • IE/Win 6 SP1 or 7
  • Firefox 2.0.0.5 or later
  • Opera 9.50 beta
  • Konqueror (3.4?)

Other browsers will still expose the cookies to JavaScript, as they always have:

  • Safari 3.1
  • Opera 9.27 (current non-Beta release)
  • Old scary browsers like IE for Mac and Netscape 4 ;)

There’s a rumor that some versions of WebTV fail altogether when the cookies are marked this way, but I have no way to confirm or deny that yet.

Update 2008-05-01: Mac IE turns out to eat HttpOnly cookies…. sometimes… when the moon is just right. :) Added a browser blacklist, so we feed Mac IE regular cookies. Other browsers are still given the benefit of the doubt.

SUL status update…

Wednesday, April 16th, 2008

Status update…

CentralAuth global logins are still restricted to the sysop beta, but Werdna and Tim have been doing some good work on cleaning things up…

  • Tim’s done a lot of code refactoring to clean up User object behavior
  • Werdna’s added support for global sessions based on Tim’s suggested model. Tim and I have helped with some cleanup on it…
  • I put together a threat assessment of the security impact of global session cookies and some mitigration strategies
  • One of my suggestions was to use HttpOnly mode for session and token cookies, where browsers support them. This will largely block XSS attacks from jumping between subdomains or stealing cookies for reuse by an attacker. Werdna’s added support for HttpOnly cookies under PHP 5.2; currently we can’t deploy this until we finish upgrading some of our machines.
  • I’ve enabled global sessions on secure.wikimedia.org, where there’s a single domain and few other services to increase the attack surface. It _seems_ to mostly work so far. ;)

    Logging out doesn’t quite clear all sessions correctly yet, but so far so good. :)

HTML to PDF, why so hard?

Thursday, March 27th, 2008

I’ve been testing out MediaWiki PDF export using PediaPress’s mwlib & mwlib.rl. This system uses a custom MediaWiki parser written in Python, which then calls out to a PDF generator library to assemble a pretty, printable PDF output file.

The PediaPress folks are responsive to bug reports, but in the long run I worry that this would be a difficult system to maintain. The alternate parser/renderer needs to reimplement not only MediaWiki’s core markup syntax, but support for every current and future parser or media format extension we roll out into production usage.

Something based on the XHTML we already generate would be the most future-proof export system. This could of course be HTML that’s geared specifically for print, say by including higher-resolution images and making use of vector versions of math and SVG more readily, among other things.

Ideally, we’d be able to use common open-source browser engines like Gecko or WebKit for this — engines we already know render our sites pretty well. Unfortunately there doesn’t yet seem to be a standard kit for using them to do headless print export.

I did some scouring around and found a few other HTML-to-PDF options, starting with those used by some MediaWiki extensions…

HTMLDoc

  • GPL/commercial dual-licence; C
  • Used by Pdf Book and Pdf Export extensions.
  • Seems to have absolutely ancient HTML support… no style sheets, no Asian text, etc…
  • Verdict: NO

dompdf

  • LGPL; PHP
  • Used by Pdf Export Dompdf extension.
  • DOM-based HTML & CSS to PDF converter written in PHP… Sounds relatively cute, but development seems to have fallen off in 2006 and support remains incomplete.
  • Verdict: NO

Googling about I stumbled upon some other fun…

Dynalivery Gecko

  • Commercial? Demo?
  • Online demo of an actual use of Gecko as an HTML-to-PDF print server! Seems to be some commercial thing, and the output quality indicates it’s a very old Gecko, with lots of printing bugs.
  • Neat to see it, though!
  • Verdict: NO

PrinceXML

  • Proprietary; server license $3800
  • Great quality and flexibility; this would be a great choice in the commercial world. :) They have some Wikipedia samples done with a custom two-column stylesheet which are quite attractive.
  • Not being open source, alas, is a killer here.
  • Verdict: NO

CSSToXSLFO

  • Public domain; Java
  • Converts XHTML+CSS2 to XSL-FO, which can then be rendered out to PDF using more open-source components. Seems under active development, last release in December 2007.
  • Might be pretty nice, but my last experience playing with XSL-FO via Apache FOP in 2005 or so was very painful, with lots of unsupported layout features.
  • Verdict: try me and see

I love Wikipedia!