<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Wiki data dumps</title>
	<atom:link href="http://leuksman.com/log/2007/10/02/wiki-data-dumps/feed/" rel="self" type="application/rss+xml" />
	<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/</link>
	<description>reticula, electronica, &#38; oddities</description>
	<pubDate>Thu, 20 Nov 2008 23:29:33 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.3</generator>
		<item>
		<title>By: brion</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3944</link>
		<dc:creator>brion</dc:creator>
		<pubDate>Sun, 14 Oct 2007 22:04:21 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3944</guid>
		<description>I've made another post with some notes on how to do &lt;a href="http://leuksman.com/log/2007/10/14/incremental-dumps/" rel="nofollow"&gt;incremental dump generation&lt;/a&gt;...</description>
		<content:encoded><![CDATA[<p>I&#8217;ve made another post with some notes on how to do <a href="http://leuksman.com/log/2007/10/14/incremental-dumps/" rel="nofollow">incremental dump generation</a>&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Felipe Ortega</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3850</link>
		<dc:creator>Felipe Ortega</dc:creator>
		<pubDate>Thu, 11 Oct 2007 09:58:40 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3850</guid>
		<description>I suscribe all your improvement ideas, Brion, as well as the ideas proposed by Erik. 

In case this is useful for you, I am currently finishing a new version of my Python parser. I tested 7zip decompression several types, piped to the parser then piped to MySQL. I can confirm that 7zip paralellizes decompression much better (I use 2 AMD Santa Rosa 2 GHz, 4 cores in total, 3Ware RAID-6 with 8 Seagate fast SATA-II disks).

Big Wikipedia dumps crash (no matter the language edition) around the 4.500.000 revision mark when piped, and exception control is required (so I now forget about pipes, and use a library for directly conecting the DB). Therefore, chunks will be more than welcomed. Incremental back-ups will be far more flexible than the current solution.

I already take advantage of extended inserts in the parser, to speed up even more the recovery process.</description>
		<content:encoded><![CDATA[<p>I suscribe all your improvement ideas, Brion, as well as the ideas proposed by Erik. </p>
<p>In case this is useful for you, I am currently finishing a new version of my Python parser. I tested 7zip decompression several types, piped to the parser then piped to MySQL. I can confirm that 7zip paralellizes decompression much better (I use 2 AMD Santa Rosa 2 GHz, 4 cores in total, 3Ware RAID-6 with 8 Seagate fast SATA-II disks).</p>
<p>Big Wikipedia dumps crash (no matter the language edition) around the 4.500.000 revision mark when piped, and exception control is required (so I now forget about pipes, and use a library for directly conecting the DB). Therefore, chunks will be more than welcomed. Incremental back-ups will be far more flexible than the current solution.</p>
<p>I already take advantage of extended inserts in the parser, to speed up even more the recovery process.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Milos Rancic</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3846</link>
		<dc:creator>Milos Rancic</dc:creator>
		<pubDate>Thu, 11 Oct 2007 03:28:35 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3846</guid>
		<description>Also, note that uncompressed file bigger then 10GB is not so useful.</description>
		<content:encoded><![CDATA[<p>Also, note that uncompressed file bigger then 10GB is not so useful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Milos Rancic</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3845</link>
		<dc:creator>Milos Rancic</dc:creator>
		<pubDate>Thu, 11 Oct 2007 03:21:57 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3845</guid>
		<description>Brion, I would be able to analyze data from smaller parts:

- wget part1
- analyze part1
- wget part2
- analyze part2
- ...
- merge data

And sorry for waiting for respond.</description>
		<content:encoded><![CDATA[<p>Brion, I would be able to analyze data from smaller parts:</p>
<p>- wget part1<br />
- analyze part1<br />
- wget part2<br />
- analyze part2<br />
- &#8230;<br />
- merge data</p>
<p>And sorry for waiting for respond.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Luca de Alfaro</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3578</link>
		<dc:creator>Luca de Alfaro</dc:creator>
		<pubDate>Thu, 04 Oct 2007 00:47:40 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3578</guid>
		<description>I find smaller, piecewise dumps to be a necessity for my work -- so that if I run some code on a dump, and the code breaks, I don't have to re-start from scratch. 
In fact, I wrote some code to take a single dump and break it into smaller dumps -- go to http://trust.cse.ucsc.edu/ and follow the link to Code. 

From what Brion told me in another discussion, much of the time required for a dump is actually taken up by the compression phase.  The actual dump takes only a couple of days.  
Since the Feb 6 07 dump already has been compressed, I think it would not be necessary to produce a single compressed dump after that -- monthly compressed incremental dumps are all that is required.</description>
		<content:encoded><![CDATA[<p>I find smaller, piecewise dumps to be a necessity for my work &#8212; so that if I run some code on a dump, and the code breaks, I don&#8217;t have to re-start from scratch.<br />
In fact, I wrote some code to take a single dump and break it into smaller dumps &#8212; go to <a href="http://trust.cse.ucsc.edu/" rel="nofollow">http://trust.cse.ucsc.edu/</a> and follow the link to Code. </p>
<p>From what Brion told me in another discussion, much of the time required for a dump is actually taken up by the compression phase.  The actual dump takes only a couple of days.<br />
Since the Feb 6 07 dump already has been compressed, I think it would not be necessary to produce a single compressed dump after that &#8212; monthly compressed incremental dumps are all that is required.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Erik Zachte</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3567</link>
		<dc:creator>Erik Zachte</dc:creator>
		<pubDate>Wed, 03 Oct 2007 18:24:59 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3567</guid>
		<description>Brion, most of your post addresses optimizations on the current process. This is of course very welcome, and hopefully buys us time. But sooner or later the English Wikipedia will have grown again to a size that takes 6 weeks to dump and then longer.

You already built a mechanism to append new data to an existing dump. I assume that saves a lot of time, as it saves tremendously on SQL calls. All those joins should mean quite a few I/O's per revision dumped, right? of course mitigated by cached indexes. Yet I wonder: 

Q1: Is this incremental dump system used often or does it have practical obstacles?

Q2: As reading a full dump takes about one day only, a normal full dump (now 6 weeks for the English Wikipedia) is mostly SQL calls right? 

--------------

I could see at least two ways to split up the archive dump and make the process more manageable, at the cost of being somewhat more complicated: 

1 Produce chunks that contain exactly one calendar month of data, which is preserved as a separate entity ever after. Every new dump needs to collect data not collected before and complete the latest unfinished month and start a new one. 

A separate queue of article and revision deletions (keys only) is updated on an ongoing basis (appending keys for new deletions). So for every month there is a never changing dump, and a growing delete queue, to be merged on a researchers PC.

This way every month needs to be downloaded only once, and for each month a thousands times smaller deletion queue needs to to be redownloaded and reapplied occasionally (of course in an automated fashion).

One downside might be that users do not apply the latest deletions. To this I would say it is unavoidable anyway that users keep old dumps that contain officially banned records, if they wish to do so. The well behaved way is mainly there to get proper data for research, stats and fork purposes. 

I know the total size of the dumps would grow due to less optimal compression, as bad side effect.  But this system greatly shortens both dump and download times.

2 Another way to make the process better restartable would be to produce a new dump every time but write to a new file for every 100,000 articles. The current English dump would result in +/- 25 files. This would not save on dump time, but greatly simplify disaster recovery/restartability. The process might even be distributed over several servers, each doing a part of the English wikipedia.

In both scenarios it means either that 
- existing scripts (like wikistats) need to be adapted to handle multiple input files or 
- a new script needs to be built to merge chunks to an archive file as we know it now.

---- 

unrelated: I sent you a mail several weeks ago on two addresses. Did you get any? If not please mail me and  I will respond to that. Thanks.</description>
		<content:encoded><![CDATA[<p>Brion, most of your post addresses optimizations on the current process. This is of course very welcome, and hopefully buys us time. But sooner or later the English Wikipedia will have grown again to a size that takes 6 weeks to dump and then longer.</p>
<p>You already built a mechanism to append new data to an existing dump. I assume that saves a lot of time, as it saves tremendously on SQL calls. All those joins should mean quite a few I/O&#8217;s per revision dumped, right? of course mitigated by cached indexes. Yet I wonder: </p>
<p>Q1: Is this incremental dump system used often or does it have practical obstacles?</p>
<p>Q2: As reading a full dump takes about one day only, a normal full dump (now 6 weeks for the English Wikipedia) is mostly SQL calls right? </p>
<p>&#8212;&#8212;&#8212;&#8212;&#8211;</p>
<p>I could see at least two ways to split up the archive dump and make the process more manageable, at the cost of being somewhat more complicated: </p>
<p>1 Produce chunks that contain exactly one calendar month of data, which is preserved as a separate entity ever after. Every new dump needs to collect data not collected before and complete the latest unfinished month and start a new one. </p>
<p>A separate queue of article and revision deletions (keys only) is updated on an ongoing basis (appending keys for new deletions). So for every month there is a never changing dump, and a growing delete queue, to be merged on a researchers PC.</p>
<p>This way every month needs to be downloaded only once, and for each month a thousands times smaller deletion queue needs to to be redownloaded and reapplied occasionally (of course in an automated fashion).</p>
<p>One downside might be that users do not apply the latest deletions. To this I would say it is unavoidable anyway that users keep old dumps that contain officially banned records, if they wish to do so. The well behaved way is mainly there to get proper data for research, stats and fork purposes. </p>
<p>I know the total size of the dumps would grow due to less optimal compression, as bad side effect.  But this system greatly shortens both dump and download times.</p>
<p>2 Another way to make the process better restartable would be to produce a new dump every time but write to a new file for every 100,000 articles. The current English dump would result in +/- 25 files. This would not save on dump time, but greatly simplify disaster recovery/restartability. The process might even be distributed over several servers, each doing a part of the English wikipedia.</p>
<p>In both scenarios it means either that<br />
- existing scripts (like wikistats) need to be adapted to handle multiple input files or<br />
- a new script needs to be built to merge chunks to an archive file as we know it now.</p>
<p>&#8212;- </p>
<p>unrelated: I sent you a mail several weeks ago on two addresses. Did you get any? If not please mail me and  I will respond to that. Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brion</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3561</link>
		<dc:creator>brion</dc:creator>
		<pubDate>Wed, 03 Oct 2007 13:23:14 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3561</guid>
		<description>What would you actually do with a small piece at a time, out of curiosity? What sort of technique would you actually follow that would make smaller pieces helpful?

That's not a rhetorical question, I really want to know so we can make this process better for the people using the dumps.</description>
		<content:encoded><![CDATA[<p>What would you actually do with a small piece at a time, out of curiosity? What sort of technique would you actually follow that would make smaller pieces helpful?</p>
<p>That&#8217;s not a rhetorical question, I really want to know so we can make this process better for the people using the dumps.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Milos Rancic</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3556</link>
		<dc:creator>Milos Rancic</dc:creator>
		<pubDate>Wed, 03 Oct 2007 10:11:05 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3556</guid>
		<description>Brion, is it possible to make dumps a little bit more useful? I mean, is it possible to split very large backups? It is not so useful to get 10 GB 7z file which be uncompressed onto 100GB single file, while my biggest single partition is around 80GB. OK, I may try to make some tricks, but...</description>
		<content:encoded><![CDATA[<p>Brion, is it possible to make dumps a little bit more useful? I mean, is it possible to split very large backups? It is not so useful to get 10 GB 7z file which be uncompressed onto 100GB single file, while my biggest single partition is around 80GB. OK, I may try to make some tricks, but&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: NickJ</title>
		<link>http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3542</link>
		<dc:creator>NickJ</dc:creator>
		<pubDate>Wed, 03 Oct 2007 02:24:52 +0000</pubDate>
		<guid isPermaLink="false">http://leuksman.com/log/2007/10/02/wiki-data-dumps/#comment-3542</guid>
		<description>Re: multithreaded 7-zip:
I had assumed it was multithreaded for compression already, from the "2 CPUs" message:
&#62; 7-Zip 4.43 beta  Copyright (c) 1999-2006 Igor Pavlov  2006-09-15
&#62; p7zip Version 4.43 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
(actually a google hit said something about it being multithreaded by default since v4.42, rather than v4.52)... However, as you say, it may only be partial. To test it on Linux, I guess "aptitude install sysstat", then do a 7-zip compression of something large, and then to see the load on each CPU every 2 seconds 10 times, do "mpstat -P ALL 2 10", and if it shows every CPU at around 100% usage / 0% idle for the average, then presumably 7-zip is fully utilizing all available CPUs / cores? When I tried this with 7zip 4.43 beta on an older 2 CPU x86 machine (and compressing using "/usr/bin/7z a -mx=9 -mfb=64 -md=32m -bd test.sql.7z test.sql" command line), I got this result:
------------------------------
Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
Average:     all   88.55    0.00    0.65    0.00    0.00    0.00    0.00   10.81    253.54
Average:       0   95.51    0.00    0.50    0.00    0.00    0.00    0.00    3.99    250.25
Average:       1   81.66    0.00    0.90    0.00    0.00    0.00    0.00   17.55      3.29
------------------------------
.... so yeah, around 1.89x comp parallelism, whereas ideally it'd be 1.99x. Still, it's better than 1.00x ;-)</description>
		<content:encoded><![CDATA[<p>Re: multithreaded 7-zip:<br />
I had assumed it was multithreaded for compression already, from the &#8220;2 CPUs&#8221; message:<br />
&gt; 7-Zip 4.43 beta  Copyright (c) 1999-2006 Igor Pavlov  2006-09-15<br />
&gt; p7zip Version 4.43 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)<br />
(actually a google hit said something about it being multithreaded by default since v4.42, rather than v4.52)&#8230; However, as you say, it may only be partial. To test it on Linux, I guess &#8220;aptitude install sysstat&#8221;, then do a 7-zip compression of something large, and then to see the load on each CPU every 2 seconds 10 times, do &#8220;mpstat -P ALL 2 10&#8243;, and if it shows every CPU at around 100% usage / 0% idle for the average, then presumably 7-zip is fully utilizing all available CPUs / cores? When I tried this with 7zip 4.43 beta on an older 2 CPU x86 machine (and compressing using &#8220;/usr/bin/7z a -mx=9 -mfb=64 -md=32m -bd test.sql.7z test.sql&#8221; command line), I got this result:<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br />
Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s<br />
Average:     all   88.55    0.00    0.65    0.00    0.00    0.00    0.00   10.81    253.54<br />
Average:       0   95.51    0.00    0.50    0.00    0.00    0.00    0.00    3.99    250.25<br />
Average:       1   81.66    0.00    0.90    0.00    0.00    0.00    0.00   17.55      3.29<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br />
&#8230;. so yeah, around 1.89x comp parallelism, whereas ideally it&#8217;d be 1.99x. Still, it&#8217;s better than 1.00x <img src='http://leuksman.com/log/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
</channel>
</rss>
