<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>erik frey</title>
	<atom:link href="http://fawx.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://fawx.com</link>
	<description></description>
	<lastBuildDate>Tue, 27 Oct 2009 01:11:31 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>An Ode to Set Intersection, Part 1</title>
		<link>http://fawx.com/2009/10/26/an-ode-to-set-intersection-part-1/</link>
		<comments>http://fawx.com/2009/10/26/an-ode-to-set-intersection-part-1/#comments</comments>
		<pubDate>Mon, 26 Oct 2009 21:20:11 +0000</pubDate>
		<dc:creator>Erik Frey</dc:creator>
				<category><![CDATA[tech]]></category>

		<guid isPermaLink="false">http://fawx.com/?p=84</guid>
		<description><![CDATA[
This here is Norman Casagrande, last.fm&#8217;s head researcher and my former mentor.  The facial expression seen here is one of my favorites.  It&#8217;s usually accompanied by some questionably-translated Italian aphorism, &#8221;You know, Erik, that is like the ox saying &#8216;horned&#8217; to the donkey.&#8221;
This smarmy smile also usually means I&#8217;m about to learn something new.  I saw [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-thumbnail wp-image-85  alignleft" style="border: 1px solid black; padding: 2px;" title="Norman Casagrande" src="http://fawx.com/wp-content/uploads/2009/10/Norman+Casagrande-150x150.jpg" alt="Norman Casagrande, Head of Music Research at Last.fm" width="150" height="150" /></p>
<p>This here is Norman Casagrande, last.fm&#8217;s head researcher and my former mentor.  The facial expression seen here is one of my favorites.  It&#8217;s usually accompanied by some questionably-translated Italian aphorism, &#8221;You know, Erik, that is like the ox saying &#8216;horned&#8217; to the donkey.&#8221;</p>
<p>This smarmy smile also usually means I&#8217;m about to learn something new.  I saw it my very first day at last.fm.  Norman was describing to me various gears and nozzles that power last.fm&#8217;s recommendations, and when he showed me the code that compares two musical items, I stopped him.  I saw my chance to strike.</p>
<h2>Set Intersection</h2>
<p>Set intersection is a delightful little haiku problem in computer science.  It&#8217;s simple, elegant, and is ubiquitous in search engine tech, collaborative filtering mechanisms, various types of matrix math&#8230; probably other places, too!  Its most basic implementation is linear in complexity, and most people are happy with linear performance here.  But as I studied the guts of Norman&#8217;s recommendation engine, I saw my first opportunity to unleash unspeakable power!</p>
<p>&#8220;Norman, what if I told you that there are set intersections much more efficient than this?  In fact, I know of one that uses a search algorithm that is&#8230;&#8221;, I whispered almost breathlessly, &#8220;<em>log-log</em>!&#8221;</p>
<p>&#8220;Ehhh, you can try but I don&#8217;t think it will be faster.  But you throw the spaghetti at the ceiling propeller and we will see what stays there.&#8221;  And out came that infuriating smile.</p>
<p>So I set about to wipe that smile off Norman&#8217;s face.  I was going to write the Rolls-Royce jet engine of all set intersection algorithms.</p>
<h2 style="font-size: 1.5em;">Interpolation Search and Dr. Baeza-Yates</h2>
<p>To write my code I drew from the experience I&#8217;d had at a job interview with Google some time before.  A smart scientist had quizzed me about <a title="Interpolation search - Wikipedia" href="http://en.wikipedia.org/wiki/Interpolation_search">interpolation search</a>, a way of finding elements in sorted sets by making educated guesses as to their location.  He claimed it had contributed greatly to the performance of operations at Google, so I was sure it could do the same for me.</p>
<p>In turn I&#8217;d found a really excellent paper on set intersection, <a href="http://www.springerlink.com/content/yth9h90y94n10l7e/">A Fast Set Intersection Algorithm for Sorted Sequences</a>, by a celebrated, highly-respected dude in the field of data mining, Ricardo Baeza-Yates.  In this paper he proposes the following algorithm:</p>
<ol>
<li>Pick the median element, A, in the smaller set.</li>
<li>Search for its insertion-position element, B, in the larger set.</li>
<li>If A and B are equal, append the element to the result.</li>
<li>Repeat steps 1-4 on non-empty subsets on either side of elements A and B.</li>
</ol>
<p>Dr. Baeza-Yates showed that this algorithm does particularly well when one set is smaller than the other.  Just to be sure, I wrote code to count the number of comparisons for three kinds of set intersection: plain old linear set intersection, Dr. Baeza-Yates&#8217; special set intersection with <a href="http://en.wikipedia.org/wiki/Binary_search">binary search</a>, and his set intersection with interpolation search.</p>
<p>View/download the source for this experiment from my <a href="http://github.com/erikfrey/themas/tree/master/src/set_intersection/">themas github repository</a>.  You can profile the algorithms on uniform random data by building the project and running:</p>
<p><code>hexdump -e '1/4 " %u"' /dev/urandom | ./set_intersection_profile binary compares 100000 0.1 1.0 0.1 20</code></p>
<p>The results were exactly what I&#8217;d hoped, looking at the number of comparisons as I varied the ratio between the two set sizes:</p>
<p style="text-align: center;"><img class="size-full wp-image-115  aligncenter" style="border: 1px solid black; padding: 2px;" title="Set Intersection Comparisons" src="http://fawx.com/wp-content/uploads/2009/10/Set-Intersection-Comparisons.jpg" alt="Set Intersection Comparisons" width="400" height="330" /></p>
<p>Not only did the interpolation intersection use far fewer comparisons in the case where |M| &lt;&lt; |N|, but it performed better than linear everywhere else, too!  Armed with this knowledge, I was ready to take Norman down.</p>
<p>To be continued!</p>
<p><em>p.s.</em> For the observant programmer that is curious as to why the linear intersection&#8217;s line isn&#8217;t flat in the graph above, see <a href="http://github.com/erikfrey/themas/commit/e734e20712404bd403d7e103673d4b253df75247">this commit</a> and this tangentially-related <a href="http://kerneltrap.org/node/4705">article about __builtin_expect</a>.  You&#8217;ll figure it out!</p>
]]></content:encoded>
			<wfw:commentRss>http://fawx.com/2009/10/26/an-ode-to-set-intersection-part-1/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
