DEC 9th 2008

FreebaseFreebase stores millions of entities and assertions about nearly every topic one can ponder (thanks are owed to their seed dataset – Wikipedia – and their amazing community). The amount of information that Freebase stores is incredible, and is a testament to what can be accomplished with the help of a dedicated community and a little (or a lot) of clever software engineering.

Graphd is the in-house tuple store powering Freebase's backend. Written in C, Graphd runs on Unix-based machines (presumably some Linux distro) and processes commands in a simple, template-based query language called MQL. The query language looks strikingly similar to JSON and Python dictionary syntax, so developers familiar with either should find working with their API a sinch.

On performance, Freebase's Scott Meyer stated as of April 9th, 2008 that Graphd is able to demonstrate sustained throughput of about 200,000 simple queries per minute on a single AMD64 core (querying a graph of only 121 million tuples, however). For his example of what a simple query might look like, he gave the example "show me all people who are authors with names containing 'herman'." As well on April 9th, 2008, on disk, their current graph of 121 million primitives (tuples) consumed about 12gb (includes all index storage).

We see that Graphd is able to handle a stunning sustained ~3300 queries/sec on a single AMD64 core. That's not anything to scoff at, either. However, the question I am finally getting around to, can Graphd scale to meet the demands of the Semantic Web? Eventually, Freebase will be much larger. 121m tuples is nothing when compared to the amount of data currently available in RDF (already in the order of billions of assertions).

I have read in comments that Graphd runs completely in memory (or perhaps more likely, only the indices). This explains the amazing performance to a degree. On an AMD64 Phenom Quad Core with 2gb of RAM I can run "simple" operations linearly through a flat file of 17m Freebase tuples in under 6 seconds (in memory). On a slice of 1m tuples the test was able complete the iterations within ~0.003 seconds. The test was written in Python, so it isn't even as quick as the potential Graphd has (written in C).

The test should illustrate the amazing performance you can achieve when processing entirely in memory, but when you can no longer store your entire set of indices in memory (say, for 3b+ tuples) you have to apply some of that clever software engineering to quickly locate data positions regardless of the number or distribution of indices.

Can Freebase scale Graphd to meet the demands of the Semantic Web, or will they need to completely redesign the architecture of their backend to reach a scale not originally designed for? I cannot say, but I wish them the best of luck. I think I speak for everyone when I say I would really like to see Graphd open sourced!

PS: Freebase, I promise I'll use the new logo in my posts going forward.

About the author

James Simmons

It's my goal to help bring about the Semantic Web. I also like to explore related topics like natural language processing, information retrieval, and web evolution. I'm the primary author of Semantic Focus and I'm currently working on several Semantic Web projects.

Trackback URL for this entry:

Spam protection by Akismet

Comments for this entry:

No one has left a comment for this entry. Be the first!

Post a comment

  1. Spam protection by Akismet