Can Graphd Scale to Meet Semantic Web Demands?
Published 1 year ago by James Simmons
Freebase stores millions of entities and assertions about nearly every topic one can ponder (thanks are owed to their seed dataset – Wikipedia – and their amazing community). The amount of information that Freebase stores is incredible, and is a testament to what can be accomplished with the help of a dedicated community and a little (or a lot) of clever software engineering.
Graphd is the in-house tuple store powering Freebase's backend. Written in C, Graphd runs on Unix-based machines (presumably some Linux distro) and processes commands in a simple, template-based query language called MQL. The query language looks strikingly similar to JSON and Python dictionary syntax, so developers familiar with either should find working with their API a sinch.
On performance, Freebase's Scott Meyer stated as of April 9th, 2008 that Graphd is able to demonstrate sustained throughput of about 200,000 simple queries per minute on a single AMD64 core (querying a graph of only 121 million tuples, however). For his example of what a simple query might look like, he gave the example "show me all people who are authors with names containing 'herman'." As well on April 9th, 2008, on disk, their current graph of 121 million primitives (tuples) consumed about 12gb (includes all index storage).
We see that Graphd is able to handle a stunning sustained ~3300 queries/sec on a single AMD64 core. That's not anything to scoff at, either. However, the question I am finally getting around to, can Graphd scale to meet the demands of the Semantic Web? Eventually, Freebase will be much larger. 121m tuples is nothing when compared to the amount of data currently available in RDF (already in the order of billions of assertions).
I have read in comments that Graphd runs completely in memory (or perhaps more likely, only the indices). This explains the amazing performance to a degree. On an AMD64 Phenom Quad Core with 2gb of RAM I can run "simple" operations linearly through a flat file of 17m Freebase tuples in under 6 seconds (in memory). On a slice of 1m tuples the test was able complete the iterations within ~0.003 seconds. The test was written in Python, so it isn't even as quick as the potential Graphd has (written in C).
The test should illustrate the amazing performance you can achieve when processing entirely in memory, but when you can no longer store your entire set of indices in memory (say, for 3b+ tuples) you have to apply some of that clever software engineering to quickly locate data positions regardless of the number or distribution of indices.
Can Freebase scale Graphd to meet the demands of the Semantic Web, or will they need to completely redesign the architecture of their backend to reach a scale not originally designed for? I cannot say, but I wish them the best of luck. I think I speak for everyone when I say I would really like to see Graphd open sourced!
PS: Freebase, I promise I'll use the new logo in my posts going forward.
About the author
Trackback URL for this entry:
http://www.semanticfocus.com/blog/tr/id/886011/
Spam protection by Akismet
Post a comment



Posted by Scott Meyer on December 9, 2008 at 5:27pm
Can the semantic web scale to meet the demands of graphd?
On today's commodity hardware with our current, relatively modest compression, graphd could handle well over 2 billion tuples. The problem is finding 2 billion tuples worth of high-quality, unambiguous, generally interesting data. If there are a dozen entities claiming to be "Walmart" or "Arnold Schwarzenegger" (or both!) much of the utility of a centralized graph database goes up in smoke and users have to revert to comparing strings (joining based on value instead of identity) and trying to figure out how to determine whether "java" means the language, the island, or the coffee, and which of 3 different estimates for the population of six different Frances they'd like to use.
We're glad you like graphd, and believe me, we'd be absolutely thrilled to be concerned about scalability.
-Scott
Posted by James Simmons on December 10, 2008 at 2:09pm
Hi Scott,
>>On today's commodity hardware with our current, relatively modest compression, graphd could handle well over 2 billion tuples.
Are you using sequential GUIDs simply to ease compression? I imagine some sort of delta encoding or diff representation being used (depending on whether you store the full URI or simply the numeric value of the GUID). Which leads me to my next question, is Graphd actually a column store?
Are you able to process over the compressed indices or does Graphd uncompress indices before processing over them?
Do properties such as http://rdf.freebase.com/ns/type.object.type have a GUID associated with them, and could that GUID be used instead of the human-friendly URI?
For example, http://www.freebase.com/view/en/arnold_schwarzenegger and http://www.freebase.com/view/guid/9202a8c04000641f8000000000006567 both represent Arnold, but how is this handled internally? Is it wrong to use the /en/arnold_schwarzenegger URI as the source of a primitive? Does this bog Graphed down when primitives mix GUIDs with human-friendly URIs?
>>The problem is finding 2 billion tuples worth of high-quality, unambiguous, generally interesting data.i f there are a dozen entities claiming to be "Walmart" or "Arnold Schwarzenegger" (or both!) much of the utility of a centralized graph database goes up in smoke [...]
Ambiguity is going to continue being issue well into the future (we're human, so possibly forever). Does this mean that Freebase has a difficult time dealing with disambiguation, or are there sufficient policies and practices (and code) in place to consolidate a dozen Walmart entities? This is perhaps another way in which Graphd would need to "scale" for the Semantic Web.
Thanks for the reply, I've been very curious about Graphd as of late. By the way I like that you updated the blog design, it looks much cleaner.
Posted by Scott Meyer on December 11, 2008 at 9:52am
Sequential guids are fundamental to the physical representation of the data and indexes. Human readable (I'm being optimistic) names like http://www.freebase.com/view/en/arnold_schwarzenegger or http://rdf.freebase.com/ns/type.object.type refer to guids. Aside from a nominal translation overhead, there's no penalty for using names instead of guids and, in addition to being more palatable to humans, the indirection present in the name space insulates applications from the (infrequent) identity changes that do occur.
Generally, we work directly with the physical representation of the data or index.
Graphd is just a database. It doesn't "care" about ambiguity any more than any other database. The problem with ambiguity (or, more generally, poor data quality) is that it makes applications harder to write and applications are the reason for having a database at all.
-Scott
Posted by James Simmons on December 11, 2008 at 12:22pm
Illuminating :)
Thanks for taking the time to answer my questions. I have a much better idea of how things work on Freebase now.