DEC 18th 2008

Calling All RDF Dumps

Published 1 year ago by James Simmons

Today on the Linking Open Data mailing list, Kingsley Idehen of OpenLink Software announced that he is preparing to load the entire LOD cloud into Virtuoso 6.0 Cluster Edition. The datasets are being added to a table on the ESW wiki, making it convenient for anyone doing Semantic Web research to get a hold of the datasets. Once all the datasets are added we should have a better idea of how much linked data there really is out there. This may also raise the bar for other triple stores and force them to develop methods for storing several billion triples.

Here are his instructions for adding your dataset to the table:

  • Go to: http://esw.w3.org/topic/DataSetRDFDumps
  • Add your data set to the table (if it isn't already listed) or correct erroneous entries
  • Add a URL entry to the "Archive URL" column
  • Add a Publisher URI to the "Publisher / Maintainer" column (used for the construction of Attribution Triples)

If you don't have a URI for yourself, you can get one by registering and you will receive one.

About the author

James Simmons

It's my goal to help bring about the Semantic Web. I also like to explore related topics like natural language processing, information retrieval, and web evolution. I'm the primary author of Semantic Focus and I'm currently working on several Semantic Web projects.

Trackback URL for this entry:

http://www.semanticfocus.com/blog/tr/id/872378/

Spam protection by Akismet

Comments for this entry:

  1. Posted by Utopiah on December 19, 2008 at 5:55am

    What is the difference with The Map of Data from Sindice (by DERI) at http://sindice.com/map ?

  2. Posted by Kingsley Idehen on December 19, 2008 at 10:04am

    James,

    Thanks for the "shout out" :-)

    Here are some links re. Virtuoso 6.0 (Cluster Edition) for those that may be interested in this incarnation of Virtuoso:

    1. http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1506
    2. http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1494

    Kingsley

  3. Posted by James Simmons on December 19, 2008 at 1:04pm

    @Utopiah:

    Sindice's Map of Data is currently a list of sites and what kind of metadata (RDF, Microformats) those sites are exposing. Giovanni Tummarello of Sindice posted a comment on that post and he said they're working on a live LOD cloud which is similar to what Kingsley is doing. The big difference between the two clouds is that I believe Sindice's will include the embedded metadata from the sites on their current Map of Data page.

    @Kingsley:

    No problem ;) Thanks for including the links. So it looks like right now Virtuoso 6.0 Cluster Edition can store anywhere between 500m to 1b triples on a single server depending on the heterogeneousness of the data. That's great! Much more than I was ever able to squeeze into any RDBMS. 16gb of RAM for 500m triples suggests this is done entirely (or nearly entirely) in memory? That would explain the incredible performance (~250,000 single triple random lookups/sec as long as disk reads are not involved).

    I was stoked when I read that Cluster Edition can be scaled into the trillions of triples! Have you had the opportunity to scale to that level yet? I doubt there are even 1 trillion triples out there yet (in dataset form) so I figure you'd need large amounts of duplicate or generated data to do it. Would that also mean you'd need 1000 to 2000 servers on hand to perform that test?

    James

  4. Posted by Kingsley Idehen on December 22, 2008 at 7:01am

    James,

    Thanks for the "shout out" :-)

    Here are some links re. Virtuoso 6.0 (Cluster Edition) for those that may be interested in this incarnation of Virtuoso:

    1. http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1506
    2. http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1494

    Kingsley

  5. Posted by Kingsley Idehen on December 22, 2008 at 7:10am

    @James,

    Re. Trillions of Triples, we will get there as the Linked Data Web grows and other players come into the fold e.g. Sun, IBM, HP, Apple, and anyone else with a vested interested in showcasing the prowess of their hardware offerings in the clustered based distributed computing realm.

    BTW - you wil be able to build your own clusters on EC2 once we are done. This is another area where service providers, analysts, and researchers will be able to exploit this technology on a "pay as you go" basis.


    The difference between our setup and Sindice is that we aren't fundamentally about RDF document indexing and search. We are about the ability to Serendipitously Discovery of Relevant things based on RDF based Linked Data lookup database we have been constructing for a while now.

    Of course, Sindice has some commonality with this effort, but once we are done with the data loading the differences will be much clearer, and the symbiotic aspects of both efforts will bubble up.

    Kingsley

Post a comment

  1. Spam protection by Akismet