The Object Oriented Web - Part 2 - Datahubs
Published 1 year ago by Manuel Vila
To begin with, there is a very simple idea: Websites should themselves indicate their changes to the search engines. I've already touched upon the subject in the previous part of this series, right now search engines have a reversed approach which consists of crawling the Web constantly looking for the slightest modification. Don't you think it's silly? Think about the number of Web pages to visit, imagine the cost to get the lowest frequency between each visit. Consequently, it seems difficult to consider the development of new search engines today. Nevertheless, the advent of the Semantic Web should lead to their multiplication, in a vertical way, while search engines are getting specialized more and more in specific fields.
Crawling seems to be the "boring" part for the search engines and if they want to be distinguishable from each other it won't be with crawling. The innovation should be in the indexing, ranking, etc. But can we consider that some day search engines like Google or Yahoo might agree to pool their crawling? Surprisingly, I think so.
But before we go further, we must know what we're talking about. All in all, crawling consists of making a sort of "backup" of the World Wide Web. Somehow, generalist search engines need to own a full copy of the Web. Therefore they need to scan and scan again to get the freshest copy. Worse, vertical search engines have got the same problem, even though their favorite domains are limited to a few topics, they still have to examine the whole Web since the information they are interested in could be anywhere. At least that is the present approach.
How can we improve things? First and foremost I'm more and more convinced we need to reverse the process. Search engines should not question Websites, Websites should inform search engines of any changes. Today a Web developer would have to set up a robot.txt file (or Sitemaps) to insure the best indexation as possible. Tomorrow he will add a mechanism able to inform search engines about any changes that have occurred in his database. It shouldn't be a problem for modern Websites based on the MVC (Model-View-Controller) paradigm, they'd just have to add a "plugin" at the model level. To sum things up, this plugin will be in charge of alerting search engines in order to report any "Create", "Update" or "Delete" action.
At the end, this reversal process should enable the development of real time search engines. Imagine that as you enter your keywords, results will appear as the Web changes! By the way, if Websites have to contact search engines, which ones are they going to pick? Are they going to restrict it to certain ones? Of course not, Websites should be able to spread their modifications towards a maximum number of search engines, from the most important to the most specialized, without even having to know them before hand.
How could this be? I'm thinking about some kind of relays diffusing the "modifications feed" as widely as possible. Let's call it "datahubs" if you want. Datahubs will be linked to each other in a completely decentralized way and if a Website sends information to a certain datahub, every other datahub will receive the exact same information, by a cascade process. In another way, if I want to create my own datahub, I will only need to connect somewhere, to another datahub, to receive all the changes happening everywhere on the Web. For its part my datahub will be able to spread the incoming data to other datahubs. Surely, my datahub will have enough bandwidth to let all that information transmit.
It would be interesting to evaluate the total bandwidth necessary to transfer in real time all the Web modifications, but I think we can estimate it shouldn't be too high. In fact it actually should be pretty low and if I had to guess and give an approximate number, I'd say that 1Gbps would be enough if we stick to the textual data! Today we can find some hosting companies able to provide this bandwidth for less than $100 a month. Try to figure out the total cost of the constant crawling done by Google, Yahoo and MSN (only to quote the main ones) compared to the few dollars necessary to accomplish the same thing with the datahub idea.
Nevertheless, crawling isn't everything. If we wish to create a true search engine we are going to need to accumulate a very important mass of information in order to achieve some basic operations such as parsing, indexing, ranking, etc. Consequently, if we want to create a new generalist search engine we'll still need to think about a huge infrastructure. On the other hand it would actually be very cheap to make vertical search engines specialized in specific fields, thanks to the datahub concept. Besides it would be one of the main features, datahubs allowing to declare the datatypes that they wish to receive and spread.
In the end, it appears to me that the necessity for datahubs is obvious and the potential is so big that I can hardly imagine all the possible applications. But this idea is fairly new to me, I just started and I barely know the actual state of research on that matter. Are they any people working on it already? Your feedback is welcome.
Trackback URL for this entry:
http://www.semanticfocus.com/blog/tr/id/878778/
Spam protection by Akismet
Post a comment

Posted by Simon Reinhardt on November 16, 2007 at 1:31pm
The other day I looked at the problem from a very similar view: it occured to me that lookup of and search for semantic data could actually be done quite well in a peer-to-peer system - and after some googling I saw that there has been some research on that already.
Consider the current ways of lookup and search for RDF resources: if you are given a URI and you see it starts with http: then you'll assume a representation can be retrieved from the associated webserver and that the domain owner has some authority over the resources / identifiers. There might be other webservers containing descriptions of the same resource and you can find them by searching for the URI in a Semantic Web search engine - if it has indexed those descriptions. Alternatively if you don't have URIs but look for certain data (e.g. using certain vocabulary or certain strings), you can use SPARQL endpoints for querying. Still you need some centralised service answering your query. This can be a server containing a certain data set or a server which is specialised in collecting data from a specific field or a server indexing all sorts of data.
If we have learned anything then that centralisation isn't the optimal solution for the web. Imagine you have URIs describing products by using GTINs (barcode numbers) and you want to request product information, reviews and prices from all over the web - no-one should have authorative URIs for those resources. Or you have the URI of a blog post and want to collect all posts on the web which reply to that, be it blog posts, e-mails on mailing lists, forum entries, ... (yay for SIOC!).
Now I don't know how current P2P systems like BitTorrent work but at uni we once implemented Gnutella and if I remember correctly you send a search query off to the net and it hops from node to node, each node decreasing its freshness until it finally is stale and shouldn't be replicated any more so that one query doesn't spam the whole net. I think the nodes which found matches would send the information about them back to the querying client through a direct connection, the client would display them and let the user start download connections to the provided IPs.
Are the requirements for a P2P system for the Semantic Web much different? When I google I rarely look at more than the first two pages, so I don't really care about all results. For semantic searches there might be cases where you really want all results (maybe because they will be scarce anyway because they're so specialised) and queries dying after a few hops will not do. But once there is a match then all that has to be done is send it back - you really could send back the whole resource description immediately and there would be no need for a download process. Alternatively you would send back URLs where the resource descriptions could be looked up. No splitted downloads of big chunks of data like in file-sharing networks.
I don't know if SPARQL is well suited for that but having this as an highly distributed alternative to centralised lookups and centralised search engines sounds intriguing to me. The search engines could actually be connected to the P2P net and, as I said, you could still just return URLs which would allow HTTP lookups, so you would get the best from both worlds. And using URNs would be possible. ;-)
Posted by James on November 16, 2007 at 1:51pm
Great article Manuel :)
Google is already working on a system to change the pecking order between Websites and search engines. They filed patents for a system they call Programmable Search Engine:
http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220070038616%22.PGNR.&OS=DN/20070038616&RS=DN/20070038616
(someone should clue the patent office in on prettified URLs)
Essentially Google is planning on moving to a "heart beat" model (similar to Technorati). Rather than PULL, they will rely on us to PUSH. Tied into this new technology is a format (similar to the Sitemaps format) that will allow Webmasters to create and store metadata about pages and about the content of pages. The files you create will be freely available on the Web (so that Google can fetch) so that means anyone can take advantage of the new metadata being created. GRDDL could be used to convert the information in Google's format into RDF.
The patent application also talks about a global knowledge base that Google will be maintaining. This means that when you search for a Nikon 900d (I made that model up, but it's just for the example) the Google search results page will be able to ask you if you're looking for the 5 megapixel version or the 7 megapixel version (assuming the metadata about that camera model has been created and indexed).
Essentially Google is going to help jump-start the Semantic Web by encouraging everyone to create metadata.
To specify the context for your metadata you must apply it to a class (roughly equivalent to ontology classes). You're able to create anything as a new class. Therefore l can, as a Webmaster selling Nikon digital cameras, create a class called DigitalCamera and then create a subclass of DigitalCamera called Nikon900d. Now I can start listing attributes about the camera (number of megapixels, amount of storage space, MSRP, etc).
I really have been meaning to write a lengthy article about this since about March... or even earlier. I just haven't gotten around to it. Maybe now is a good time, as people are starting to get familiar with some of the core concepts behind PSE.
Posted by Yihong Ding on November 17, 2007 at 8:52am
Manual,
A few comments to your idea.
(1) In general, you have a great thought and visionary picture about the future web, which is fantastic. I agree with many of the fundamentals you want to deliver in this article, such as the reversed relationship between searcher and information provider and migration of web structure.
(2) The center of moving to the next step is a character called "proactivity". I also have mentioned the importance of this term in several of my web evolution articles. When you say that information providers should actively submit updated information and let them properly indexed in search engines, you are calling this character. But you know, it is not easy to really implement this character. It is much harder than what you presented in your post by saying that allowing users to actively submit. You may have underestimated the technical difficulty of this problem.
The difference between user-submit and search-engine-crawl is that search engines have fully control on the content when executing the latter strategy but they have less control when executing the former strategy. In fact, if you look back the history of web search a little bit, surprising you can find that the former strategy was indeed executed earlier than the latter one. At the early age of World Wide Web, search engines such as Yahoo very much depends on users to submit their new sites themselves to keep on indexing. Only after the success of the PageRank algorithm, Google finally changed this trend totally.
Based on the current technology level, Google's strategy is inevitably a winner. Most importantly, Google maintains an objective environment (at least this objectiveness is nearly equivalent to everybody else except Google itself) on the Web. Web users, no matter they are full of wisdom or totally stupid, their publications are equally handled by Google. And the important of their publication is objectively indexed by page ranks. Don't underestimate the importance of this fact. This fact provide the general public a confidence that World Wide Web is still a fair playground for everybody.
Now this is also the most severe challenge to your proposal. How will your proposed model to keep on ensure the Web to be a fair playground based on the current web technologies? A problem in your proposal is that now everybody must start to learn how to "better" submit their work to search engines. You load too heavy to the user side, which is not a good thing.
We need some crucial technique breakthrough before your proposal could be applicable.
The Web will gradually from a centralized model to a decentralized model. I fully agree with this viewpoint. But what I suggest to you is to think of things evolutionary instead of revolutionary. Every new life must first have its proper ground to grow, and the ground has not been ready yet. This is my last comment.
Great work and keep on going!
cheers,
-- Yihong