Adam Bosworth reviews the unintuitive lessons of THE WEB in
ACM Queue – Learning from THE WEB
His second point is
2. It is worth making things simple enough that one can harness Moore’s law in parallel. This means that a system can scale more or less linearly both to handle the volume of the requests coming in and, ideally, to handle the complexity. For example, Google is able to handle huge numbers of requests that ask for quite complex work (find all the documents out of billions that are “most relevant” to the following words) largely because it can scale linearly with respect to both the size of the underlying corpus and the volume of queries. DNS is, of course, the best example of this, but so is caching, which brings me to point 3.
3. Have databases enabled people to harness Moore’s law in parallel? This would mean that databases could scale more or less linearly to handle both the volume of the requests coming in and even the complexity. The answer is no. Things like ORDER BY, joins, subqueries, and many others make it almost impossible to push the query logic down to an arbitrary number of leaf nodes and simply sort/merge/aggregate the results. The easier way to limit queries to avoid this would be to limit all predicates to ones that can be computed against a single row at a time, at least where efficiency and scale are paramount.
My thesis adviser Phil Windley poses the following question: Does XQuery suffer from these same limitation?
If you look at XQuery as just a SQL replacement with support for XML data types, they yes it definitely suffers from the same imitation, namely query complexity that cannot be parallelized.
XQuery is not just a SQL replacement. XQuery was designed with several key points.
XQuery can be seen as a superset of XSLT functionality. As results come out of a google search style data store, XQuery can be used to format or transform results much like XSLT. XQuery makes sorting transformed results trivial, while sorting in XSLT is unnecessarily difficult to implement.
XQuery is computationally complete (It’s a Turing complete functional language.) and supports first order, higher order, and user defined functions. Google style queries are executed using a wide range of underlying utility and specialized functions. Theoretically, XQuery could be used as the glue or processing code that calls the low level search functions. XQuery or a derivative dialect could be the standard for describing a Google style search process or query. A new style language is needed to describe the process of executing embarrassingly parallel queries or Google style queries.
Off hand, BPEL is an example of such a new style orchestration language that drives parallel processes.
Because of it’s focus on businesses operations and interdependence’s, BPEL is not a solution of Google style attention or relevancy queries. But in many ways, BPEL is step forward, an attempt to declare the problem and let a specialized engines handle the work.
In some ways I think this is what Adam Bosworth is looking for, a way to describe and execute a huge quantity of unrelated relevance queries against all available knowledge.
I believe that making the connect between computation and data more seamless is a step in the right direction. The right answer will be highly declarative. And in that respect it should be somewhat like SQL, minus non-parallel or barrier operations such as joins, sub selects, huge ordering operations, etc. SQL lacks computational power, user defined functions, etc to even be a starting point down this road. Adam himself has said that he doesn’t believe that the relational model applies, at least at the front end.
This is where XQuery brings power. XQuery is functional in nature and can used in a purely functional manner. What does this mean? First no side effects. Every operation is repeatable. Second functions can be easily composed without worrying unexplainable behavior. XQuery is also declarative. XQuery allows the programmer to describe complex problems as higher order processes without getting bogged down in the ity gritty details of implementation.
It is all about abstraction. Assembly language has become largely declarative as we abstract away the fine details of microprocessor computation. Assembly language tell the computer what to do, not how to do it. Programmers do generally specify which processor their code is run on, when or how data navigates through the memory/caching pipeline, which execution unit does the work or in which order execution will occur. Don’t get me wrong, there are experts who do spend all day every day worrying about the microprocessor execution. But the powerful abstractions of programming languages allows the majority of software to developers to focus attention up the stack.
I like Linux Torvalds latest comment has some application
“I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home.”
Adam Boswoth want the same story for data query.
We should stop playing games with complex object-relational mappings and look towards more generalizable and scalable solutions.
XQuery is not the solution to Adam’s problem, nor is it a silver bullet answer to Phil’s questions, but I think It pushes us in the right direction. It forces us to start to reconcile and solve the perceived differences between object orientation, relational data and XML.
XQuery shows that data query and computation can be tightly coupled to enhance programmer productivity without giving up power or optimization in either realm.
Bonus Links:
The Gillmor Gang with guest Adam Bosworth
