Google’s BigQuery’s public appearance, makes things even more complicated for deciding where to go for Big Data analytics. How big is really big? Can one hundred thousand records with mega-size data such as images or even one million records be called as big ? As they asked, how big a table can it be? How many fields etc.
When Oracle first came with Version 2.0 as their first database version without version 1, it was primarily to store unstructured data in a meaningful form. CIA’s interest in the software that can collect global information made it compelling enough to release Oracle and make the software more rigid. It was not clear to any then it seems, that there would be the problem of millions and millions of data which needed to be analyzed and perhaps make sense out of it. Who could not have thought about the amount of data that could accumulate over course of time, for whether related information for instance. Similarly, for medical data, for product purchases, credit card usages, traveling etc etc. First it was the storing problem. How big can we store? then things changed. More storage devices evolved and they evolved cheap. Now, it is a much bigger problem; how can we make sense of what has been stored.
Demystification of noise among data, is a good thought. But at this point the concept itself has created much “noise” in the equation. (Hahah-A small laughter) The truth is out there!
BigQuery is simply put, an analysis tool for the Big Data. With huge amount of content residing within google’s walls such as 60 hours of video uploads per minute and 100 million gigabytes in search index and importantly,425 million gmail accounts produce and inject a lot of data into the thousands of servers of Google, world across. This repository can be tapped to attain result oriented queries in the discovery process. This is Big Query.
While discovery is important, the time to retrieve results and help in the discovery process, is also very important. This is different from batch oriented queries run by Hadoop. When it is said, different, it does’t mean “bad”. It is simply different.
Big Query seems to have been evolved from an internal system called Dremel. Dremel is now externalized to run queries on big data sets and is now called as Big Query.
What is interesting is, Dremel the internal Big Query of google, talks about the term, what they call as full table scan. This is supposedly done
to span the search across hundreds and hundreds of tables which resides on servers across. According to what is being generally talked about by Big Query experts, this search is far better than running a search on an indexed RDBMS. Hmmmm!
An example of such a query would be , what are the different applications running on google servers, for which a type is set to “Something” or somethis similar to above question.
Google, would have extensively used this tool to run across their servers. With huge content within their walls, when externalized, the usage would be interesting. The question still remains, why wouldn’t an RDBMS suffice the need? Why is it that the indexing and statistical updates on databases and optimization techniques help in fast retrievals? Big Query data sets has a limitation of 64KB for total field lengths per query.
Big Query internally the Dremel, was using SQL “Like” statements to pull data for analysis and it IS FASTER. But then, this is due to the underlying format of data that needed to be withdrawn for analytical purposes.
So what is BigQuery ? BigQ is all about Big Data available withing Google’s Cloud storage which is loaded to BigQuery and quired for results.
With the API, one can embed it within applications to enable users to fire the queries. With SQL Like statements run against a process which does a full table scan as opposed to indexed databases. Supported scripts/languages include Java, Python, .NET, Javascript and also Ruby. BigQ comes with a web based UI and the pricing as is understood would be charged on “per query” basis. Some of the examples of running a BigQ query would be to see the things such as, number of page views of wikipedia in a month and what subject perhaps; OR taking a look at all the works of Shakespeare existing and is known so far ? Interestingly, the total number of page views for a month on Wikipedia is about 6 Terabytes of data un-compressed and BigQ runs against that number as heard from the horses mouth.
A small “BigQ” I run from the examples was to see the biggest work of Shakespeare. Biggest in terms of maximum number of words. The query, gave a result as “Hamlet” with 32446 words “topping” the total published works of Shakespeare’s 42 books that he wrote.
What more should I see..
Many are the things that man seeing must understand. Not seeing, how shall he know what lies in the hand of time to come? – Sophocles