What to do when you have a #bigdata project ?
- Analytics & Analysis has been there since many years. Enterprises have been sucking in data from large monolithic systems called data warehouse since many years. Once retrieved, the data went through phases of analysis for different purposes that included, sales forecasting, weather predictions, medical analytics and so forth that the term analytics has also gone through a paradiagm shift before it is now being viewed more from predictive analytics standpoint. Time takes before a shift. So predictive analytics is not just with bigdata. With bigdata a new kind of analytics is evolving.
- Proliferation of bigdata tools and services has begun in the bigdata world. Data warehouse as a service was being talked about. But, with data security that is still not embracing “the openness” for obvious reasons, we have not yet seen success except from giants such as Amazon, google, emc, Rackspace etc. When talked about bigdata, there is a secondary term that needs to be queried along with the term bigdata and that is existing analytical data. Please review point number 1 on predictive analytics.
- Various tools since the inception of Hadoop provide some real cool value adds. Pig, Hive, Google’s NoSql, Cassandra and MongoDb, sqoop, Gora, HBase, Avro, combined with machine learning systems provide the pass-throughs to connect, search, filter, retrieve and manipulate data for further analytics. Check out for oracle big data offering from big vendors.
- Some have tried to explain differences with business intelligence and replaced it with bigdata. While this attempt is good, they are both complimentary. Many fail to realize that there is already huge work done on analytics and so, bigdata work need not necessarily be a heavy weight lifting.
- While choosing tools, existing infrastructure must be studied. The study must focus on reports attained from business intelligence. This can provide insights into data patterns.
- Reports are on existing data. But reports provide insights. Reports on certain intervals or peak periods of data or on certain intervals can be detected on running complex equation based queries. Used these perceptions for trials. Perceptions here can be reality.
- The first challenge today, irrespective of various tools within the enterprise and that is existing within the open-source world is to choose the tools. While this is less stressful, once after chosen, the deployment and infrastructure setup is the first major challenge today because, getting connection to the valuable data, and getting massive amount this time, becomes real challenging. Here is where the data warehouse as a service, enterprises will lean more towards giants. Please see point number 2.
- The crucial part of bigdata analytics involve creating test data during development. When creating test data, try NOT to replicate or avoid depletion of data points. This can create blocks in moving ahead especially when using PFP/mahout ; for further analysis of data. Who knows what can help.
- Furthermore, the test system must be thought of beyond a single machine. Most often for parallel processing, multiple nodes are involved. Especially when processing terabytes of data. Either case, when uploading from data repositories, multiple systems get involved anyways and therefore bigdata.
- Before choosing the tools, understand about the tool with respect to bigdata. There are many tools out there which but otherwise does something else. With the bigdata buzzword in action, many vendors incorporate capabilities for bigdata which is good. But that tools might not be the one you are looking for. Once chosen, do a proof of concept. Engage engineers in rubbing shoulder to shoulder with bigdata people. Always good to collaborate. This is not a one person’s meal? Right?
Thoughts beyond usual thinking.
Many are the things that man seeing must understand. Not seeing, how shall he know what lies in the hand of time to come? – Sophocles