On wrapping up a bigdata POC implementation project for a major retail organization. We were tasked with reviewing security and compliance issues with the same client. To a question, where is the money residing and buried within enterprises, we ask a consecutive question, what do you have to offer in #bigdata ? Our approach to the problem has always been from the architecture standpoint. Architecture matters. There is lot of data within enterprises. Especially with sensory data moving constantly and absorbing as much information as it can, enterprises are challenged with storing this information that can make sense eventually, if not today. This by itself presents a greater challenge. Data can be completely useless to the enterprise or at the same, data can be very useful. Storing unwanted data can cause overheads, especially when we talk about thousands of employees pinching their devices on a daily basis and in a minute by minute bases. Above all this, imagine the movement of these subjects transmitting coordinates to base station. All of the above opens doors and paves the way for tremendous possibilities. We recently delivered a major bigdata project for a retail company. Although the company had completed phase I of their bigdata project, what was being sought out was the second phase. This phase called for fraud detection. Bigdata plays a BIG role here. We began with a proof of concept on bigdata. For the proof of concept with bigdata, unlike traditional application development, one huge factor that comes into ply here is data itself. This time the main focus will be to manipulate bigdata. When we think of data, we now think of containers beyond ordinary database systems. NoSQL comes into ply. When we think of data we think about data transfers. Data transfers are to be thought about because, data is not only residing as un-structured or semi-structured data, it is residing as huge chunks in systems outside in different security zones or different domain boundaries. And therefore in order for data to be manipulated, we need data to be transferred closer to the application that utilizes this data. The application that utilizes the data is not simple linear application anymore. Non-linear applications spawning across several threads across processes takes over memory segment in a parallel execution mode and starts executing data. Because of this reason, the design and flow of systems need to be carefully thought about and surpasses traditional design methodologies. Today we will see two different pages of flow for each processes and decision trees need to be carefully designed to deal with outcomes. Responsiveness is another factor. Although this responsive programming can be kept aside, a good idea is to think about it during the design phase. Here are some excerpts of big data implementation for the retailer. Here is where the money is.
- Number of downstream systems touching the main application : 28
- Repositories used in total : 4
- NoSQL – MongoDB (Follow me on twitter to know how to play with MongoDB. How to work on mongodb and scale mongodb -@sunnymenon)
BigData stack – Cloudera, Hortonworks, DataStax, MongDB, and a set of other tools, some being evaluated and other being discussed with vendors and therefore the floatation of clones. They will soon be deciding and one of our work is, to help them choose. POC is utilizing some of these vendor platforms.. but we have stated the need to drill down. It is important that one of the vendors be alongside in the bigdata realization path. We will talk more.
OUR STACK : Apache Hadoop on widnows. Bigdata made easy. Talk to us to know more about OUR STACK. The Apache Hadoop Stack on Windows made by we developers. Easy to roll out, difficult to neglect. No IT Challenges. No questions asked. Nothing to sign up.
- Our Team size 5 people
- Began with discussions with stake holders – 12 hours on different days. Total of 35 people interviewed. Included CxO’s, Engineering managers, directors, developers, IT operations staff, network sys admins, database sys admins, data analysts, and architects and consultants.
- Primary layout of the composite was defined. Presentation of the layout for poc – 8 hours. This was to target known application where major part of data resides. Strategy was to utilize that data and thereby look at analytical data and at the same time pull in “some bigdata” from outside unstructured.. thereby “proving” accessibility to voluminous data and retrieval functionality, streaming etc.
- Messaging layer played “big” time. Asynchronous message of bigdata has different meaning. Kafka can be thought about. But found it challenging to get information. Community strength is still growing. Stackoverflow is NOT overflowing. STORM the same. “One little pig” can be useful really, with HIVE is where the honey might be. All tools combined, the assembly should begin. Call it the ORACLE WAY .. Engineered systems. Good boy .. good boy.
- Visualization is the key. Go beyond clustering, graphs to new MEANS ?
- Interactivity another.
- Equations come to the front end.
- Probability theories are favorite tool of a data scientist.
- Enterprises should go beyond probabilities.
- Security and compliance audit. We didn’t touch on single sign on. Look, ORACLE engineered systems is the way to go. But will ORACLE survive the greediness of innovation? or will they survive, Bigdata-base in the cloud and NoSQL in the cloud?
- Clustering algorithms may not make much sense as randomness of data is too disparate and K-means accessibility may not be able to be assessed.
- Pure solution approach – Credit from CEO in a direct email and quote about us in the brown-bag meeting.
- TOTAL TIME TAKEN 8 weeks.
- Evaluated Result : POC, inventory list and portfolio. Data visualization modules for actual usage provided code and training to business analysts. Excel and R.
- Total Charged to clients : <<ASK ME>> This is the total cost of bigdata project. I wouldn’t recommend enterprises or companies wanting to negotiate lower rates for a bigdata project. Look out for gains in technology that vendors brings in, look for knowledge on the enterprise infrastructure vendors bring in, look out for appropriate deployment model, look out for how it serves a specific need. While cost savings is important, let that be secondary aspect in the beginning stages. Invest on foundation, reducing recurring costs, should be the principle as far as bigdata is concerned.
Should cloudera or hortonworks be your Enterprise bigdata architecture ? What about DataStax ? Where has SPLUNK gone? Ask that question again, should cloudera or hortonworks be your enterprise bigdata architecture or should it be mapR or should you engage in working with Amazon?
Many users ask if any single product can provide all the features necessary for the enterprise in dealing with BigData. According to whitepaper released by Computer sciences corporation, CSC, 2014 will leave us with two major players in the market. Further more they say, others will be either acquired or make an exit. While the bigdata vendor wars appear to be moving in the direction depicted by CSC, we still have to wait and see how the necessity will take shape within enterprise. Meaning, what will enterprise try to achieve. Will they be relying on existing analytics and business intelligence only ? and make use of bigdata as another component? OR will they move beyond ordinary predictive analytics and elevate themselves to real-time predictions, converting those analytics into actionable items? Forget business intelligence and reporting.. think predictive.
~We Do Not Learn; And What We Call Learning Is Only A Process Of Recollection Plato
Hey , ya all have a great Friday, Saturday & Sunday Aal-right?