Wednesday, September 12, 2012

Mupd8 – The @WalmartLabs Real-time Platform



In recent years, the world has seen an explosive growth in the volume of real-time data streams. Once the preserve of stock markets and day traders, real-time data is now ubiquitously available to consumers through popular services such as Twitter and Facebook. With the availability and growth of real-time data comes the inevitable problem of real-time data overload, and the need for systems that can separate the signal from the noise.

As we began working with firehoses from various social media sites, we recognized the need for a general-purpose real-time stream processing platform that could address the issues of scale and performance -- and enable our stream processing applications to focus on the quality of their generated content. “Mupd8” came into existence to fulfill that need. At the highest level, Mupd8 does for Fast Data, what Hadoop and the MapReduce computation model do for Big Data. Mupd8 was formerly known as Muppet within @WalmartLabs and in our peer-reviewed publications.


A MapUpdate application is a workflow of map and update operators.

With Mupd8, you can easily write applications to process these (or your own) firehoses using the MapUpdate framework, a simple way to express streaming computation. By writing your application as a collection of customized operators map and update, you can focus your programming on your application logic and let Mupd8 handle the load-and data-distribution across multiple CPU cores and machines for you. For example, an application can be written to subscribe to the Twitter firehose of every tweet written; such an application can analyze the tweets to determine Twitter's most influential users, or identify suddenly prominent events as they occur. Alternatively, an application can be written to subscribe to a log of all user activity on a Web site; such an application can detect service problems users face as they occur, or compute suggestions for users' next steps based on up-to-the-moment activity.

Mupd8 enables @WalmartLabs to leverage Walmart’s business data and product taxonomy to extract valuable content in real-time from the largest streams of social media content. As an example, we extract and collect videos, images, location information and status updates from these massive flowing streams. We then categorize them and track such information as key influencers, relevant videos, images and web pages for each product.

To be effective at separating the signal from the noise in real-time, Mupd8 not only has to provide a simple computation model for stream processing applications, but also has to be highly available since any downtime causes loss of real-time information. This, in turn, requires persistence of the rapidly changing content generated by the applications. Low latency -- the time between receiving an update and updating the affected nodes -- is a key requirement, since the half-life of social media content is short. Lastly, Mupd8 has to be scalable across three dimensions -- increasing stream rates, the size of the taxonomy and the complexity of supported applications.

To address these requirements for high availability, low latency and scalability, the current Mupd8 architecture leverages open-source solutions, including Cassandra, which has proven to be a high-performance and scalable key-value store (though Mupd8 adds a caching layer to minimize the impact of its higher network read latency).

The @WalmartLabs Mupd8 platform has already supported more than a dozen sophisticated stream applications processing over 300 million status updates per day, gathering real-time information on our product taxonomy. Learn more about our experience (see our VLDB 2012 paper) and check out the newest version, available under the Apache License 2.0, for yourself at http://github.com/walmartlabs/mupd8. Get started on your application with "Starting a New Application in MUPD8," available in the Mupd8 source tree and http://walmartlabs.github.com/mupd8/quickstart.html.