The USPTO awarded search giant Google a software method patent that covers the principle of distributed MapReduce, a strategy for parallel processing that is used by the search giant. If Google chooses to aggressively enforce the patent, it could have significant implications for some open source software projects that use the technique, including the Apache Foundation’s popular Hadoop software framework.
“Map” and “reduce” are functional programming primitives that have been used in software development for decades. A “map” operation allows you to apply a function to every item in a sequence, returning a sequence of equal size with the processed values. A “reduce” operation, also called “fold,” accumulates the contents of a sequence into a single return value by performing a function that combines each item in the sequence with the return value of the previous iteration.
Google’s MapReduce framework is roughly based on those concepts. A series of data elements is processed in a map operation, then combined at the end with a reduce operation to produce the finished output. The advantage of partitioning a workload this way is that it’s extremely conducive to parallelization. Each discrete unit of data in the series can be processed individually and combined at the end, making it possible to spread the workload across multiple processors or computers. It’s a fairly elegant approach to scalable concurrency, one that offers efficiency regardless of whether your environment is a single multicore processor or a massive grid in a data center.
Google published a paper in 2004 that described how it uses MapReduce. The paper attracted considerable interest and paved the way for the MapReduce pattern to become a common technique for parallelization. One of the most well-known third-party implementations of MapReduce for distributed computing is Hadoop, an open source Apache project now used by Yahoo, Amazon, IBM, Facebook, Rackspace, Hulu, the New York Times, and a growing number of other companies.
Google’s patent on MapReduce could potentially pose a problem for those using third-party open source implementations. Patent #7,650,331, which was granted to Google on Tuesday, defines a system and method for efficient large-scale data processing:
A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.
i suspect google applied for a patent defensively as many software firms do and don’t expect them to go after anyone. if they didn’t someone else likely would have. it shows us another business distortion and resource waste brought about by intellectual property. nothing in this patent is all that impressive or new. even if it were it doesn’t justify an artificial monopoly privilege enforced through aggression and the threat thereof by another monopoly.
this patent is important both in that distributed mapreduce has become a popular way of processing data but also personally affects the work i’m currently doing. i hope that in this regard Google sticks with their ‘do no evil’ slogan and simply sits on this patent.