Parallel and sequential
One interesting feature of the new Stream API is that it doesn’t require to operations to be either parallel or sequential from beginning till the end. It is possible to start consuming the data concurrently, then switch to sequential processing and back at any point in the flow:
The cool part here is that the concurrent part of data processing flow will manage itself automatically, without requiring us to deal with the concurrency issues.
Stream<T> stream = collection.stream();
Here's an example from the package Javadocs.
int sumOfWeights = blocks.stream().filter(b -> b.getColor() == RED)
.mapToInt(b -> b.getWeight())
.sum();
Here we use a blocks Collection as a source for a stream, and then
perform a filter-map-reduce on the stream to obtain the sum of the weights of
the red blocks.
Stream can be infinite and stateful. They can be sequential or
parallel. When working with streams, you first obtain a stream from some
source, perform one or more intermediate operations, and then perform one final
terminal operation. Intermediate operations include filter, map, flatMap, peel,
distinct, sorted, limit, and substream. Terminal operations include forEach,
toArray, reduce, collect, min, max, count, anyMatch, allMatch, noneMatch,
findFirst, and findAny. One very useful utility class is java.util.stream.Collectors. It implements various reduction
operations, such as converting streams into collections, and aggregating
elements.
- Intermediate - An intermediate operation keeps the stream open and allows further operations to follow. The
filter
andmap
methods in the example above are intermediate operations. The return type of these methods is Stream; they return the current stream to allow chaining of more operations. - Terminal - A terminal operation must be the final operation invoked on a stream. Once a terminal operation is invoked, the stream is "consumed" and is no longer usable. The
sum
method in the example above is a terminal operation.
Usually, dealing with a stream will involve these steps:
- Obtain a stream from some source.
- Perform one or more intermediate operations.
- Perform one terminal operation.
It's likely that you'll want to perform all those steps within one method. That way, you know the properties of the source and the stream and can ensure that it's used properly. You probably don't want to accept arbitrary
Stream<T>
instances as input to your method because they may have properties you're ill-equipped to deal with, such as being parallel or infinite.
There are a couple more general properties of stream operations to consider:
- Stateful - A stateful operation imposes some new property on the stream, such as uniqueness of elements, or a maximum number of elements, or ensuring that the elements are consumed in sorted fashion. These are typically more expensive than stateless intermediate operations.
- Short-circuiting - A short-circuiting operation potentially allows processing of a stream to stop early without examining all the elements. This is an especially desirable property when dealing with infinite streams; if none of the operations being invoked on a stream are short-circuiting, then the code may never terminate.
Here are short, general descriptions for each Stream method. See the javadocs for more thorough explanations. Links are provided below for each overloaded form of the operation.
Intermediate operations:
- filter 1 - Exclude all elements that don't match a Predicate.
- map 1 2 3 4 - Perform a one-to-one transformation of elements using a Function.
- flatMap 1 2 3 4 - Transform each element into zero or more elements by way of another Stream.
- peek 1 - Perform some action on each element as it is encountered. Primarily useful for debugging.
- distinct 1 - Exclude all duplicate elements according to their .equals behavior. This is a stateful operation.
- sorted 1 2 - Ensure that stream elements in subsequent operations are encountered according to the order imposed by a Comparator. This is a stateful operation.
- limit 1 - Ensure that subsequent operations only see up to a maximum number of elements. This is a stateful, short-circuiting operation.
- substream 1 2 - Ensure that subsequent operations only see a range (by index) of elements. Like String.substring except for streams. There are two forms, one with a begin index and one with an end index as well. Both are stateful operations, and the form with an end index is also a short-circuiting operation.
Terminal operations:
- forEach 1 - Perform some action for each element in the stream.
- toArray 1 2 - Dump the elements in the stream to an array.
- reduce 1 2 3 - Combine the stream elements into one using a BinaryOperator.
- collect 1 2 - Dump the elements in the stream into some container, such as a Collection or Map.
- min 1 - Find the minimum element of the stream according to a Comparator.
- max 1 - Find the maximum element of the stream according to a Comparator.
- count 1 - Find the number of elements in the stream.
- anyMatch 1 - Find out whether at least one of the elements in the stream matches a Predicate. This is a short-circuiting operation.
- allMatch 1 - Find out whether every element in the stream matches a Predicate. This is a short-circuiting operation.
- noneMatch 1 - Find out whether zero elements in the stream match a Predicate. This is a short-circuiting operation.
- findFirst 1 - Find the first element in the stream. This is a short-circuiting operation.
- findAny 1 - Find any element in the stream, which may be cheaper than findFirst for some streams. This is a short-circuiting operation.
As noted in the javadocs, intermediate operations are lazy. Only a terminal operation will start the processing of stream elements. At that point, no matter how many intermediate operations were included, the elements are then consumed in (usually, but not quite always) a single pass. (Stateful operations such as sorted() and distinct() may require a second pass over the elements.)
Streams try their best to do as little work as possible. There are micro-optimizations such as eliding a sorted() operation when it can determine the elements are already in order. In operations that include limit(x) or substream(x,y), a stream can sometimes avoid performing intermediate map operations on the elements it knows aren't necessary to determine the result. I'm not going to be able to do the implementation justice here; it's clever in lots of small but significant ways, and it's still improving.
Returning to the concept of parallel streams, it's important to note that parallelism is not free. It's not free from a performance standpoint, and you can't simply swap out a sequential stream for a parallel one and expect the results to be identical without further thought. There are properties to consider about your stream, its operations, and the destination for its data before you can (or should) parallelize a stream. For instance: Does encounter order matter to me? Are my functions stateless? Is my stream large enough and are my operations complex enough to make parallelism worthwhile?
There are primitive-specialized versions of Stream for ints, longs, and doubles:
One can convert back and forth between an object stream and a primitive stream using the primitive-specialized map and flatMap functions, among others. To give a few contrived examples:
List<String> strings = Arrays.asList("a", "b", "c"); strings.stream() // Stream<String> .mapToInt(String::length) // IntStream .longs() // LongStream .mapToDouble(x -> x / 10.0) // DoubleStream .boxed() // Stream<Double> .mapToLong(x -> 1L) // LongStream .mapToObj(x -> "") // Stream<String> ...
The primitive streams also provide methods for obtaining basic numeric statistics about the stream as a data structure. You can find the count, sum, min, max, and mean of the elements all from one terminal operation.
There are not primitive versions for the rest of the primitive types because it would have required an unacceptable amount of bloat in the JDK. IntStream, LongStream, and DoubleStream were deemed useful enough to include, and streams of other numeric primitives can represented using these three via widening primitive conversion.
One of the most confusing, intricate, and useful terminal stream operations is
collect
. It introduces a new interface called Collector. This interface is somewhat difficult to understand, but fortunately there is a Collectors utility class for generating all sorts of useful Collectors. For example:List<String> strings = values.stream() .filter(...) .map(...) .collect(Collectors.toList());
If you want to put your stream elements into a Collection, Map, or String, then Collectors probably has what you need. It's definitely worthwhile to browse through the javadoc of that class.
No comments:
Post a Comment