F# Immutable data structures for high frequency real-time streaming data

问题

We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment

A generic version of the C# event's data structure looks like this:

public enum MyType { type0 = 0, type1 = 1}

public class dataObj
{
    public int myInt= 0;
    public double myDouble;
    public string myString;
    public DateTime myDataTime;
    public MyType type;
    public object myObj = null;

}

We plan to use this data structure in f# in two ways:

Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
Real-time classification of data streams using the above models

The data structure needs to be able to grow as we add more events. This rules out array<t> because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T> because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.

According to this post, Set<T> may be a good choice...

> " ...Vanilla Set<'a> does a more than adequate job. I'd prefer a 'Set' over a 'List' so you always have O(lg n) access to the largest and smallest items, allowing you to ordered your set by insert date/time for efficient access to the newest and oldest items..."

EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.

Here is a breakdown of the steps of the real-time process:

A realtime event is received
This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set<T>, or some other structure?
A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
Features for the model are generated and sent to a model for evaluation.
We may prune the data structure of older data to improve performance.

So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.

回答1:

You should consider FSharpx.Collections.Vector. Vector<T> will give you Array-like features, including indexed O(log32(n)) look-up and update, which is within spitting distance of O(1), as well as adding new elements to the end of your sequence. There is another implementation of Vector which can be used from F# at Solid Vector. Very well documented and some functions perform up to 4X faster at large scale (element count > 10K). Both implementations perform very well up to and possibly beyond 1M elements.

回答2:

In his answer, Jack Fox suggests using either the FSharpx.Collections Vector<'T> or the Solid Vector<'t> by Greg Rosenbaum (https://github.com/GregRos/Solid). I thought I might give back a bit to the community by providing instructions on how to get up and running with each of them.

Using the FSharpx.Collections.Vector<'T>

The process is pretty straight forward:

Download the FSharpx.Core nuget package using either the Project Manager Console or Manager Nuget Packages for Solution. Both are found in Visual Studio -> tools -> Library Manager.
If you're using it in F# script file add #r "FSharpx.Core.dll". You may need to use a full path.

Usage:

open FSharpx.Collections

let ListOfTuples = [(1,true,3.0);(2,false,1.5)] 
let vector = ListOfTuples |> Vector.ofSeq

printfn "Last %A" vector.Last
printfn "Unconj %A" vector.Unconj
printfn "Item(0) %A" (vector.[0])
printfn "Item(1) %A" (vector.[1])
printfn "TryInitial %A" dataAsVector.TryInitial
printfn "TryUnconj %A" dataAsVector.Last

Using the Solid.Vector<'T>

Getting setup to use the Solid Vector<'t> is a bit more involved. But the Solid version has a lot more handy functionality and as Jack pointed out, has a number of performance benefits. It also has a lot of useful documentation.

You will need to download the visual studio solution from https://github.com/GregRos/Solid
Once you have downloaded it you will need to build it as there is no ready to use pre-built dll.
If you're like me, you may run into a number of missing dependencies that prevent the solution from being built. In my case, they were all related to the nuit testing frameworks (I use a different one). Just work through downloading/adding each of the dependencies until the solutions builds.
Once that is done and the solution is built, you will have a shiny new Solid.dll in the Solid/Solid/bin folder. This is where I went wrong. That is the core dll and is only enough for C# usage. If you only include a reference to the Solid.dll you will be able to create a vector<'T> in f#, but funky things will happen from then on.
To use this data structure in F# you will need to reference both the Solid.dll and the Solid.FSharp.dll which is found in \Solid\SolidFS\obj\Debug\ folder. You will only need one open statement -> open Solid

Here is some code showing usage in a F# script file:

#r "Solid.dll"
#r "Solid.FSharp.dll" // don't forget this reference

open Solid

let ListOfTuples2 = [(1,true,3.0);(2,false,1.5)] 
let SolidVector = ListOfTuples2 |> Vector.ofSeq

printfn "%A" SolidVector.Last
printfn "%A" SolidVector.First
printfn "%A" (SolidVector.[0])
printfn "%A" (SolidVector.[1])
printfn "Count %A" SolidVector.Count

let test2 = vector { for i in {0 .. 100} -> i }

回答3:

Suppose your dataObj contains a unique ID field, then any set data structure would be fine for your job. The immutable data structures are primarily used for functional style code or persistency. If you don't need these two, you can use HashSet<T> or SortedSet<T> in the .Net collection library.

Some stream specific optimization may be useful, e.g., keeping a fixed-size Queue<T>for the most recent data objects in the stream and store older objects in the more heavy weight set. I would suggest a benchmarking before switching to such hybrid data structure solutions.

Edit:

After reading your requirements more carefully, I found that what you want is a queue with user-accessible indexing or backward enumerator. Under this data structure, your feature extraction operations (e.g. average/sum, etc) cost O(n). If you want to do some of the operations in O(log n), you can use more advanced data structures, e.g. interval trees or skip lists. However, you will have to implement these data structures yourself as you need to store meta information in the tree nodes which are behind collection API.

回答4:

This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set, a Queue, or some other structure?

Difficult to say without more information.

If your data are coming in with timestamps in ascending order (i.e. they are never out of order) then you can just use some kind of queue or extensible array.

If your data can come in out of order and you need them reordered then you want a priority queue or indexed collection instead.

to upwards of around 800 events per second

Those are extremely tame performance requirements for insertion rate.

A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.

If you only ever want elements near the beginning why do you want random access? Do you really want random access by index or do you actually want random access by some other key like time?

From what you've said I would suggest using an ordinary F# Map keyed on index maintained by a MailboxProcessor that can append a new event and retrieve an object that allows all events to be indexed, i.e. wrap the Map in an object that provides its own Item property and implementation of IEnumerable<_>. On my machine that simple solution takes 50 lines of code and can handle around 500,000 events per second.

来源：https://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data

标签

data-structures

stream

f#-3.0

f#-data