Data Lakes and Data Rivers, Two technologies
Updated: Jan 26
There's a lot of buzz around artificial intelligence but there is an important middle step, data analytics that is having a significant impact on businesses. Often we think of data analytics, we think about the data lakes. This is a terrible way of thinking about big data in general because it makes us think that the data is just sitting there and that's only one type of data. Data usually is in motion, this type of data is what we would call data rivers.
It's essential to make that distinction between Data Lakes and Data Rivers because how we think about them dictates how the technology analyzes the data and ultimately how we use the data. When you look at why we use data lakes, it's normally for one of three reasons: 1)searching, 2)model generation, and 3)detection. Data Rivers are better at detection and watchlists.
Use of data lakes
Data Lakes are hot after the Sunburst breach. This is a key component most CISOs under-estimated. In a matter of weeks, hundreds of indicators of compromise (IoC) were released that CISOs needed to search. It was not just 30 or 90 days of data to search; the breach covered a time period of ten months and possibly longer. Suddenly the search bar is raised to a level nobody but Fluency can deliver. Most of the existing SIEMs can barely handle 3 months of data. Anything longer than 6 months would need the involvement of a dedicated data analyst and lots of time. Fluency's data lake searches in minutes over an "unlimited" retention period. Fluency can stand alone or be an easy complement/augmentation to existing SIEM deployment.
This first reason is why we have a data lakes that we want to search really big datasets for matches. In computer security, we have large audit logs stretching over a year, and we need to figure out what elements in that log are associated with a breach. In this case, you looking are performing historical searching.
One excellent way of using a data lake is to create models; when used properly, models are incredibly powerful. At a global level, we can use models to look for consistent trends regardless of the instance. Note however where models really shine is when they're used to profile a customer.
When we create models for a particular customer, it’s called profiling. It allows us to create rules that are unique not only to the applications being used and the commands being used but also to what users use which applications and commands. In other words, we can determine by the commands which users are privileged users. By doing so we can determine when there's an escalation of privileges from a user that started to use administrative commands who has not done so previously.
Another name for a model is a behavioral rule. We can create rules to determine when somebody is logging in from a new geographical area. We can also create a rule and let us know when a person is logged in from two different systems at the same time or within a specified window of time. We can create a rule to let us know when the system has detected a virus but has not responded in a manner to prevent it. There are a lot of variations on rules that can be created.
Now the twist is that we will not use this rule to search the data lake (database), but instead to search the incoming data (data stream).
Use of Data Rivers
The primary use of data rivers is real-time detection. This is where most SIEMs have it all wrong. If you are to review the way which detection runs today you would find SIEM companies use database queries to make detections. This means the data is collected, parsed, and then inserted into a database before it's reviewed to determine if it's an issue. You can't be looking at all the signatures at once so what the system does is it searches every five minutes, half-hour, or maybe an hour, guaranteeing that the alert will be late.
We believe the Fluency way of doing this is the proper approach by reviewing the data as it comes in live from the data stream. In this manner, instead of having a behavioral rule search thousands of events, an audit event is reviewed by thousands of signatures. The result is that detection is immediate and scales while providing real-time alerting.
There is more to it. When you are evaluating a data stream with a behavioral rule, it is more accurate than when the rule maintains state and time. This is something a database cannot do. Both state and time-critical aspects of behavioral analytics.
To conclude, not only is detection using a data river more immediate, but it's also more accurate for it can consider more than just key-value matching.
If you have a watchlist, data rivers are better. Watchlists are simple key-value matching that can be done immediately as the data streams into the system. Indications of compromise (IoC) are a perfect example of what is being searched for by a watchlist. When we first get a list of IoCs, we would search the historical data. This is an example of using the data lake. Yet, we would also like to know if any of the new data coming in matches an IoC. This second request is a stream search or watchlist.
Why do data rivers matter?
Why does this matter? The obvious answer is timeliness. The data stream analysis is immediate while the lake analysis has inherent delays. Another answer is accuracy, however signatures that consider state and time can be much more accurate.
But there is also another hidden reason, and that the density of data prevents data lakes from having too many signatures. Data density, the amount of data created by a user, gets extremely high when looking at technologies like EDRs. In order to scale up to the data, the system has a limited number of searches it performs. We see this in the number of concurrent searches and a total number of searches that can be sustained over a period of time. We see this in Fluency, which is capable of handling thousands of rules while data lake systems can only handle a couple of hundred rules at best. Data rivers matter immensely when it comes to cybersecurity and supporting the businesses needs.