Skip to main content
Podcast

We Get Privacy for Work — Episode 13: Demystifying Data Mining

Details

December 23, 2025

When an organization’s data systems are compromised, they face the daunting challenge of sorting through mountains of sensitive data under intense regulatory pressure. In this episode of We Get Privacy for Work, hosts Joe Lazzarotti and Damon Silver are joined by Matt Morocco, Director and Cyber Practice Lead at Consilio, to break down the real-world complexities of data mining after a breach. Together, they discuss why data mining is essential and how organizations can streamline the process for faster, smarter incident response.

Transcript

Joe Lazzarotti
Principal, Tampa

Welcome to the We Get Privacy podcast. I'm Joe Lazzarotti, and I'm joined by my co-host, Damon Silver. Damon and I co-lead the Privacy, AI, and Cybersecurity Group here at Jackson Lewis. In that role, we receive a variety of questions every day from our clients, all of which boil down to the core question of how do we handle our data security? In other words, how do we leverage all of the great things data can do for our organization without running headfirst into a wall of legal risk? How can we manage that risk without unnecessarily hindering our business operations?

Damon Silver
Principal, New York City

On each episode of the podcast, Joe and I talk through a common question that we're getting from our clients. We talk through it in the same way that we would with our clients, meaning with a focus on the practical. What are the legal risks? What options are available to manage those risks? What should we be mindful of from an execution perspective?

Our multi-part question for today is why is data mining necessary? Why does it cost so much, and how can we do it more efficiently? To tackle these questions, we've invited a special guest, Matt Morocco. Matt is the Director and Cyber Practice Lead at Consilio and has regularly partnered with our team on a variety of data mining projects. 

Matt, before we jump into data mining, would you mind just telling our audience a little bit about yourself and your role at Consilio? 

Matt Morocco
Director, Cyber Practice Lead, Consilio

Thank you, Joe and Damon, for having me on. I'm Matt Morocco. I'm based in Pittsburgh, where I've spent most of my life and most of my career. I'm a director at Consilio, a global e-discovery, litigation services, and consulting company. Over the past five years or so, having founded, along with a partner of mine, our Insurance Practice, which focuses on our insurance carrier clients and their business needs. 

Cyber has obviously become a very large part and a specialized part of the e-discovery business. We'll be talking about data mining in document review, which is a core practice in e-discovery and how we specialize tools, processes, and workflows towards navigating the specialization that happens in a cyber incident response.

Silver

Great, thanks, Matt. Joe, to tee things up for the discussion here, there are probably some members of the audience who are very familiar with data mining and others who have never even heard the word. Just at a very high level, Joe, what is data mining, and why might it be necessary?

Lazzarotti 

For our purposes in the cybersecurity world and doing incident response, it basically amounts to trying to really get a handle on what data, if any, constitutes a certain type of data we're looking for or are concerned about being compromised by an unauthorized third party who is in the company's systems. If you have a ransomware attack or a business email compromise, and a forensics firm says, it looks like this data was exfiltrated from your systems, or here's the file list that the threat actor who deployed ransomware in your environment says that they took. The question then becomes, one, did they really take that? Two, if we feel like there's a reasonable likelihood that they did, how do we dig into that data? Oftentimes, as I'm sure Matt can tell us, and I know from our own experience we've seen, there could be substantial amounts of data in order to be able to go through that person-by-person. Doing a manual review without the help of folks like Consilio can be a very arduous process.

Silver

Joe, just to piggyback on something you just said there, there are definitely incidents we handle where we have a pretty clear sense of what the total universe of data that might need to be mined is. You gave the example of an attack where the bad actor actually takes data and exfiltrates it from the environment. They provide a tree with the names of all the files that they took, so there's this defined set of data, which we'll talk about in a moment. It doesn't necessarily mean all of it needs to be mined, but that's the starting point. There are other types of incidents, though, where, through the forensic investigation, we might see that a bad actor had access to email accounts or to certain systems. However, we don't have a specific set of documents or a specific set of files that we know for sure they took or accessed.

Joe, this is something we can talk about a little bit more just before we even get to the point of working with Matt's team and bringing them in to help with the data mining. What are some of the things that we try to work through with clients so that we're properly defining the scope of the potential data mining project in a way that's going to put us in the position to tackle the project as efficiently as we can?

Lazzarotti

One thing is just understanding the client and the client's business, and tapping into the institutional knowledge that they have to say; this is what is potentially at risk. In the case of a business email compromise, for example, what's the role of the person who owns that account? Are they someone who is dealing with X, Y, or Z? If you are in a particular type of business and that person fits a certain role, you might have a good sense of the kind of data that they will likely have in email exchanges while they're doing their job. If it's somebody who is in, take a CPA firm for example, and that email belonged to a tax preparer, there's a good chance that that person might have tax return information of some kind, either in their inbox, sent, or deleted items. In that case, you would say, we probably need to really look at that. Whereas if you're talking to someone who's just putting together content for promotional activities, you might say, the likelihood of personal data is not likely to be there. Then, that may shape things like what we search for and how much effort we put into the search. 

That’s just one way to look at that, Damon, in terms of approaching the next step of bringing someone like Matt in.

Silver

That's a great example. Just to throw one more out there, we recently were working with a university that had a ransomware attack, and the initial data set was 10 terabytes, which is a very significant volume of data that appeared to have been impacted. It was not a situation where we were clear on what was exfiltrated. It was just based on where the ransomware was deployed and based on what we were able to see through forensics, which was not a whole lot; those were the files that were potentially in scope. The initial inclination on the client's part was just to data mine the whole 10 terabytes, because they didn't want to, understandably, have to really dig into this data set and try to figure stuff out themselves. Then, they saw the statement of work for data mining 10 terabytes, and it was a lot. 

At that point, we started working with them in coordination with their cyber carrier, and we were able to actually get that 10 terabytes of data down to about 50 gigabytes of data, which is dramatically, dramatically less. It was a much more reasonable scope of work. It did take probably a month of very intensive back and forth with the client’s and our teams, but it had a dramatic impact on the overall cost of the project and frankly on how long the project took. These projects tend to take a long time, no matter what, but if you're doing 10 terabytes of data mining, it's going to take a hell of a lot longer than if you're able to make that a more manageable amount of data at the front end. 

Matt, let's assume that Joe and I have done everything we can from the standpoint of working directly with the client to identify data that does need to be mined. We reach out to you, as we have many times before, to say, we have this project, can you take the audience through an insider's look at what happens next? What does the process look like on your end?

Morocco

Absolutely. You already touched on a key aspect of the scoping process that we look at before we get involved in these matters, which is, what is the client industry, and what is the data source? Who were the data owners who were involved? That gives us a sense going in to help set and manage expectations. Really, it's a big part of my role in a key part of the collaboration between Consilio, counsel, and the cyber carrier to set and manage the proper expectations. We want to be sensitive always to what these clients are going through, having had this experience, and how we are going to make the most efficient and really cost-effective process. That comes down a lot of times to leveraging as much technology as we can. 

We move data through a funnel is really the best visual that we can provide. Whether it's 10 terabytes, one terabyte or 100 gigabytes. Keeping in perspective that to some companies, even 100 gigabytes is a massive amount of data for them to think about wading through and for a company like us to get through that amount of data to really get down to what's sensitive. That can be different things in different matters. It could be sensitive contractual information, it could be IP, or it could be what we'll call your more garden-variety PII or PHI that's at issue. The goal is to reduce the amount of data down to the least amount of documents that need to be looked at, whether by manual review or what we call programmatic extraction. Programmatic extraction is really where we have spreadsheets, or what we refer to as structured data, that can be automated in terms of how we get that information out and then provide it to you for analysis to determine whether there's a notification obligation. Anywhere that we can avoid manual, tedious processes and data entry, which allows more room for error or requires more quality control, we're going to leverage technology to get through those processes. 

We start out with data mining, which is the fancier term for e-discovery. This means we're going to create an index of the data. We're going to throw some analytics tools at it, and we're going to try to identify sensitive information. That, again, can be things like birth date, Social Security number, and credit card numbers. There are ways to search for number strings to identify sensitive information, taking into account whether we're dealing with just U.S. data or GDPR or other regional jurisdictional non-English languages to take into account. We are going to put that data through an iterative rinse and repeat process where we run some searches and provide some results back.

We really want to get the clients involved wherever we can. Ideally, they know their data, and they can identify someone in the company who can help us quickly determine whether there are folders or directory trees within the data that they know may have test data. We encounter a lot in insurance companies or health systems where they have anonymized or test data that appears as though it's sensitive, but it's really something that can be put aside. You might have contracts or information where, if a client can assist in that process by saying, we believe the majority of sensitive data is going to lie within these folders or there's not going to be any in these folders, we can look at prioritizing and triaging those data sets to get to information faster to reduce the amount of manual review. Maybe it just takes some statistical sampling for us to validate where the sensitive information is and how we're going to prioritize that review. 

The thing to remember is that there's always a ticking clock. Once there's a process that started and there's maybe a regulatory requirement to satisfy, we all want to collaborate as much as we can on not wasting time reviewing documents that are unlikely to contain sensitive information and focus on how quickly and accurately we can get information to your team to start the analysis to consult with that client about what their obligations might be.

Lazzarotti

Matt, one thing that's developed over the last few years is that we often think about breaches as involving personal data, like a Social Security number or a credit card number. Then, with HIPAA, it gets a little bit more complicated in terms of trying to find whether a document has health information about an individual. As we've seen more recently, there's critical infrastructure, federal contractors, and a whole range of SEC requirements. How are you seeing the evolution of “breach notification” change because of the increased regulatory environment that is getting at disclosures of information that may not fall into the typical Social Security number or strings of numbers? How do you try to deal with that issue in these projects?

Morocco

That's a great question, Joe. It really comes down to that consultative, upfront approach. We are working with your team, and we want to set the right process in place. PII and PHI can be commonplace these days, but it doesn't always mean that the puzzle pieces fit together easily. You sometimes have to dig around, since you have pieces of people's information in one place or one document, and other pieces scattered throughout a data set. Some of the challenges that we face are finding those pieces of information, putting them together, and making sure that we're not combining two John Smiths where they don't belong. 

Then, validating that information so you have an accurate notification process. When it comes to some of the more complex challenges in terms of data sets that don't present just personal or health information, we have sensitive information that might be contractual, intellectual property, or, as you mentioned, government contractor information. It requires more specialized review, really. You don't necessarily always have the same folks doing manual data entry review for Social Security numbers as you might want analyzing a contract to determine what someone's notification obligations are and how sensitive that information might be. There are certainly occasions where we're bifurcating those review workflows and making sure that the right folks are staffed on them. 

Then, leveraging other types of technology that get more into the e-discovery space, where we might be using analytics to further investigate a data set and help us identify sensitive information beyond just knowing, value, so to speak, of credit card numbers and dates of birth.

Silver

That’s a really important point you've made, both when talking about the front end and then talking about the iterative process as you go through the data set. It really does pay to make an investment in being clear about what your objectives are, what you're looking for, and what your risk tolerance is. Do you want to find every single last document, or is there some amount of buffer that the client is comfortable with, given the other circumstances? There is a tendency with this, as with other things, when there's time pressure, and there's resource pressure, to say, let's just use your standard set of terms for PII and move this forward as quickly as we can. It may not be the case that a cookie-cutter set of terms is really what's going to best serve the client. If, to your point, they're dealing with data that might be contractual, subject to confidential information, or, to Joe's point, that might be PHI under HIPAA or subject to other laws that have a more nuanced definition. 

Then, as you're going through the process, and we've done this many times with your team, maybe they're coming across documents that suggest a pattern of false hits. Something that seems to be fitting in one of our PII categories, but consistently isn't. Are there then subsequent programmatic searches that we can run to try and identify those other false hits? In some instances, we've been able to eliminate or significantly reduce the population of documents that actually need to be manually reviewed by the team because we are able to identify these patterns. 

Could you speak a little bit to that last point around once you have your initial set of documents, that your team is manually reviewing what they're looking for, and what you do if you see opportunities to streamline that review further?

Morocco

Absolutely. One of my favorite lines that I learned at my first job as a reinsurance paralegal many moons ago was that you get what you inspect. It's important to have a QC process and a validation process. As you mentioned, to not necessarily just always take the standard set of search terms and throw things straight over to review, just because it sounds like it's super cheap to do it on a pro-document basis, or you feel like you're under a time pressure. The more data that you're putting through that process, the more data that might come out of it, or you might think that there's an increased potential for notifications. Whereas spending a little bit of extra time upfront for what we generically call advanced data mining can have a very large impact on the amount of work that needs to go into the process, the amount of time it takes, and the overall cost.

Advanced data mining can take many flavors. One of the most common being pattern recognition, as you mentioned. We have ways to identify what we call unique hits, which are documents that might hit on a single term out of all of the terms that we run, and that's usually an indicator that it's an outlier. One of my favorite examples is the word passport and knowing that there happens to be a software out there called Passport. You want to take time to collaborate and understand the client's business. Some of these terms of art can often be confused and throw false positives into the process. If you can spend a little time to try to identify patterns, work with the client and be consultative in terms of what kind of results you're seeing and whether it makes sense that they would have data that contains this type of information. You can often have a really significant impact on the overall timeline and the overall cost.

Silver

Matt, thank you. This is a great place, actually, for us to start winding things down. Thanks to everyone for joining us as always. We have a designated email address, Privacy@JacksonLewis.com, where we're always happy to receive your feedback on this episode. Any questions for Matt, Joe, or me, or if you have ideas for other topics that you'd like to see us cover in the future on the podcast, please share those as well. 

Matt, thanks again so much for taking the time to join us. This has been extremely helpful for Joe and me, and hopefully for our audience as well. We look forward to seeing you soon.

© Jackson Lewis P.C. This material is provided for informational purposes only. It is not intended to constitute legal advice nor does it create a client-lawyer relationship between Jackson Lewis and any recipient. Recipients should consult with counsel before taking any actions based on the information contained within this material. This material may be considered attorney advertising in some jurisdictions. Prior results do not guarantee a similar outcome. 

Focused on employment and labor law since 1958, Jackson Lewis P.C.’s 1,100+ attorneys located in major cities nationwide consistently identify and respond to new ways workplace law intersects business. We help employers develop proactive strategies, strong policies and business-oriented solutions to cultivate high-functioning workforces that are engaged and stable, and share our clients’ goals to emphasize belonging and respect for the contributions of every employee. For more information, visit https://www.jacksonlewis.com.