2014 |
|
Breitinger, Frank; Rathgeb, Christian; Baier, Harald An Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach (Journal Article) In: Journal of Digital Forensics, Security and Law (JDFSL), vol. 9, no. 2, pp. 155–166, 2014. @article{BRB14, Investigating seized devices within digital forensics represents a challenging task due to the increasing amount of data. Common procedures utilize automated file identification, which reduces the amount of data an investigator has to examine manually. In the past years the research field of approximate matching arises to detect similar data. However, if n denotes the number of similarity digests in a database, then the lookup for a single similarity digest is of complexity of O(n). This paper presents a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(n) to O(log(n)). Our proposed approach is based on the well-known divide and conquer paradigm and builds a Bloom filter-based tree data structure in order to enable an efficient lookup of similarity digests. Further, it is demonstrated that the presented technique is highly scalable operating a trade-off between storage requirements and computational efficiency. We perform a theoretical assessment based on recently published results and reasonable magnitudes of input data, and show that the complexity reduction achieved by the proposed technique yields a 220-fold acceleration of look-up costs. | |
Breitinger, Frank; Stivaktakis, Georgios; Baier, Harald FRASH: A Framework to Test Algorithms of Similarity Hashing (Journal Article) In: Digit. Investig., vol. 10, pp. S50–S58, 2014, ISSN: 1742-2876. @article{BSB13, Automated input identification is a very challenging, but also important task. Within computer forensics this reduces the amount of data an investigator has to look at by hand. Besides identifying exact duplicates, which is mostly solved using cryptographic hash functions, it is necessary to cope with similar inputs (e.g., different versions of a file), embedded objects (e.g., a JPG within a Word document), and fragments (e.g., network packets), too. Over the recent years a couple of different similarity hashing algorithms were published. However, due to the absence of a definition and a test framework, it is hardly possible to evaluate and compare these approaches to establish them in the community. The paper at hand aims at providing an assessment methodology and a sample implementation called FRASH: a framework to test algorithms of similarity hashing. First, we describe common use cases of a similarity hashing algorithm to motivate our two test classes efficiency and sensitivity & robustness. Next, our open and freely available framework is briefly described. Finally, we apply FRASH to the well-known similarity hashing approaches ssdeep and sdhash to show their strengths and weaknesses. | |
Breitinger, Frank; Stivaktakis, Georgios; Roussev, Vassil Evaluating Detection Error Trade-offs for Bytewise Approximate Matching Algorithms (Journal Article) In: Digital Investigation, vol. 11, no. 2, pp. 81–89, 2014, ISSN: 1742-2876, (bf Best Paper Award @ ICDF2C'13). @article{BSR14, Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages, and have been evaluated in a somewhat ad-hoc manner. Recently, the FRASH testing framework has been proposed as a vehicle for developing a set of standardized tests for approximate matching algorithms; the aim is to provide a useful guide for understanding and comparing the absolute and relative performance of different algorithms. The contribution of this work is twofold: a) expand FRASH with automated tests for quantifying approximate matching algorithm behavior with respect to precision and recall; and b) present a case study of two algorithms already in use-sdhash and ssdeep. | |
Breitinger, Frank; Baggili, Ibrahim File Detection On Network Traffic Using Approximate Matching (Journal Article) In: Journal of Digital Forensics, Security and Law (JDFSL), vol. 9, no. 2, pp. 23–36, 2014, (bf Best Paper Award). @article{BB14, In recent years, Internet technologies changed enormously and allow faster Internet connections, higher data rates and mobile usage. Hence, it is possible to send huge amounts of data / files easily which is often used by insiders or attackers to steal intellectual property. As a consequence, data leakage prevention systems (DLPS) have been developed which analyze network traffic and alert in case of a data leak. Although the overall concepts of the detection techniques are known, the systems are mostly closed and commercial. Within this paper we present a new technique for network traffic analysis based on approximate matching (a.k.a fuzzy hashing) which is very common in digital forensics to correlate similar files. This paper demonstrates how to optimize and apply them on single network packets. Our contribution is a straightforward concept which does not need a comprehensive configuration: hash the file and store the digest in the database. Within our experiments we obtained false positive rates between 10−4 and 10−5 and an algorithm throughput of over 650 Mbit/s. | |
Breitinger, Frank; Roussev, Vassil Automated evaluation of approximate matching algorithms on real data (Journal Article) In: Digital Investigation, vol. 11, Supplement 1, no. 0, pp. S10 - S17, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). @article{BR14, Abstract Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to screen and analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages and evaluation methodology is still evolving. Broadly, prior approaches have used either a human in the loop to manually evaluate the goodness of similarity matches on real world data, or controlled (pseudo-random) data to perform automated evaluation. This work's contribution is to introduce automated approximate matching evaluation on real data by relating approximate matching results to the longest common substring (LCS). Specifically, we introduce a computationally efficient LCS approximation and use it to obtain ground truth on the t5 set. Using the results, we evaluate three existing approximate matching schemes relative to LCS and analyze their performance. | |
Breitinger, Frank; Baier, Harald; White, Douglas On the database lookup problem of approximate matching (Journal Article) In: Digital Investigation, vol. 11, Supplement 1, no. 0, pp. S1–S9, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). @article{BBW14, Abstract Investigating seized devices within digital forensics gets more and more difficult due to the increasing amount of data. Hence, a common procedure uses automated file identification which reduces the amount of data an investigator has to look at by hand. Besides identifying exact duplicates, which is mostly solved using cryptographic hash functions, it is also helpful to detect similar data by applying approximate matching. Let x denote the number of digests in a database, then the lookup for a single similarity digest has the complexity of O(x). In other words, the digest has to be compared against all digests in the database. In contrast, cryptographic hash values are stored within binary trees or hash tables and hence the lookup complexity of a single digest is O(log2(x)) or O(1), respectively. In this paper we present and evaluate a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of using multiple small Bloom filters (which is the common procedure), we demonstrate that a single, huge Bloom filter has a far better performance. Our evaluation demonstrates that current approximate matching algorithms are too slow (e.g., over 21 min to compare 4457 digests of a common file corpus against each other) while the improved version solves this challenge within seconds. Studying the precision and recall rates shows that our approach works as reliably as the original implementations. We obtain this benefit by accuracy–the comparison is now a file-against-set comparison and thus it is not possible to see which file in the database is matched. |
Book Chapters & Journal Articles
2014 |
|
An Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach (Journal Article) In: Journal of Digital Forensics, Security and Law (JDFSL), vol. 9, no. 2, pp. 155–166, 2014. | |
FRASH: A Framework to Test Algorithms of Similarity Hashing (Journal Article) In: Digit. Investig., vol. 10, pp. S50–S58, 2014, ISSN: 1742-2876. | |
Evaluating Detection Error Trade-offs for Bytewise Approximate Matching Algorithms (Journal Article) In: Digital Investigation, vol. 11, no. 2, pp. 81–89, 2014, ISSN: 1742-2876, (bf Best Paper Award @ ICDF2C'13). | |
File Detection On Network Traffic Using Approximate Matching (Journal Article) In: Journal of Digital Forensics, Security and Law (JDFSL), vol. 9, no. 2, pp. 23–36, 2014, (bf Best Paper Award). | |
Automated evaluation of approximate matching algorithms on real data (Journal Article) In: Digital Investigation, vol. 11, Supplement 1, no. 0, pp. S10 - S17, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). | |
On the database lookup problem of approximate matching (Journal Article) In: Digital Investigation, vol. 11, Supplement 1, no. 0, pp. S1–S9, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). |