51. | Breitinger, Frank; Stivaktakis, Georgios; Roussev, Vassil: Evaluating Detection Error Trade-offs for Bytewise Approximate Matching Algorithms. In: Digital Investigation, 11 (2), pp. 81–89, 2014, ISSN: 1742-2876, (bf Best Paper Award). (Type: Journal Article | Abstract | Links | BibTeX) @article{BSR14, title = {Evaluating Detection Error Trade-offs for Bytewise Approximate Matching Algorithms}, author = {Frank Breitinger and Georgios Stivaktakis and Vassil Roussev}, url = {http://dx.doi.org/10.1016/j.diin.2014.05.002}, doi = {10.1016/j.diin.2014.05.002}, issn = {1742-2876}, year = {2014}, date = {2014-06-08}, journal = {Digital Investigation}, volume = {11}, number = {2}, pages = {81--89}, publisher = {Elsevier Science Publishers B. V.}, address = {Amsterdam, The Netherlands, The Netherlands}, abstract = {Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages, and have been evaluated in a somewhat ad-hoc manner. Recently, the FRASH testing framework has been proposed as a vehicle for developing a set of standardized tests for approximate matching algorithms; the aim is to provide a useful guide for understanding and comparing the absolute and relative performance of different algorithms. The contribution of this work is twofold: a) expand FRASH with automated tests for quantifying approximate matching algorithm behavior with respect to precision and recall; and b) present a case study of two algorithms already in use-sdhash and ssdeep.}, note = {bf Best Paper Award}, keywords = {}, pubstate = {published}, tppubtype = {article} } Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages, and have been evaluated in a somewhat ad-hoc manner. Recently, the FRASH testing framework has been proposed as a vehicle for developing a set of standardized tests for approximate matching algorithms; the aim is to provide a useful guide for understanding and comparing the absolute and relative performance of different algorithms. The contribution of this work is twofold: a) expand FRASH with automated tests for quantifying approximate matching algorithm behavior with respect to precision and recall; and b) present a case study of two algorithms already in use-sdhash and ssdeep. |
52. | File Detection On Network Traffic Using Approximate Matching. In: Journal of Digital Forensics, Security and Law (JDFSL), 9 (2), pp. 23–36, 2014, (bf Best Paper Award). (Type: Journal Article | Abstract | Links | BibTeX) @article{BB14, title = {File Detection On Network Traffic Using Approximate Matching}, url = {https://doi.org/10.15394/jdfsl.2014.1168}, doi = {doi.org/10.15394/jdfsl.2014.1168}, year = {2014}, date = {2014-05-22}, journal = {Journal of Digital Forensics, Security and Law (JDFSL)}, volume = {9}, number = {2}, pages = {23--36}, abstract = {In recent years, Internet technologies changed enormously and allow faster Internet connections, higher data rates and mobile usage. Hence, it is possible to send huge amounts of data / files easily which is often used by insiders or attackers to steal intellectual property. As a consequence, data leakage prevention systems (DLPS) have been developed which analyze network traffic and alert in case of a data leak. Although the overall concepts of the detection techniques are known, the systems are mostly closed and commercial. Within this paper we present a new technique for network traffic analysis based on approximate matching (a.k.a fuzzy hashing) which is very common in digital forensics to correlate similar files. This paper demonstrates how to optimize and apply them on single network packets. Our contribution is a straightforward concept which does not need a comprehensive configuration: hash the file and store the digest in the database. Within our experiments we obtained false positive rates between 10−4 and 10−5 and an algorithm throughput of over 650 Mbit/s.}, note = {bf Best Paper Award}, keywords = {}, pubstate = {published}, tppubtype = {article} } In recent years, Internet technologies changed enormously and allow faster Internet connections, higher data rates and mobile usage. Hence, it is possible to send huge amounts of data / files easily which is often used by insiders or attackers to steal intellectual property. As a consequence, data leakage prevention systems (DLPS) have been developed which analyze network traffic and alert in case of a data leak. Although the overall concepts of the detection techniques are known, the systems are mostly closed and commercial. Within this paper we present a new technique for network traffic analysis based on approximate matching (a.k.a fuzzy hashing) which is very common in digital forensics to correlate similar files. This paper demonstrates how to optimize and apply them on single network packets. Our contribution is a straightforward concept which does not need a comprehensive configuration: hash the file and store the digest in the database. Within our experiments we obtained false positive rates between 10−4 and 10−5 and an algorithm throughput of over 650 Mbit/s. |
53. | Breitinger, Frank; Guttman, Barbara; McCarrin, Michael; Roussev, Vassil; White, Douglas: Approximate Matching: Definition and Terminology. National Institute of Standards and Technologies 2014. (Type: Technical Report | Links | BibTeX) @techreport{AM-DEF, title = {Approximate Matching: Definition and Terminology}, author = {Frank Breitinger and Barbara Guttman and Michael McCarrin and Vassil Roussev and Douglas White}, url = {http://dx.doi.org/10.6028/NIST.SP.800-168}, year = {2014}, date = {2014-05-01}, institution = {National Institute of Standards and Technologies}, keywords = {}, pubstate = {published}, tppubtype = {techreport} } |
54. | Breitinger, Frank; Roussev, Vassil: Automated evaluation of approximate matching algorithms on real data. In: Digital Investigation, 11, Supplement 1 (0), pp. S10 - S17, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). (Type: Journal Article | Abstract | Links | BibTeX) @article{BR14, title = {Automated evaluation of approximate matching algorithms on real data}, author = {Frank Breitinger and Vassil Roussev}, url = {http://www.sciencedirect.com/science/article/pii/S1742287614000073}, doi = {10.1016/j.diin.2014.03.002}, issn = {1742-2876}, year = {2014}, date = {2014-03-13}, journal = {Digital Investigation}, volume = {11, Supplement 1}, number = {0}, pages = {S10 - S17}, abstract = {Abstract Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to screen and analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages and evaluation methodology is still evolving. Broadly, prior approaches have used either a human in the loop to manually evaluate the goodness of similarity matches on real world data, or controlled (pseudo-random) data to perform automated evaluation. This work's contribution is to introduce automated approximate matching evaluation on real data by relating approximate matching results to the longest common substring (LCS). Specifically, we introduce a computationally efficient LCS approximation and use it to obtain ground truth on the t5 set. Using the results, we evaluate three existing approximate matching schemes relative to LCS and analyze their performance.}, note = {Proceedings of the First Annual DFRWS Europe}, keywords = {}, pubstate = {published}, tppubtype = {article} } Abstract Bytewise approximate matching is a relatively new area within digital forensics, but its importance is growing quickly as practitioners are looking for fast methods to screen and analyze the increasing amounts of data in forensic investigations. The essential idea is to complement the use of cryptographic hash functions to detect data objects with bytewise identical representation with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have been studied and tested for a long time, approximate matching ones are still in their early development stages and evaluation methodology is still evolving. Broadly, prior approaches have used either a human in the loop to manually evaluate the goodness of similarity matches on real world data, or controlled (pseudo-random) data to perform automated evaluation. This work's contribution is to introduce automated approximate matching evaluation on real data by relating approximate matching results to the longest common substring (LCS). Specifically, we introduce a computationally efficient LCS approximation and use it to obtain ground truth on the t5 set. Using the results, we evaluate three existing approximate matching schemes relative to LCS and analyze their performance. |
55. | Breitinger, Frank; Baier, Harald; White, Douglas: On the database lookup problem of approximate matching. In: Digital Investigation, 11, Supplement 1 (0), pp. S1–S9, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). (Type: Journal Article | Abstract | Links | BibTeX) @article{BBW14, title = {On the database lookup problem of approximate matching}, author = {Frank Breitinger and Harald Baier and Douglas White}, url = {http://www.sciencedirect.com/science/article/pii/S1742287614000061}, doi = {10.1016/j.diin.2014.03.001}, issn = {1742-2876}, year = {2014}, date = {2014-03-13}, journal = {Digital Investigation}, volume = {11, Supplement 1}, number = {0}, pages = {S1--S9}, abstract = {Abstract Investigating seized devices within digital forensics gets more and more difficult due to the increasing amount of data. Hence, a common procedure uses automated file identification which reduces the amount of data an investigator has to look at by hand. Besides identifying exact duplicates, which is mostly solved using cryptographic hash functions, it is also helpful to detect similar data by applying approximate matching. Let x denote the number of digests in a database, then the lookup for a single similarity digest has the complexity of O(x). In other words, the digest has to be compared against all digests in the database. In contrast, cryptographic hash values are stored within binary trees or hash tables and hence the lookup complexity of a single digest is O(log2(x)) or O(1), respectively. In this paper we present and evaluate a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of using multiple small Bloom filters (which is the common procedure), we demonstrate that a single, huge Bloom filter has a far better performance. Our evaluation demonstrates that current approximate matching algorithms are too slow (e.g., over 21 min to compare 4457 digests of a common file corpus against each other) while the improved version solves this challenge within seconds. Studying the precision and recall rates shows that our approach works as reliably as the original implementations. We obtain this benefit by accuracy--the comparison is now a file-against-set comparison and thus it is not possible to see which file in the database is matched.}, note = {Proceedings of the First Annual DFRWS Europe}, keywords = {}, pubstate = {published}, tppubtype = {article} } Abstract Investigating seized devices within digital forensics gets more and more difficult due to the increasing amount of data. Hence, a common procedure uses automated file identification which reduces the amount of data an investigator has to look at by hand. Besides identifying exact duplicates, which is mostly solved using cryptographic hash functions, it is also helpful to detect similar data by applying approximate matching. Let x denote the number of digests in a database, then the lookup for a single similarity digest has the complexity of O(x). In other words, the digest has to be compared against all digests in the database. In contrast, cryptographic hash values are stored within binary trees or hash tables and hence the lookup complexity of a single digest is O(log2(x)) or O(1), respectively. In this paper we present and evaluate a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of using multiple small Bloom filters (which is the common procedure), we demonstrate that a single, huge Bloom filter has a far better performance. Our evaluation demonstrates that current approximate matching algorithms are too slow (e.g., over 21 min to compare 4457 digests of a common file corpus against each other) while the improved version solves this challenge within seconds. Studying the precision and recall rates shows that our approach works as reliably as the original implementations. We obtain this benefit by accuracy--the comparison is now a file-against-set comparison and thus it is not possible to see which file in the database is matched. |
56. | Breitinger, Frank: Reducing data for forensic investigations using approximate matching. Presentation at University New Haven, 2014. (Type: Miscellaneous | BibTeX) @misc{pres-UNH14, title = {Reducing data for forensic investigations using approximate matching}, author = {Frank Breitinger}, year = {2014}, date = {2014-02-01}, howpublished = {Presentation at University New Haven}, keywords = {}, pubstate = {published}, tppubtype = {misc} } |
57. | Breitinger, Frank; Ziroff, Georg; Lange, Steffen; Baier, Harald: Similarity Hashing Based on Levenshtein Distances. In: Peterson, Gilbert; Shenoi, Sujeet (Ed.): Advances in Digital Forensics X, pp. 133-147, Springer Berlin Heidelberg, 2014, ISBN: 978-3-662-44951-6. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BZLB14, title = {Similarity Hashing Based on Levenshtein Distances}, author = {Frank Breitinger and Georg Ziroff and Steffen Lange and Harald Baier}, editor = {Gilbert Peterson and Sujeet Shenoi}, url = {http://dx.doi.org/10.1007/978-3-662-44952-3_10}, doi = {10.1007/978-3-662-44952-3_10}, isbn = {978-3-662-44951-6}, year = {2014}, date = {2014-01-01}, booktitle = {Advances in Digital Forensics X}, volume = {433}, pages = {133-147}, publisher = {Springer Berlin Heidelberg}, series = {IFIP Advances in Information and Communication Technology}, abstract = {It is increasingly common in forensic investigations to use automated pre-processing techniques to reduce the massive volumes of data that are encountered. This is typically accomplished by comparing fingerprints (typically cryptographic hashes) of files against existing databases. In addition to finding exact matches of cryptographic hashes, it is necessary to find approximate matches corresponding to similar files, such as different versions of a given file. This paper presents a new stand-alone similarity hashing approach called saHash, which has a modular design and operates in linear time. saHash is almost as fast as SHA-1 and more efficient than other approaches for approximate matching. The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered. Given the hash values of two byte sequences, saHash returns a lower bound on the number of Levenshtein operations between the two byte sequences as their similarity score. The robustness of saHash is verified by comparing it with other approximate matching approaches such as sdhash.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } It is increasingly common in forensic investigations to use automated pre-processing techniques to reduce the massive volumes of data that are encountered. This is typically accomplished by comparing fingerprints (typically cryptographic hashes) of files against existing databases. In addition to finding exact matches of cryptographic hashes, it is necessary to find approximate matches corresponding to similar files, such as different versions of a given file. This paper presents a new stand-alone similarity hashing approach called saHash, which has a modular design and operates in linear time. saHash is almost as fast as SHA-1 and more efficient than other approaches for approximate matching. The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered. Given the hash values of two byte sequences, saHash returns a lower bound on the number of Levenshtein operations between the two byte sequences as their similarity score. The robustness of saHash is verified by comparing it with other approximate matching approaches such as sdhash. |
58. | Breitinger, Frank; Winter, Christian; Yannikos, York; Fink, Tobias; Seefried, Michael: Using Approximate Matching to Reduce the Volume of Digital Data. In: Peterson, Gilbert; Shenoi, Sujeet (Ed.): Advances in Digital Forensics X, pp. 149-163, Springer Berlin Heidelberg, 2014, ISBN: 978-3-662-44951-6. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BWY14, title = {Using Approximate Matching to Reduce the Volume of Digital Data}, author = {Frank Breitinger and Christian Winter and York Yannikos and Tobias Fink and Michael Seefried}, editor = {Gilbert Peterson and Sujeet Shenoi}, url = {http://dx.doi.org/10.1007/978-3-662-44952-3_11}, doi = {10.1007/978-3-662-44952-3_11}, isbn = {978-3-662-44951-6}, year = {2014}, date = {2014-01-01}, booktitle = {Advances in Digital Forensics X}, volume = {433}, pages = {149-163}, publisher = {Springer Berlin Heidelberg}, series = {IFIP Advances in Information and Communication Technology}, abstract = {Digital forensic investigators frequently have to search for relevant files in massive digital corpora -- a task often compared to finding a needle in a haystack. To address this challenge, investigators typically apply cryptographic hash functions to identify known files. However, cryptographic hashing only allows the detection of files that exactly match the known file hash values or fingerprints. This paper demonstrates the benefits of using approximate matching to locate relevant files. The experiments described in this paper used three test images of Windows XP, Windows 7 and Ubuntu 12.04 systems to evaluate fingerprint-based comparisons. The results reveal that approximate matching can improve file identification -- in one case, increasing the identification rate from 1.82% to 23.76%.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Digital forensic investigators frequently have to search for relevant files in massive digital corpora -- a task often compared to finding a needle in a haystack. To address this challenge, investigators typically apply cryptographic hash functions to identify known files. However, cryptographic hashing only allows the detection of files that exactly match the known file hash values or fingerprints. This paper demonstrates the benefits of using approximate matching to locate relevant files. The experiments described in this paper used three test images of Windows XP, Windows 7 and Ubuntu 12.04 systems to evaluate fingerprint-based comparisons. The results reveal that approximate matching can improve file identification -- in one case, increasing the identification rate from 1.82% to 23.76%. |
59. | Breitinger, Frank; Baier, Harald: Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2. In: Rogers, Marcus; Seigfried-Spellar, KathrynC. (Ed.): Digital Forensics and Cyber Crime, pp. 167-182, Springer Berlin Heidelberg, 2013, ISBN: 978-3-642-39890-2. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BB12d, title = {Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2}, author = {Frank Breitinger and Harald Baier}, editor = {Marcus Rogers and KathrynC. Seigfried-Spellar}, url = {http://dx.doi.org/10.1007/978-3-642-39891-9_11}, doi = {10.1007/978-3-642-39891-9_11}, isbn = {978-3-642-39890-2}, year = {2013}, date = {2013-11-01}, booktitle = {Digital Forensics and Cyber Crime}, volume = {114}, pages = {167-182}, publisher = {Springer Berlin Heidelberg}, series = {Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering}, abstract = {Hash functions are a widespread class of functions in computer science and used in several applications, e.g. in computer forensics to identify known files. One basic property of cryptographic hash func- tions is the avalanche effect that causes a significantly different output if an input is changed slightly. As some applications also need to identify similar files (e.g. spam/virus detection) this raised the need for similarity preserving hashing. In recent years, several approaches came up, all with different namings, properties, strengths and weaknesses which is due to a missing definition. Based on the properties and use cases of traditional hash functions this paper discusses a uniform naming and properties which is a first step towards a suitable definition of similarity preserving hashing. Additionally, we extend the algorithm MRSH for similarity preserving hashing to its successor MRSH-v2, which has three specialties. First, it fulfills all our proposed defining properties, second, it outperforms existing approaches especially with respect to run time performance and third it has two detections modes. The regular mode of MRSH-v2 is used to identify similar files whereas the f-mode is optimal for fragment detection, i.e. to identify similar parts of a file.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Hash functions are a widespread class of functions in computer science and used in several applications, e.g. in computer forensics to identify known files. One basic property of cryptographic hash func- tions is the avalanche effect that causes a significantly different output if an input is changed slightly. As some applications also need to identify similar files (e.g. spam/virus detection) this raised the need for similarity preserving hashing. In recent years, several approaches came up, all with different namings, properties, strengths and weaknesses which is due to a missing definition. Based on the properties and use cases of traditional hash functions this paper discusses a uniform naming and properties which is a first step towards a suitable definition of similarity preserving hashing. Additionally, we extend the algorithm MRSH for similarity preserving hashing to its successor MRSH-v2, which has three specialties. First, it fulfills all our proposed defining properties, second, it outperforms existing approaches especially with respect to run time performance and third it has two detections modes. The regular mode of MRSH-v2 is used to identify similar files whereas the f-mode is optimal for fragment detection, i.e. to identify similar parts of a file. |
60. | Rathgeb, Christian; Breitinger, Frank; Busch, Christoph: Alignment-free cancelable iris biometric templates based on adaptive bloom filters. In: Biometrics (ICB), 2013 International Conference on, pp. 1-8, 2013. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{RBB13, title = {Alignment-free cancelable iris biometric templates based on adaptive bloom filters}, author = {Christian Rathgeb and Frank Breitinger and Christoph Busch}, url = {http://dx.doi.org/10.1109/ICB.2013.6612976}, doi = {10.1109/ICB.2013.6612976}, year = {2013}, date = {2013-09-30}, booktitle = {Biometrics (ICB), 2013 International Conference on}, pages = {1-8}, abstract = {Biometric characteristics are largely immutable, i.e. unprotected storage of biometric data provokes serious privacy threats, e.g. identity theft, limited re-newability, or cross-matching. In accordance with the ISO/IEC 24745 standard, technologies of cancelable biometrics offer solutions to biometric information protection by obscuring biometric signal in a non-invertible manner, while biometric comparisons are still feasible in the transformed domain. In the presented work alignment-free cancelable iris biometrics based on adaptive Bloom filters are proposed. Bloom filter-based representations of binary biometric templates (iris-codes) enable an efficient alignment-invariant biometric comparison while a successive mapping of parts of a binary biometric template to a Bloom filter represents an irreversible transform. In experiments, which are carried out on the CASIA - v 3 iris database, it is demonstrated that the proposed system maintains biometric performance for diverse iris recognition algorithms, protecting biometric templates at high security levels.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Biometric characteristics are largely immutable, i.e. unprotected storage of biometric data provokes serious privacy threats, e.g. identity theft, limited re-newability, or cross-matching. In accordance with the ISO/IEC 24745 standard, technologies of cancelable biometrics offer solutions to biometric information protection by obscuring biometric signal in a non-invertible manner, while biometric comparisons are still feasible in the transformed domain. In the presented work alignment-free cancelable iris biometrics based on adaptive Bloom filters are proposed. Bloom filter-based representations of binary biometric templates (iris-codes) enable an efficient alignment-invariant biometric comparison while a successive mapping of parts of a binary biometric template to a Bloom filter represents an irreversible transform. In experiments, which are carried out on the CASIA - v 3 iris database, it is demonstrated that the proposed system maintains biometric performance for diverse iris recognition algorithms, protecting biometric templates at high security levels. |
61. | Guttman, Barbara; Breitinger, Frank; Garfinkel, Simson; Kornblum, Jesse; Shields, Clay: Approximate Matching of Digital Artifacts. Panel discussion at 13th Digital Forensics Research Conference (DFRWS13), 2013, (Monterey, California). (Type: Miscellaneous | BibTeX) @misc{panel-dfrws, title = {Approximate Matching of Digital Artifacts}, author = {Barbara Guttman and Frank Breitinger and Simson Garfinkel and Jesse Kornblum and Clay Shields}, year = {2013}, date = {2013-08-01}, howpublished = {Panel discussion at 13th Digital Forensics Research Conference (DFRWS13)}, note = {Monterey, California}, keywords = {}, pubstate = {published}, tppubtype = {misc} } |
62. | Breitinger, Frank; ø, Knut Asteb; Baier, Harald; Busch, Christoph: mvHash-B - A New Approach for Similarity Preserving Hashing. In: IT Security Incident Management and IT Forensics (IMF), 2013 Seventh International Conference on, pp. 33-44, 2013. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BABB13, title = {mvHash-B - A New Approach for Similarity Preserving Hashing}, author = {Frank Breitinger and Knut Asteb ø and Harald Baier and Christoph Busch}, url = {http://dx.doi.org/10.1109/IMF.2013.18}, doi = {10.1109/IMF.2013.18}, year = {2013}, date = {2013-07-25}, booktitle = {IT Security Incident Management and IT Forensics (IMF), 2013 Seventh International Conference on}, pages = {33-44}, abstract = {The handling of hundreds of thousands of files is a major challenge in today's IT forensic investigations. In order to cope with this information overload, investigators use fingerprints (hash values) to identify known files automatically using blacklists or whitelists. Besides detecting exact duplicates it is helpful to locate similar files by using similarity preserving hashing (SPH), too. We present a new algorithm for similarity preserving hashing. It is based on the idea of majority voting in conjunction with run length encoding to compress the input data and uses Bloom filters to represent the fingerprint. It is therefore called mvHash-B. Our assessment shows that mvHash-B is superior to other SPHs with respect to run time efficiency: It is almost as fast as SHA-1 and thus faster than any other SPH algorithm. Additionally the hash value length is approximately 0.5% of the input length and hence outperforms most existing algorithms. Finally, we show that the robustness of mvHash-B against active manipulation is sufficient for practical purposes.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } The handling of hundreds of thousands of files is a major challenge in today's IT forensic investigations. In order to cope with this information overload, investigators use fingerprints (hash values) to identify known files automatically using blacklists or whitelists. Besides detecting exact duplicates it is helpful to locate similar files by using similarity preserving hashing (SPH), too. We present a new algorithm for similarity preserving hashing. It is based on the idea of majority voting in conjunction with run length encoding to compress the input data and uses Bloom filters to represent the fingerprint. It is therefore called mvHash-B. Our assessment shows that mvHash-B is superior to other SPHs with respect to run time efficiency: It is almost as fast as SHA-1 and thus faster than any other SPH algorithm. Additionally the hash value length is approximately 0.5% of the input length and hence outperforms most existing algorithms. Finally, we show that the robustness of mvHash-B against active manipulation is sufficient for practical purposes. |
63. | Breitinger, Frank: Similarity Preserving Hashing. Presentation at 8. GI SIG SIDAR Graduate Workshop on Reactive Security (SPRING), 2013. (Type: Miscellaneous | BibTeX) @misc{pres-sidar13, title = {Similarity Preserving Hashing}, author = {Frank Breitinger}, year = {2013}, date = {2013-02-01}, howpublished = {Presentation at 8. GI SIG SIDAR Graduate Workshop on Reactive Security (SPRING)}, keywords = {}, pubstate = {published}, tppubtype = {misc} } |
64. | Breitinger, Frank; Petrov, Kaloyan: Reducing the Time Required for Hashing Operations. In: Peterson, Gilbert; Shenoi, Sujeet (Ed.): Advances in Digital Forensics IX, pp. 101-117, Springer Berlin Heidelberg, 2013, ISBN: 978-3-642-41147-2. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BK13, title = {Reducing the Time Required for Hashing Operations}, author = {Frank Breitinger and Kaloyan Petrov}, editor = {Gilbert Peterson and Sujeet Shenoi}, url = {http://dx.doi.org/10.1007/978-3-642-41148-9_7}, doi = {10.1007/978-3-642-41148-9_7}, isbn = {978-3-642-41147-2}, year = {2013}, date = {2013-01-01}, booktitle = {Advances in Digital Forensics IX}, volume = {410}, pages = {101-117}, publisher = {Springer Berlin Heidelberg}, series = {IFIP Advances in Information and Communication Technology}, abstract = {Due to the increasingly massive amounts of data that need to be analyzed in digital forensic investigations, it is necessary to automatically recognize suspect files and filter out non-relevant files. To achieve this goal, digital forensic practitioners employ hashing algorithms to classify files into known-good, known-bad and unknown files. However, a typical personal computer may store hundreds of thousands of files and the task becomes extremely time-consuming. This paper attempts to address the problem using a framework that speeds up processing by using multiple threads. Unlike a typical multithreading approach, where the hashing algorithm is performed by multiple threads, the proposed framework incorporates a dedicated prefetcher thread that reads files from a device. Experimental results demonstrate a runtime efficiency of nearly 40% over single threading.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Due to the increasingly massive amounts of data that need to be analyzed in digital forensic investigations, it is necessary to automatically recognize suspect files and filter out non-relevant files. To achieve this goal, digital forensic practitioners employ hashing algorithms to classify files into known-good, known-bad and unknown files. However, a typical personal computer may store hundreds of thousands of files and the task becomes extremely time-consuming. This paper attempts to address the problem using a framework that speeds up processing by using multiple threads. Unlike a typical multithreading approach, where the hashing algorithm is performed by multiple threads, the proposed framework incorporates a dedicated prefetcher thread that reads files from a device. Experimental results demonstrate a runtime efficiency of nearly 40% over single threading. |
65. | Breitinger, Frank; Baier, Harald: Performance Issues About Context-Triggered Piecewise Hashing. In: Gladyshev, Pavel; Rogers, MarcusK. (Ed.): Digital Forensics and Cyber Crime, pp. 141-155, Springer Berlin Heidelberg, 2012, ISBN: 978-3-642-35514-1. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BB12a, title = {Performance Issues About Context-Triggered Piecewise Hashing}, author = {Frank Breitinger and Harald Baier}, editor = {Pavel Gladyshev and MarcusK. Rogers}, url = {http://dx.doi.org/10.1007/978-3-642-35515-8_12}, doi = {10.1007/978-3-642-35515-8_12}, isbn = {978-3-642-35514-1}, year = {2012}, date = {2012-12-01}, booktitle = {Digital Forensics and Cyber Crime}, volume = {88}, pages = {141-155}, publisher = {Springer Berlin Heidelberg}, series = {Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering}, abstract = {A hash function is a well-known method in computer science to map arbitrary large data to bit strings of a fixed short length. This property is used in computer forensics to identify known files on base of their hash value. As of today, in a pre-step process hash values of files are generated and stored in a database; typically a cryptographic hash function like MD5 or SHA-1 is used. Later the investigator computes hash values of files, which he finds on a storage medium, and performs look ups in his database. Due to security properties of cryptographic hash functions, they can not be used to identify similar files. Therefore Jesse Kornblum proposed a similarity preserving hash function to identify similar files. This paper discusses the efficiency of Kornblum's approach. We present some enhancements that increase the performance of his algorithm by 55% if applied to a real life scenario. Furthermore, we discuss some characteristics of a sample Windows XP system, which are relevant for the performance of Kornblum's approach.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } A hash function is a well-known method in computer science to map arbitrary large data to bit strings of a fixed short length. This property is used in computer forensics to identify known files on base of their hash value. As of today, in a pre-step process hash values of files are generated and stored in a database; typically a cryptographic hash function like MD5 or SHA-1 is used. Later the investigator computes hash values of files, which he finds on a storage medium, and performs look ups in his database. Due to security properties of cryptographic hash functions, they can not be used to identify similar files. Therefore Jesse Kornblum proposed a similarity preserving hash function to identify similar files. This paper discusses the efficiency of Kornblum's approach. We present some enhancements that increase the performance of his algorithm by 55% if applied to a real life scenario. Furthermore, we discuss some characteristics of a sample Windows XP system, which are relevant for the performance of Kornblum's approach. |
66. | Breitinger, Frank: Similarity Preserving Hashing. Presentation at CAST Workshop - Forensik und Internetkriminalitaet, 2012, (Darmstadt, Germany). (Type: Miscellaneous | BibTeX) @misc{pres-cast12, title = {Similarity Preserving Hashing}, author = {Frank Breitinger}, year = {2012}, date = {2012-12-01}, howpublished = {Presentation at CAST Workshop - Forensik und Internetkriminalitaet}, note = {Darmstadt, Germany}, keywords = {}, pubstate = {published}, tppubtype = {misc} } |
67. | Breitinger, Frank; Baier, Harald: Properties of a similarity preserving hash function and their realization in sdhash. In: Information Security for South Africa (ISSA), pp. 1-8, 2012. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BB12c, title = {Properties of a similarity preserving hash function and their realization in sdhash}, author = {Frank Breitinger and Harald Baier}, url = {http://dx.doi.org/10.1109/ISSA.2012.6320445}, doi = {10.1109/ISSA.2012.6320445}, year = {2012}, date = {2012-10-04}, booktitle = {Information Security for South Africa (ISSA)}, pages = {1-8}, abstract = {Finding similarities between byte sequences is a complex task and necessary in many areas of computer science, e.g., to identify malicious files or spam. Instead of comparing files against each other, one may apply a similarity preserving compression function (hash function) first and do the comparison for the hashes. Although we have different approaches, there is no clear definition / specification or needed properties of such algorithms available. This paper presents four basic properties for similarity pre- serving hash functions that are partly related to the properties of cryptographic hash functions. Compression and ease of computation are borrowed from traditional hash functions and define the hash value length and the performance. As every byte is expected to influence the hash value, we introduce coverage. Similarity score describes the need for a comparison function for hash values. We shortly discuss these properties with respect to three existing approaches and finally have a detailed view on the promising approach sdhash. However, we uncovered some bugs and other peculiarities of the implementation of sdhash. Finally we conclude that sdhash has the potential to be a robust similarity preserving digest algorithm, but there are some points that need to be improved.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Finding similarities between byte sequences is a complex task and necessary in many areas of computer science, e.g., to identify malicious files or spam. Instead of comparing files against each other, one may apply a similarity preserving compression function (hash function) first and do the comparison for the hashes. Although we have different approaches, there is no clear definition / specification or needed properties of such algorithms available. This paper presents four basic properties for similarity pre- serving hash functions that are partly related to the properties of cryptographic hash functions. Compression and ease of computation are borrowed from traditional hash functions and define the hash value length and the performance. As every byte is expected to influence the hash value, we introduce coverage. Similarity score describes the need for a comparison function for hash values. We shortly discuss these properties with respect to three existing approaches and finally have a detailed view on the promising approach sdhash. However, we uncovered some bugs and other peculiarities of the implementation of sdhash. Finally we conclude that sdhash has the potential to be a robust similarity preserving digest algorithm, but there are some points that need to be improved. |
68. | Breitinger, Frank; Baier, Harald; Beckingham, Jesse: Security and implementation analysis of the similarity digest sdhash. In: First International Baltic Conference on Network Security & Forensics (NeSeFo), 2012. (Type: Inproceedings | BibTeX) @inproceedings{BBB12, title = {Security and implementation analysis of the similarity digest sdhash}, author = {Frank Breitinger and Harald Baier and Jesse Beckingham}, year = {2012}, date = {2012-08-01}, booktitle = {First International Baltic Conference on Network Security & Forensics (NeSeFo)}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
69. | Breitinger, Frank; Baier, Harald: A fuzzy hashing approach based on random sequences and hamming distance. In: Proceedings of the Conference on Digital Forensics, Security and Law, pp. 89–100, 2012. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BB12b, title = {A fuzzy hashing approach based on random sequences and hamming distance}, author = {Frank Breitinger and Harald Baier}, url = {https://commons.erau.edu/cgi/viewcontent.cgi?article=1193&context=adfsl}, year = {2012}, date = {2012-05-01}, booktitle = {Proceedings of the Conference on Digital Forensics, Security and Law}, pages = {89--100}, abstract = {Hash functions are well-known methods in computer science to map arbitrary large input to bit strings of a fixed length that serve as unique input identifier/fingerprints. A key property of cryptographic hash functions is that even if only one bit of the input is changed the output behaves pseudo randomly and therefore similar files cannot be identified. However, in the area of computer forensics it is also necessary to find similar files (e.g. different versions of a file), wherefore we need a similarity preserving hash function also called fuzzy hash function. In this paper we present a new approach for fuzzy hashing called bbHash. It is based on the idea to `rebuild' an input as good as possible using a fixed set of randomly chosen byte sequences called building blocks of byte length l (e.g. l = 128). The proceeding is as follows: slide through the input byte-by-byte, read out the current input byte sequence of length l, and compute the Hamming distances of all building blocks against the current input byte sequence. Each building block with Hamming distance smaller than a certain threshold contributes the file's bbHash. We discuss (dis-)advantages of our bbHash to further fuzzy hash approaches. A key property of bbHash is that it is the first fuzzy hashing approach based on a comparison to external data structures.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Hash functions are well-known methods in computer science to map arbitrary large input to bit strings of a fixed length that serve as unique input identifier/fingerprints. A key property of cryptographic hash functions is that even if only one bit of the input is changed the output behaves pseudo randomly and therefore similar files cannot be identified. However, in the area of computer forensics it is also necessary to find similar files (e.g. different versions of a file), wherefore we need a similarity preserving hash function also called fuzzy hash function. In this paper we present a new approach for fuzzy hashing called bbHash. It is based on the idea to `rebuild' an input as good as possible using a fixed set of randomly chosen byte sequences called building blocks of byte length l (e.g. l = 128). The proceeding is as follows: slide through the input byte-by-byte, read out the current input byte sequence of length l, and compute the Hamming distances of all building blocks against the current input byte sequence. Each building block with Hamming distance smaller than a certain threshold contributes the file's bbHash. We discuss (dis-)advantages of our bbHash to further fuzzy hash approaches. A key property of bbHash is that it is the first fuzzy hashing approach based on a comparison to external data structures. |
70. | Baier, Harald; Breitinger, Frank: Security Aspects of Piecewise Hashing in Computer Forensics. In: IT Security Incident Management and IT Forensics (IMF), 2011 Sixth International Conference on, pp. 21-36, 2011. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BB11, title = {Security Aspects of Piecewise Hashing in Computer Forensics}, author = {Harald Baier and Frank Breitinger}, url = {http://dx.doi.org/10.1109/IMF.2011.16}, doi = {10.1109/IMF.2011.16}, year = {2011}, date = {2011-06-17}, booktitle = {IT Security Incident Management and IT Forensics (IMF), 2011 Sixth International Conference on}, pages = {21-36}, abstract = {Although hash functions are a well-known method in computer science to map arbitrary large data to bit strings of a fixed length, their use in computer forensics is currently very limited. As of today, in a pre-step process hash values of files are generated and stored in a database, typically a cryptographic hash function like MD5 or SHA-1 is used. Later the investigator computes hash values of files, which he finds on a storage medium, and performs look ups in his database. This approach has several drawbacks, which have been sketched in the community, and some alternative approaches have been proposed. The most popular one is due to Jesse Kornblum, who transferred ideas from spam detection to computer forensics in order to identify similar files. However, his proposal lacks a thorough security analysis. It is therefore one aim of the paper at hand to present some possible attack vectors of an active adversary to bypass Kornblum's approach. Furthermore, we present a pseudo random number generator being both more efficient and more random compared to Kornblum's pseudo random number generator.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Although hash functions are a well-known method in computer science to map arbitrary large data to bit strings of a fixed length, their use in computer forensics is currently very limited. As of today, in a pre-step process hash values of files are generated and stored in a database, typically a cryptographic hash function like MD5 or SHA-1 is used. Later the investigator computes hash values of files, which he finds on a storage medium, and performs look ups in his database. This approach has several drawbacks, which have been sketched in the community, and some alternative approaches have been proposed. The most popular one is due to Jesse Kornblum, who transferred ideas from spam detection to computer forensics in order to identify similar files. However, his proposal lacks a thorough security analysis. It is therefore one aim of the paper at hand to present some possible attack vectors of an active adversary to bypass Kornblum's approach. Furthermore, we present a pseudo random number generator being both more efficient and more random compared to Kornblum's pseudo random number generator. |
71. | Breitinger, Frank: Security Aspects of fuzzy hashing. University of Applied Sciences Darmstadt, 2011. (Type: Masters Thesis | BibTeX) @mastersthesis{FB-master, title = {Security Aspects of fuzzy hashing}, author = {Frank Breitinger}, year = {2011}, date = {2011-03-01}, school = {University of Applied Sciences Darmstadt}, keywords = {}, pubstate = {published}, tppubtype = {mastersthesis} } |
72. | Breitinger, Frank; Baier, Harald: Security Aspects of Piecewise Hashing in Computer Forensics. Abstract and Presentation at 6. GI SIG SIDAR Graduate Workshop on Reactive Security (SPRING), 2011. (Type: Miscellaneous | BibTeX) @misc{pres-sidar11, title = {Security Aspects of Piecewise Hashing in Computer Forensics}, author = {Frank Breitinger and Harald Baier}, year = {2011}, date = {2011-03-01}, volume = {SR-2011-01}, pages = {11}, howpublished = {Abstract and Presentation at 6. GI SIG SIDAR Graduate Workshop on Reactive Security (SPRING)}, keywords = {}, pubstate = {published}, tppubtype = {misc} } |
73. | Breitinger, Frank; Nickel, Claudia: User Survey on Phone Security and Usage. In: Brömme, Arslan; Busch, Christoph (Ed.): BIOSIG, pp. 139-144, GI, 2010, ISBN: 978-3-88579-258-1. (Type: Inproceedings | Abstract | Links | BibTeX) @inproceedings{BN10, title = {User Survey on Phone Security and Usage}, author = {Frank Breitinger and Claudia Nickel}, editor = {Arslan Brömme and Christoph Busch}, url = {http://dblp.uni-trier.de/db/conf/biosig/biosig2010.html#BreitingerN10}, isbn = {978-3-88579-258-1}, year = {2010}, date = {2010-06-01}, booktitle = {BIOSIG}, volume = {164}, pages = {139-144}, publisher = {GI}, series = {LNI}, abstract = {Mobile phones are widely used nowadays and during the last years devel- oped from simple phones to small computers with an increasing number of features. These result in a wide variety of data stored on the devices which could be a high security risk in case of unauthorized access. A comprehensive user survey was con- ducted to get information about what data is really stored on the mobile devices, how it is currently protected and if biometric authentication methods could improve the cur- rent state. This paper states the results from about 550 users of mobile devices. The analysis revealed a very low securtiy level of the devices. This is partly due to a low security awareness of their owners and partly due to the low acceptance of the offered authentication method based on PIN. Further results like the experiences with mobile thefts and the willingness to use biometric authentication methods as alternative to PIN authentication are also stated.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Mobile phones are widely used nowadays and during the last years devel- oped from simple phones to small computers with an increasing number of features. These result in a wide variety of data stored on the devices which could be a high security risk in case of unauthorized access. A comprehensive user survey was con- ducted to get information about what data is really stored on the mobile devices, how it is currently protected and if biometric authentication methods could improve the cur- rent state. This paper states the results from about 550 users of mobile devices. The analysis revealed a very low securtiy level of the devices. This is partly due to a low security awareness of their owners and partly due to the low acceptance of the offered authentication method based on PIN. Further results like the experiences with mobile thefts and the willingness to use biometric authentication methods as alternative to PIN authentication are also stated. |
All publications by year
51. | Evaluating Detection Error Trade-offs for Bytewise Approximate Matching Algorithms. In: Digital Investigation, 11 (2), pp. 81–89, 2014, ISSN: 1742-2876, (bf Best Paper Award). | :
52. | File Detection On Network Traffic Using Approximate Matching. In: Journal of Digital Forensics, Security and Law (JDFSL), 9 (2), pp. 23–36, 2014, (bf Best Paper Award). |
53. | Approximate Matching: Definition and Terminology. National Institute of Standards and Technologies 2014. | :
54. | Automated evaluation of approximate matching algorithms on real data. In: Digital Investigation, 11, Supplement 1 (0), pp. S10 - S17, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). | :
55. | On the database lookup problem of approximate matching. In: Digital Investigation, 11, Supplement 1 (0), pp. S1–S9, 2014, ISSN: 1742-2876, (Proceedings of the First Annual DFRWS Europe). | :
56. | Reducing data for forensic investigations using approximate matching. Presentation at University New Haven, 2014. | :
57. | Similarity Hashing Based on Levenshtein Distances. In: Peterson, Gilbert; Shenoi, Sujeet (Ed.): Advances in Digital Forensics X, pp. 133-147, Springer Berlin Heidelberg, 2014, ISBN: 978-3-662-44951-6. | :
58. | Using Approximate Matching to Reduce the Volume of Digital Data. In: Peterson, Gilbert; Shenoi, Sujeet (Ed.): Advances in Digital Forensics X, pp. 149-163, Springer Berlin Heidelberg, 2014, ISBN: 978-3-662-44951-6. | :
59. | Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2. In: Rogers, Marcus; Seigfried-Spellar, KathrynC. (Ed.): Digital Forensics and Cyber Crime, pp. 167-182, Springer Berlin Heidelberg, 2013, ISBN: 978-3-642-39890-2. | :
60. | Alignment-free cancelable iris biometric templates based on adaptive bloom filters. In: Biometrics (ICB), 2013 International Conference on, pp. 1-8, 2013. | :
61. | Approximate Matching of Digital Artifacts. Panel discussion at 13th Digital Forensics Research Conference (DFRWS13), 2013, (Monterey, California). | :
62. | mvHash-B - A New Approach for Similarity Preserving Hashing. In: IT Security Incident Management and IT Forensics (IMF), 2013 Seventh International Conference on, pp. 33-44, 2013. | :
63. | Similarity Preserving Hashing. Presentation at 8. GI SIG SIDAR Graduate Workshop on Reactive Security (SPRING), 2013. | :
64. | Reducing the Time Required for Hashing Operations. In: Peterson, Gilbert; Shenoi, Sujeet (Ed.): Advances in Digital Forensics IX, pp. 101-117, Springer Berlin Heidelberg, 2013, ISBN: 978-3-642-41147-2. | :
65. | Performance Issues About Context-Triggered Piecewise Hashing. In: Gladyshev, Pavel; Rogers, MarcusK. (Ed.): Digital Forensics and Cyber Crime, pp. 141-155, Springer Berlin Heidelberg, 2012, ISBN: 978-3-642-35514-1. | :
66. | Similarity Preserving Hashing. Presentation at CAST Workshop - Forensik und Internetkriminalitaet, 2012, (Darmstadt, Germany). | :
67. | Properties of a similarity preserving hash function and their realization in sdhash. In: Information Security for South Africa (ISSA), pp. 1-8, 2012. | :
68. | Security and implementation analysis of the similarity digest sdhash. In: First International Baltic Conference on Network Security & Forensics (NeSeFo), 2012. | :
69. | A fuzzy hashing approach based on random sequences and hamming distance. In: Proceedings of the Conference on Digital Forensics, Security and Law, pp. 89–100, 2012. | :
70. | Security Aspects of Piecewise Hashing in Computer Forensics. In: IT Security Incident Management and IT Forensics (IMF), 2011 Sixth International Conference on, pp. 21-36, 2011. | :
71. | Security Aspects of fuzzy hashing. University of Applied Sciences Darmstadt, 2011. | :
72. | Security Aspects of Piecewise Hashing in Computer Forensics. Abstract and Presentation at 6. GI SIG SIDAR Graduate Workshop on Reactive Security (SPRING), 2011. | :
73. | User Survey on Phone Security and Usage. In: Brömme, Arslan; Busch, Christoph (Ed.): BIOSIG, pp. 139-144, GI, 2010, ISBN: 978-3-88579-258-1. | :