File Content-based Malware Classification

Mahendra Deore; Chhaya S. Gosavi

File Content-based Malware Classification

Authors: Mahendra Deore¹, Chhaya S. Gosavi²
View Affiliations Hide Affiliations

¹ Department of Computer Engineering, Cummins College of Engineering for Women, Pune, Maharashtra, India ² Department of Computer Engineering, Cummins College of Engineering for Women, Pune, Maharashtra, India
Source: Artificial Intelligence, Machine Learning and User Interface Design , pp 223-240
Publication Date: May 2024
Language: English

Malicious Software (MALWARE) is a serious threat to system security the moment any electronic gadget or Thing is connected to the World Wide Web (WWW). The malware - stealthy software that is used to collect sensitive information gains access to private systems and can disrupt device operation. Thus, malware acts against the user requirement and is a threat to all operating systems (OS), but more to Windows and Android systems, as those are the most widely used OS. Malware developers try to invade the system by means of viruses, adware, spyware, ransomware, botware, Trojans, etc. Developers try different anti-forensic techniques so that malware cannot be detected or investigated. Malware developers typically play peekaboo with the malware investigators. The result is that investigating such attacks becomes more complex, and many times it fails because of immature forensics methodology or a lack of appropriate tools. This chapter is the first step towards analysing malware. The process started with malware dataset collection and understanding the same. ML has two basic blocks, i.e., feature extraction and classification. In the case of supervised learning, this feature plays a significant role. This asks for understanding features and their effect on classification, which was a major task. Two separate experimental processes were explored. The first one involved extracting n-grams from the binary files using the kfNgram tool, and the second one used a shell script to parse the assembly files for method calls to external API libraries. Several supervised machine learning classifiers like Decision Trees, SVM, and Naive Bayes were used to classify the malware family based on extracted features. We proposed a method to classify malware into nine families as per the Kaggle dataset. It analyses the n-gram of the malware file to generate the feature vector. Here, the value of n in n-gram is selectable; presently, it is four. The objective was to extract highly probable n-grams from the binary files after pre-processing, i.e., calculating the IG parameter. The present threshold for selecting n-gram from the top-most lists is five hundred. It has been observed that SVM and Decision trees provide accuracy on the scale of 98%. Nevertheless, there are chances of improvement as there is a probability of selecting irrelevant n-grams due to the sequential selection of n-grams. This method is considered a starting point for malware classification.

Hardbound ISBN: 9789815179613

Ebook ISBN: 9789815179606

Book DOI: https://doi.org/10.2174/97898151796061240101

From This Site

/content/books/9789815179606.chapter-12

dcterms_subject,pub_keyword

-contentType:Journal -contentType:Figure -contentType:Table -contentType:SupplementaryData

10

5

/content/books/9789815179606.chapter-12

dcterms_subject,pub_keyword

-contentType:Journal -contentType:Figure -contentType:Table -contentType:SupplementaryData

10

5

Chapter

content/books/9789815179606

Book

false

en

File Content-based Malware Classification

From This Site