{"id":85612,"date":"2018-02-02T17:11:12","date_gmt":"2018-02-02T11:41:12","guid":{"rendered":"https:\/\/blogs_admin.quickheal.com\/?p=85612"},"modified":"2018-02-05T12:58:30","modified_gmt":"2018-02-05T07:28:30","slug":"machine-learning-approach-advanced-threat-hunting","status":"publish","type":"post","link":"https:\/\/www.quickheal.com\/blogs\/machine-learning-approach-advanced-threat-hunting\/","title":{"rendered":"Machine learning approach for advanced threat hunting"},"content":{"rendered":"<p><span style=\"font-weight: 400\">In today\u2019s fast-changing world, the cyber threat landscape is getting increasingly complex and signature-based systems are falling behind to protect endpoints. All major security solutions are built with layered security models to protect endpoints from today\u2019s advanced threats. Machine learning-based detection is also becoming an inevitable component of these layered security models. In this post, we are going to discuss how Quick Heal is incorporating machine learning (ML) in its products to protect endpoints.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Use of ML or AI (artificial intelligence) has already been embraced by the security industry. We, at Quick Heal, are using ML for different use cases. One of them is an initial layer of our multi-level defense. The main objective of this layer is to determine the suspiciousness of a given file or sample. If it finds the file as suspicious, then the file will be scrutinized with the next layers. This filtering reduces the load on the other security layers.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Here, we are going to discuss how we use static analysis of PE files using ML to filter out the traffic sent to the next security layers.<\/span><\/p>\n<h4><b>Machine learning overview<\/b><\/h4>\n<p><span style=\"font-weight: 400\">In <\/span><strong>machine learning<\/strong><span style=\"font-weight: 400\">, we try to give computers the ability to learn from data instead of programmed explicitly. In our problem statement, we can use the power of ML to make a model that is trained on the known data of clean and malicious files and that can further predict the nature of unknown sample files. The key requirement for building ML models is to have data for training. And it must be accurately labeled otherwise results and predictions can be misleading. <\/span><\/p>\n<p><span style=\"font-weight: 400\">The process starts with organizing a trained set by collecting a large number of clean and malicious samples and labeling them. We extract features from these samples (for our models we extracted features from PE headers of all the collected samples). Then the model is trained by providing these feature matrices to it <\/span>&#8211; as shown in the block diagram below.<\/p>\n<figure id=\"attachment_85618\" aria-describedby=\"caption-attachment-85618\" style=\"width: 589px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-85618\" src=\"https:\/\/blogs_admin.quickheal.com\/wp-content\/uploads\/2018\/02\/ML1-1-589x390.png\" alt=\"Machine Learning Training Cycle\" width=\"589\" height=\"390\" srcset=\"https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML1-1-589x390.png 589w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML1-1-300x199.png 300w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML1-1-768x509.png 768w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML1-1-789x523.png 789w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML1-1.png 800w\" sizes=\"(max-width: 589px) 100vw, 589px\" \/><figcaption id=\"caption-attachment-85618\" class=\"wp-caption-text\">Fig 1. Machine Learning Training Cycle<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400\">A feature array of an unknown sample is provided to this trained model. It returns a confidence score for that sample suggesting whether this file is similar to a clean or malicious set.<\/span><\/p>\n<figure id=\"attachment_85619\" aria-describedby=\"caption-attachment-85619\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-85619\" src=\"https:\/\/blogs_admin.quickheal.com\/wp-content\/uploads\/2018\/02\/ML2-1-650x324.png\" alt=\"ML Prediction stage\" width=\"650\" height=\"324\" srcset=\"https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML2-1-650x324.png 650w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML2-1-300x150.png 300w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML2-1-768x383.png 768w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML2-1-789x394.png 789w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML2-1.png 800w\" sizes=\"(max-width: 650px) 100vw, 650px\" \/><figcaption id=\"caption-attachment-85619\" class=\"wp-caption-text\">Fig 2. ML prediction stage<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<h4><b>Feature selection<\/b><\/h4>\n<p><span style=\"font-weight: 400\">A feature is an individual measurable property or characteristic of the things being observed. For example, if we want to classify cats and dogs we can take features like \u201ctype of fur\u201d, \u201cloves the company of others or loves to stay alone\u201d, \u201cbarks or not\u201d, \u00a0\u201cnumber of legs\u201d, \u201cvery sharp claws\u201d, etc.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Selecting good features is one of the most important steps in training any ML model. <\/span>As shown in the above example, if we select \u201cnumber of legs\u201d as a feature for training, it won\u2019t be possible to classify that both dogs and cats have the same number of legs. But, if we take \u201cbark or not\u201d, it would be a good classifier as dogs bark and cats don\u2019t. Also \u201csharpness of claws\u201d would be a good feature.<\/p>\n<p><span style=\"font-weight: 400\">On similar lines, to classify malicious samples from clean ones, we need to have features. So, we start with deriving <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Portable_Executable\"><span style=\"font-weight: 400\">pe file header<\/span><\/a><span style=\"font-weight: 400\"> attributes which need fewer computations to extract. We have some Boolean kind features like \u201cEntryPoint present in First Section\u201d, \u201cResource present\u201d or \u201cSection has a special character in its name\u201d. And some count or value-based like \u201cNumberOfSections\u201d, \u201cFile TimeDateStamp\u201d, etc. Some of these features are so strong that many times they act as good separator for malicious and benign samples. We will try to explain this by following two examples. As shown in the following graph 1.0, in most of the cases, for clean files, the entry point section is the 0th section, but for malicious files, it may vary.<\/span><\/p>\n<figure id=\"attachment_85620\" aria-describedby=\"caption-attachment-85620\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-85620\" src=\"https:\/\/blogs_admin.quickheal.com\/wp-content\/uploads\/2018\/02\/ML3.png\" alt=\"%(percentage) of Entry point section index for clean v\/s malicious\" width=\"600\" height=\"371\" srcset=\"https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML3.png 600w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML3-300x186.png 300w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-85620\" class=\"wp-caption-text\">Graph 1. %(percentage) of entry point section index for clean v\/s malicious<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">Similarly, for an index of \u201cFirst empty section\u201d of a file, we get the following plot.<\/span><\/p>\n<figure id=\"attachment_85621\" aria-describedby=\"caption-attachment-85621\" style=\"width: 600px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-85621\" src=\"https:\/\/blogs_admin.quickheal.com\/wp-content\/uploads\/2018\/02\/ML4.png\" alt=\"%(percentage) of index of first empty section for clean v\/s malicious\" width=\"600\" height=\"371\" srcset=\"https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML4.png 600w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML4-300x186.png 300w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-85621\" class=\"wp-caption-text\">Graph 2. %(percentage) of index of the first empty section for clean v\/s malicious<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<h4><b>Feature scaling\u00a0<\/b><\/h4>\n<p><span style=\"font-weight: 400\">The data we feed to our model must be scaled properly. For example, the value of attribute like <\/span><i><span style=\"font-weight: 400\">Address_OF_EntryPoint<\/span><\/i><span style=\"font-weight: 400\"> = 307584 will have a very large value when compared with the value of attribute like <\/span><i><span style=\"font-weight: 400\">No_Of_Sections<\/span><\/i><span style=\"font-weight: 400\"> = 5. Such a huge difference can cause our model to be poorly trained. We use <\/span><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.MinMaxScaler.html\"><i><span style=\"font-weight: 400\">min_max scaling<\/span><\/i><\/a><span style=\"font-weight: 400\"> to scale the data. Here, we bring all the features to the same range in this case 0 to 1.<\/span><\/p>\n<h4><b>Dimensionality reduction<\/b><\/h4>\n<p><span style=\"font-weight: 400\">The next step is <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Dimensionality_reduction\"><span style=\"font-weight: 400\">dimensionality reduction<\/span><\/a><span style=\"font-weight: 400\">. In data, many times there are groups of features that are co-related to each other, i.e. if one feature changes, others also change at some rate. Removing or combining such features saves space and time and improves the performance of machine learning models. This process of reducing the number of unwanted or redundant features is known as dimensionality reduction. This ensures models are trained and tested faster without compromising the accuracy. We tried using PCA (Principal Component Analysis), <\/span><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.SelectKBest.html\"><i><span style=\"font-weight: 400\">Select K-Best<\/span><\/i><\/a><span style=\"font-weight: 400\">\u00a0by various algorithms.<\/span><\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis\"><span style=\"font-weight: 400\">PCA<\/span><\/a>\u00a0<span style=\"font-weight: 400\">is used to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called <\/span><b>principal components<\/b><span style=\"font-weight: 400\">. PCA tries to find the principal components so as the distance between our samples and the principal component is minimized. The lesser is the distance between samples and principal components, the more variance is retained.<\/span><\/p>\n<p><span style=\"font-weight: 400\">We can make a plot of cumulative variance versus the number of principal components to find the number of Principal Components (PCs) required to cover almost all data.\u00a0<\/span>Applying PCA on our data and tuning <span style=\"font-weight: 400\">it gives us <\/span><strong>163 PCs<\/strong> <span style=\"font-weight: 400\">for<\/span> <span style=\"font-weight: 400\">nearly <\/span><strong>1000<\/strong><span style=\"font-weight: 400\"> different features. That means now we can train the model on 163 features<\/span> <span style=\"font-weight: 400\">that can almost cover the complete data set.<\/span><\/p>\n<figure id=\"attachment_85622\" aria-describedby=\"caption-attachment-85622\" style=\"width: 392px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-85622\" src=\"https:\/\/blogs_admin.quickheal.com\/wp-content\/uploads\/2018\/02\/ML5.png\" alt=\"cumulative variance v\/s No. of Principal components\" width=\"392\" height=\"266\" srcset=\"https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML5.png 392w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML5-300x204.png 300w\" sizes=\"(max-width: 392px) 100vw, 392px\" \/><figcaption id=\"caption-attachment-85622\" class=\"wp-caption-text\">Graph 3. cumulative variance v\/s no. of principal components<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">We can <\/span>transform our test data into <span style=\"font-weight: 400\">these 163 Principal Components and can apply classification models. Though PCA works fine, we need to calculate PCs at a client&#8217;s end before prediction. So, we settled with Select K-Best using Chi-squared statistics, which gives k number of best features by finding out their importance after comparing whole data.<\/span><\/p>\n<h4><strong>Training<\/strong><\/h4>\n<p><span style=\"font-weight: 400\">The next step is about training the model. While training, we need to focus on the following parameters:<\/span><\/p>\n<blockquote><p><b>Size of Model<\/b><span style=\"font-weight: 400\">: The model generated by any algorithm should have an optimal size and updating it should take minimal of efforts.<\/span><\/p>\n<p><b>Training Time<\/b><span style=\"font-weight: 400\">: Time and resources utilized by an algorithm while training should be minimal (though it is not a primary requirement).<\/span><\/p>\n<p><b>Prediction Time<\/b><span style=\"font-weight: 400\">: Prediction function must be as fast as possible.<\/span><\/p>\n<p><b>Overfitting<\/b><span style=\"font-weight: 400\">: Some algorithms work great on small datasets. But, as we grow the dataset the results tend to go down this is called overfitting.<\/span><\/p><\/blockquote>\n<p><span style=\"font-weight: 400\">We tried a few algorithms; <\/span><b>Logistic Regression<\/b><span style=\"font-weight: 400\"> worked good on small datasets but it tends to overfit &amp; the accuracy reduces as we increase the training set.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Then we tried <\/span><b>Random Forest Classifier<\/b><span style=\"font-weight: 400\">. Accuracy here is good but the size of the model that is generated is large as it is a forest of multiple decision trees. And prediction function also takes time as trees need to be traversed. So we moved to <\/span><b>SVM (Support Vector Machine)<\/b><span style=\"font-weight: 400\">.<\/span><\/p>\n<h4><b>Support vector machines<\/b><\/h4>\n<p><span style=\"font-weight: 400\">Support vector machine (SVM) is a supervised machine learning algorithm which can be used for both classification and regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is a count of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform a classification by finding the hyperplane that differentiates the two classes properly.<\/span><\/p>\n<p><span style=\"font-weight: 400\">There can be a lot of hyperplanes classifying the data, but SVM targets to find out a hyperplane that has the largest distance from the nearest sample set on both sides of the plane.<\/span><\/p>\n<p><span style=\"font-weight: 400\">For example, here SVM will choose plane C, among A and B.<\/span><\/p>\n<figure id=\"attachment_85643\" aria-describedby=\"caption-attachment-85643\" style=\"width: 454px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-85643\" src=\"https:\/\/blogs_admin.quickheal.com\/wp-content\/uploads\/2018\/02\/ML6-1.png\" alt=\"Linear SVM classifier\" width=\"454\" height=\"337\" srcset=\"https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML6-1.png 454w, https:\/\/www.quickheal.com\/blogs\/wp-content\/uploads\/2018\/02\/ML6-1-300x223.png 300w\" sizes=\"(max-width: 454px) 100vw, 454px\" \/><figcaption id=\"caption-attachment-85643\" class=\"wp-caption-text\">Graph 4: Linear SVM classifier<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400\">The example above is of Linear SVM. By using kernel functions in SVM, we can classify non linearly distributed data as well. The kernel function is a function which takes two inputs and outputs a number representing how similar they are. Generally used kernels in SVM are <\/span><b>Linear kernel, Polynomial kernel, Radial bias function kernel (RBF)<\/b>.<span style=\"font-weight: 400\"> We tried with all of these kernels and found that RBF and Linear work well for our data. As linear kernel has an extremely fast prediction function, we finalized it<\/span>. <span style=\"font-weight: 400\">The prediction function is <\/span><span style=\"font-weight: 400\">(<\/span><i><span style=\"font-weight: 400\">w*x + b<\/span><\/i><span style=\"font-weight: 400\">)<\/span><span style=\"font-weight: 400\"> where w is a vector of support vectors, x is a feature vector of the test sample and b is intercept or bais.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The SVM model trained using the above steps has a very good accuracy that it nearly reduced 1\/3rd of the traffic to next levels. It is helping us greatly in improving performance and accuracy of detection. <\/span><\/p>\n<p><b>Conclusion<\/b><\/p>\n<p>Machine learning can be effectively used by the antivirus industry and use cases are huge. Quick Heal is committed to bringing the latest research in ML to reform endpoint security. At Quick Heal, we practice ML at various stages and the above case study is one of them. In the future, we will share write-ups on other case studies.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In today\u2019s fast-changing world, the cyber threat landscape is getting increasingly complex and signature-based systems are falling behind to protect endpoints. All major security solutions are built with layered security models to protect endpoints from today\u2019s advanced threats. Machine learning-based detection is also becoming an inevitable component of these layered security models. In this post, [&hellip;]<\/p>\n","protected":false},"author":44,"featured_media":85613,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1214,5],"tags":[1573,266,1577,201,1578],"class_list":["post-85612","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-security-terms-2","category-security","tag-svm","tag-endpoint-security","tag-feature-selection","tag-machine-learning","tag-threat-hunting"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/posts\/85612"}],"collection":[{"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/users\/44"}],"replies":[{"embeddable":true,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/comments?post=85612"}],"version-history":[{"count":19,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/posts\/85612\/revisions"}],"predecessor-version":[{"id":85684,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/posts\/85612\/revisions\/85684"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/media\/85613"}],"wp:attachment":[{"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/media?parent=85612"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/categories?post=85612"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.quickheal.com\/blogs\/wp-json\/wp\/v2\/tags?post=85612"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}