The identification and detection of Domain Generation Algorithms

The malware wants to keep communication points open with the Command and Control (C2) servers to receive instructions on what to do with the infected host. However, when the domain information of this server is detected and blocked by IT security, will the problem be solved?

Bad news, no!

Because the malware dynamically produces the domain name in order to maintain contact with the C2 servers, if someone is blocked, it may quickly move to the newly generated domain name.

As you can see, we will take a closer look at our major player DGAs in this post, whose detection is critical for DNS security.

The flow will be like this;

What is DGA?
- DGA and NonDGA
DGA Families and Examples
DGA Detection
- Datasets
- Articles
Our work as CRYPTTECH

What is DGA?

DGAs are a large family of algorithms that help malware periodically generate large numbers of domain names that can be used as points of contact with Command and Control servers.

Before DGA stepped in, most malware used hard-coded IP addresses or lists of domains. By embedding the DGA in the malware’s obfuscated binary rather than a pre-built (by command and control servers) list of domains, malware has become much more difficult to block by network security software or network administrators. Because attackers would only need to register a tiny number of these domains to keep the communication channel open, whereas the network security team would have to ban practically all of them to stop communication.

DGA and NonDGA Domains

First, let’s examine what a domain is. Domains consist of more than one level. Let’s write a few lines using the Python tldextract module;

In [1]: import tldextract
In [2]: extract= tldextract.extract('https://www.crypttech.com/tr/')In [3]: extract.subdomain
Out[1]: 'www'In [4]: extract.domain
Out[2]: 'crypttech'In [5]: extract.suffix
Out[3]: 'com'

A domain name can have up to 253 characters according to the DNS protocol, and queries for “subdomain.domain.suffix” are sent to the DNS servers for “domain.suffix.”

DGAs are often long and meaningless to avoid conflicting with a previously registered domain name, for example;
cvyh1po636avyrsxebwbkn7.ddns.net
bqlqscpqwsh.dynserv.com
txumyqrubwutbb.cc

However, DGA can also be formed by bringing together meaningful words and names, they are more difficult to detect than others;
brothernerveplacebringconsult.com
capitalhuntdealsmokeboxclue.com
jeannetteabrahamson.net

Every day, algorithms create hundreds of different domains using unique elements like time, encryption keys, and seed values.

Some metrics that can be used to distinguish between DGA and NonDGA are:

Average of the . . .
- number of used hyphens
- maximum number of contiguous consonants
Standard deviation of the . . .
- string length of the label
- consonant and vowel ratio
- entropy
Redundancy of substrings or words in case of a wordlist DGA

The DGA family refers to algorithms that use any or a mix of these metrics to produce a certain pattern. Detection of a family may not mean that all DGAs produced from that family will be detected. Because seed values are as important as the metrics used in algorithms.
Seeds can be chosen in a static or dynamic manner. A static seed can be anything the malware can alter at any time, such as a word dictionary, random numbers, or letter combinations. Dynamic seed, on the other hand, are time dependent. As dynamic seed values, you can utilize a daily trending Twitter hashtag, insignificant exchange rate figures, or air temperature. Using a date as the root, for example, results in completely separate domains for each day. As a result, today’s domain produced by the same family using the same settings will be different from tomorrow’s domain.

DGA Families and Examples

Some DGA Families;

bamital      banjori  blackhole ccleaner chinad  conficker   
cryptolocker dircrypt dyre      emotet   feodo   fobber      
gameover     gspy     locky     madmax   matsnu  mirai       
murofet      mydoom   necurs    nymaim   omexo   padcrypt    
proslikefan  pykspa   qadars    ramnit   ranbyus rovnix      
shifu        shiotob  simda     suppobox symmi   tempedreve 
tinba        tinynuke tofsee    vawtrak  vidro   xshellghost

Let’s look at a few samples of DGA families’ algorithms:

Cryptolocker

In [1]: generate_domains(2)
Out[1]: ['xqlyfkrxxqorwkpdlsqyfdiukpfsrtwm', 'pbrpehrxxqorwkpdlsqyfdiukpfsrtwm']

Unless the seed value is modified, this code will yield the same result.
If we change the seed value;

In [2]: generate_domains(2, 50)
Out[2]: ['wyeupyaihearwrfbdmsjrhbtmxmusgof', 'ojklovaihearwrfbdmsjrhbtmxmusgof']

Sisron

On the date the code was ran, the following was the result:

In [1]: dga_gen(1)
Out[1]: mdgwnjiwmjia.com

When the code is run as seed one year before the run date, the following is the result:

In [2]: dga_gen(1, "2021-06-08")
Out[2]: mdgwnjiwmjea.com

Did you see the little difference?

If you want to examine many more algorithms like these:

GitHub - baderj/domain_generation_algorithms: Some results of my DGA reversing efforts

Domain Generation Algorithms (DGAs) of Malware reimplemented in Python. bazarbackdoor (aka BazarLoader Team9Backdoor))…

github.com

DGA/dga_algorithms at master · andrewaeva/DGA

The repository that contains the algorithms for generating domain names, dictionaries of malicious domain names…

github.com

Note: Botnets are one of the most dangerous threats to devices linked to the internet. Methods of avoiding security actions are often developed. To protect themselves from hacking and blacklisting attempts, most current botnets use Domain Creation Algorithms (DGA). This link will take you to a nice article on how to do it.

DGA Detection

How do we detect domains created by DGA?
There are two basic ways to do this: reactionary and real-time. The first technique relies on statistical data such as DNS responses, IP address location, WHOIS and TLS certificate information to determine domain name validity. The other technique concentrates on the analysis of the domain taken as a regular string of characters. Within the latter technique, there is a wide variety of approaches to detecting these unnatural or strange areas. The most common are:

Dividing the domain into N-grams and then performing frequency analysis: The N-gram approach is particularly effective when performance is critical, but its implementation is quite complex and its accuracy is rather average.
Calculating the field’s entropy (this is problematic with non-ASCII fields and dictionary-based DGAs): The entropy approach is the most efficient in terms of memory and CPU usage. Its accuracy is the lowest of the three, but because of its ease of use, speed, and simplicity, it can be utilized if you only need a rough estimate and don’t mind false alarms.
Applying machine learning to domain analysis: Well-trained neural networks can produce unexpected findings with almost no false alarms. However, in exchange for great accuracy, it sacrifices performance and resource utilization. If accuracy is crucial, machine learning is the way to go.

Dataset

Benign/NonDGA: There are lots of sources, but the most common are

DGA: You can create DGA domains using the following algorithms

baderj/domain_generation_algorithms
endgameinc/dga_predict

Mix: Ready-to-use mix dataset

chrmor/DGA_domains_dataset

Articles

A Novel Approach for Detecting DGA-Based Botnets in DNS Queries Using Machine Learning Techniques

A study that includes how the metrics that are effective in the distinction between DGA and NonDGA, which we mentioned above, affect the result in different ml models, and many more inferences, was made using Splunk. Jump to the link to view.

If you want to take a look at the technical studies, this article on DGA binary classification may be of interest to you;

Assessment of Machine Learning Models in Detecting DGA Botnet in Characteristics by TF-IDF

They use UMUDGA dataset to conduct the evaluation. Dataset includes 1,000,000 benign domains and 50 DGA botnet families. There are 20,000 pattern domain names for each DGA botnet family. Algortihms are Support Vector Machines (SVM); Logistic Regression (LR); Naïve Bayes (NB); Neural Networks (NN); Decision Trees (DT); Random For-ests (RF); k-Nearest Neighbor (kNN); AdaBoost (AB); Voting Ensemble Algorithm — VEA (Voting-soft ensemble); Hard Ensemble Algorithm — HEA (Voting-hard ensemble)

Knowing whether the domain is DGA or not is not enough for me. If you want to know which DGA family it belongs to, here are a few articles that might interest you;

A DGA domain names detection modeling method based on integrating an attention mechanism and deep neural network

They use the deep learning framework ATT-CNN-BiLSTM for identifying and detecting DGA domains to alleviate the threat. The total amount of DGA domain names in the dataset is 308,230 (24 different malware families). And the benign domain is top 1000,000 domain names in Alexa.

Real-Time Detection of Dictionary DGA Network Traffic Using Deep Learning

A parallel hybrid architecture named Bilbo, composed of an LSTM, a CNN, and an ANN, for dictionary DGA detection. The families selected were suppobox, gozi, and matsnu with domains collected over 2 years (2016–17) by DGArchive. The benign domains in the training set originate from the Alexa Top 1 Million domains, measured in 2016.

Applied machine learning in recognition of DGA domain names

Results of standard and blind evaluations for 14 ML and 9 DL models, along with 2 comparative models, for the recognition of DGA domain names.

Our work as CRYPTTECH

We observed that, while the train success in the datasets was high, the approaches did not demonstrate the required success when tested with new domain names that were not included in the training set.

We have developed many models including different preprocessing methods, ML, DL, and NLP approaches and datasets to achieve high success in DGA detection which will provide additional visibility in our SIEM and SOAR products.

The datasets we use:

1- A data set was created by using the 360 Netlab database containing 59 different DGA algorithm domain names and the Tranco database containing the most visited 1 million domain names worldwide. The dataset contains a total of 1970159 records.

2- A data set containing 25 different DGA algorithms, 13 of which are time-dependent and 12 of which are time-independent, in equal proportion (13500 for each algorithm, a total of 337500 records) and another data containing the most visited domain names worldwide. A data set of 675000 records was created by taking 337500 records from Alexa, which is the base.

The preprocess methods we use:

The ML, DL and NLP methods we use:

Random Forest
Support Vector Machines
Multinomial Naive Bayes
Stochastic Gradient Descent (SGD)
CatBoost
LSTM
CNN
Fasttext

The top 3 results we got are as follows:

Within this project’s scope, we have developed an API service to use the fasttext model that gives the best results.
The domain mentioned in the DNS query can be directed to the API service we have set up. It will return the detection result along with the confidence score. Or the file exported from the log source can be uploaded or given the path to make bulk queries. We produce a report containing the confidence score and the detection result in excel format.
There is no need to worry about meaningless subdomains that trusted addresses will contain (ex: k0 — an-adyl4n7z.googlevideo.com), as the service will subject the entered input to separate reviews as subdomain, domain, and suffix.
The threshold value used for detection is determined according to the ROC Curve of the model, so the classification process is done according to the most optimal value.
In addition, a white list can be created or the existing list can be updated with the request to be sent to the service.

With this model, which will work in integration with our SOAR and SIEM products, possible botnet attacks can be prevented.

References

Bu Blogda Ara

CRYPTTECH BLOG