Do you know how to validate Hyperlink for Phishing Detection?

A technique called Hyperlink Validation can be used to determine whether a webpage is legitimate or phishing.

This scheme could identify new phishing pages, which otherwise could not be identified by blacklist based anti-phishing tools and URL verification.

It is a novel approach to detect the phishing websites by intercepting all the hyperlinks of a current page through Google API. Then it constructs a parse tree with the intercepted hyperlinks. 

The phishing hyperlinks which were not identified by the URL verification are identified correctly in this validation.

Moreover, the phishing target is also identified.  An approach to the discovery of the Phishing target of a suspicious webpage is identified. Thus, it is based on constructing a parse tree with the intercepted hyperlinks. 

Domain Name Retrieval for Hyperlink Validation

The hyperlink validation system primarily involves three functional areas. They are search engines, parse tree constructor, parser and the phishing pages collection repository.

The URL is forwarded to an interface instead of giving direct access to the World Wide Web (WWW) or the proxy server.

The fetched URL is passed for the tokenize operation, to get the domain name from the URL. The domain name is the keyword to verify and check, whether the requested site is phishing or not. 

Role of Google API for Hyperlink Validation

Importance of Hyperlink Validation in Phishing Detection

To easily identify and influence the information on the web, the developers can use the Google API, which is the beta program. 

The software developers use the Google Web APIs service to demand more than two billion web documents directly from their own computer programs. 

Between the user’s program and Google API, the SOAP and WSDL standards act as the interface.

The domain name from the URL is placed in the Google search engine, by using the Google API. If the keyword appears only once within the body of a page it will receive a low score.

After that, the following techniques need to be followed to find the Phishing Site.

(i) Top Results Retrieval

The domain name of the captured URL is considered as the root node. The Google search with the Google API is based on the root node. 

The results are displayed on the Google search engine. The top ten results are identified based on the importance of the pages using the page ranking algorithm. 

(ii) Page Ranking 

The importance of the web page can be considered based on the number of pages linking on to that web page.

If the web page j consists of the hyperlink k, it means that k is important and relevant to j. If there are many pages that link to k, then the assumption is that k is important.

The website can be considered as a directed graph, which consists of nodes and edges. The nodes are represented as web pages and the edges as the links between them.

From the searched list, the top ten results are identified which are assumed to be more relevant to the domain name. 

The identified results are collected from the search result. From the results obtained, each result is further browsed to collect all the hyperlinks placed within the site.

All the hyperlinks are checked with the previous hyperlink validator. If any suspicious hyperlinks are identified as phishing, then the given website is the phishing website. 

But the system does not stop with only hyperlink checking, the tree is constructed to fully clarify whether the hyperlink is phishing or not as well as identify the phishing target.

(iii) Tree Construction

The internal hyperlinks which are placed inside the top ten results are collected to construct the parse tree. The ten levels of the tree are considered as shown in the below Picture.

In the parse tree construction, the given input keyword for the search, i.e. the domain name, is to be the root node for the parse tree. The first ten results along with the internal hyperlinks of each result are used to establish the tree from the root node.

The first level of the tree from the root node is the top ten from the searched results. The same procedure is repeated for all the leaves in the second, third and so on up to the nth level.

Tree Construction
Format of the Tree Construction

In the next article titled what is the target of Hackers? , you can read about tree traversal and finding the Phishing Site with the Phishing target

Final Words,

Here we have discussed the importance of hyperlink validation in phishing detection. Hence, a well-structured phishing URL (malicious link) can be detected through hyperlink validation of the website.