Phishing Classification
Large-scale lexical classification of phishing URLs
  • 2017-05-05
  • Python
  • MySQL
  • Machine Learning
  • Data Science
  • scikit-learn
  • MQTT
  • Phishing Websites

    Phishing is a widespread crime which costs the economy hundreds of millions of pounds every year. This form of social engineering aims to steal credentials from individuals through the use of a legitimate-looking site. Fradusters impersonate websites of a variety of services in order to gain access to bank accounts, online payment services and more.

    The Project

    With phishing being at an all-time high, efficient detection mechanisms are required to kill campaigns as quickly as possible and minimise the impact of the phishing attack. For my BSc thesis, I researched and built a phishing classifier that uses machine learning to classify large volumes of URLs (ranging up to 1000 URLs/s), using features extracted from the URL only.

    This system is currently deployed at Netcraft where it is finding thousands of phishing URLs per month, which are subsequently forwarded to organisations ranging from major browsers to ISPs and domain registrars.

    Existing research in lexical-only classification is limited, with little evidence of investigations undertaken in real-life scenarios. This study improved upon this by evaluating the classification system against live phishing attacks, in a real-time environment that exhibits the true conditions and constraints under which a system of this nature would be used.


    The research I undertook aimed to identify the best features to extract from the URL, since fetching the content of the website and any third party data introduces latency and therefore makes processing large volumes of URLs difficult. I also identifed a large number of novel features, some of which proved to be significantly more effective than the features presented by existing literature.

    A graph showing feature effectiveness of Novel (coloured) and Non-novel (grey) features, for two data sets (DP and SBSP)

    Classification Model

    An investigation was also performed to find the most effective classification algorithm, using hyperparameter optimisation to ensure that the best instances of each algorithm were used. A Random Forest model with 150 trees proved to be the most successful.

    Results of Random Forest hyperparameter optimisation

    A comparison against results from existing literature showed that the system I developed outperformed existing lexical classifiers, and often came class or even outperformed classifiers that used full data from the web-page.

    A comparison of accuracy against existing literature, using data-sets that are as similar as possible

    Implementation & Deployment

    The results of the research was used to develop a large-scale system that can support large volumes of URLs. The system was built as a network of workers, that communicate using the MQTT protocol. Scikit-learn and other scientific-computing libraries for Python were used.

    On a 4-core machine, the system can process up to 43 million URLs per day (500/s). This costs less than $100 a month to run when using popular cloud computing services such as DigitalOcean and Amazon EC2. This system currently processes a live-feed of URLs in real-time, reporting suspected attacks so that take-down efforts can take place.