Using Machine Learning to Detect Malicious URLs


machine-learning

With the growth of Machine Learning in the past few years, many tasks are being done with the help of machine learning algorithms.Unfortunately or fortunately, there has been little work done on machine learning and cyber security. So I thought of presenting some at Fsecurify.

A few days ago, I had this idea about what if we could detect a malicious URL from a non-malicious URL using some machine learning algorithm. There has been some research done on the topic so I thought that I should give it a go and implement something from scratch. So lets start.

Machine Learning and Security | Using Machine Learning to detect Malicious URLs with 98% accuracy

Gathering Data

The first task was gathering data. I did some surfing and found some websites offering malicious links. I set up a little crawler and crawled a lot of malicious links from various websites. The next task was finding clear URLs. Fortunately, I did not have to crawl any. There was a data set available. Don’t worry if I am not mentioning the sources of the data. You’ll get the data at the end of this post.

So, I gathered around 400,000 URLs out of which around 80,000 were malicious and others were clean. There we have it, our data set. Lets move next.

Analysis

We’ll be using Logistic Regression since it is fast. The first part was tokenizing the URLs. I wrote my own tokenizer function for this since URLs are not like some other document text. Some of the tokens we get are like ‘virus’,’exe’,’php’,’wp’,’dat’ etc.

The next step is to load the data and store it into a list.

Now that we have the data in our list, we have to vectorize our URLs. I used tf-idf scores instead of using bag of words classification since there are words in urls that are more important than other words e.g ‘virus’, ‘.exe’ ,’.dat’ etc. Lets convert the URLs into a vector form.

We have the vectors. Lets now convert it into test and training data and go right about performing logistic regression on it.

That’s it. See, its that simple yet so effective. We get an accuracy of 98%. That’s a very high value for a machine to be able to detect a malicious URL with.

Want to test some links to see if the model gives good predictions? Sure. Lets do it.

The results come out to be amazing.

  • wikipedia.com (Good Url)
  • google.com/search=faizanahad (Good Url)
  • pakistanifacebookforever.com/getpassword.php/ (Bad Url)
  • www.radsport-voggel.de/wp-admin/includes/log.exe (Bad Url)
  • ahrenhei.without-transfer.ru/nethost.exe (Bad Url)
  • www.itidea.it/centroesteticosothys/img/_notes/gum.exe (Bad Url)

This is what a human would have predicted. No?

The data and code is available at Github

That is it. I hope you enjoyed reading.

Your comments are most welcome.

Categories

16 Comments

Add yours
    • 3
      Faizan Ahmad

      The dataset was compiled after scraping various websites offering malicious links e.g vxvault.net. The good URLs were obtained from a link given in a research paper. That dataset was public. There are no usage restrictions.

      Best Regards

  1. 4
    halloween Wishes

    I don’t know if it’s just me or if perhaps everyone else experiencing problems with your website.

    It appears like some of the text in your posts are running off the screen.
    Can someone else please comment and let me know
    if this is happening to them as well? This may be a issue with
    my internet browser because I’ve had this happen before.
    Thanks

  2. 5
    Yuri

    I think method getTokens could be simplified using re.split which supports regular expressions.
    It could be as simple as:
    def getTokens(input):
    tokens = set(re.split(r'[.-/]’))
    tokens.pop(‘com’)
    return tokens

  3. 7
    SeongKyu, Park

    Hi.
    It is very interesting tool.
    How to use it?
    When I execute though command, it happens an error messge.
    ImportError: No module named sklearn.feature_extraction.text
    Please check it.
    Thanks.

  4. 9
    Nisha

    Hi Faizan,
    Nice Work. I tried to run your script, but looks like ‘train_test_split’ is not defined.
    Am I missing any file?
    -Thanks

  5. 14
    Dee

    Hi Fiazan, The explaination is really good. but i am not able to run the code. I am relatively new to python. i am running code in python 3.5. I am getting this error: ValueError: empty vocabulary; perhaps the documents only contain stop words. It is raising error at this line: X = vectorizer.fit_transform(corpus). Can you please check it?
    Thanks

+ Leave a Comment