Fwaf – Machine Learning Driven Web Application Firewall

  • It seems like what it actually "learned" is no better than banning some keywords, for example:

      >>> p=lambda x:lgs.predict(vectorizer.transform([x]))
      >>> p("/product.php?name=etc")
      array([1])
      >>> p("/login.php?name=rfoo&pass=hehe")
      array([1])
      >>> p("/download.php?file=/root/.bashrc")
      array([0])
      >>> p("/example/test/q=" + lorem + "<script>alert(1)</script>") # len(lorem) = 4488
      array([0])
    
    (FYI 1 means malicious and 0 means clean)

  • Doesn't look like he did any cross-validation, hence the high accuracy. Always keep a hold-out set to test against

  • In case someone is having difficulty with the link, here is an alternative:

    https://web.archive.org/web/20170514081124/http://fsecurify....

    Apologies for inconvenience.

    Thanks

  • Like others have said, you might be overfitting your training data here: your model is just memorising the examples you give it and would fail if somebody slightly varies some payload. (inserting some whitespace or something)

    Another thing to keep in mind is that an accuracy of 99% doesn't mean much in an unbalanced problem like yours (much more clean queries than malicious once).

    What you should show instead is precision (of the ones labeled malicious, how many are actually malicious?) and recall (out of the malicious queries in the dataset, how many did your model label as malicious?)

  • The similarities in name _and_ logo to F-Secure are a little bothersome.

  • Fun, will look into this! Wonder if anyone can point to other datasets?

  • The website is working a bit slow but the page is loading. If you are facing any problem, please wait for a minute and the page will load.

  • Why use a trigram as n ?

  • Woo, saving this.im planning a similar ptoject once im done with my studies.