AI-Driven Phishing Detection in Choreo
- Lahiru Ganegoda
- Senior Technical Lead - WSO2
Introduction
One of the major challenges in detecting phishing is the limitations at the Internet Service Provider (ISP) level. Traditional tools often lack the visibility and ability to recognize phishing sites in real time, as they mostly rely on network-level information. Phishing schemes frequently use complex, changing tactics, like rotating domains or imitating legitimate sites, which can go unnoticed without a detailed analysis of the site’s content. As a result, without manually reviewing the site’s structure, behavior, and intent, it’s difficult to accurately identify these threats. Choreo, WSO2’s internal developer platform as a service, has introduced a solution utilizing custom-developed AI technologies to detect potentially malicious sites. This document outlines the underlying methodology Choreo uses to identify such malicious activities, providing a safer platform for its clients and other internet users.
The Problem: Hosting Phishing Sites
Phishing is a prevalent challenge across many online hosting platforms. It involves deceiving individuals into sharing sensitive information by mimicking legitimate entities, often through websites crafted to appear trustworthy. With the accessibility of free-tier hosting options, there is potential for misuse, as low entry barriers can sometimes be exploited for phishing activities.
Why Do We Need to Address This?
- Protecting Users: Ensuring that everyone interacting with sites hosted on our platform can do so safely, without the risk of falling victim to phishing attempts, is our top priority.
- Safeguarding Our Platform's Integrity: Maintaining a high standard of security is essential for building trust, ensuring our platform remains a reliable space for both developers and users.
- Supporting Legitimate Development: By tackling these security challenges, we help foster a positive experience for genuine developers, ensuring their work isn’t compromised by malicious users.
To continue fostering a safe and secure environment, we are committed to strengthening our measures for identifying and preventing phishing activities across our platform.
The Role of AI in Phishing Detection
Traditional security tools often struggle to keep up with the evolving tactics used by phishing sites. Signature-based detection methods can be easily bypassed by slight modifications to phishing sites, and manual analysis is neither scalable nor feasible given the volume of content hosted on most of the hosting platforms.
This is where AI comes into play. By utilizing OpenAI's advanced models, we can automatically analyze the content and structure of websites hosted on our platform, identifying patterns and anomalies indicative of phishing.
Advantages of AI in Phishing Detection
- Proactive Identification: AI can detect phishing sites based on content analysis, even before they are reported or flagged by other users.
- Adaptability: AI models learn from vast datasets, enabling them to recognize and adapt to new phishing techniques that may not yet be cataloged by traditional security systems.
- Efficiency: The AI framework operates real-time, scanning and analyzing websites as they are deployed, ensuring prompt detection and action.
Our AI-driven solution is designed to minimize false positives while maximizing the detection of genuinely malicious sites, ensuring that legitimate developers are not unduly impacted.
AI-Based Phishing Detection Methodology
Solution Overview
The AI-based phishing detection framework we've developed is designed to be highly efficient and scalable, ensuring it can handle the diverse and dynamic environments of modern hosting platforms.
Data Collection Layer
- Data Feeding Engine: The solution will extract information from the logs and database and feed it into the analysis engine. Since the analysis engine operates at a consistent frequency, the data feeding process will run at short intervals, with some overlap, to ensure that all created, modified, and requested web apps are thoroughly analyzed.
AI Analysis Engine
- Information Gathering: This layer involves scraping and crawling all websites identified by the data feeding layer. The system collects OCR (Optical Character Recognition) text, image elements, and HTML content for further analysis. However, relying solely on OpenAI for extracting all the required information may not be sufficient. To enhance the data extraction process, additional technologies are needed. For instance, developing a custom web crawler that leverages image elements, OCR text, and HTML content from the target sites can help uncover more detailed insights. This approach allows us to gather information at a granular level, tailored to specific use cases. The enriched data can then be fed into the model, enabling it to generate more accurate and relevant results..
- Model Training: The AI engine is trained using OpenAI's language models, with a focus on distinguishing between legitimate sites and phishing attempts. The training data includes a wide range of phishing and non-phishing sites to ensure robustness.
- Pattern Recognition: The engine analyzes the collected data, looking for phishing indicators such as suspicious URLs, misleading content, brand abuse and anomalous patterns.
Decision-Making Layer
- Risk Scoring: Each site is assigned a risk score based on the AI's analysis. Sites with high-risk scores are flagged for further inspection or immediate action.
- Automated Notification: Depending on the nature of the detected threat, the system can automatically notify the WSO2 Security Operations team for further analysis and initiate, suspend, or take down the sites and block access.
Feedback and Learning Loop
- Continuous Improvement: The system is designed to learn from its decisions, incorporating feedback to refine its detection algorithms continuously. This ensures that the AI adapts to new phishing strategies over time.
Results and Impact
With the implementation of our AI-driven phishing detection framework, we have achieved a substantial reduction in phishing-related activities across the Choreo platform, significantly lowering the noise generated by these malicious sites. This proactive measure not only helps protect Choreo users from phishing scams but also maintains the platform’s integrity as a trusted environment for developers.
By reducing the prevalence of phishing, we create a cleaner, more reliable space where developers can focus on innovation without unnecessary distractions. This framework leverages advanced AI technology to ensure a safer, more streamlined experience for the entire development community.