Web page Classification using Network Analysis Approach, By Sadegh Sulaimany

Research

Title	Web page Classification using Network Analysis Approach
Type	Thesis
Keywords	Web Page Classification, Graph-based Features, Machine Learning, Gradient Boosting, Information Retrieval
Year	2024
Researchers	Heshw Abas Mohammed(Student)، Sadegh Sulaimany(PrimaryAdvisor)

Abstract

Web page classification is a fundamental task in the field of web mining, playing a crucial role in organizing and managing the vast amount of information available on the internet. As the web continues to grow exponentially, the need for accurate and efficient classification methods becomes increasingly important. Proper categorization of web pages enables more effective information retrieval, enhances search engine performance, and facilitates content management across various domains. However, the dynamic nature of web content, diverse page structures, and the sheer volume of data pose significant challenges to traditional classification approaches. This thesis addresses these challenges by proposing a novel method that combines network analysis with conventional content-based techniques, aiming to improve the accuracy and robustness of web page classification systems. This thesis presents a novel approach to web page classification, addressing the challenges posed by the dynamic and complex nature of web content. By integrating graph-based features with traditional content-based methods, we develop a more robust and accurate classification system. Our methodology involves constructing network graphs from web page datasets, extracting centrality measures, and incorporating these as additional features for machine learning algorithms. We utilize the Dmoz dataset, a comprehensive web directory, to train and evaluate various classification algorithms. Our approach employs both Pearson and Spearman correlation methods to capture linear and monotonic relationships between web pages. We compare the performance of multiple machines learning algorithms, including Naive Bayes, Decision Trees, Support Vector Machines, and ensemble methods such as Random Forests and Gradient Boosting. The results demonstrate significant improvements in classification accuracy compared to existing methods. Our best-performing model, the Histogram-Based Gradient Boosting Classifier, achieves an accuracy of 77.17% using the Spearman method, outperforming previous benchmarks. We provide a comprehensive analysis of classifier performance using multiple metrics, including precision, recall, F1-score, and Area Under the Curve (AUC). This research contributes to the field of web mining by offering a more adaptable and efficient approach to web page classification. The integration of graph-based features enhances the model's ability to capture complex relationships between web pages, leading to improved classification accuracy. Our findings have important implications for various applications, including search engine optimization, content management, and information retrieval systems. The thesis concludes by discussing the limitations of the current approach and proposing future research directions, including the integration of deep learning techniques, exploration of dynamic graph analysis, and investigation of multi-modal classification methods.