|
چکیده
|
This research introduces a comprehensive dataset of aca- demic publications and professorial metrics from Iranian uni- versities, systematically collected from Google Scholar using Python-based tools such as Selenium and BeautifulSoup, val- idated through expert review. articles.csv was kept raw ex- cept for exact duplicate removal, while a four-step Data Re- f inement Process (governmental affiliation, ≥ 100 citations, author-article verification, 2020–22 window) produced fi- nal_articles.csv for analysis. The dataset includes over 1.5 million records of articles scraped from various categories, providing detailed information on each article’s title, cita- tions, authorship details, and institutional affiliations, all cu- rated through an intricate web scraping process. It spans multiple interlinked files with attributes including article metadata, professor profiles, and institutional details, We then applied a temporal filter (2020–2022) in conjunction with institution and author-level criteria, restricting to gov- ernmental universities and professors exceeding our citation threshold, and excluded records missing essential metadata (specifically, entries without titles or with removed/invalid Google Scholar links), yielding a focused cohort primed for downstream analytical pipelines. These attributes enable in- depth exploration of academic productivity, collaboration networks, and institutional performance across disciplines.
|