GC-28 Modern Web Scraping

Kenny Randolph, Kennesaw State University
Joselyn Giron, Kennesaw State University
Denise Tucker, Kennesaw State University
Justin B Bridges, Kennesaw State University
Sandhya Bantu

Description

This project was developed for the IT7993 Capstone class in the May semester of 2021.The goal of the project is to scrape all names of key professionals of organizations in the open990.org website and insert that information into a structured database for query and analysis. The Key Professionals dataset aims to include global coverage of key investor and consultant professionals, beginning with US-based companies, involved in making an investment decision. The overarching aim of this project is to create a one-stop center for institutional asset management distribution intelligence; the one spot to go for mandates, documentation and profiles of consultants, investors, and managers with key technical contact information by including coverage within the eVestment network for US investors and consultants. From end to end, the key professional database project consists of creating a web crawler to retrieve information from the open990 website, wrangling the data into the desired structure, and inserting it into a database for comprehensive data analysis. The primary data source is the open990.org website. The team was given a list of names of organizations as targets to scrape information. Each organization has a page within the open990 website with the organization information, including names of the key professionals, which is the target data. Scraping data from the open990 website consisted of several challenges. First, the website is coded completely using JavaScript which requires specific techniques to render and scrape. Second, the different organization sites have different data structures, which causes problems for parsing. Third, most of the data is in tables that are delivered through a backend API. Fourth, due to delivery of the tables from a backend API, the HTML tags used for the data are not unique, so that identifying and parsing specific data using HTML tags was not possible. Lastly, by observing the network traffic using the Chrome browser tools, and examining the HAR data returned from Splash, we discovered the website is delivered through Cloudflare servers, which we believe blocked some of our attempts to scrape the data. Cloudflare is a network for content delivery featuring robust security services. The complexity of the webpage is an example of how modern, secure web development will change the landscape and require webscrapers to develop more advanced methods of automation.
Advisors(s): Dr Meng Han
Topic(s): Data/Data Analytics
IT 7993

 
May 26th, 5:00 PM

GC-28 Modern Web Scraping

https://ccse.kennesaw.edu/computing-showcase/cday-programs/spring2021program.php

This project was developed for the IT7993 Capstone class in the May semester of 2021.The goal of the project is to scrape all names of key professionals of organizations in the open990.org website and insert that information into a structured database for query and analysis. The Key Professionals dataset aims to include global coverage of key investor and consultant professionals, beginning with US-based companies, involved in making an investment decision.   The overarching aim of this project is to create a one-stop center for institutional asset management distribution intelligence; the one spot to go for mandates, documentation and profiles of consultants, investors, and managers with key technical contact information by including coverage within the eVestment network for US investors and consultants.  From end to end, the key professional database project consists of creating a web crawler to retrieve information from the open990 website, wrangling the data into the desired structure, and inserting it into a database for comprehensive data analysis. The primary data source is the open990.org website. The team was given a list of names of organizations as targets to scrape information. Each organization has a page within the open990 website with the organization information, including names of the key professionals, which is the target data. Scraping data from the open990 website consisted of several challenges. First, the website is coded completely using JavaScript which requires specific techniques to render and scrape. Second, the different organization sites have different data structures, which causes problems for parsing. Third, most of the data is in tables that are delivered through a backend API. Fourth, due to delivery of the tables from a backend API, the HTML tags used for the data are not unique, so that identifying and parsing specific data using HTML tags was not possible. Lastly, by observing the network traffic using the Chrome browser tools, and examining the HAR data returned from Splash, we discovered the website is delivered through Cloudflare servers, which we believe blocked some of our attempts to scrape the data. Cloudflare is a network for content delivery featuring robust security services. The complexity of the webpage is an example of how modern, secure web development will change the landscape and require webscrapers to develop more advanced methods of automation.
Advisors(s): Dr Meng Han
Topic(s): Data/Data Analytics
IT 7993

https://digitalcommons.kennesaw.edu/cday/Fall2021/Graduate_Capstone/1