CS121 Python Web Crawler Project

xzna963
timer Asked: Jan 22nd, 2021

Question Description

Need help with creating a web crawler in python. As you work, please let me know if you have questions.

web cache image: https://imgur.com/a/peKrf6H

github: https://github.com/Mondego/spacetime-crawler4py

different HTTP status codes: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10...

Unformatted Attachment Preview

Specifications To get started, fork or get the crawler code from https://github.com/Mondego/spacetime-crawler4pye Read the instructions in the README.md file up to, and including the section "Execution". This is enough to implement the simple crawler for this project. In short, this is the minimum amount of work that you need to do: 1. Install the dependencies 2. Set the USERAGENT variable in Config.ini so that it contains all students' IDs separated by a comma (the numbers! e.g. IR UW21 123123213,12312312,123123 ) of the group members, and please also modify the quarter information (i.e. UW21 for Undergraduate Winter 2021, US21 for Undergraduate Spring 2021, etc). If you fail to do this properly, your crawler will not exist in the server's log, which will put your grade for this project at risk. 3. (This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example, the first one will be "http://www.ics.uci.edu" and the Web response will contain the page itself). Your task is to parse the Web response, extract enough information from the page (if it's a valid page) so to be able to answer the questions for the report, and finally, return the list of URLs "scrapped" from that page. Some important notes: 1. Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py -- you need to change it) 2. Make sure to defragment the URLs, i.e. remove the fragment part. 3. You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup, Ixml (nudge, nudge, wink, wink!) 4. Optionally, in the scraper function, you can also save the URL and the web page on your local disk. 4. Run the crawler from your laptop/desktop or from an ICS openlab machine (you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh, such that you will learn a skill that will be probably important for the rest of your professional life... note that to install any software in machines that you do not own or that you are authorized to sudo, you need to install them to your user folder, and in pip/pip3 you need to use the --user option to do so). Note that this will take several hours, possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network, or you won't be able to crawl. If your computer is outside UCI, use the VPN. 5. Monitor what your crawler is doing. If you see it trapped in a Web trap, or malfunctioning in any way, stop it, fix the problem in the code, and restart it. Sometimes, you may need to restart from scratch. In that case, delete the frontier file (frontier.shelve), or move it to a backup location, before restarting the crawler. Crawler Behavior Requirements In this project. we are looking for text in Web pages so that we can search it later on. The following is a list of what a "correct crawl" entails in this context: • Honor the politeness delay for each site • Crawl all pages with high textual information content • Detect and avoid infinite traps • Detect and avoid sets of similar pages with no information • Detect and avoid dead URLs that return a 200 status but no data (click here to see what the different HTTP status codes mean e) • Detect and avoid crawling very large files, especially if they have low information value For most of these requirements, the only way you can detect these problems is by first monitoring where your crawler is going, and then adjusting its behavior in order to stay away from problematic pages. In this project, you are going to implement the core of a Web crawler, and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths: • *.ics.uci.edu/* . *.cs.uci.edu/* . *.informatics.uci.edu/* • *.stat.uci.edu/* • today.uci.edu/department/information_computer_sciences/* As a concrete deliverable of this project, besides the code itself, you must submit a report containing answers to the following questions: 1. How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL, but discarding the fragment part. So, for example, http://www.ics.uci.edu#aaa and http:/www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection, please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment. 2. What is the longest page in terms of the number of words? (HTML markup doesn't count as words) 3. What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words, which can be found, for example, here ) Submit the list of common words ordered by frequency. 4. How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL, number, for example: http://vision.ics.uci.edu, 10 (not the actual number here) What to submit: a zip file containing your modified crawler code and the report. Test Period and Deployment Period Due to the nature of this project, the time allocated to it is divided into two parts: Test: until January 31st, 22h00. During this time, your crawler can make all sorts of mistakes -- try to crawl outside allowed domains, be impolite, etc. No penalties if you are figuring things out; but penalties if you knock down the server on purpose (i.e. removing politeness delays). Do not wait until the last minute to start working on your assignment; this assignment requires you to experiment significantly, and you will probably not have time to finish it if you start it late. Deployment: from January 31st, 22h30, until February 4th, 21h00 (9h00 pm). This is the real crawl. During this time, your crawler is expected to behave correctly. Even if you finish your project earlier, you must operate your crawler during this time period, but you must not restart the crawl more than three times during this period (unless there is a server issue; note that they are all recorded). You must submit your assignment on Canvas by February 4th, 21h00 (9h00 pm). Late assignments: For late assignments, you cannot disrespect the crawling policy during the normal deployment period (for Winter 2021: you can only crawl three times during deployment, no more! MAKE SURE TO DEVELOP AND TEST YOUR CRAWLER DURING THE TEST PERIOD). Then you can have from the 5th of February 9h00 am until the 7th of February 22h00 as an additional Test period, and from the 8th of February 8h00 until the 9th of February, 8h00 am as an additional Deployment period. You need to submit your late assignment in Canvas with an automatic penalty of 25% of the grade by the deadline, and no submission will be accepted after the 9th of February, 8h00 am. Note: The cache server may die for a few hours during these periods due to loads created by impolite crawlers. We will be monitoring closely the server, and it will be back online after (at most) - 8h hours during the Test period, and after (at most) -4h during the Deployment period (unless it happens to die during the night - 23h00 till -7h00 am- in Irvine). Make sure to abide by politeness rules and respect to your colleagues, especially if you are trying to implement a multithreaded crawler. Extra credit: (+1 points) Implement checks and usage of the robots and sitemap files. (+2 points) Implement exact and near webpage similarity detection using the methods discussed in the lecture. Your implementation must be made from scratch, no libraries are allowed. (+5 points) Make the crawler multithreaded. However, your multithreaded crawler MUST obey the politeness rule: two or more requests to the same domain, possibly from separate threads, must have a delay of 500ms (this is more tricky than it seems!). In order to do this part of the extra credit, you should read the "Architecture" section of the README.md file. Basically, to make a multithreaded crawler you will need to: 1. Reimplement the Frontier so that it's thread-safe and so that it makes politeness per domain easy to manage 2. Reimplement the Worker thread so that it's politeness-safe 3. Set the THREADCOUNT variable in Config.ini to whatever number of threads you want 4. If your multithreaded crawler is knocking down the server you may be penalized, so make sure you keep it polite (and note that it makes no sense to use a too large number of threads due to the politeness rule that you MUST obey). Grading criteria 1. Are the analytics in your report within the expected range? (10%) 2. Did your crawler operate correctly? -- we'll check our logs (-45%) 1. Does it exist in Prof. Lopes' Web cache server logs? (if it's not in the ICS logs, it didn't happen: you will get 0) 2. Was it polite? (penalties for impolite crawlers) 3. Did you crawl ALL domains and paths mentioned in the spec? (penalties for missing domains and paths) 4. Did it crawl ONLY the domains and paths mentioned in the spec? (penalties for attempts to crawl outside) 5. Did it avoid traps? (penalties for falling in traps) 6. Did it avoid sets of pages with low information value? (penalties for crawling useless families of pages - you must decide and discuss within your group on a reasonable definition for a low information value page and be able to defend it during the interview with the TAS) 3. Are you able to answer and justify the questions about your code and the operation of your crawler? (~45%)
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

This question has not been answered.

Create a free account to get help with this and any other question!

Related Tags

Brown University





1271 Tutors

California Institute of Technology




2131 Tutors

Carnegie Mellon University




982 Tutors

Columbia University





1256 Tutors

Dartmouth University





2113 Tutors

Emory University





2279 Tutors

Harvard University





599 Tutors

Massachusetts Institute of Technology



2319 Tutors

New York University





1645 Tutors

Notre Dam University





1911 Tutors

Oklahoma University





2122 Tutors

Pennsylvania State University





932 Tutors

Princeton University





1211 Tutors

Stanford University





983 Tutors

University of California





1282 Tutors

Oxford University





123 Tutors

Yale University





2325 Tutors