assignment about python .

Content Type

User Generated

User

anffre707

Subject

Computer Science

Description

Please open the attachment .

I need the work 100% correct.

Python Information.docx

Unformatted Attachment Preview

Python Information You can download a copy on your system by going to python.org. We will be using python 2.7, so be sure to use that one only, or you will have problems. Make sure it references 2.7. You will need to browse for python each time to launch that version in the lab. Be sure to use the Python – GUI IDLE program to use the interactive window. You can create a python script by using file– new script… and to run, use run -> run module. Be sure to check out a few of these videos if you are new to python https://www.youtube.com/watch?v=kY216CRafSI https://www.youtube.com/watch?v=H9O6LzH32iE https://www.youtube.com/watch?v=NleUKfRXQWI https://www.youtube.com/watch?v=Xsp8lwj67Zk https://www.youtube.com/watch?v=q48u_SVcZNo 1. Simple Web / Screen Scraper Submit your code from our screen scraper lab once you have modified it to work for another site or for another other data on the USZIP website. If you need examples check out the following two tutorials on using a python script to create a small scraper program. Part 1 is more of an introduction program, while Part 2 shows you how to extract information from yahoo finance. I did notice that the Yahoo! Finance URL used has changed slightly (and seems much longer!) then the one used in the video, but it will still work. Go through both videos and practice the tutorial. Note: There may be issues J so if you run into an error or problem see if you can work around it the best you can as part of the process. https://www.youtube.com/watch?v=kPhZDsJUXic https://www.youtube.com/watch?v=f2h41uEi0xU 2. Word Frequency Program This assignment was the development of the word frequency counter program. For this one we modified the SearchEngine2 file. SearchEngine2: import urllib2 from BeautifulSoup import* from urlparse import urljoin #create a list of words to ignore ignorewords=set(['the', 'of', 'to', 'and','a','in','is','it']) class crawler: def __init__(self,dbname): pass def __del__(self): pass def dbcommit(self): pass #Index an indicidual page def addtoindex(self,url,soup): print "Found link %s" % url # Extract the text from an HTML page (no tags) def gettextonly(self,soup): v=soup.string if v==Null: c=soup.contents resulttext='' for t in c: subtext=self.gettextonly(t) resulttext+=subtext+'\n' return resulttext else: return v.strip() # Seperate the words by any non-whitespace character def separatewords(self,text): splitter=re.compile('\\W*') return [s.lower() for s in splitter.split(text) if s!=''] def crawl(self,pages,depth=2): allurl=open('urls.txt' , 'w') for i in range(depth): newpages={} for page in pages: try: c=urllib2.urlopen(page) if page.find('.jpg')==-1: #if the page is not a jpg allurl.write(page+'\n') except: print "Could not open %s" % page continue try: soup=BeautifulSoup(c.read()) self.addtoindex(page,soup) links=soup('a') for link in links: if ('href' in dict(link.attrs)): url=urljoin(page,link['href']) if url.find("'")!=-1: continue url=url.split('#')[0] # remove location portion if url[0:4]=='http': newpages[url]=1 self.dbcommit() except: print "Could not parse page %s" % page pages=newpages #allurl.close() def somecalc(self,p): listpages=open('urls.txt','r') for link in listpages: #user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' #headers = {'User-Agent' : user_agent} #try: #req = urllib2.Request(link, None, headers) p=urllib2.urlopen(link) soup2=BeautifulSoup(p.read()) text=self.gettextonly(soup2) words=self.separatewords(text) print words print "______________________________________________" #except urllib2.HTTPError, e: #print "g" #print e.fp.read()