Monday, 17 March 2014

Python's beautifulsoup is beautiful

If you're looking for a way to scrape data from a html - python is the way.
"Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping."

I've found myself in a situation where I needed to download a lots of different images. Since I was too lazy to do it manually I found beautifulsoup to be very helpful. In my case i needed to download only images with certain alt and src description. I would use that src then to access files on site/y and save them to a local folder.

If you are using linux installing beuatifulsoup should be a straightforward operation. However, if you are on windows I recommend installing cygwin. You can use your setup.exe any time to install new packages.

For more info on soup and how to install it go here.


import urllib
import re
import errno
import os, os.path
from BeautifulSoup import BeautifulSoup


url ="http://site.com/x"

file  = urllib.urlopen(url)

soup = BeautifulSoup(file.read())


j = len(soup('img', {'alt': re.compile("Flag of") }  ))

nameList = [None]*j
tagList = [None]*j


for i in range (0,j):
   flag_alt = soup('img')[i]['alt'].encode('ascii', 'ignore')
   flag_src = soup('img')[i]['src'].encode('ascii', 'ignore')
   nameList[i] = flag_alt.split("of ")[1]; 
   tagList[i] = flag_src.split("/")[-1]; 
  
for i in range (0,j): 
   imgName = nameList[i].lower().replace(" ","_")+".png"
   filename = os.path.join(os.path.dirname(os.path.abspath(__file__)), imgName)
   if not os.path.exists(filename):
       urllib.urlretrieve("http://site/y"+tagList[i], imgName)

num_of_files = len([name for name in os.listdir('.') if os.path.isfile(name)])  
if j== num_of_files -1:
   print "All the files downloaded: "+str(j)
else:
   print str(j-num_of_files-1)+" not dowloaded!"

No comments:

Post a Comment