If you're looking for a way to scrape data from a html - python is the way.
"Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping."
I've found myself in a situation where I needed to download a lots of different images. Since I was too lazy to do it manually I found beautifulsoup to be very helpful. In my case i needed to download only images with certain alt and src description. I would use that src then to access files on site/y and save them to a local folder.
If you are using linux installing beuatifulsoup should be a straightforward operation. However, if you are on windows I recommend installing cygwin. You can use your setup.exe any time to install new packages.
For more info on soup and how to install it go here.
"Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping."
I've found myself in a situation where I needed to download a lots of different images. Since I was too lazy to do it manually I found beautifulsoup to be very helpful. In my case i needed to download only images with certain alt and src description. I would use that src then to access files on site/y and save them to a local folder.
If you are using linux installing beuatifulsoup should be a straightforward operation. However, if you are on windows I recommend installing cygwin. You can use your setup.exe any time to install new packages.
For more info on soup and how to install it go here.
import urllib import re import errno import os, os.path from BeautifulSoup import BeautifulSoup url ="http://site.com/x" file = urllib.urlopen(url) soup = BeautifulSoup(file.read()) j = len(soup('img', {'alt': re.compile("Flag of") } )) nameList = [None]*j tagList = [None]*j for i in range (0,j): flag_alt = soup('img')[i]['alt'].encode('ascii', 'ignore') flag_src = soup('img')[i]['src'].encode('ascii', 'ignore') nameList[i] = flag_alt.split("of ")[1]; tagList[i] = flag_src.split("/")[-1]; for i in range (0,j): imgName = nameList[i].lower().replace(" ","_")+".png" filename = os.path.join(os.path.dirname(os.path.abspath(__file__)), imgName) if not os.path.exists(filename): urllib.urlretrieve("http://site/y"+tagList[i], imgName) num_of_files = len([name for name in os.listdir('.') if os.path.isfile(name)]) if j== num_of_files -1: print "All the files downloaded: "+str(j) else: print str(j-num_of_files-1)+" not dowloaded!"