自己动手做一个搜索引擎(一)——用python实现一个网络爬虫
HTML
What is HTML?
HTML is a markup language for describing web documents (web pages).
- HTML stands for Hyper Text Markup Language
- A markup language is a set of markup tags
- HTML documents are described by HTML tags
- Each HTML tag describes different document content
HTML tags
HTML tags are keywords (tag names) surrounded by angle brackets:
<tagname>content goes here...</tagname>
- HTML tags normally come in pairs
- The first tag in a pair is the start tag, the second tag is the end tag
- The end tag is written like the start tag, but with a forward slash inserted before the tag name
Here’re some examples for HTML tags.
- The <!DOCTYPE html> declaration defines this document to be HTML5
- The text between <html> and </html> describes an HTML document The text between <head> and </head> provides information about the document
- The text between <title> and </title> provides a title for the document
- The text between <body> and </body> describes the visible page content
- The text between <h1> and </h1> describes a heading
- The text between <p> and </p> describes a paragraph
Web Browser
The purpose of a web browser (Chrome, IE, Firefox, Safari) is to read HTML documents and display them.
The browser does not display the HTML tags, but uses them to determine how to display the document.
For example, when we input a URL the in address bar, the web browser send a request to the web server. The web server process the request and response the HTML document. Then the web browser will parse the HTML document. As a result, we see the finalwebsite.
BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Use Python to Fetch HTML document
We can use urllib2 to fetch HTML document.
import urllib2
response = urllib2.urlopen('http://www.baidu.com')
content = response.read()
print content
Howerver, I’d prefer to use another library called requests, which is more simple and flexible.
import requests
r = requests.get("http://www.baidu.com")
content = r.text
print content
Both pieces of code are the same.
Use BeautifulSoup to Parse Website
1. Making the Soup
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup("<html>data</html>")
2. Use Tag as members
print soup.head
print soup.head.title
print str(soup.head.title).decode('utf8')
print str(soup.head.title.string).decode('utf8')
soup.head.meta['content']
soup.head.meta.get('content','')
3. Navigating the tree
Here’s an example:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
p = soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
p = soup.title
p = soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Parent:
p.parent
p.parent.name
head = p.parent
Contents:
print head.contents
print str(head.contents[0]).decode('utf8') #
print str(head.contents[1]).decode('utf8') #
nextSibling and previusSibling:
p.nextSibling.name
p.previousSibling.name
4. Regex
import re
p = re.compile('<a.*?href="(.*?)".*?<\/a>'
m = p.match(content)
Crawler
Here I will show how to write a crawler step by step.
First step: Establish connections
I use urllib to establish connections. Also there’re other alternatives but TA don’t persuade.
def get_page(page):
content = ''
try:
req = urllib2.urlopen(page)
except:
pass
else:
content = req.r
Consedering that some links can’t reach, I use ‘try’ before open the url. I don’t deal with the Exception and don’t mind it.
Second step: Store content and fetch urls
At first step I get all the content in a website. So I need to store it.
def add_page_to_folder(page, content):
index_filename = 'index.txt'
folder = 'html'
filename = valid_filename(page)
index = open(index_filename, 'a')
index.write(page.encode('ascii', 'ignore') + '\t' + filename + '\n')
index.close()
if not os.path.exists(folder):
os.mkdir(folder)
f = open(os.path.join(folder, filename), 'w')
f.write(content)
f.close
To get all the links I use BeautifulSoup and regex to match url. If you only use regex, some websites are not so correct that you may match something wrong. But BeautifulSoup may be a little slow, and lxml maybe a better choice.
def get_all_links(content, page):
links = []
soup = BeautifulSoup(content)
for a in soup.findAll('a', {'href': re.compile('^http|^/')}):
url = a.get('href')
if url[0] == '/':
url = urlparse.urljoin(page, url)
links.append(url)
return links
Here I consider the url may be relative path so I deal with that.
Two methods: BFS and DFS
BFS will search along the breadth of a tree. Here is the algorithm, which unions a queue to another.
def union_bfs(a,b):
for ele in b:
if ele not in a:
a.insert(0, ele)
Oppositely, DFS search as deep as possible.
def union_dfs(a,b):
for e in b:
if e not in a:
a.append(e)
The algorithm above unions a list to another.
You can get the full code here.
Hash
Time Complextity
An interesting problem is that how long it takes to run the program. A crawler often needs to deal with large of data so that it costs too much time. Here’s a simple method to calculate the runtime.
import time
start_time = time.time()
main()
print "%s seconds" % (time.time() - start_time)
Make it Faster!
We use list to store crawled pages.
tocrawl = [seed]
crawled = [] #已爬序列,list方式存储
while tocrawl:
page = tocrawl.pop()
if page not in crawled: #查看page是否在已爬序列中
… #抓取page
crawled.append(page) #将page加入已爬序列中
It has the time complexity O(n), of low efficiency. To greatly improve it, we use hash table instead of list.
First we put strings in b bucket according to hash function and look up in relevant bucket.Thus the search range reduce to 1/b.
Here’s an simple hash function example.
def simple_hash_string(keyword, b):
if keyword != '':
return ord(keyword[0]) % b
else:
return 0
This function let string mapped to [0..1] numbers. But it also has a big problem that the frequency of each alpha is difference. So we have a big space to improve it.
Then we need to initialize a hash table.
def make_hashtable(b):
table = []
for i in range(b):
table.append([])
return table
To find an element in hash table, we need to acquire the bucket according to the hash function.
def hashtable_get_bucket(table, keyword):
index = simple_hash_string(keyword)
return table[index]
Thus, hashtable_get_bucket(table, keyword)
represent the bucket. Then we can finish the function to check if a keyword is in hash table.
def hashtable_lookup(table, keyword):
if keyword in hashtable_get_bucket(table, keyword):
return true
else:
return false
Finally, if the keyword is not in the table, we should add it.
def hashtable_add(table, keyword):
table[hashtable_get_bucket(table, keyword)].append(keyword)
We should look up the keyword outside hashtable_add
because it is widely used to check if a keyword is in the hash table.
BloomFilter
However, considering our crawler will deal with a large amount of webpage content and fetch tens of thousands of URLs, hash table will cover too large space. So we use bloomfilter to do with it ultimately.
Principle
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector. Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.
To reduce the probability of conflict, we use k hash function rather than one.
Implementation
Let’s define the Bitset at first:
class Bitarray:
def __init__(self, size):
""" Create a bit array of a specific size """
self.size = size
self.bitarray = bytearray(size/8)
def set(self, n):
""" Sets the nth element of the bitarray """
index = n / 8
position = n % 8
self.bitarray[index] = self.bitarray[index] | 1 << (7 - position)
def get(self, n):
""" Gets the nth element of the bitarray """
index = n / 8
position = n % 8
return (self.bitarray[index] & (1 << (7 - position))) > 0
Here we just use bit operation to change the value of a bit to 1. The expression 1 << (7 - position)
represent the bit of the position. And we use the BitArray of size as the base data structure.
bit_map = BitArray(size)
The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible. We use k hash function, so we should calculate each value hash function for the keyword. If all of them are 1, may be it exists. But one of them is not 1, it’s definitely not stored.
def BloomFilter(s):
h1 = hash1(s)
h2 = hash2(s)
h3 = hash3(s)
h4 = hash4(s)
if not(bit_map.get(h1) and bit_map.get(h2) and bit_map.get(h3) and bit_get(h4)):
bit_map.set(h1)
bit_map.set(h2)
bit_map.set(h3)
bit_map.set(h4)
return True
else:
return False
The code above is a simple implementation. It takes 4 hash function.
Tuning
We can change hash function to improve the result.
Also, k, the number of hash functions makes a difference. Gnerally, k = ln(2) * m/n
.
The complete code of exercise 1 is in attachment.
Check the correct rate
When you finish all the things, you should desigh a experiment to check your correct rate of BloomFilter.
I consider comparing the numbers of words in a text between using directionary and BloomFilter. Here’s my code.
def merge(s, l):
for i in l:
if i not in s:
s.append(i)
def calc1():
fo = open('pg1661.txt', 'r')
reads = []
for i in range(10000):
s = fo.readline()
l = s.split(' ')
merge(reads, l)
return len(reads)
def calc2():
fo = open('pg1661.txt', 'r')
count = 0
reads = BloomFilter(32000)
for i in range(10000):
s = fo.readline()
l = s.split(' ')
for i in l:
count += reads.BloomFilter(i)
return count
To my delight, two results are 14312 and 11816, about 80%. I think it is a little low. But when I add the BloomFilter’s size up to 3200000, two results are equal. This time it spent too much space. Finally BloomFilter of size 320000 is perfect. Two results are 14312 and 11816. It is fast and don’t waste space.
Concurrent Programming
We can use concurrent programming to decrease the time cost. We should use library Threading and Queue. An example is like this.
for i in range(NUM):
t = Thread(target=working)
t.setDaemon(True)
t.start()
for i in range(JOBS):
q.put(i)
for i in range(NUM):
threads[i].join()
To use Threading we should define a class:
import threading
class MyThread(threading.Thread):
def __init__(self,func,args,name='')
threading.Thread.__init__(self)
self.name=name
self.func=func
self.args=args
def getResult(self):
return self.res
def run(self):
print 'starting', self.name
self.res = apply(self.func, self,args)
print self.name, 'finished'
And the use of some functions in Queue are listed:
Queue.qsize() # return the size of queue
Queue.empty() # return true if the queue is empty, else return false
Queue.put(item) # put item into queue
Queue.get() # get out of an item from queue
You should remember Queue is FIFO.
One important problem bothering me for a long time is that when to kill the thread. The easiest method is to count the pages the crawler fetched. But what if the pages in the queue are all crawled? So if the queue is empty and it is not the first page considering the seed URL, also kill the thread.
When I run my crawler, a UnicodeEncodeError came up. I think it was because of the irregular website.
BeautifulSoup is easy but too slow. I recommend lxml to parse url, which is 100 times faster. More brutely, you can use regex directly, 1000 times faster than BeautifulSoup.
Comments