自己动手做一个搜索引擎（一）——用python实现一个网络爬虫

HTML

What is HTML?

HTML is a markup language for describing web documents (web pages).

HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags
HTML documents are described by HTML tags
Each HTML tag describes different document content

HTML tags

HTML tags are keywords (tag names) surrounded by angle brackets:

<tagname>content goes here...</tagname>

HTML tags normally come in pairs
The first tag in a pair is the start tag, the second tag is the end tag
The end tag is written like the start tag, but with a forward slash inserted before the tag name

Here’re some examples for HTML tags.

The <!DOCTYPE html> declaration defines this document to be HTML5

The text between <html> and </html> describes an HTML document The text between <head> and </head> provides information about the document

The text between <title> and </title> provides a title for the document

The text between <body> and </body> describes the visible page content

The text between <h1> and </h1> describes a heading

The text between <p> and </p> describes a paragraph

Web Browser

The purpose of a web browser (Chrome, IE, Firefox, Safari) is to read HTML documents and display them.

The browser does not display the HTML tags, but uses them to determine how to display the document.

For example, when we input a URL the in address bar, the web browser send a request to the web server. The web server process the request and response the HTML document. Then the web browser will parse the HTML document. As a result, we see the finalwebsite.

BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Use Python to Fetch HTML document

We can use urllib2 to fetch HTML document.

import urllib2
response = urllib2.urlopen('http://www.baidu.com')
content = response.read()
print content

Howerver, I’d prefer to use another library called requests, which is more simple and flexible.

import requests
r = requests.get("http://www.baidu.com")
content = r.text
print content

Both pieces of code are the same.

Use BeautifulSoup to Parse Website

1. Making the Soup

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup("<html>data</html>")

2. Use Tag as members

print soup.head
print soup.head.title
print str(soup.head.title).decode('utf8')
print str(soup.head.title.string).decode('utf8')
soup.head.meta['content']
soup.head.meta.get('content','')

3. Navigating the tree

Here’s an example:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
p = soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
p = soup.title
p = soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Parent:

p.parent
p.parent.name
head = p.parent

Contents:

print head.contents
print str(head.contents[0]).decode('utf8') #
print str(head.contents[1]).decode('utf8') #

nextSibling and previusSibling:

p.nextSibling.name
p.previousSibling.name

4. Regex

import re
p = re.compile('<a.*?href="(.*?)".*?<\/a>'
m = p.match(content)

Crawler

Here I will show how to write a crawler step by step.

First step: Establish connections

I use urllib to establish connections. Also there’re other alternatives but TA don’t persuade.

def get_page(page):
    content = ''
    try:
        req = urllib2.urlopen(page)
    except:
        pass
    else:
        content = req.r

Consedering that some links can’t reach, I use ‘try’ before open the url. I don’t deal with the Exception and don’t mind it.

Second step: Store content and fetch urls

At first step I get all the content in a website. So I need to store it.

def add_page_to_folder(page, content):
    index_filename = 'index.txt'
    folder = 'html'
    filename = valid_filename(page)
    index = open(index_filename, 'a')
    index.write(page.encode('ascii', 'ignore') + '\t' + filename + '\n')
    index.close()
    if not os.path.exists(folder):
        os.mkdir(folder)
    f = open(os.path.join(folder, filename), 'w')
    f.write(content)
    f.close

To get all the links I use BeautifulSoup and regex to match url. If you only use regex, some websites are not so correct that you may match something wrong. But BeautifulSoup may be a little slow, and lxml maybe a better choice.

def get_all_links(content, page):
    links = []
    soup = BeautifulSoup(content)
    for a in soup.findAll('a', {'href': re.compile('^http|^/')}):
        url = a.get('href')
        if url[0] == '/':
            url = urlparse.urljoin(page, url)
        links.append(url)
    return links

Here I consider the url may be relative path so I deal with that.

Two methods: BFS and DFS

BFS will search along the breadth of a tree. Here is the algorithm, which unions a queue to another.

def union_bfs(a,b):
    for ele in b:
        if ele not in a:
            a.insert(0, ele)

Oppositely, DFS search as deep as possible.

def union_dfs(a,b):
    for e in b:
        if e not in a:
            a.append(e)

The algorithm above unions a list to another.

You can get the full code here.

Hash

Time Complextity

An interesting problem is that how long it takes to run the program. A crawler often needs to deal with large of data so that it costs too much time. Here’s a simple method to calculate the runtime.

import time
start_time = time.time()
main()
print "%s seconds" % (time.time() - start_time)

Make it Faster!

We use list to store crawled pages.

tocrawl = [seed]		
crawled = []		#已爬序列，list方式存储
while tocrawl:
    page = tocrawl.pop()
    if page not in crawled:	#查看page是否在已爬序列中
        …			#抓取page
        crawled.append(page)	#将page加入已爬序列中

It has the time complexity O(n), of low efficiency. To greatly improve it, we use hash table instead of list.

First we put strings in b bucket according to hash function and look up in relevant bucket.Thus the search range reduce to 1/b.

Here’s an simple hash function example.

def simple_hash_string(keyword, b):
	if keyword != '':
		return ord(keyword[0]) % b
	else:
	return 0

This function let string mapped to [0..1] numbers. But it also has a big problem that the frequency of each alpha is difference. So we have a big space to improve it.

Then we need to initialize a hash table.

def make_hashtable(b):
	table = []
	for i in range(b):
		table.append([])
	return table

To find an element in hash table, we need to acquire the bucket according to the hash function.

def hashtable_get_bucket(table, keyword):
	index = simple_hash_string(keyword)
	return table[index]

Thus, hashtable_get_bucket(table, keyword) represent the bucket. Then we can finish the function to check if a keyword is in hash table.

def hashtable_lookup(table, keyword):
	if keyword in hashtable_get_bucket(table, keyword):
		return true
	else:
		return false

Finally, if the keyword is not in the table, we should add it.

def hashtable_add(table, keyword):
	table[hashtable_get_bucket(table, keyword)].append(keyword)

We should look up the keyword outside hashtable_add because it is widely used to check if a keyword is in the hash table.

BloomFilter

However, considering our crawler will deal with a large amount of webpage content and fetch tens of thousands of URLs, hash table will cover too large space. So we use bloomfilter to do with it ultimately.

Principle

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

The base data structure of a Bloom filter is a Bit Vector. Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.

To reduce the probability of conflict, we use k hash function rather than one.

Implementation

Let’s define the Bitset at first:

class Bitarray:
    def __init__(self, size):
        """ Create a bit array of a specific size """
        self.size = size
        self.bitarray = bytearray(size/8)

    def set(self, n):
        """ Sets the nth element of the bitarray """

        index = n / 8
        position = n % 8
        self.bitarray[index] = self.bitarray[index] | 1 << (7 - position)

    def get(self, n):
        """ Gets the nth element of the bitarray """

        index = n / 8
        position = n % 8
        return (self.bitarray[index] & (1 << (7 - position))) > 0

Here we just use bit operation to change the value of a bit to 1. The expression 1 << (7 - position) represent the bit of the position. And we use the BitArray of size as the base data structure.

bit_map = BitArray(size)

The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible. We use k hash function, so we should calculate each value hash function for the keyword. If all of them are 1, may be it exists. But one of them is not 1, it’s definitely not stored.

def BloomFilter(s):
	h1 = hash1(s)
	h2 = hash2(s)
	h3 = hash3(s)
	h4 = hash4(s)
	if not(bit_map.get(h1) and bit_map.get(h2) and bit_map.get(h3) and bit_get(h4)):
		bit_map.set(h1)
		bit_map.set(h2)
		bit_map.set(h3)
		bit_map.set(h4)
		return True
	else:
		return False

The code above is a simple implementation. It takes 4 hash function.

Tuning

We can change hash function to improve the result.

Also, k, the number of hash functions makes a difference. Gnerally, k = ln(2) * m/n.

The complete code of exercise 1 is in attachment.

Check the correct rate

When you finish all the things, you should desigh a experiment to check your correct rate of BloomFilter.

I consider comparing the numbers of words in a text between using directionary and BloomFilter. Here’s my code.

def merge(s, l):
    for i in l:
        if i not in s:
            s.append(i)


def calc1():
    fo = open('pg1661.txt', 'r')
    reads = []
    for i in range(10000):
        s = fo.readline()
        l = s.split(' ')
        merge(reads, l)
    return len(reads)


def calc2():
    fo = open('pg1661.txt', 'r')
    count = 0
    reads = BloomFilter(32000)
    for i in range(10000):
        s = fo.readline()
        l = s.split(' ')
        for i in l:
            count += reads.BloomFilter(i)
    return count

To my delight, two results are 14312 and 11816, about 80%. I think it is a little low. But when I add the BloomFilter’s size up to 3200000, two results are equal. This time it spent too much space. Finally BloomFilter of size 320000 is perfect. Two results are 14312 and 11816. It is fast and don’t waste space.

Concurrent Programming

We can use concurrent programming to decrease the time cost. We should use library Threading and Queue. An example is like this.

for i in range(NUM):
	t = Thread(target=working)
	t.setDaemon(True)
	t.start()

for i in range(JOBS):
	q.put(i)

for i in range(NUM):
	threads[i].join()

To use Threading we should define a class:

import threading

class MyThread(threading.Thread):
	def __init__(self,func,args,name='')
		threading.Thread.__init__(self)
		self.name=name
		self.func=func
		self.args=args

	def getResult(self):
		return self.res

	def run(self):
		print 'starting', self.name
		self.res = apply(self.func, self,args)
		print self.name, 'finished'

And the use of some functions in Queue are listed:

Queue.qsize()    # return the size of queue
Queue.empty()    # return true if the queue is empty, else return false
Queue.put(item)      # put item into queue
Queue.get()          # get out of an item from queue

You should remember Queue is FIFO.

One important problem bothering me for a long time is that when to kill the thread. The easiest method is to count the pages the crawler fetched. But what if the pages in the queue are all crawled? So if the queue is empty and it is not the first page considering the seed URL, also kill the thread.

When I run my crawler, a UnicodeEncodeError came up. I think it was because of the irregular website.

BeautifulSoup is easy but too slow. I recommend lxml to parse url, which is 100 times faster. More brutely, you can use regex directly, 1000 times faster than BeautifulSoup.

*Posts*

TensorRT Summary

两行命令爬取全部ICML论文

Build the development environment in a new Ubuntu

SJTU FTP Setup

Reinforcement Learning Needs Rethinking

IRL发展史

0ops探祕

三种深度学习框架的比较：Caffe, Torch和TensorFlow

概率式编程

我的2016年度总结

SIFT算法图像特征提取

用Go来实现Http代理服务器

真正的火星文——Perl

Ubuntu16.10+fish安装Torch

「评测」Likeyboard MX87红轴机械键盘

自己动手做一个搜索引擎（三）——利用web.py建立搜索引擎

自己动手做一个搜索引擎（二）——基于lucene的搜索引擎