Web Scraping Golang



Lightning Fast and Elegant Scraping Framework for Gophers. Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving. To scrape Google results we have to make a request to Google using a URL containing our search parameters. For instance Google allows you to pass a number of different parameters to a search query. In this particular example we are going to write a function that will generate us a search URL with our desired parameters.

I have previously written a post on scraping Google with Python. As I am starting to write more Golang, I thought I should write the same tutorial using Golang to scrape Google. Why not scrape Google search results using Google’s home grown programming language.

Imports & Setup

2
4
6
ResultRank int
ResultTitle string
}

We can build a simple struct which will hold an individual search result, when writing our final function we can then set the return value to be a slice of our GoogleResult struct. This will make it very easy for us to manipulate our search results once we have scraped them from Google.

Making A Request To Google

To scrape Google results we have to make a request to Google using a URL containing our search parameters. For instance Google allows you to pass a number of different parameters to a search query. In this particular example we are going to write a function that will generate us a search URL with our desired parameters.

But first we are going to define a “map” of supported Google geo locations. In this post we are only going to support a few major geographical locations, but Google operates in over 100 different geographical locations.

2
4
6
8
func buildGoogleUrl(searchTerm string,countryCode string,languageCode string)string{
searchTerm=strings.Replace(searchTerm,' ','+',-1)
ifgoogleBase,found:=googleDomains[countryCode];found{
returnfmt.Sprintf('%s%s&num=100&hl=%s',googleBase,searchTerm,languageCode)
returnfmt.Sprintf('%s%s&num=100&hl=%s',googleDomains['com'],searchTerm,languageCode)
}

We then write a function that allows us to build a Google search URL. The function takes in three arguments, all of the string type and returns a URL also a string. We first trim the search term to remove any trailing or proceeding white-space. We then replace any of the remaining spaces with ‘+’, the -1 in this line of code means that we replace every-single remaining instance of white-space with a plus.

We then look up the country code passed as an argument against the map we defined earlier. If the countryCode is found in our map, we use the respective URL from the map, otherwise we use the default ‘.com’ Google site. We then use the format packages “Sprintf” function to format a string made up of our base URL, our search term and language code. We don’t check the validity of the language code, which is something we might want to do if we were writing a more fully featured scraper.

2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
func googleResultParser(response *http.Response)([]GoogleResult,error){
doc,err:=goquery.NewDocumentFromResponse(response)
returnnil,err
results:=[]GoogleResult{}
rank:=1
item:=sel.Eq(i)
link,_:=linkTag.Attr('href')
descTag:=item.Find('span.st')
title:=titleTag.Text()
iflink!='&&link!='#'{
rank,
title,
}
rank+=1
}
}

We generate a goquery document from our response, and if we encounter any errors we simply return the error and a nil value object. We then create an empty slice of Google results which we will eventually append results to. On a Google results page, each organic result can be found in ‘div’ block with the class of ‘g’. So we can simply use the JQuery selector “div.g” to pick out all of the organic links.

We then loop through each of these found ‘div’ tags finding the link and it’s href attribute, as well as extracting the title and meta description information. Providing the link isn’t an empty string or a navigational reference, we then create an GoogleResult struct holding our information. This can then be appended to the slice of structs which we defined earlier. Finally, we increment the rank so we can tell the order in which the results appeared on the page.

Wrapping It All Up