Web Scraping Golang

Lightning Fast and Elegant Scraping Framework for Gophers. Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving. To scrape Google results we have to make a request to Google using a URL containing our search parameters. For instance Google allows you to pass a number of different parameters to a search query. In this particular example we are going to write a function that will generate us a search URL with our desired parameters.

I have previously written a post on scraping Google with Python. As I am starting to write more Golang, I thought I should write the same tutorial using Golang to scrape Google. Why not scrape Google search results using Google’s home grown programming language.

Imports & Setup

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div></div></td><td><div><div><span>'fmt'</span></div><div><span>'net/http'</span></div><div><span>)</span></div></div></td></tr></tbody></table><p>This example will only being using one external dependency. While it is possible to parse HTML using Go’s standard library, this involves writing a lot of code. So instead we are going to be using the very popular Golang library, Goquery which supports JQuery style selection of HTML elements.</p><h2>Defining What To Return</h2><p>We can get a variety of different information from Google, but we typically want to return a result’s position, URL, title and description. In Golang it makes sense to create a struct representing the data we want to be gathered by our scraper.</p><div><textarea wrap='soft' readonly='>type GoogleResult struct{ ResultRank int ResultURL string ResultTitle string ResultDesc string}

ResultRank int

ResultTitle string

}

We can build a simple struct which will hold an individual search result, when writing our final function we can then set the return value to be a slice of our GoogleResult struct. This will make it very easy for us to manipulate our search results once we have scraped them from Google.

Making A Request To Google

To scrape Google results we have to make a request to Google using a URL containing our search parameters. For instance Google allows you to pass a number of different parameters to a search query. In this particular example we are going to write a function that will generate us a search URL with our desired parameters.

But first we are going to define a “map” of supported Google geo locations. In this post we are only going to support a few major geographical locations, but Google operates in over 100 different geographical locations.

<table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div></div></td><td><div><div><span>'com'</span><span>:</span><span>'https://www.google.com/search?q='</span><span>,</span></div><div><span>'ru'</span><span>:</span><span>'https://www.google.ru/search?q='</span><span>,</span></div><div><span>}</span></div></div></td></tr></tbody></table><p>This will allow pass a two letter country code to our scraping function and scrape results from that particular version of Google. Using the different base domains in combination with a language code allows us to scrape results as they appear in the country in question.</p><h3>What Is Web Scraping</h3><div><span>Building a Google Search URL</span></div><div><textarea wrap='soft' readonly='>func buildGoogleUrl(searchTerm string, countryCode string, languageCode string) string { searchTerm = strings.Trim(searchTerm, ' ') searchTerm = strings.Replace(searchTerm, ' ', '+', -1) if googleBase, found := googleDomains[countryCode]; found { return fmt.Sprintf('%s%s&num=100&hl=%s', googleBase, searchTerm, languageCode) } else { return fmt.Sprintf('%s%s&num=100&hl=%s', googleDomains['com'], searchTerm, languageCode) }}

func buildGoogleUrl(searchTerm string,countryCode string,languageCode string)string{

searchTerm=strings.Replace(searchTerm,' ','+',-1)

ifgoogleBase,found:=googleDomains[countryCode];found{

returnfmt.Sprintf('%s%s&num=100&hl=%s',googleBase,searchTerm,languageCode)

returnfmt.Sprintf('%s%s&num=100&hl=%s',googleDomains['com'],searchTerm,languageCode)

}

We then write a function that allows us to build a Google search URL. The function takes in three arguments, all of the string type and returns a URL also a string. We first trim the search term to remove any trailing or proceeding white-space. We then replace any of the remaining spaces with ‘+’, the -1 in this line of code means that we replace every-single remaining instance of white-space with a plus.

We then look up the country code passed as an argument against the map we defined earlier. If the countryCode is found in our map, we use the respective URL from the map, otherwise we use the default ‘.com’ Google site. We then use the format packages “Sprintf” function to format a string made up of our base URL, our search term and language code. We don’t check the validity of the language code, which is something we might want to do if we were writing a more fully featured scraper.

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div></div></td><td><div><div><span>func</span><span>googleRequest</span><span>(</span><span>searchURL </span><span>string</span><span>)</span><span>(</span><span>*</span><span>http</span><span>.</span><span>Response</span><span>,</span><span>error</span><span>)</span><span>{</span></div><div><span>baseClient</span><span>:</span><span>=</span><span>&</span><span>http</span><span>.</span><span>Client</span><span>{</span><span>}</span></div><div><span>req</span><span>,</span><span>_</span><span>:</span><span>=</span><span>http</span><span>.</span><span>NewRequest</span><span>(</span><span>'GET'</span><span>,</span><span>searchURL</span><span>,</span><span>nil</span><span>)</span></div><div><span>req</span><span>.</span><span>Header</span><span>.</span><span>Set</span><span>(</span><span>'User-Agent'</span><span>,</span><span>'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'</span><span>)</span></div><div><span>res</span><span>,</span><span>err</span><span>:</span><span>=</span><span>baseClient</span><span>.</span><span>Do</span><span>(</span><span>req</span><span>)</span></div><div><span>if</span><span>err</span><span>!=</span><span>nil</span><span>{</span></div><div><span>}</span><span>else</span><span>{</span></div><div><span>}</span></div></div></td></tr></tbody></table><p>We can now write a function to make a request. Go has a very easy to use and power “net/http” library which makes it relatively easy to make HTTP requests. We first get a client to make our request with. We then start building a new HTTP request which will be eventually executed using our client. This allows us to set custom headers to be sent with our request. In this instance we our replicating the User-Agent header of a real browser.</p><p>We then execute this request, with the client’s Do method returning us a response and error. If something went wrong with the request we return a nil value and the error. Otherwise we simply return the response object and a nil value to show that we did not encounter an error.</p><h2>Parsing the Result</h2><p>Now we move onto parsing the result of request. Compared with Python the options when it comes to HTML parsing libraries is not as robust, with nothing coming close to the ease of use of BeautifulSoup. In this example, we are going to use the very popular Goquery package which uses JQuery style selectors to allow users to extract data from HTML documents.</p><div><textarea wrap='soft' readonly='>func googleResultParser(response *http.Response) ([]GoogleResult, error){ doc, err := goquery.NewDocumentFromResponse(response) if err != nil { return nil, err } results := []GoogleResult{} sel := doc.Find('div.g') rank := 1 for i := range sel.Nodes { item := sel.Eq(i) linkTag := item.Find('a') link, _ := linkTag.Attr('href') titleTag := item.Find('h3.r') descTag := item.Find('span.st') desc := descTag.Text() title := titleTag.Text() link = strings.Trim(link, ' ') if link != ' && link != '#'{ result := GoogleResult{ rank, link, title, desc, } results = append(results, result) rank += 1 } } return results, err}

func googleResultParser(response *http.Response)([]GoogleResult,error){

doc,err:=goquery.NewDocumentFromResponse(response)

returnnil,err

results:=[]GoogleResult{}

rank:=1

item:=sel.Eq(i)

link,_:=linkTag.Attr('href')

descTag:=item.Find('span.st')

title:=titleTag.Text()

iflink!='&&link!='#'{

rank,

title,

}

rank+=1

}

We generate a goquery document from our response, and if we encounter any errors we simply return the error and a nil value object. We then create an empty slice of Google results which we will eventually append results to. On a Google results page, each organic result can be found in ‘div’ block with the class of ‘g’. So we can simply use the JQuery selector “div.g” to pick out all of the organic links.

We then loop through each of these found ‘div’ tags finding the link and it’s href attribute, as well as extracting the title and meta description information. Providing the link isn’t an empty string or a navigational reference, we then create an GoogleResult struct holding our information. This can then be appended to the slice of structs which we defined earlier. Finally, we increment the rank so we can tell the order in which the results appeared on the page.

Wrapping It All Up

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div><div>16</div><div>18</div><div>20</div></div></td><td><div><div><span>'./googlescraper'</span></div><div><span>'time'</span></div><div><span>var</span><span>keywords</span><span>=</span><span>[</span><span>]</span><span>string</span><span>{</span><span>'edmund martin'</span><span>,</span><span>'python programming'</span><span>,</span><span>'web scraping'</span><span>}</span></div><div><span>res</span><span>,</span><span>_</span><span>:</span><span>=</span><span>googlescraper</span><span>.</span><span>GoogleScrape</span><span>(</span><span>keyword</span><span>,</span><span>'uk'</span><span>,</span><span>'en'</span><span>)</span></div><div><span>fmt</span><span>.</span><span>Println</span><span>(</span><span>keyword</span><span>)</span></div><div><span>fmt</span><span>.</span><span>Println</span><span>(</span><span>item</span><span>)</span></div><div><span>time</span><span>.</span><span>Sleep</span><span>(</span><span>time</span><span>.</span><span>Second *</span><span>30</span><span>)</span></div><div><span>}</span></div></div></td></tr></tbody></table><p>The above program makes use of our GoogleScraper function by working through a list of keywords and scraping search results. After each scrape we are waiting a total of 30 seconds, this should help us avoid being banned. Should we want to scrape a larger set of keywords, we would want to randomise our User-Agent and change up the proxy we were using in each request. Otherwise we are very likely to run into a Google captcha which would prevent us from gathering any results.</p><p>The full Google scraping script can be found here. Feel free to play with it and think about some of the additional functionality that could be added. You might for instance want to scrape the first few pages of Google, or pass in a custom number of results to be returned by the script.</p><p>If you’re here, you probably already know what web scraping is. But on the off chance that you just happened to stumble upon this article, let’s start with a quick refresher on web scraping, and then we’ll move on to goquery.</p><h2>Web Scraping – a quick introduction</h2><p>Web Scraping is the automated method of extracting human-readable data output from a website. The specific data is gathered and copied into a central local database for later retrieval or analysis. There is a built-in library in the Go language for scraping HTML web pages, but often there are some methods that are used by websites to prevent web scraping – because it could potentially cause a denial-of-service, incur bandwidth costs to yourself or the website provider, overload log files, or otherwise stress computing resources.</p><p>However, there are web scraping techniques like DOM parsing, computer vision and NLP to simulate human browsing on web page content.</p><p>GoQuery is a library created by Martin Angers and brings a syntax and a set of features similar to jQuery to the Go language.</p><p>jQuery is a fast, small, and feature-rich JavaScript library. It makes things like HTML document traversal and manipulation, event handling, animation, and Ajax much simpler with an easy-to-use API that works across a multitude of browsers.</p><cite> – jquery</cite><p>GoQuery makes it easier to parse HTML websites than the default <strong>net/html</strong> package, using DOM (Document Object Model) parsing.</p><h2>Installing goquery</h2><p>Let’s download the package using “<code>go get</code>“.</p><p>A concise manual can be brought up by using the “<code>go doc goquery</code>” command.</p><h3>Web Scraping Golang Yang</h3><h2>GoLang Web Scraping using goquery</h2><h3>Web Scraping With Python</h3><p>Create a new .go document in your preferred IDE or text editor. Mine’s titled “goquery_program.go”, and you may choose to do the same:</p><p>We’ll begin by importing json and goquery, along with ‘log‘ to log any errors. We create a <strong>struct</strong> called Article with the <em>Title, URL, and Category </em>as metadata of the article.</p><p>Within the function <code>main()</code>, dispatch a GET client request to the URL journaldev.com for scraping the html.</p><p>We have already fetched our full html source code from the website. We can dump it to our terminal using the “os” package.</p><p>This will output the whole html file along with all tags in the terminal. I’m working on Linux Ubuntu 20.04, so the output display may vary with system.</p><h3>Golang Web Frameworks</h3><p>It also gave a secondary print statement along with a notification that the page was optimized by LiteSpeed Cache: </p><p>Number of bytes copied to STDOUT: 151402</p><p>Now, let’s store this response in a <strong>reader</strong> file using goquery:</p><p>Now we need to use the Find() function, which takes in a tag, and inputs that as an argument into Each(). The Each function is typically used with an argument i <strong>int</strong>, and the selection for the specified tag. On clicking “inspect” in the JournalDev website, I saw that my content was in <p> tags. So I defined my Find with only the name of the tag:</p><ul><li>The “fmt” library has been used to print the text.</li><li>The “next” was just to check if the output was being received(like, for debugging) but I think it looks good with the final output.</li><li>The “%d” and “%s” are string format specifiers for Printf.</li></ul><h2>Web Scraping Example Output</h2><p>The best thing about coding is the satisfaction when your code outputs exactly what you need, and I think this was to my utmost satisfaction:</p><h3>Golang Web Scraping Xpath</h3><p>I tried to keep this article as generalised as possible when dealing with websites. This method should work for you no matter what website you’re trying to parse !</p><p>With that, I will leave you…until next time.</p><h2>References</h2><br><br><a href='https://netlify.mix-goapp.com/my-pantone-for-mac.html#Evvda=VARFFwcEXFdWFwdTB1EAUV8JB10YE1BVHEcASgNGCl9XHlQKCANbVRhTAlcbURlGQQ5XF1UJHlpWV0gLThpWHwIFAVRIVxkAGAddS2IyGQcDVwRLWgFBW15SGhYDRhMdHEFWHRAREAFVRwAHZA==' target='_blank'><img src='https://cdn-ak.f.st-hatena.com/images/fotolife/r/ruriatunifoefec/20200910/20200910011333.png' style='cursor:pointer;display:block;margin-left:auto;margin-right:auto;'></a><br><br></p>
			</div>
	<footer class="entry-meta">
		
			</footer>
</article>
					<nav role="navigation" id="nav-below" class="site-navigation post-navigation">
		<h1 class="assistive-text">Post navigation</h1>
	
		<div class="nav-previous"><a href='/deezer-ariana-grande'>Deezer Ariana Grande</a></div>		<div class="nav-next"><a href='/cleverbridge-mackeeper'>Cleverbridge Mackeeper</a></div>
	
	</nav>
	
				
			
			</div>
		</div>
		<div id="secondary" class="widget-area" role="complementary">
						<aside id="search-3" class="widget widget_search">	<form method="get" id="searchform" action="#" role="search">
		<label for="s" class="assistive-text">Search</label>
		<input type="text" class="field" name="s" value="" id="s" placeholder="Search …" />
		<input type="submit" class="submit" name="submit" id="searchsubmit" value="Search" />
	</form>
</aside>		<aside id="recent-posts-5" class="widget widget_recent_entries">		<h1 class="widget-title">MOST POPULAR ARTICLES</h1>		<ul>
					<li><a href='/twitter-to-mp4'>: Twitter To Mp4</a></li>
<li><a href='/mac-os-x-100'>: Mac Os X 10.0</a></li>
<li><a href='/the-memoir-project'>: The Memoir Project</a></li>
<li><a href='/evernote-program'>: Evernote Program</a></li>
<li><a href='/use-of-ms-word'>: Use Of Ms Word</a></li>
<li><a href='/ben-shapiro-twitter'>: Ben Shapiro Twitter</a></li>
<li><a href='/the-markdown-guide'>: The Markdown Guide</a></li>
				</ul>
		</aside>				</div>

	</div>
	<div id="colophon-wrap">
	<footer id="colophon" class="site-footer" role="contentinfo">
		<div class="site-info">
						 
<a href="/">Woodload417 </a> | 2021
	
		</div>
	</footer>
	</div>
</div>
</body>
</html>