How to use Google Sheets to scrape online data

Quick Links

Finding the data you need can be a challenge, but when it comes to visualizing large quantities, you’ll need the help of additional tools.Google Sheets has everything you needto format large amounts of data into a suitable format.

We show you how to scrape data from the internet using three methods. We break down how each works and when you should use them. What’s the best thing about them? Because they’re on Google Sheets, you can scrape data from anywhere withjust a budget Chromebook.

The source code for Android Police’s web page

What is data scraping?

Data scraping, in this sense, is the process of extracting data from a website and displaying it in a human-readable output.

A successful data scrape saves hours of work by collating information scattered across one or multiple web pages and displaying it in a format that a human can quickly read. While the term in its most general sense can refer to any program-to-program scrape, we cover the process of scraping data from a website into Google Sheets.

A Wikipedia list of best-selling books

When should I scrape data?

Data scraping is used when an established data viewing method is unavailable. As the process relies onHTML and XML tags,most data from websites can be scraped with the correct formula.

For example, data scraping is the easiest method for exporting a table on Wikipedia for easy searching and ordering (as we’ll do later in this guide).

scraped data in Google Sheets

How does data scraping work?

There are three methods to scrape data, which should be chosen based on the complexity and type of the data being scraped. These are HTML, XML, andRSS (with no Python needed).

Each method involves a different formula but follows the same fundamental rules. Point the formula toward the data you want to scrape with the appropriate tags, and it scrapes the data and places it into your table. The skill is identifying the tags you need and compensating for each website’s source code.

XML scraped data

What are tags?

If you use Google Chrome or most desktop browsers, you may view a webpage’s source code with right-clicking on the page and selectingView page sourcefrom the drop-down menu. This opens a separate tab showing the website’s HTML source code. Don’t get alarmed if this seems overwhelming. All you need to scrape data successfully is to identify a few tags.

The HTML source code for Android Police’s homepage

An XML example

Tags come as pairsand look like this in the source code:

Anything placed between the tags is displayed as specified by the chosen tags. So in the example above, the text between these tags is formatted as a list. Tags can be placed within tags to specify further details about how the text is displayed.

Depending on the method you use, you’ll look out for different tags.

What data can I scrape?

The short answer is pretty much anything. Scraping from tables and lists is the easiest, but you can scrape anything corresponding to a particular tag with the right know-how. It’s best to pick a method after you identify your data. There’s no point messing with a complicated XML formula for a simple HTML list.

What data can I scrape with the HTML method?

The HTML method can scrape lists and tables. Check the page’s source code, and search for the data you want to scrape. If it’s between

What data can I scrape with the XML method?

Instead of clickingView page source, clickInspectfrom the drop-down menu. This displays the page’s source code in XML.

Scraping data with the XML method involvesfinding theXPath. This is more precise than the HTML method, as you can search for a specific spot in the source code. Use the XML method if you’re scraping data that isn’t in a list or table format or want to scrape a part of a table.

What data can I scrape with the RSS method?

This method is used for scraping RSS feeds. It’s a great way to create your own tool for scraping news, job listings, or regularly updated data.

How to scrape data using Google Sheets

Now that you have a basic understanding of scraping data, you can try it in action.

How to scrape data using the HTML method

The HTML method requires a straightforward formula:

=IMPORTHTML(“URL”, “element”, location)

We show you how to scrape data from thisWikipedia page of best-selling books. As you can see from the page, there are multiple tables here. We’ll scrape data from the second table that includes books that have sold between 50 million and 100 million copies.

We used theInspecttool rather thanView Source. For finding HTML tags, both methods work, butInspecthas the benefit of highlighting corresponding sections on the page.

By inspecting the source code, we see that this is a table, not a list. So we use “table” for the element component. It is the second table on the page, so we use “2” as the location. The resultant formula is:

=IMPORTHTML(“https://en.wikipedia.org/wiki/List_of_best-selling_books", “table”, 2)

That’s it! Now you canorganize the data as you wishwithin Google Sheets. However, you may run into problems. Here are some common problems and their solutions:

How to scrape data using the XML method

If the HTML method doesn’t work or the scraped data isn’t precise enough, the XML method should be your next port of call. This method requires the following formula:

=IMPORTXML(“URL”, “XPath”)

The URL component is self-explanatory, but the XPath component can be complicated.This tutorial from w3schoolsdoes a great job explaining the structure of an XPath query, but we break down the basics here.

For this example, we’ll scrape all the book titles from the same Wikipedia page in the HTML example. In this scenario, the correct formula would be:

=IMPORTXML(“https://en.wikipedia.org/wiki/List_of_best-selling_books", “//tbody/tr/td/i”)

Above, you can see the result. So how did we arrive at “//tbody/tr/td/i” for the XPath query?

The first step involved finding an example of the data we wanted. In this case, we had to burrow into the tags before finding the element containing the book title within the table.

As you may see, it’s nested in multiple tags. It’s nested within, then

Checking the Wikipedia page, we see that the formula has returned all data in italics, which is what thetag represents. However, we only wanted the text within the table. Therefore, we use “//tbody/tr/td/i” to narrow the search. The resultant formula only returns text found in this specific place, which is the book titles.

XPath commands aren’t an exact science, as every web page is different. Inthis example, someone could pull the table they wanted because it had a class that no other table on the page had. Figuring out what XPath you need depends on the web page.

This isn’t a foolproof method. Inthis example, a separate XML script had to be written to scrape the data, and this was due to bad HTML practices on the source site. So if everything you do fails, blame the source code.

How to scrape data using the RSS method

Scraping RSS data is more akin to the HTML method than the XML method. It’s just extremely limited in its scope. The formula is as follows:

=IMPORTFEED(“URL”)

If we use Android Police as an example URL (so,=IMPORTFEED(“https://www.androidpolice.com/feed/")), we get this result, precisely what we wanted.

But you can customize it further by using the following parameters in your formula:

=IMPORTFEED(url, [query], [headers], [num_items])

A full breakdown of these parameters can be found onGoogle’s support page for the formula. Using these parameters, you can create a tidier feed, such as the example above, which returns just the title and URL.

Scrape data in seconds, not hours

Scraping data in Google Sheets is a challenging concept to wrap your head around, but after some practice, you can scrape massive amounts of data in seconds. Still, you’ll need an understanding of Google Sheets, butthese tips and tricks can help you sort your data without a headache.

Quick Links#

What is data scraping?#

When should I scrape data?#

How does data scraping work?#

What are tags?#

What data can I scrape?#

What data can I scrape with the HTML method?#

What data can I scrape with the XML method?#

What data can I scrape with the RSS method?#

How to scrape data using Google Sheets#

How to scrape data using the HTML method#

The HTML method requires a straightforward formula:#

How to scrape data using the XML method#

How to scrape data using the RSS method#

Scrape data in seconds, not hours#