Tuesday, 9 August 2016

How to Scrape a Website into Excel without programming

How to Scrape a Website into Excel without programming

This web scraping tutorial will teach you visually step by step how to scrape or extract or pull data from websites using import.io(Free Tool) without programming skills into Excel.

Personally, I use web scraping for analysing my competitors’ best-performing blog posts or content such as what blog posts or content received most comments or social media shares.

In this tutorial,We will scrape the following data from a blog:

    All blog posts URLs.
    Authors names for each post.
    Blog posts titles.
    The number of social media shares each post received.

Then we will use the extracted data to determine what are the popular blog posts and their authors,which posts received much engagement from users through social media shares and on page comments.

Let’s get started.

Step 1:Install import.io app

The first step is to install import.io app.A free web scraping tool and one of the best web scraping software.It is available for Windows,Mac and Linux platforms.Import.io offers advanced data extraction features without coding by allowing you to create custom APIs or crawl entire websites.

After installation, you will need to sign up for an account.It is completely free so don’t worry.I will not cover the installation process.Once everything is set correctly you will see something similar to the window below after your first login.

Step 2:Choose how to scrape data using import.io extractor

With import.io you can do data extraction by creating custom APIs or crawling the entire websites.It comes equipped with different tools for data extraction such as magic,extractor,crawler and connector.

In this tutorial,I will use a tool called “extractor” to create a custom API for our data extraction process.

To get started click the “new” red button on the right top of the page and then click “Start Extractor” button on the pop-up window.

After clicking  “Start Extractor” the Import.io app internal browser window will open as shown below.

Step 3:Data scraping process

Now after the import.io browser is open navigate to the blog URL you want to scrape data from. Then once you already navigated to the target blog URL turn on extraction.In this tutorial,I will use this blog URL bongo5.com  for data extraction.

You can see from the window below I already navigated to www.bongo5.com but extraction switch is still off.

Turn extraction switch “ON” as shown in the window below and move to the next step.

Step 4:Training the “columns” or specifying the data we want to scrape

In this step,I will specify exactly what kind of data I want to scrape from the blog.On import.io app specifying the data you want to scrape is referred to as “training the columns”.Columns represent the data set I want to scrape(post titles,authors’ names and posts URLs).

In order to understand this step, you need to know the difference between a blog page and a blog post.A page might have a single post or multiple posts depending on the blog configuration.

A blog might have several blog posts,even hundreds or thousands of posts.But I will take only one session to train the “extractor” about the data I want to extract.I will do so by using an import.io visual highlighter.Once the data extraction is turned on the-the highlighter will appear by default.

I will do the training session for a single post in a single blog page with multiple posts then the extractor will extract data automatically for the remaining posts on the “same” blog page.
Step 4a:Creating “post_title” column

I will start by renaming “my_column” into the name of the data I want to scrape.Our goal in this tutorial is to scrape the blog posts titles,posts URLs,authors names and get social statistics later so I will create columns for posts titles,posts URLs,authors names.Later on, I will teach you how to get social statistics for the post URLs.

After editing “my_column” into “post_title” then point the mouse cursor over to any of the Posts title on the same blog page and the visual highlighter will automatically appear.Using the highlighter I can select the data I want to extract.

You can see below I selected one of the blog post titles on the page.The rectangular box with orange border is the visual highlighter.

The app will ask you how is the data arranged on the page.Since I have more than one post in a single page then you have rows of repeating data.This blog is having 25 posts per page.So you will select “many rows”.Sometimes you might have a single post on a page for that case you need to select “Just one row”.

Source: http://nocodewebscraping.com/web-scraping-for-dummies-tutorial-with-import-io-without-coding/

No comments:

Post a Comment