website scraping with laravel

Building A Simple Scraping Website With PHP Laravel Part1: Beginning

In this tutorial we will introduce the concept of website scraping, then we will clarify that concept with a real world example using php and laravel.

 

Series Topics:

 

Requirements:

  • PHP and Laravel framework
  • Boostrap
  • Jquery

 

Overview:

Content scraping means reading pieces of content from html or xml pages to be displayed in some websites or to be saved into database, for example we might want to read a listing of articles in a news website or to read the products of ecommerce websites to make something like pricing comparison, etc. So the process is to pull pieces of information like item titles, descriptions, images and more.

 

Methods of Scraping:

  • Regex (Regular expressions).
  • CSS selectors.
  • XPath (a technique similar to querying with CSS selector).

 

Scraping Process:

As shown in the figure below consider we have a list of articles, each article have image, title, excerpt and we need to retrieve each of those pieces.

so using a technique like css selectors we can do something like this identified as pseudo code:

// this is just a pseudo code to clarify the process not actual code
foreach (<article> as node) {

    $title = node.children('h2').text()     
    $excerpt = node.children('p').html()
    $img = node.children('img').attr('src')

    .....
}

 

content scraping process

PHP Goutte Package:

For the purpose of this tutorial we will be using a php package called Goutte used in web scraping and this is the github repo for the package. This package internally based on symfony dom crawler.

use Goutte\Client;

$client = new Client();

// create a crawler object from this link
$crawler = $client->request('GET', 'https://www.nytimes.com/section/politics');

// filter all li elements that have class “css-ye6x8s” and loop over them
$crawler->filter('li.css-ye6x8s')->each(function ($node) {

    print  $node->filter(‘h2.css-1dq8tca’).text();   // get all h2 elements text inside each li
});

The filter() function is one of the functions in symfony dom crawler, and it functions to filter elements by css selectors, there is also another function filterXPath() which filter elements with xpath expressions. So filter(selector) returns the selector if found as a crawler object or null of not found so that you can use that object to retrieve other nested elements

consider this sample html:

<ul>
     <li><p>First</p></li>
     <li><p class=”second”>Second</p></li>
     <li><p>Third</p></li>
<ul>

To get the first element

$crawler->filter(‘ul li’)->first();

To get element by specific index

$crawler->filter(‘ul li’)->eq(1);

To get children’s of an element

$crawler->filter(‘ul’)->children();

Refer to symfony dom crawler documentation to learn more about the other available functions.

 

 

Preparing The Project:

We will build a news website where administrator can be able to add categories, add links for news websites and fetch articles with scraping.

 

So create a new laravel project:

composer create-project laravel/laravel web-scraper "5.5.*" --prefer-dist

 

Install PHP Goutte:

composer require fabpot/goutte

 

Next modify your database settings in .env:

DB_CONNECTION=mysql
DB_HOST=127.0.0.1
DB_PORT=3306
DB_DATABASE=<db name>
DB_USERNAME=<db user>
DB_PASSWORD=<db password>

 

Database tables

  • Categories
  • Articles
  • Websites: Hold the websites we will need to pull data from.
  • Links: Hold the single links linked with the websites to pull data from.
  • Item Schema: This table define the schema for a single item in the list of items, it will contain the css expression to fetch individual pieces of data like title, excerpt, image etc.

Let’s create some migrations for our database

php artisan make:migration create_category_table
php artisan make:migration create_website_table
php artisan make:migration create_article_table
php artisan make:migration create_item_schema_table
php artisan make:migration create_links_table

 

Next modify the migration files as follows:

database/migrations/XXXX_XX_XX_create_category_table.php

<?php

use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;

class CreateCategoryTable extends Migration
{
    /**
     * Run the migrations.
     *
     * @return void
     */
    public function up()
    {
        Schema::create('category', function (Blueprint $table) {
            $table->increments('id');
            $table->string("title");
            $table->timestamps();
        });
    }

    /**
     * Reverse the migrations.
     *
     * @return void
     */
    public function down()
    {
        Schema::dropIfExists('category');
    }
}

database/migrations/XXXX_XX_XX_create_website_table.php

<?php

use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;

class CreateWebsiteTable extends Migration
{
    /**
     * Run the migrations.
     *
     * @return void
     */
    public function up()
    {
        Schema::create('website', function (Blueprint $table) {
            $table->increments('id');
            $table->string('title');
            $table->string('logo');
            $table->string('url');
            $table->timestamps();
        });
    }

    /**
     * Reverse the migrations.
     *
     * @return void
     */
    public function down()
    {
        Schema::dropIfExists('website');
    }
}

database/migrations/XXXX_XX_XX_create_article_table.php

<?php

use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;

class CreateArticleTable extends Migration
{
    /**
     * Run the migrations.
     *
     * @return void
     */
    public function up()
    {
        Schema::create('article', function (Blueprint $table) {
            $table->increments('id');
            $table->string('title', 355);
            $table->text('excerpt')->nullable();
            $table->longText('content')->nullable();
            $table->string('image')->nullable();
            $table->string('source_link', 355)->nullable();
            $table->unsignedInteger('category_id')->nullable();
            $table->unsignedInteger('website_id')->nullable();
            $table->foreign('category_id')
                ->references('id')
                ->on('category')
                ->onUpdate('cascade')
                ->onDelete('set null');
            $table->foreign('website_id')
                ->references('id')
                ->on('website')
                ->onUpdate('cascade')
                ->onDelete('set null');
            $table->timestamps();
        });
    }

    /**
     * Reverse the migrations.
     *
     * @return void
     */
    public function down()
    {
        Schema::dropIfExists('article');
    }
}

database/migrations/XXXX_XX_XX_create_links_table.php

<?php

use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;

class CreateLinksTable extends Migration
{
    /**
     * Run the migrations.
     *
     * @return void
     */
    public function up()
    {
        Schema::create('links', function (Blueprint $table) {
            $table->increments('id');
            $table->string('url');
            $table->string('main_filter_selector');   // this is the main filter selector used in the main filter() function
            $table->unsignedInteger('website_id')->nullable();
            $table->unsignedInteger('category_id')->nullable();
            $table->unsignedInteger('item_schema_id')->nullable();
            $table->foreign('website_id')
                ->references('id')
                ->on('website')
                ->onUpdate('cascade')
                ->onDelete('set null');

            $table->foreign('category_id')
                ->references('id')
                ->on('category')
                ->onUpdate('cascade')
                ->onDelete('set null');

            $table->foreign('item_schema_id')
                ->references('id')
                ->on('item_schema')
                ->onUpdate('cascade')
                ->onDelete('set null');
            $table->timestamps();
        });
    }

    /**
     * Reverse the migrations.
     *
     * @return void
     */
    public function down()
    {
        Schema::dropIfExists('links');
    }
}

database/migrations/XXXX_XX_XX_create_item_schema_table.php

<?php

use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;

class CreateItemSchemaTable extends Migration
{
    /**
     * Run the migrations.
     *
     * @return void
     */
    public function up()
    {
        Schema::create('item_schema', function (Blueprint $table) {
            $table->increments('id');
            $table->string('title');
            $table->boolean('is_full_url')->default(1);   // whether this is a full link to article or partial link
            $table->text('css_expression');    // expression defines the selectors structure for this item i.e (a > p) find all p tags inside a
            $table->string('full_content_selector');

            $table->timestamps();
        });
    }

    /**
     * Reverse the migrations.
     *
     * @return void
     */
    public function down()
    {
        Schema::dropIfExists('item_schema');
    }
}

The category table will hold the categories, the article table will hold the articles linked by category id.

The website table represent the websites that we will scrape data from.

 

The links table represent the links for specific website and we added category_id to be assigned to, hence we say that link articles will be linked to the selected category id.The main_filter_selector defines the main css selector that will be passed to the filter() function so we say $crawler->filter(main_filter_selector)

 

The item_schema table represent the schema structure for single items in an article list page, for example the article contain title, url, excerpt, image.

The is_full_url defines whether the article uses a full url to the details page or partial url.

The css_expression attribute hold a special expression for all css selectors that represent those elements, as we will discuss this in the next sections.

The full_content_selector defines the css selector of the item content in the detail page.

 

Generating Models

Let’s create the required models for the database tables:

php artisan make:model Website
php artisan make:model Category
php artisan make:model Article
php artisan make:model Link
php artisan make:model ItemSchema

Open app/Website.php and modify it as follows:

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;

class Website extends Model
{
    protected $table = "website";
}

Modify app/Category.php as follows:

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;

class Category extends Model
{
    protected $table = "category";
}

app/Article.php

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;

class Article extends Model
{
    protected $table = "article";

    public function category()
    {
        return $this->belongsTo('App\Category', 'category_id');
    }

    public function website()
    {
        return $this->belongsTo('App\Website', 'website_id');
    }
}

app/Link.php

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;

class Link extends Model
{
    protected $table = "links";

    public function category()
    {
        return $this->belongsTo('App\Category', 'category_id');
    }

    public function website()
    {
        return $this->belongsTo('App\Website', 'website_id');
    }

    public function itemSchema()
    {
        return $this->belongsTo('App\ItemSchema', 'item_schema_id');
    }
}

app/ItemSchema.php

<?php

namespace App;

use Illuminate\Database\Eloquent\Model;

class ItemSchema extends Model
{
    protected $table = "item_schema";
}

As shown in the above code we added some relations to the models for example in the Article model there are two relations the Category and the Website it belongs to. In the Link model there are relations between the Category, Website and Item Schema.

Preparing main layout

Now we need to create the main layout template in resources/views/layout.blade.php and add the below contents:

<!doctype html>
<html lang="{{ app()->getLocale() }}">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <title>Web Scraper</title>

    <!-- Fonts -->
    <link href="https://fonts.googleapis.com/css?family=Raleway:100,600" rel="stylesheet" type="text/css">

    <link href="https://stackpath.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css" rel="stylesheet" type="text/css" />

    <style>
        .glyphicon.fast-right-spinner {
            -webkit-animation: glyphicon-spin-r 1s infinite linear;
            animation: glyphicon-spin-r 1s infinite linear;
        }

        @-webkit-keyframes glyphicon-spin-r {
            0% {
                -webkit-transform: rotate(0deg);
                transform: rotate(0deg);
            }

            100% {
                -webkit-transform: rotate(359deg);
                transform: rotate(359deg);
            }
        }

        @keyframes glyphicon-spin-r {
            0% {
                -webkit-transform: rotate(0deg);
                transform: rotate(0deg);
            }

            100% {
                -webkit-transform: rotate(359deg);
                transform: rotate(359deg);
            }
        }
    </style>

    <script src="https://code.jquery.com/jquery-1.12.4.min.js"></script>

    <script src="https://stackpath.bootstrapcdn.com/bootstrap/3.3.1/js/bootstrap.min.js"></script>
</head>
<body>
    <div class="container">
        <div class="row">
            <nav class="navbar navbar-default">
                <div class="container-fluid">
                    <div class="navbar-header">
                        <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1" aria-expanded="false">
                            <span class="sr-only">Toggle navigation</span>
                            <span class="icon-bar"></span>
                            <span class="icon-bar"></span>
                            <span class="icon-bar"></span>
                        </button>
                        <a class="navbar-brand" href="{{url('/')}}">Web Scraper</a>
                    </div>

                    <div class="collapse navbar-collapse">
                        <ul class="nav navbar-nav navbar-right">
                            <li class="dropdown">
                                <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Dashboard <span class="caret"></span></a>
                                <ul class="dropdown-menu">
                                    <li><a href="#">Websites</a></li>
                                    <li><a href="#">Categories</a></li>
                                    <li><a href="#">Links</a></li>
                                    <li><a href="#">Item Schema</a></li>
                                    <li role="separator" class="divider"></li>
                                    <li><a href="#">Articles</a></li>
                                </ul>
                            </li>
                        </ul>
                    </div>
                </div>
            </nav>


                @yield('content')
            </div>
        </div>
    </div>

    @yield('script')
</body>
</html>

This is just a simple layout with a header, we added some links that represent the dashboard items like categories, websites, articles, item schema and links.

 

In the next part of the tutorial we will implement the Dashboard and Crud operations.

 

Continue to part 2 >>> Implementing Scraper Dashboard

0 0 vote
Article Rating
Share this: