Backend Development

Building A Simple Scraping Website With PHP Laravel Part2: Dashboard and Crud

website scraping with laravel

In this part of this tutorial (Building a simple scraping website) we will continue and in this part we will create a simple dashboard and some crud operations for our modules.

 

 

Series Topics:

 

At first we will need some modules for the database tables we just created in the previous part, i will not go into the details of explaining those modules because it is just a normal crud operations, but i will explain the required part of scraping.

 

Creating Controllers

Let’s first create the required controllers, i will use resource controllers here so in the terminal run these commands:

php artisan make:controller CategoriesController --resource
php artisan make:controller WebsitesController --resource
php artisan make:controller LinksController --resource
php artisan make:controller ItemSchemaController --resource
php artisan make:controller ArticlesController --resource

 

Next let’s add the routes for those controllers

Open routes/web.php and add the below code:

Route::group(['prefix' => 'dashboard'], function() {
    Route::resource('/websites', 'WebsitesController');
    Route::resource('/categories', 'CategoriesController');
    Route::resource('/links', 'LinksController');
    Route::resource('/item-schema', 'ItemSchemaController');
    Route::resource('/articles', 'ArticlesController');
});

As you see above all the modules url will be under dashboard group so for example to go to article list type /dashboard/articles

 

At first modify app/Http/Controllers/Controller.php like this:

<?php

namespace App\Http\Controllers;

use Illuminate\Foundation\Bus\DispatchesJobs;
use Illuminate\Routing\Controller as BaseController;
use Illuminate\Foundation\Validation\ValidatesRequests;
use Illuminate\Foundation\Auth\Access\AuthorizesRequests;

class Controller extends BaseController
{
    use AuthorizesRequests, DispatchesJobs, ValidatesRequests;


    function uploadFile($name, $destination, $request = null)
    {
        try {

            $image = $request->file($name);

            $fileName = time() . '.' . $image->getClientOriginalExtension();

            $image->move($destination, $fileName);

            return ["state" => 1, "filename" => $fileName];
        } catch (\Exception $ex) {

            return ["state" => 0, "filename" => ""];

        }
    }
}

 

 

 

Adding Categories

Let’s start with the category module open app/Http/Controllers/CategoriesController.php and modify it like this:

<?php

namespace App\Http\Controllers;

use App\Category;
use Illuminate\Http\Request;

class CategoriesController extends Controller
{
    /**
     * Display a listing of the resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function index()
    {
        $cats = Category::orderBy('id', 'DESC')->paginate(10);

        return view('dashboard.category.index')->withCategories($cats);
    }

    /**
     * Show the form for creating a new resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function create()
    {
        return view('dashboard.category.create');
    }

    /**
     * Store a newly created resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @return \Illuminate\Http\Response
     */
    public function store(Request $request)
    {
        $this->validate($request, [
            'title' => 'required'
        ]);

        $cat = new Category;

        $cat->title = $request->input('title');

        $cat->save();

        return redirect()->route('categories.index');
    }

    /**
     * Display the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function show($id)
    {
        //
    }

    /**
     * Show the form for editing the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function edit($id)
    {
        return view('dashboard.category.edit')->withCategory(Category::find($id));
    }

    /**
     * Update the specified resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function update(Request $request, $id)
    {
        $this->validate($request, [
            'title' => 'required'
        ]);

        $cat = Category::find($id);

        $cat->title = $request->input('title');

        $cat->save();

        return redirect()->route('categories.index');
    }

    /**
     * Remove the specified resource from storage.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function destroy($id)
    {
        //
    }
}

Next create a new view resources/views/dashboard/category/index.blade.php and add the below code:

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Categories</h2>

            <a href="{{ route('categories.create') }}" class="btn btn-warning pull-right">Add new</a>

            @if(count($categories) > 0)

                <table class="table table-bordered">
                    <tr>
                        <td>Title</td>
                        <td>Actions</td>
                    </tr>
                    @foreach($categories as $cat)
                        <tr>
                            <td>{{ $cat->title }}</td>
                            <td>
                                <a href="{{ url('dashboard/categories/' . $cat->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a>
                            </td>
                        </tr>
                    @endforeach
                </table>

                @if(count($categories) > 0)
                    <div class="pagination">
                        <?php echo $categories->render();  ?>
                    </div>
                @endif

            @else
                <i>No categories found</i>

            @endif
        </div>
    </div>

@endsection

resources/views/dashboard/category/create.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Add Category</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('categories.store') }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Title:</strong>

                            <input type="text" name="title" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Create</button>

                </div>

            </form>
        </div>
    </div>

@endsection

resources/views/dashboard/category/edit.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Update Category #{{$category->id}}</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('categories.update', ['id' => $category->id]) }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                {{ method_field("PUT") }}

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Title:</strong>

                            <input type="text" name="title" value="{{ $category->title }}" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Update</button>

                </div>

            </form>
        </div>
    </div>

@endsection

The above code self explanatory just a simple crud for the categories module. First we updated the Categories controller, then we added three views for create, edit, and list categories.

 

Adding Websites

Now open up app/Http/Controllers/WebsitesController.php and modify it like this:

<?php

namespace App\Http\Controllers;

use App\Website;
use Illuminate\Http\Request;

class WebsitesController extends Controller
{
    /**
     * Display a listing of the resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function index()
    {
        $websites = Website::orderBy('id', 'DESC')->paginate(10);

        return view('dashboard.website.index')->withWebsites($websites);
    }

    /**
     * Show the form for creating a new resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function create()
    {
        return view('dashboard.website.create');
    }

    /**
     * Store a newly created resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @return \Illuminate\Http\Response
     */
    public function store(Request $request)
    {
        $this->validate($request, [
            'title' => 'required',
            'url' => 'required',
            'logo' => 'required'
        ]);

        $website = new Website;

        $website->title = $request->input('title');

        $website->url = $request->input('url');

        $website->logo = $this->uploadFile('logo', public_path('uploads/'), $request)["filename"];

        $website->save();

        return redirect()->route('websites.index');
    }

    /**
     * Display the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function show($id)
    {
        //
    }

    /**
     * Show the form for editing the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function edit($id)
    {
        return view('dashboard.website.edit')->withWebsite(Website::find($id));
    }

    /**
     * Update the specified resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function update(Request $request, $id)
    {
        $this->validate($request, [
            'title' => 'required',
            'url' => 'required'
        ]);

        $website = Website::find($id);

        $website->title = $request->input('title');

        $website->url = $request->input('url');

        if($request->file('logo') != null) {

            $website->logo = $this->uploadFile('logo', public_path('uploads/'), $request)["filename"];
        }

        $website->save();

        return redirect()->route('websites.index');
    }

    /**
     * Remove the specified resource from storage.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function destroy($id)
    {
        //
    }
}

resources/views/dashboard/website/index.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Websites</h2>

            <a href="{{ route('websites.create') }}" class="btn btn-warning pull-right">Add new</a>

            @if(count($websites) > 0)

                <table class="table table-bordered">
                    <tr>
                        <td>Title</td>
                        <td>Logo</td>
                        <td>Url</td>
                        <td>Actions</td>
                    </tr>
                    @foreach($websites as $website)
                        <tr>
                            <td>{{ $website->title }}</td>
                            <td><img width="150" src="{{ url('uploads/' . $website->logo) }}" /></td>
                            <td><a href="{{ $website->url }}">{{ $website->url }}</a> </td>
                            <td>
                                <a href="{{ url('dashboard/websites/' . $website->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a>
                            </td>
                        </tr>
                    @endforeach
                </table>

                @if(count($websites) > 0)
                    <div class="pagination">
                        <?php echo $websites->render();  ?>
                    </div>
                @endif

            @else
                <i>No websites found</i>

            @endif
        </div>
    </div>

@endsection

resources/views/dashboard/website/create.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Add Website</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('websites.store') }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Title:</strong>

                            <input type="text" name="title" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Url:</strong>

                            <input type="text" name="url" class="form-control" />

                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Logo:</strong>

                            <input type="file" name="logo" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Create</button>

                </div>

            </form>
        </div>
    </div>

@endsection

resources/views/dashboard/website/edit.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Update Website #{{$website->id}}</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('websites.update', ['id' => $website->id]) }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                {{ method_field("PUT") }}

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Title:</strong>

                            <input type="text" name="title" value="{{ $website->title }}" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Url:</strong>

                            <input type="text" name="url" value="{{ $website->url }}" class="form-control" />

                        </div>
                    </div>
                </div>

                <div class="row">
                    @if($website->logo != "")
                        <img src="{{ url('uploads/' . $website->logo) }}" width="150" />
                    @endif
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Logo:</strong>

                            <input type="file" name="logo" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Update</button>

                </div>

            </form>
        </div>
    </div>

@endsection

Make sure to create uploads/ directory inside public/ folder and give it a writable permissions, this directory will be used as the uploaded website logos as shown in the code above

Adding Item Schema

Item schema as we mentioned in the previous article represent the schema for a single item in a list of items so we need to construct an expression that represent that schema

 

Open app/Http/Controllers/ItemSchemaController.php and modify it like this:

<?php

namespace App\Http\Controllers;

use App\ItemSchema;
use Illuminate\Http\Request;

class ItemSchemaController extends Controller
{
    /**
     * Display a listing of the resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function index()
    {
        $itemSchema = ItemSchema::orderBy('id', 'DESC')->paginate(10);

        return view('dashboard.item_schema.index')->withItemSchemas($itemSchema);
    }

    /**
     * Show the form for creating a new resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function create()
    {
        return view('dashboard.item_schema.create');
    }

    /**
     * Store a newly created resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @return \Illuminate\Http\Response
     */
    public function store(Request $request)
    {
        $this->validate($request, [
            'title' => 'required',
            'css_expression' => 'required',
            'full_content_selector' => 'required'
        ]);

        $itemSchema = new ItemSchema;

        $itemSchema->title = $request->input('title');

        if($request->input('is_full_url') != null) {

            $itemSchema->is_full_url = 1;
        } else {
            $itemSchema->is_full_url = 0;
        }

        $itemSchema->css_expression = $request->input('css_expression');

        $itemSchema->full_content_selector = $request->input('full_content_selector');

        $itemSchema->save();

        return redirect()->route('item-schema.index');
    }

    /**
     * Display the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function show($id)
    {
        //
    }

    /**
     * Show the form for editing the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function edit($id)
    {
        return view('dashboard.item_schema.edit')->withItemSchema(ItemSchema::find($id));
    }

    /**
     * Update the specified resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function update(Request $request, $id)
    {
        $this->validate($request, [
            'title' => 'required',
            'css_expression' => 'required',
            'full_content_selector' => 'required'
        ]);

        $itemSchema = ItemSchema::find($id);

        $itemSchema->title = $request->input('title');

        if($request->input('is_full_url') != null) {

            $itemSchema->is_full_url = 1;
        } else {
            $itemSchema->is_full_url = 0;
        }

        $itemSchema->css_expression = $request->input('css_expression');

        $itemSchema->full_content_selector = $request->input('full_content_selector');

        $itemSchema->save();

        return redirect()->route('item-schema.index');
    }

    /**
     * Remove the specified resource from storage.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function destroy($id)
    {

    }
}

resources/views/dashboard/item_schema/index.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Item Schema</h2>

            <a href="{{ route('item-schema.create') }}" class="btn btn-warning pull-right">Add new</a>

            @if(count($itemSchemas) > 0)

                <table class="table table-bordered">
                    <tr>
                        <td>Title</td>
                        <td>CSS Expression</td>
                        <td>Is Full Url To Article</td>
                        <td>Full content selector</td>
                        <td>Actions</td>
                    </tr>
                    @foreach($itemSchemas as $item)
                        <tr>
                            <td>{{ $item->title }}</td>
                            <td>{{ $item->css_expression }}</td>
                            <td>{{ $item->is_full_url==1?"Yes":"No" }}</td>
                            <td>{{ $item->full_content_selector }}</td>
                            <td>
                                <a href="{{ url('dashboard/item-schema/' . $item->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a>
                            </td>
                        </tr>
                    @endforeach
                </table>

                @if(count($itemSchemas) > 0)
                    <div class="pagination">
                        <?php echo $itemSchemas->render();  ?>
                    </div>
                @endif

            @else
                <i>No items found</i>

            @endif
        </div>
    </div>

@endsection

resources/views/dashboard/item_schema/create.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Add Item Schema</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('item-schema.store') }}" enctype="multipart/form-data">
                {{ csrf_field() }}

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Title:</strong>

                            <input type="text" name="title" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>CSS Expression:</strong>

                            <input type="text" name="css_expression" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Is Full Url To Article/Partial Url:</strong>

                            <input type="checkbox" name="is_full_url" value="1" checked />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Full content selector:</strong>

                            <input type="text" name="full_content_selector" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Create</button>

                </div>

            </form>
        </div>
    </div>

@endsection

resources/views/dashboard/item_schema/edit.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Update Item Schema #{{$itemSchema->id}}</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('item-schema.update', ['id' => $itemSchema->id]) }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                {{ method_field("PUT") }}

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Title:</strong>

                            <input type="text" name="title" value="{{ $itemSchema->title }}" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>CSS Expression:</strong>

                            <input type="text" name="css_expression" value="{{ $itemSchema->css_expression }}" class="form-control" />
                            <div class="help-block">CSS expression identifies css selector for specific item in a single article separated by ||. i.e h2.post_title for title</div>
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Is Full Url To Article/Partial Url:</strong>

                            <input type="checkbox" name="is_full_url" value="1" {{ $itemSchema->is_full_url?"checked":"" }} />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Full content selector:</strong>

                            <input type="text" name="full_content_selector" value="{{ $itemSchema->full_content_selector }}" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Update</button>

                </div>

            </form>
        </div>
    </div>

@endsection

Atypical css expression takes this structure:

field1[css selector1]||field2[css selector2]||field3[css selector3[attribute]]

For example the item schema expression to pull articles from the new york times website will be as follows:

title[h2.css-1dq8tca]||excerpt[p.css-1echdzn]||image[img.css-11cwn6f[src]]||source_link[.css-4jyr1y a[href]]

As shown the expression identifies every field of data that need to be fetched separated by “||”. Every field has two parts the first one is the field name and the other part is the css selector between two brackets “[]”. The field name must match the field name in the database. In case of attributes like image src we add the attribute inside “[]” after the css selector.

Adding Links

The links module is the most important module as it stores the links we will fetch data from and will do the actual scraping process

app/Http/Controllers/LinksController.php

<?php

namespace App\Http\Controllers;

use App\Category;
use App\ItemSchema;
use App\Lib\Scraper;
use App\Link;
use App\Website;
use Illuminate\Http\Request;
use Goutte\Client;

class LinksController extends Controller
{
    /**
     * Display a listing of the resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function index()
    {
        $links = Link::orderBy('id', 'DESC')->paginate(10);

        $itemSchemas = ItemSchema::all();

        return view('dashboard.link.index')->withLinks($links)->withItemSchemas($itemSchemas);
    }

    /**
     * Show the form for creating a new resource.
     *
     * @return \Illuminate\Http\Response
     */
    public function create()
    {
        $categories = Category::all();
        $websites = Website::all();

        return view('dashboard.link.create')->withCategories($categories)->withWebsites($websites);
    }

    /**
     * Store a newly created resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @return \Illuminate\Http\Response
     */
    public function store(Request $request)
    {
        $this->validate($request, [
            'url' => 'required',
            'main_filter_selector' => 'required',
            'website_id' => 'required',
            'category_id' => 'required'
        ]);

        $link = new Link;

        $link->url = $request->input('url');

        $link->main_filter_selector = $request->input('main_filter_selector');

        $link->website_id = $request->input('website_id');

        $link->category_id = $request->input('category_id');

        $link->save();

        return redirect()->route('links.index');
    }

    /**
     * Display the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function show($id)
    {
        //
    }

    /**
     * Show the form for editing the specified resource.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function edit($id)
    {
        $categories = Category::all();
        $websites = Website::all();

        return view('dashboard.link.edit')->withLink(Link::find($id))->withCategories($categories)->withWebsites($websites);
    }

    /**
     * Update the specified resource in storage.
     *
     * @param  \Illuminate\Http\Request  $request
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function update(Request $request, $id)
    {
        $this->validate($request, [
            'url' => 'required',
            'main_filter_selector' => 'required',
            'website_id' => 'required',
            'category_id' => 'required'
        ]);

        $link = Link::find($id);

        $link->url = $request->input('url');

        $link->main_filter_selector = $request->input('main_filter_selector');

        $link->website_id = $request->input('website_id');

        $link->category_id = $request->input('category_id');

        $link->save();

        return redirect()->route('links.index');
    }

    /**
     * Remove the specified resource from storage.
     *
     * @param  int  $id
     * @return \Illuminate\Http\Response
     */
    public function destroy($id)
    {
        //
    }


    /**
     * @param Request $request
     */
    public function setItemSchema(Request $request)
    {
        if(!$request->item_schema_id && !$request->link_id)
            return;

        $link = Link::find($request->link_id);

        $link->item_schema_id = $request->item_schema_id;

        $link->save();

        return response()->json(['msg' => 'Link updated!']);
    }


    /**
     * scrape specific link
     *
     * @param Request $request
     */
    public function scrape(Request $request)
    {
        if(!$request->link_id)
            return;

        $link = Link::find($request->link_id);

        if(empty($link->main_filter_selector) && (empty($link->item_schema_id) || $link->item_schema_id == 0)) {
            return;
        }

        $scraper = new Scraper(new Client());

        $scraper->handle($link);

        if($scraper->status == 1) {
            return response()->json(['status' => 1, 'msg' => 'Scraping done']);
        } else {
            return response()->json(['status' => 2, 'msg' => $scraper->status]);
        }
    }
}

In the above code we need to focus on the scrape() method, this method do the actual job of scraping by calling another class Scraper which we will implement shortly to fetch and scrape a certain link like so:

$scraper = new Scraper(new Client());

        $scraper->handle($link);

        if($scraper->status == 1) {
            return response()->json(['status' => 1, 'msg' => 'Scraping done']);
        } else {
            return response()->json(['status' => 2, 'msg' => $scraper->status]);
        }

create a new file app/Lib/Scraper.php and add the below code:

<?php

namespace App\Lib;

use App\Article;
use Goutte\Client as GoutteClient;

/**
 * Class Scraper
 *
 * handles and process scraping using specific link
 * first we work on the main filter expression which is the
 * the container of the items, then using annonymous callback
 * on the filter function we iterate and save the results
 * into the article table
 *
 * @package App\Lib
 */
class Scraper
{
    protected $client;

    public $results = [];

    public $savedItems = 0;

    public $status = 1;

    public function __construct(GoutteClient $client)
    {
        $this->client = $client;
    }

    public function handle($linkObj)
    {
        try {
            $crawler = $this->client->request('GET', $linkObj->url);

            $translateExpre = $this->translateCSSExpression($linkObj->itemSchema->css_expression);

            if (isset($translateExpre['title'])) {

                $data = [];

                // filter
                $crawler->filter($linkObj->main_filter_selector)->each(function ($node) use ($translateExpre, &$data, $linkObj) {

                    // using the $node var we can access sub elements deep the tree

                    foreach ($translateExpre as $key => $val) {

                        if($node->filter($val['selector'])->count() > 0) {

                            if ($val['is_attribute'] == false) {

                                $data[$key][] = preg_replace("#\n|'|\"#",'', $node->filter($val['selector'])->text());
                            } else {
                                if ($key == 'source_link') {

                                    $item_link = $node->filter($val['selector'])->attr($val['attr']);

                                    // append website url in case the article is not full url
                                    if ($linkObj->itemSchema->is_full_url == 0) {
                                        $item_link = $linkObj->website->url . $node->filter($val['selector'])->attr($val['attr']);
                                    }

                                    $data[$key][] = $item_link;
                                    $data['content'][] = $this->fetchFullContent($item_link, $linkObj->itemSchema->full_content_selector);
                                } else {
                                    $data[$key][] = $node->filter($val['selector'])->attr($val['attr']);
                                }
                            }
                        }
                    }

                    $data['category_id'][] = $linkObj->category->id;

                    $data['website_id'][] = $linkObj->website->id;

                });
                //dd($data);
                $this->save($data);

                $this->results = $data;
            }
        } catch (\Exception $ex) {
            $this->status = $ex->getMessage();
        }
    }


    /**
     * fetchFullContent
     *
     * this method pulls the full content of a single item using the
     * item url and selector
     *
     * @param $item_url
     * @param $selector
     * @return string
     */
    protected function fetchFullContent($item_url, $selector)
    {
        try {
            $crawler = $this->client->request('GET', $item_url);

            return $crawler->filter($selector)->html();
        } catch (\Exception $ex) {
            return "";
        }
    }

    protected function save($data)
    {
        foreach ($data['title'] as $k => $val) {

            $checkExist = Article::where('source_link', $data['source_link'][$k])->first();

            if(!isset($checkExist->id)) {

                $article = new Article();

                $article->title = $val;

                $article->excerpt = isset($data['excerpt'][$k]) ? $data['excerpt'][$k] : "";

                $article->content = isset($data['content'][$k]) ? $data['content'][$k] : "";

                $article->image = isset($data['image'][$k]) ? $data['image'][$k] : "";

                $article->source_link = $data['source_link'][$k];

                $article->category_id = $data['category_id'][$k];

                $article->website_id = $data['website_id'][$k];

                $article->save();

                $this->savedItems++;
            }
        }
    }


    /**
     * translateCSSExpression
     *
     * translate the css expression into corresponding fields and sub selectors
     *
     * @param $expression
     * @return array
     */
    protected function translateCSSExpression($expression)
    {
        $exprArray = explode("||", $expression);

        // try to match split that expression into pieces
        $regex = '/(.*?)\[(.*)\]/m';

        $fields = [];

        foreach ($exprArray as $subExpr) {

            preg_match($regex, $subExpr, $matches);

            if(isset($matches[1]) && isset($matches[2])) {

                $is_attribute = false;

                $selector = $matches[2];

                $attr = "";

                // if this condition meets then this is attribute like img[src] or a[href]
                if (strpos($selector, "[") !== false && strpos($selector, "]") !== false) {

                    $is_attribute = true;

                    preg_match($regex, $matches[2], $matches_attr);

                    $selector = $matches_attr[1];

                    $attr = $matches_attr[2];
                }

                $fields[$matches[1]] = ['field' => $matches[1], 'is_attribute' => $is_attribute, 'selector' => $selector, 'attr' => $attr];
            }
        }

        return $fields;
    }
}

The main method in the above code is the handle() method. This method works on the Goutte client package. It takes a link object, creates a crawler object from the given url.

Then it translate the css expression for the item schema attached with that link into an array of fields and their selectors with translateCSSExpression() method:

protected function translateCSSExpression($expression)
    {
        $exprArray = explode("||", $expression);

        // try to match split that expression into pieces
        $regex = '/(.*?)\[(.*)\]/m';

        $fields = [];

        foreach ($exprArray as $subExpr) {

            preg_match($regex, $subExpr, $matches);

            if(isset($matches[1]) && isset($matches[2])) {

                $is_attribute = false;

                $selector = $matches[2];

                $attr = "";

                // if this condition meets then this is attribute like img[src] or a[href]
                if (strpos($selector, "[") !== false && strpos($selector, "]") !== false) {

                    $is_attribute = true;

                    preg_match($regex, $matches[2], $matches_attr);

                    $selector = $matches_attr[1];

                    $attr = $matches_attr[2];
                }

                $fields[$matches[1]] = ['field' => $matches[1], 'is_attribute' => $is_attribute, 'selector' => $selector, 'attr' => $attr];
            }
        }

        return $fields;
    }

After we convert the expression into the array we move into the filtering process passing the main filter selector to the filter() method, this will gives us a collection of results we iterate over them using each() method, inside that function we get the different pieces of data like this:

$crawler->filter($linkObj->main_filter_selector)->each(function ($node) use ($translateExpre, &$data, $linkObj) {

                    // using the $node var we can access sub elements deep the tree

                    foreach ($translateExpre as $key => $val) {

                        if($node->filter($val['selector'])->count() > 0) {

                            if ($val['is_attribute'] == false) {

                                $data[$key][] = preg_replace("#\n|'|\"#",'', $node->filter($val['selector'])->text());
                            } else {
                                if ($key == 'source_link') {

                                    $item_link = $node->filter($val['selector'])->attr($val['attr']);

                                    // append website url in case the article is not full url
                                    if ($linkObj->itemSchema->is_full_url == 0) {
                                        $item_link = $linkObj->website->url . $node->filter($val['selector'])->attr($val['attr']);
                                    }

                                    $data[$key][] = $item_link;
                                    $data['content'][] = $this->fetchFullContent($item_link, $linkObj->itemSchema->full_content_selector);
                                } else {
                                    $data[$key][] = $node->filter($val['selector'])->attr($val['attr']);
                                }
                            }
                        }
                    }

                    $data['category_id'][] = $linkObj->category->id;

                    $data['website_id'][] = $linkObj->website->id;

                });

Using the $node variable passed to the callback we can get and filter the sub elements we need to fetch such as titles and images. As a result we looped over $translateExpr which is the translated css expression and return an array of $data to be saved into the database.

It’s better to put the code between try catch block as the fetching process may result in an error in any time due to many reasons like network loss or not found nodes matching the given expressions.

/resources/views/dashboard/link/index.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Links</h2>

            <div class="alert alert-success" style="display: none"></div>

            <a href="{{ route('links.create') }}" class="btn btn-warning pull-right">Add new</a>

            @if(count($links) > 0)

                <table class="table table-bordered">
                    <tr>
                        <td>Url</td>
                        <td>Main Filter Selector</td>
                        <td>Website</td>
                        <td>Assigned To Category</td>
                        <td><strong>Item Schema</strong></td>
                        <td><strong>Scrape Link</strong></td>
                        <td>Actions</td>
                    </tr>
                    @foreach($links as $link)
                        <tr data-id="{{ $link->id }}">
                            <td>{{ $link->url }}</td>
                            <td>{{ $link->main_filter_selector }}</td>
                            <td>{{ $link->website->title }} </td>
                            <td><strong><span class="label label-info">{{ $link->category->title }}</span></strong> </td>
                            <td>
                                <select class="item_schema" data-id="{{ $link->id }}" data-original-schema="{{$link->item_schema_id}}">
                                    <option value="" disabled selected>Select</option>
                                    @foreach($itemSchemas as $item)
                                        <option value="{{$item->id}}" {{ $item->id==$link->item_schema_id?"selected":"" }}>{{$item->title}}</option>
                                    @endforeach
                                </select>
                                <button type="button" class="btn btn-info btn-sm btn-apply" style="display: none">Apply</button>
                            </td>
                            <td>
                                @if($link->item_schema_id != "" && $link->main_filter_selector != "")
                                    <button type="button" class="btn btn-primary btn-scrape" title="pull the latest items">Scrape <i class="glyphicon glyphicon-repeat fast-right-spinner" style="display: none"></i></button>
                                @else
                                    <span style="color: red">fill main filter selector and item schema first</span>
                                @endif
                            </td>
                            <td>
                                <a href="{{ url('dashboard/links/' . $link->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a>
                            </td>
                        </tr>
                    @endforeach
                </table>

                @if(count($links) > 0)
                    <div class="pagination">
                        <?php echo $links->render();  ?>
                    </div>
                @endif

            @else
                <i>No links found</i>

            @endif
        </div>
    </div>

@endsection

@section('script')
    <script>
        $(function () {
           $("select.item_schema").change(function () {
              if($(this).val() != $(this).attr("data-original-schema")) {
                  $(this).siblings('.btn-apply').show();
              }
           });
           
           $('.btn-apply').click(function () {

               var btn = $(this);

               var tRowId = $(this).parents("tr").attr("data-id");
               var schema_id = $(this).siblings('select').val();

               $.ajaxSetup({
                   headers: {
                       'X-XSRF-TOKEN': "{{ csrf_token() }}"
                   }
               });

               $.ajax({
                  url: "{{ url('dashboard/links/set-item-schema') }}",
                  data: {link_id: tRowId, item_schema_id: schema_id, _token: "{{ csrf_token() }}", _method: "patch"},
                  method: "post",
                  dataType: "json",
                  success: function (response) {
                      alert(response.msg);

                      btn.hide();
                  }
               });
           });
           
           $(".btn-scrape").click(function () {
               var btn = $(this);

               btn.find(".fast-right-spinner").show();

               btn.prop("disabled", true);

               var tRowId = $(this).parents("tr").attr("data-id");

               $.ajaxSetup({
                   headers: {
                       'X-XSRF-TOKEN': "{{ csrf_token() }}"
                   }
               });

               $.ajax({
                   url: "{{ url('dashboard/links/scrape') }}",
                   data: {link_id: tRowId, _token: "{{ csrf_token() }}"},
                   method: "post",
                   dataType: "json",
                   success: function (response) {

                       if(response.status == 1) {
                           $(".alert").removeClass("alert-danger").addClass("alert-success").text(response.msg).show();
                       } else {
                           $(".alert").removeClass("alert-success").addClass("alert-danger").text(response.msg).show();
                       }

                       btn.find(".fast-right-spinner").hide();
                   }
               });
           });
        });
    </script>
@endsection

/resources/views/dashboard/link/create.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Add Link</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('links.store') }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Url:</strong>

                            <input type="text" name="url" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Main Filter Selector:</strong>

                            <input type="text" name="main_filter_selector" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Website:</strong>

                            <select name="website_id" class="form-control">
                                <option value="">select</option>

                                @foreach($websites as $website)
                                    <option value="{{ $website->id }}">{{ $website->title }}</option>
                                @endforeach
                            </select>

                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Category:</strong>

                            <select name="category_id" class="form-control">
                                <option value="">select</option>

                                @foreach($categories as $category)
                                    <option value="{{ $category->id }}">{{ $category->title }}</option>
                                @endforeach
                            </select>

                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Create</button>

                </div>

            </form>
        </div>
    </div>

@endsection

/resources/views/dashboard/link/edit.blade.php

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Update Link #{{$link->id}}</h2>

            @if(session('error')!='')
                <div class="alert alert-danger">
                    {{ session('error') }}
                </div>
            @endif

            @if (count($errors) > 0)

                <div class="alert alert-danger">

                    <ul>

                        @foreach ($errors->all() as $error)

                            <li>{{ $error }}</li>

                        @endforeach

                    </ul>

                </div>

            @endif

            <form method="post" action="{{ route('links.update', ['id' => $link->id]) }}" enctype="multipart/form-data">
                {{ csrf_field() }}
                {{ method_field("PUT") }}

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Url:</strong>

                            <input type="text" name="url" value="{{ $link->url }}" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Main Filter Selector:</strong>

                            <input type="text" name="main_filter_selector" value="{{ $link->main_filter_selector }}" class="form-control" />
                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Website:</strong>

                            <select name="website_id" class="form-control">
                                <option value="">select</option>

                                @foreach($websites as $website)
                                    <option value="{{ $website->id }}" {{ $website->id==$link->website_id?"selected":"" }}>{{ $website->title }}</option>
                                @endforeach
                            </select>

                        </div>
                    </div>
                </div>

                <div class="row">
                    <div class="col-xs-12 col-sm-12 col-md-6">
                        <div class="form-group">

                            <strong>Category:</strong>

                            <select name="category_id" class="form-control">
                                <option value="">select</option>

                                @foreach($categories as $category)
                                    <option value="{{ $category->id }}" {{ $category->id==$link->category_id?"selected":"" }}>{{ $category->title }}</option>
                                @endforeach
                            </select>

                        </div>
                    </div>
                </div>

                <div class="col-xs-12 col-sm-12 col-md-12 text-center">

                    <button type="submit" class="btn btn-primary" id="btn-save">Update</button>

                </div>

            </form>
        </div>
    </div>

@endsection

 

Create app/Http/Controllers/HomeController.php and add this code, this will be our home controller:

<?php

namespace App\Http\Controllers;

use App\Article;
use App\Category;
use Illuminate\Http\Request;

class HomeController extends Controller
{
    public function index()
    {
        
    }

    public function getArticleDetails($id)
    {
        
    }

    public function getCategory($id)
    {
        
    }
}

 

Create resources/views/home.blade.php and add this code:

@extends('layout')

@section('content')

    <div class="row">
        <div class="col-md-12">
            <h2>Articles</h2>
            ....
        </div>
    </div>

@endsection

 

Now modify routes/web.php to be like this:

Route::get('/', 'HomeController@index');
Route::get('/article-details/{id}', 'HomeController@getArticleDetails');
Route::get('/category/{id}', 'HomeController@getCategory');

Route::group(['prefix' => 'dashboard'], function() {
    Route::resource('/websites', 'WebsitesController');
    Route::resource('/categories', 'CategoriesController');
    Route::patch('/links/set-item-schema', 'LinksController@setItemSchema');
    Route::post('/links/scrape', 'LinksController@scrape');
    Route::resource('/links', 'LinksController');
    Route::resource('/item-schema', 'ItemSchemaController');
    Route::resource('/articles', 'ArticlesController');
});

Modify resources/views/layout.blade.php and add the actual links

.....
<li><a href="{{ url('dashboard/websites') }}">Websites</a></li>
<li><a href="{{ url('dashboard/categories') }}">Categories</a></li>
<li><a href="{{ url('dashboard/links') }}">Links</a></li>
<li><a href="{{ url('dashboard/item-schema') }}">Item Schema</a></li>
<li role="separator" class="divider"></li>
<li><a href="{{ url('dashboard/articles') }}">Articles</a></li>
......

Now try to navigate to http://localhost/web_scraper/public/ and try to add websites and categories.

 

This video uses an example to demonstrate the process

In the final part of the tutorial we will implement the Frontend and article display.

 

Continue to part 3 >>> Article Display In Home Page

0 0 vote
Article Rating

What's your reaction?

Excited
0
Happy
0
Not Sure
0
Confused
0

You may also like

Subscribe
Notify of
guest
30 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Fernando
2 years ago

Hi there , Thanks for this wonderful tutorial I just have a question , Im trying to scrap data from this website (diario.mx) and this is my schema

title[.contenido]||excerpt[.sumario]||image[img.img-fluid [src]]||source_link[.nota a[href]]

But I dont know what Im doing wrong . Do you have a chance to take a look ?

Thank you again
Regards

JayChou
JayChou
1 year ago

Hi there, Thanks for this tutorial. I just have a question, I want save all img Avatar post and All img detail post into the folder upload on server. Can you help me?
Thanks.

JayChou
JayChou
1 year ago
Reply to  Wael Salah

Ok, Thanks 🙂

Kaushal Raj Kandel
Kaushal Raj Kandel
1 year ago

Method App\Http\Controllers\WebsitesController::uploadFile does not exist.
Can you help?

Evgeniy
1 year ago

Hi! Thnx 4 tutorial. I just have a question: Im trying to use proxy with scraping, can Your tell me, right way to use proxy.
Thank you!
Regards

Evgeniy
1 year ago
Reply to  Wael Salah

in Scraper.php Ill try:

public function __construct(GoutteClient $client)
{
$client = new GoutteClient;
$client->setClient(new GoutteClient([‘proxy’ => ‘http://174.138.33.167:8080’]));
$this->client = $client;
}
But on the link page, when ill click on scrape button, preloader on button spin and nothing happens. Цhat am I doing wrong?

Thank you!
Regards

Evgeniy
1 year ago
Reply to  Wael Salah

This code are work for me:
First im install Guzzle: php composer.phar require guzzlehttp/guzzle:~6.0

Then:
add use GuzzleHttp\Client as GuzzleClient;

adn modify code like this:
public function __construct(GoutteClient $client)
{
$proxy = new GuzzleClient([‘proxy’ => ‘87.251.238.156:51627’]);
$client = new \Goutte\Client();
$client->setClient($proxy);
$this->client = $client;
}

Racheal Chapman
1 year ago

To keep you on track and ensure that your process is conducted smoothly, you can conduct web scraping with proxies so that you always secured. With proxy servers, you always receive content which does not cause any harm to your systems.

Ahmad
Ahmad
1 year ago

good
please give me file this program 🙂

ARIF
ARIF
1 year ago

Where is ArticleController?

mahdy m
mahdy m
1 year ago

hi i am trying to scrap data from an ecommerce website, but no data is retrieved.
this is link and schema
https://shopee.co.id/shop/127192295/search
title[div._1JAmkB]||excerp[span._341bF0]||image[div.customized-overlay-image img[src]]||source_link[.col-xs-2-4 div a[href]]

Thank you again
Regards.

mahdy m
mahdy m
1 year ago
Reply to  Wael Salah

then how to scrape data from javascript website? what’s with this isn’t possible? thanks

soyae
soyae
1 year ago

my edit.blade.php is eror

soyae
soyae
1 year ago
Reply to  soyae

Missing required parameters for [Route: item-schema.update] [URI: dashboard/item-schema/{item_schema}]. (View: D:\Projects\Web\translator\resources\views\dashboard\item_schema\edit.blade.php)

sorry this is my first time using laravel

Igor
Igor
1 year ago
Reply to  Wael Salah

Dear Wael,
I join to all thnx for your tutorial.

Yep, I did follow it step by step))

But I have the same error. And have solved it somehow.

Please tell us why this doesn’t work:

action=”{{ route(‘websites.update’, [‘id’ => $website->id]) }}”

But this works:

action=”{{ route(‘websites.update’, $website->id) }}”.

Laravel 6.12

soyae
soyae
1 year ago

after scraping is done, it’s not showing in the home

masjack
11 months ago
Reply to  soyae

check your HomeController,
public function index()
{
return view(‘home’);
}

i hope this help…