In this part of this tutorial (Building a simple scraping website) we will continue and in this part we will create a simple dashboard and some crud operations for our modules.
Series Topics:
- Building A Simple Scraping Website With PHP Laravel Part1: Beginning
- Building A Simple Scraping Website With PHP Laravel Part2: Dashboard and Crud
- Building A Simple Scraping Website With PHP Laravel Part3: Article Display
At first we will need some modules for the database tables we just created in the previous part, i will not go into the details of explaining those modules because it is just a normal crud operations, but i will explain the required part of scraping.
Creating Controllers
Let’s first create the required controllers, i will use resource controllers here so in the terminal run these commands:
php artisan make:controller CategoriesController --resource php artisan make:controller WebsitesController --resource php artisan make:controller LinksController --resource php artisan make:controller ItemSchemaController --resource php artisan make:controller ArticlesController --resource
Next let’s add the routes for those controllers
Open routes/web.php and add the below code:
Route::group(['prefix' => 'dashboard'], function() { Route::resource('/websites', 'WebsitesController'); Route::resource('/categories', 'CategoriesController'); Route::resource('/links', 'LinksController'); Route::resource('/item-schema', 'ItemSchemaController'); Route::resource('/articles', 'ArticlesController'); });
As you see above all the modules url will be under dashboard group so for example to go to article list type /dashboard/articles
At first modify app/Http/Controllers/Controller.php like this:
<?php namespace App\Http\Controllers; use Illuminate\Foundation\Bus\DispatchesJobs; use Illuminate\Routing\Controller as BaseController; use Illuminate\Foundation\Validation\ValidatesRequests; use Illuminate\Foundation\Auth\Access\AuthorizesRequests; class Controller extends BaseController { use AuthorizesRequests, DispatchesJobs, ValidatesRequests; function uploadFile($name, $destination, $request = null) { try { $image = $request->file($name); $fileName = time() . '.' . $image->getClientOriginalExtension(); $image->move($destination, $fileName); return ["state" => 1, "filename" => $fileName]; } catch (\Exception $ex) { return ["state" => 0, "filename" => ""]; } } }
Adding Categories
Let’s start with the category module open app/Http/Controllers/CategoriesController.php and modify it like this:
<?php namespace App\Http\Controllers; use App\Category; use Illuminate\Http\Request; class CategoriesController extends Controller { /** * Display a listing of the resource. * * @return \Illuminate\Http\Response */ public function index() { $cats = Category::orderBy('id', 'DESC')->paginate(10); return view('dashboard.category.index')->withCategories($cats); } /** * Show the form for creating a new resource. * * @return \Illuminate\Http\Response */ public function create() { return view('dashboard.category.create'); } /** * Store a newly created resource in storage. * * @param \Illuminate\Http\Request $request * @return \Illuminate\Http\Response */ public function store(Request $request) { $this->validate($request, [ 'title' => 'required' ]); $cat = new Category; $cat->title = $request->input('title'); $cat->save(); return redirect()->route('categories.index'); } /** * Display the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function show($id) { // } /** * Show the form for editing the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function edit($id) { return view('dashboard.category.edit')->withCategory(Category::find($id)); } /** * Update the specified resource in storage. * * @param \Illuminate\Http\Request $request * @param int $id * @return \Illuminate\Http\Response */ public function update(Request $request, $id) { $this->validate($request, [ 'title' => 'required' ]); $cat = Category::find($id); $cat->title = $request->input('title'); $cat->save(); return redirect()->route('categories.index'); } /** * Remove the specified resource from storage. * * @param int $id * @return \Illuminate\Http\Response */ public function destroy($id) { // } }
Next create a new view resources/views/dashboard/category/index.blade.php and add the below code:
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Categories</h2> <a href="{{ route('categories.create') }}" class="btn btn-warning pull-right">Add new</a> @if(count($categories) > 0) <table class="table table-bordered"> <tr> <td>Title</td> <td>Actions</td> </tr> @foreach($categories as $cat) <tr> <td>{{ $cat->title }}</td> <td> <a href="{{ url('dashboard/categories/' . $cat->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a> </td> </tr> @endforeach </table> @if(count($categories) > 0) <div class="pagination"> <?php echo $categories->render(); ?> </div> @endif @else <i>No categories found</i> @endif </div> </div> @endsection
resources/views/dashboard/category/create.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Add Category</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('categories.store') }}" enctype="multipart/form-data"> {{ csrf_field() }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Title:</strong> <input type="text" name="title" class="form-control" /> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Create</button> </div> </form> </div> </div> @endsection
resources/views/dashboard/category/edit.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Update Category #{{$category->id}}</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('categories.update', ['id' => $category->id]) }}" enctype="multipart/form-data"> {{ csrf_field() }} {{ method_field("PUT") }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Title:</strong> <input type="text" name="title" value="{{ $category->title }}" class="form-control" /> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Update</button> </div> </form> </div> </div> @endsection
The above code self explanatory just a simple crud for the categories module. First we updated the Categories controller, then we added three views for create, edit, and list categories.
Adding Websites
Now open up app/Http/Controllers/WebsitesController.php and modify it like this:
<?php namespace App\Http\Controllers; use App\Website; use Illuminate\Http\Request; class WebsitesController extends Controller { /** * Display a listing of the resource. * * @return \Illuminate\Http\Response */ public function index() { $websites = Website::orderBy('id', 'DESC')->paginate(10); return view('dashboard.website.index')->withWebsites($websites); } /** * Show the form for creating a new resource. * * @return \Illuminate\Http\Response */ public function create() { return view('dashboard.website.create'); } /** * Store a newly created resource in storage. * * @param \Illuminate\Http\Request $request * @return \Illuminate\Http\Response */ public function store(Request $request) { $this->validate($request, [ 'title' => 'required', 'url' => 'required', 'logo' => 'required' ]); $website = new Website; $website->title = $request->input('title'); $website->url = $request->input('url'); $website->logo = $this->uploadFile('logo', public_path('uploads/'), $request)["filename"]; $website->save(); return redirect()->route('websites.index'); } /** * Display the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function show($id) { // } /** * Show the form for editing the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function edit($id) { return view('dashboard.website.edit')->withWebsite(Website::find($id)); } /** * Update the specified resource in storage. * * @param \Illuminate\Http\Request $request * @param int $id * @return \Illuminate\Http\Response */ public function update(Request $request, $id) { $this->validate($request, [ 'title' => 'required', 'url' => 'required' ]); $website = Website::find($id); $website->title = $request->input('title'); $website->url = $request->input('url'); if($request->file('logo') != null) { $website->logo = $this->uploadFile('logo', public_path('uploads/'), $request)["filename"]; } $website->save(); return redirect()->route('websites.index'); } /** * Remove the specified resource from storage. * * @param int $id * @return \Illuminate\Http\Response */ public function destroy($id) { // } }
resources/views/dashboard/website/index.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Websites</h2> <a href="{{ route('websites.create') }}" class="btn btn-warning pull-right">Add new</a> @if(count($websites) > 0) <table class="table table-bordered"> <tr> <td>Title</td> <td>Logo</td> <td>Url</td> <td>Actions</td> </tr> @foreach($websites as $website) <tr> <td>{{ $website->title }}</td> <td><img width="150" src="{{ url('uploads/' . $website->logo) }}" /></td> <td><a href="{{ $website->url }}">{{ $website->url }}</a> </td> <td> <a href="{{ url('dashboard/websites/' . $website->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a> </td> </tr> @endforeach </table> @if(count($websites) > 0) <div class="pagination"> <?php echo $websites->render(); ?> </div> @endif @else <i>No websites found</i> @endif </div> </div> @endsection
resources/views/dashboard/website/create.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Add Website</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('websites.store') }}" enctype="multipart/form-data"> {{ csrf_field() }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Title:</strong> <input type="text" name="title" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Url:</strong> <input type="text" name="url" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Logo:</strong> <input type="file" name="logo" class="form-control" /> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Create</button> </div> </form> </div> </div> @endsection
resources/views/dashboard/website/edit.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Update Website #{{$website->id}}</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('websites.update', ['id' => $website->id]) }}" enctype="multipart/form-data"> {{ csrf_field() }} {{ method_field("PUT") }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Title:</strong> <input type="text" name="title" value="{{ $website->title }}" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Url:</strong> <input type="text" name="url" value="{{ $website->url }}" class="form-control" /> </div> </div> </div> <div class="row"> @if($website->logo != "") <img src="{{ url('uploads/' . $website->logo) }}" width="150" /> @endif <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Logo:</strong> <input type="file" name="logo" class="form-control" /> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Update</button> </div> </form> </div> </div> @endsection
Make sure to create uploads/ directory inside public/ folder and give it a writable permissions, this directory will be used as the uploaded website logos as shown in the code above
Adding Item Schema
Item schema as we mentioned in the previous article represent the schema for a single item in a list of items so we need to construct an expression that represent that schema
Open app/Http/Controllers/ItemSchemaController.php and modify it like this:
<?php namespace App\Http\Controllers; use App\ItemSchema; use Illuminate\Http\Request; class ItemSchemaController extends Controller { /** * Display a listing of the resource. * * @return \Illuminate\Http\Response */ public function index() { $itemSchema = ItemSchema::orderBy('id', 'DESC')->paginate(10); return view('dashboard.item_schema.index')->withItemSchemas($itemSchema); } /** * Show the form for creating a new resource. * * @return \Illuminate\Http\Response */ public function create() { return view('dashboard.item_schema.create'); } /** * Store a newly created resource in storage. * * @param \Illuminate\Http\Request $request * @return \Illuminate\Http\Response */ public function store(Request $request) { $this->validate($request, [ 'title' => 'required', 'css_expression' => 'required', 'full_content_selector' => 'required' ]); $itemSchema = new ItemSchema; $itemSchema->title = $request->input('title'); if($request->input('is_full_url') != null) { $itemSchema->is_full_url = 1; } else { $itemSchema->is_full_url = 0; } $itemSchema->css_expression = $request->input('css_expression'); $itemSchema->full_content_selector = $request->input('full_content_selector'); $itemSchema->save(); return redirect()->route('item-schema.index'); } /** * Display the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function show($id) { // } /** * Show the form for editing the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function edit($id) { return view('dashboard.item_schema.edit')->withItemSchema(ItemSchema::find($id)); } /** * Update the specified resource in storage. * * @param \Illuminate\Http\Request $request * @param int $id * @return \Illuminate\Http\Response */ public function update(Request $request, $id) { $this->validate($request, [ 'title' => 'required', 'css_expression' => 'required', 'full_content_selector' => 'required' ]); $itemSchema = ItemSchema::find($id); $itemSchema->title = $request->input('title'); if($request->input('is_full_url') != null) { $itemSchema->is_full_url = 1; } else { $itemSchema->is_full_url = 0; } $itemSchema->css_expression = $request->input('css_expression'); $itemSchema->full_content_selector = $request->input('full_content_selector'); $itemSchema->save(); return redirect()->route('item-schema.index'); } /** * Remove the specified resource from storage. * * @param int $id * @return \Illuminate\Http\Response */ public function destroy($id) { } }
resources/views/dashboard/item_schema/index.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Item Schema</h2> <a href="{{ route('item-schema.create') }}" class="btn btn-warning pull-right">Add new</a> @if(count($itemSchemas) > 0) <table class="table table-bordered"> <tr> <td>Title</td> <td>CSS Expression</td> <td>Is Full Url To Article</td> <td>Full content selector</td> <td>Actions</td> </tr> @foreach($itemSchemas as $item) <tr> <td>{{ $item->title }}</td> <td>{{ $item->css_expression }}</td> <td>{{ $item->is_full_url==1?"Yes":"No" }}</td> <td>{{ $item->full_content_selector }}</td> <td> <a href="{{ url('dashboard/item-schema/' . $item->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a> </td> </tr> @endforeach </table> @if(count($itemSchemas) > 0) <div class="pagination"> <?php echo $itemSchemas->render(); ?> </div> @endif @else <i>No items found</i> @endif </div> </div> @endsection
resources/views/dashboard/item_schema/create.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Add Item Schema</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('item-schema.store') }}" enctype="multipart/form-data"> {{ csrf_field() }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Title:</strong> <input type="text" name="title" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>CSS Expression:</strong> <input type="text" name="css_expression" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Is Full Url To Article/Partial Url:</strong> <input type="checkbox" name="is_full_url" value="1" checked /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Full content selector:</strong> <input type="text" name="full_content_selector" class="form-control" /> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Create</button> </div> </form> </div> </div> @endsection
resources/views/dashboard/item_schema/edit.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Update Item Schema #{{$itemSchema->id}}</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('item-schema.update', ['id' => $itemSchema->id]) }}" enctype="multipart/form-data"> {{ csrf_field() }} {{ method_field("PUT") }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Title:</strong> <input type="text" name="title" value="{{ $itemSchema->title }}" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>CSS Expression:</strong> <input type="text" name="css_expression" value="{{ $itemSchema->css_expression }}" class="form-control" /> <div class="help-block">CSS expression identifies css selector for specific item in a single article separated by ||. i.e h2.post_title for title</div> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Is Full Url To Article/Partial Url:</strong> <input type="checkbox" name="is_full_url" value="1" {{ $itemSchema->is_full_url?"checked":"" }} /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Full content selector:</strong> <input type="text" name="full_content_selector" value="{{ $itemSchema->full_content_selector }}" class="form-control" /> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Update</button> </div> </form> </div> </div> @endsection
Atypical css expression takes this structure:
field1[css selector1]||field2[css selector2]||field3[css selector3[attribute]]
For example the item schema expression to pull articles from the new york times website will be as follows:
title[h2.css-1dq8tca]||excerpt[p.css-1echdzn]||image[img.css-11cwn6f[src]]||source_link[.css-4jyr1y a[href]]
As shown the expression identifies every field of data that need to be fetched separated by “||”. Every field has two parts the first one is the field name and the other part is the css selector between two brackets “[]”. The field name must match the field name in the database. In case of attributes like image src we add the attribute inside “[]” after the css selector.
Adding Links
The links module is the most important module as it stores the links we will fetch data from and will do the actual scraping process
app/Http/Controllers/LinksController.php
<?php namespace App\Http\Controllers; use App\Category; use App\ItemSchema; use App\Lib\Scraper; use App\Link; use App\Website; use Illuminate\Http\Request; use Goutte\Client; class LinksController extends Controller { /** * Display a listing of the resource. * * @return \Illuminate\Http\Response */ public function index() { $links = Link::orderBy('id', 'DESC')->paginate(10); $itemSchemas = ItemSchema::all(); return view('dashboard.link.index')->withLinks($links)->withItemSchemas($itemSchemas); } /** * Show the form for creating a new resource. * * @return \Illuminate\Http\Response */ public function create() { $categories = Category::all(); $websites = Website::all(); return view('dashboard.link.create')->withCategories($categories)->withWebsites($websites); } /** * Store a newly created resource in storage. * * @param \Illuminate\Http\Request $request * @return \Illuminate\Http\Response */ public function store(Request $request) { $this->validate($request, [ 'url' => 'required', 'main_filter_selector' => 'required', 'website_id' => 'required', 'category_id' => 'required' ]); $link = new Link; $link->url = $request->input('url'); $link->main_filter_selector = $request->input('main_filter_selector'); $link->website_id = $request->input('website_id'); $link->category_id = $request->input('category_id'); $link->save(); return redirect()->route('links.index'); } /** * Display the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function show($id) { // } /** * Show the form for editing the specified resource. * * @param int $id * @return \Illuminate\Http\Response */ public function edit($id) { $categories = Category::all(); $websites = Website::all(); return view('dashboard.link.edit')->withLink(Link::find($id))->withCategories($categories)->withWebsites($websites); } /** * Update the specified resource in storage. * * @param \Illuminate\Http\Request $request * @param int $id * @return \Illuminate\Http\Response */ public function update(Request $request, $id) { $this->validate($request, [ 'url' => 'required', 'main_filter_selector' => 'required', 'website_id' => 'required', 'category_id' => 'required' ]); $link = Link::find($id); $link->url = $request->input('url'); $link->main_filter_selector = $request->input('main_filter_selector'); $link->website_id = $request->input('website_id'); $link->category_id = $request->input('category_id'); $link->save(); return redirect()->route('links.index'); } /** * Remove the specified resource from storage. * * @param int $id * @return \Illuminate\Http\Response */ public function destroy($id) { // } /** * @param Request $request */ public function setItemSchema(Request $request) { if(!$request->item_schema_id && !$request->link_id) return; $link = Link::find($request->link_id); $link->item_schema_id = $request->item_schema_id; $link->save(); return response()->json(['msg' => 'Link updated!']); } /** * scrape specific link * * @param Request $request */ public function scrape(Request $request) { if(!$request->link_id) return; $link = Link::find($request->link_id); if(empty($link->main_filter_selector) && (empty($link->item_schema_id) || $link->item_schema_id == 0)) { return; } $scraper = new Scraper(new Client()); $scraper->handle($link); if($scraper->status == 1) { return response()->json(['status' => 1, 'msg' => 'Scraping done']); } else { return response()->json(['status' => 2, 'msg' => $scraper->status]); } } }
In the above code we need to focus on the scrape() method, this method do the actual job of scraping by calling another class Scraper which we will implement shortly to fetch and scrape a certain link like so:
$scraper = new Scraper(new Client()); $scraper->handle($link); if($scraper->status == 1) { return response()->json(['status' => 1, 'msg' => 'Scraping done']); } else { return response()->json(['status' => 2, 'msg' => $scraper->status]); }
create a new file app/Lib/Scraper.php and add the below code:
<?php namespace App\Lib; use App\Article; use Goutte\Client as GoutteClient; /** * Class Scraper * * handles and process scraping using specific link * first we work on the main filter expression which is the * the container of the items, then using annonymous callback * on the filter function we iterate and save the results * into the article table * * @package App\Lib */ class Scraper { protected $client; public $results = []; public $savedItems = 0; public $status = 1; public function __construct(GoutteClient $client) { $this->client = $client; } public function handle($linkObj) { try { $crawler = $this->client->request('GET', $linkObj->url); $translateExpre = $this->translateCSSExpression($linkObj->itemSchema->css_expression); if (isset($translateExpre['title'])) { $data = []; // filter $crawler->filter($linkObj->main_filter_selector)->each(function ($node) use ($translateExpre, &$data, $linkObj) { // using the $node var we can access sub elements deep the tree foreach ($translateExpre as $key => $val) { if($node->filter($val['selector'])->count() > 0) { if ($val['is_attribute'] == false) { $data[$key][] = preg_replace("#\n|'|\"#",'', $node->filter($val['selector'])->text()); } else { if ($key == 'source_link') { $item_link = $node->filter($val['selector'])->attr($val['attr']); // append website url in case the article is not full url if ($linkObj->itemSchema->is_full_url == 0) { $item_link = $linkObj->website->url . $node->filter($val['selector'])->attr($val['attr']); } $data[$key][] = $item_link; $data['content'][] = $this->fetchFullContent($item_link, $linkObj->itemSchema->full_content_selector); } else { $data[$key][] = $node->filter($val['selector'])->attr($val['attr']); } } } } $data['category_id'][] = $linkObj->category->id; $data['website_id'][] = $linkObj->website->id; }); //dd($data); $this->save($data); $this->results = $data; } } catch (\Exception $ex) { $this->status = $ex->getMessage(); } } /** * fetchFullContent * * this method pulls the full content of a single item using the * item url and selector * * @param $item_url * @param $selector * @return string */ protected function fetchFullContent($item_url, $selector) { try { $crawler = $this->client->request('GET', $item_url); return $crawler->filter($selector)->html(); } catch (\Exception $ex) { return ""; } } protected function save($data) { foreach ($data['title'] as $k => $val) { $checkExist = Article::where('source_link', $data['source_link'][$k])->first(); if(!isset($checkExist->id)) { $article = new Article(); $article->title = $val; $article->excerpt = isset($data['excerpt'][$k]) ? $data['excerpt'][$k] : ""; $article->content = isset($data['content'][$k]) ? $data['content'][$k] : ""; $article->image = isset($data['image'][$k]) ? $data['image'][$k] : ""; $article->source_link = $data['source_link'][$k]; $article->category_id = $data['category_id'][$k]; $article->website_id = $data['website_id'][$k]; $article->save(); $this->savedItems++; } } } /** * translateCSSExpression * * translate the css expression into corresponding fields and sub selectors * * @param $expression * @return array */ protected function translateCSSExpression($expression) { $exprArray = explode("||", $expression); // try to match split that expression into pieces $regex = '/(.*?)\[(.*)\]/m'; $fields = []; foreach ($exprArray as $subExpr) { preg_match($regex, $subExpr, $matches); if(isset($matches[1]) && isset($matches[2])) { $is_attribute = false; $selector = $matches[2]; $attr = ""; // if this condition meets then this is attribute like img[src] or a[href] if (strpos($selector, "[") !== false && strpos($selector, "]") !== false) { $is_attribute = true; preg_match($regex, $matches[2], $matches_attr); $selector = $matches_attr[1]; $attr = $matches_attr[2]; } $fields[$matches[1]] = ['field' => $matches[1], 'is_attribute' => $is_attribute, 'selector' => $selector, 'attr' => $attr]; } } return $fields; } }
The main method in the above code is the handle() method. This method works on the Goutte client package. It takes a link object, creates a crawler object from the given url.
Then it translate the css expression for the item schema attached with that link into an array of fields and their selectors with translateCSSExpression() method:
protected function translateCSSExpression($expression) { $exprArray = explode("||", $expression); // try to match split that expression into pieces $regex = '/(.*?)\[(.*)\]/m'; $fields = []; foreach ($exprArray as $subExpr) { preg_match($regex, $subExpr, $matches); if(isset($matches[1]) && isset($matches[2])) { $is_attribute = false; $selector = $matches[2]; $attr = ""; // if this condition meets then this is attribute like img[src] or a[href] if (strpos($selector, "[") !== false && strpos($selector, "]") !== false) { $is_attribute = true; preg_match($regex, $matches[2], $matches_attr); $selector = $matches_attr[1]; $attr = $matches_attr[2]; } $fields[$matches[1]] = ['field' => $matches[1], 'is_attribute' => $is_attribute, 'selector' => $selector, 'attr' => $attr]; } } return $fields; }
After we convert the expression into the array we move into the filtering process passing the main filter selector to the filter() method, this will gives us a collection of results we iterate over them using each() method, inside that function we get the different pieces of data like this:
$crawler->filter($linkObj->main_filter_selector)->each(function ($node) use ($translateExpre, &$data, $linkObj) { // using the $node var we can access sub elements deep the tree foreach ($translateExpre as $key => $val) { if($node->filter($val['selector'])->count() > 0) { if ($val['is_attribute'] == false) { $data[$key][] = preg_replace("#\n|'|\"#",'', $node->filter($val['selector'])->text()); } else { if ($key == 'source_link') { $item_link = $node->filter($val['selector'])->attr($val['attr']); // append website url in case the article is not full url if ($linkObj->itemSchema->is_full_url == 0) { $item_link = $linkObj->website->url . $node->filter($val['selector'])->attr($val['attr']); } $data[$key][] = $item_link; $data['content'][] = $this->fetchFullContent($item_link, $linkObj->itemSchema->full_content_selector); } else { $data[$key][] = $node->filter($val['selector'])->attr($val['attr']); } } } } $data['category_id'][] = $linkObj->category->id; $data['website_id'][] = $linkObj->website->id; });
Using the $node variable passed to the callback we can get and filter the sub elements we need to fetch such as titles and images. As a result we looped over $translateExpr which is the translated css expression and return an array of $data to be saved into the database.
It’s better to put the code between try catch block as the fetching process may result in an error in any time due to many reasons like network loss or not found nodes matching the given expressions.
/resources/views/dashboard/link/index.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Links</h2> <div class="alert alert-success" style="display: none"></div> <a href="{{ route('links.create') }}" class="btn btn-warning pull-right">Add new</a> @if(count($links) > 0) <table class="table table-bordered"> <tr> <td>Url</td> <td>Main Filter Selector</td> <td>Website</td> <td>Assigned To Category</td> <td><strong>Item Schema</strong></td> <td><strong>Scrape Link</strong></td> <td>Actions</td> </tr> @foreach($links as $link) <tr data-id="{{ $link->id }}"> <td>{{ $link->url }}</td> <td>{{ $link->main_filter_selector }}</td> <td>{{ $link->website->title }} </td> <td><strong><span class="label label-info">{{ $link->category->title }}</span></strong> </td> <td> <select class="item_schema" data-id="{{ $link->id }}" data-original-schema="{{$link->item_schema_id}}"> <option value="" disabled selected>Select</option> @foreach($itemSchemas as $item) <option value="{{$item->id}}" {{ $item->id==$link->item_schema_id?"selected":"" }}>{{$item->title}}</option> @endforeach </select> <button type="button" class="btn btn-info btn-sm btn-apply" style="display: none">Apply</button> </td> <td> @if($link->item_schema_id != "" && $link->main_filter_selector != "") <button type="button" class="btn btn-primary btn-scrape" title="pull the latest items">Scrape <i class="glyphicon glyphicon-repeat fast-right-spinner" style="display: none"></i></button> @else <span style="color: red">fill main filter selector and item schema first</span> @endif </td> <td> <a href="{{ url('dashboard/links/' . $link->id . '/edit') }}"><i class="glyphicon glyphicon-edit"></i> </a> </td> </tr> @endforeach </table> @if(count($links) > 0) <div class="pagination"> <?php echo $links->render(); ?> </div> @endif @else <i>No links found</i> @endif </div> </div> @endsection @section('script') <script> $(function () { $("select.item_schema").change(function () { if($(this).val() != $(this).attr("data-original-schema")) { $(this).siblings('.btn-apply').show(); } }); $('.btn-apply').click(function () { var btn = $(this); var tRowId = $(this).parents("tr").attr("data-id"); var schema_id = $(this).siblings('select').val(); $.ajaxSetup({ headers: { 'X-XSRF-TOKEN': "{{ csrf_token() }}" } }); $.ajax({ url: "{{ url('dashboard/links/set-item-schema') }}", data: {link_id: tRowId, item_schema_id: schema_id, _token: "{{ csrf_token() }}", _method: "patch"}, method: "post", dataType: "json", success: function (response) { alert(response.msg); btn.hide(); } }); }); $(".btn-scrape").click(function () { var btn = $(this); btn.find(".fast-right-spinner").show(); btn.prop("disabled", true); var tRowId = $(this).parents("tr").attr("data-id"); $.ajaxSetup({ headers: { 'X-XSRF-TOKEN': "{{ csrf_token() }}" } }); $.ajax({ url: "{{ url('dashboard/links/scrape') }}", data: {link_id: tRowId, _token: "{{ csrf_token() }}"}, method: "post", dataType: "json", success: function (response) { if(response.status == 1) { $(".alert").removeClass("alert-danger").addClass("alert-success").text(response.msg).show(); } else { $(".alert").removeClass("alert-success").addClass("alert-danger").text(response.msg).show(); } btn.find(".fast-right-spinner").hide(); } }); }); }); </script> @endsection
/resources/views/dashboard/link/create.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Add Link</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('links.store') }}" enctype="multipart/form-data"> {{ csrf_field() }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Url:</strong> <input type="text" name="url" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Main Filter Selector:</strong> <input type="text" name="main_filter_selector" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Website:</strong> <select name="website_id" class="form-control"> <option value="">select</option> @foreach($websites as $website) <option value="{{ $website->id }}">{{ $website->title }}</option> @endforeach </select> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Category:</strong> <select name="category_id" class="form-control"> <option value="">select</option> @foreach($categories as $category) <option value="{{ $category->id }}">{{ $category->title }}</option> @endforeach </select> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Create</button> </div> </form> </div> </div> @endsection
/resources/views/dashboard/link/edit.blade.php
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Update Link #{{$link->id}}</h2> @if(session('error')!='') <div class="alert alert-danger"> {{ session('error') }} </div> @endif @if (count($errors) > 0) <div class="alert alert-danger"> <ul> @foreach ($errors->all() as $error) <li>{{ $error }}</li> @endforeach </ul> </div> @endif <form method="post" action="{{ route('links.update', ['id' => $link->id]) }}" enctype="multipart/form-data"> {{ csrf_field() }} {{ method_field("PUT") }} <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Url:</strong> <input type="text" name="url" value="{{ $link->url }}" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Main Filter Selector:</strong> <input type="text" name="main_filter_selector" value="{{ $link->main_filter_selector }}" class="form-control" /> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Website:</strong> <select name="website_id" class="form-control"> <option value="">select</option> @foreach($websites as $website) <option value="{{ $website->id }}" {{ $website->id==$link->website_id?"selected":"" }}>{{ $website->title }}</option> @endforeach </select> </div> </div> </div> <div class="row"> <div class="col-xs-12 col-sm-12 col-md-6"> <div class="form-group"> <strong>Category:</strong> <select name="category_id" class="form-control"> <option value="">select</option> @foreach($categories as $category) <option value="{{ $category->id }}" {{ $category->id==$link->category_id?"selected":"" }}>{{ $category->title }}</option> @endforeach </select> </div> </div> </div> <div class="col-xs-12 col-sm-12 col-md-12 text-center"> <button type="submit" class="btn btn-primary" id="btn-save">Update</button> </div> </form> </div> </div> @endsection
Create app/Http/Controllers/HomeController.php and add this code, this will be our home controller:
<?php namespace App\Http\Controllers; use App\Article; use App\Category; use Illuminate\Http\Request; class HomeController extends Controller { public function index() { } public function getArticleDetails($id) { } public function getCategory($id) { } }
Create resources/views/home.blade.php and add this code:
@extends('layout') @section('content') <div class="row"> <div class="col-md-12"> <h2>Articles</h2> .... </div> </div> @endsection
Now modify routes/web.php to be like this:
Route::get('/', 'HomeController@index'); Route::get('/article-details/{id}', 'HomeController@getArticleDetails'); Route::get('/category/{id}', 'HomeController@getCategory'); Route::group(['prefix' => 'dashboard'], function() { Route::resource('/websites', 'WebsitesController'); Route::resource('/categories', 'CategoriesController'); Route::patch('/links/set-item-schema', 'LinksController@setItemSchema'); Route::post('/links/scrape', 'LinksController@scrape'); Route::resource('/links', 'LinksController'); Route::resource('/item-schema', 'ItemSchemaController'); Route::resource('/articles', 'ArticlesController'); });
Modify resources/views/layout.blade.php and add the actual links
..... <li><a href="{{ url('dashboard/websites') }}">Websites</a></li> <li><a href="{{ url('dashboard/categories') }}">Categories</a></li> <li><a href="{{ url('dashboard/links') }}">Links</a></li> <li><a href="{{ url('dashboard/item-schema') }}">Item Schema</a></li> <li role="separator" class="divider"></li> <li><a href="{{ url('dashboard/articles') }}">Articles</a></li> ......
Now try to navigate to http://localhost/web_scraper/public/ and try to add websites and categories.
This video uses an example to demonstrate the process
In the final part of the tutorial we will implement the Frontend and article display.
Continue to part 3 >>> Article Display In Home Page
Hi there , Thanks for this wonderful tutorial I just have a question , Im trying to scrap data from this website (diario.mx) and this is my schema
title[.contenido]||excerpt[.sumario]||image[img.img-fluid [src]]||source_link[.nota a[href]]
But I dont know what Im doing wrong . Do you have a chance to take a look ?
Thank you again
Regards
Ok sorry for the delay, i checked this website but your item schema has something not valid in the title and image and this is the valid item schema for https://diario.mx/seccion/Nacional/: title[h4.titulo-normal]||excerpt[div.sumario]||image[amp-img.img-fluid[src]]||source_link[a.nota[href]] full content selector: .parrafo_notas for the link Main Filter Selector: .container .mt-4 article As shown if you inspect any element on diario.mx website you note that the title has class “titulo-normal”, for the image keep in mind that some websites uses a special tag for images not the standard “img” tag, if you view the source of the any page on diario.mx you will see that the image… Read more »
Hi there, Thanks for this tutorial. I just have a question, I want save all img Avatar post and All img detail post into the folder upload on server. Can you help me?
Thanks.
– Just add two new fields in the database for Avatar image and details image
– Modify the item schema to accept avatar image and details image
– Modify the code to capture and scrape those images in app/Lib/Scraper.php
Ok, Thanks 🙂
Method App\Http\Controllers\WebsitesController::uploadFile does not exist.
Can you help?
Oh sorry i have added it to the parent controller class
app/Http/Controllers/Controller.php
function uploadFile($name, $destination, $request = null)
{
try {
$image = $request->file($name);
$fileName = time() . '.' . $image->getClientOriginalExtension();
$image->move($destination, $fileName);
return ["state" => 1, "filename" => $fileName];
} catch (\Exception $ex) {
return ["state" => 0, "filename" => ""];
}
}
Hi! Thnx 4 tutorial. I just have a question: Im trying to use proxy with scraping, can Your tell me, right way to use proxy.
Thank you!
Regards
May be this article can help you
https://stackoverflow.com/questions/5211887/how-to-use-curl-via-a-proxy
in Scraper.php Ill try:
public function __construct(GoutteClient $client)
{
$client = new GoutteClient;
$client->setClient(new GoutteClient([‘proxy’ => ‘http://174.138.33.167:8080’]));
$this->client = $client;
}
But on the link page, when ill click on scrape button, preloader on button spin and nothing happens. Цhat am I doing wrong?
Thank you!
Regards
According to this link
https://github.com/FriendsOfPHP/Goutte/issues/220
to add the proxy in the Scraper.php
line 37
$crawler = $this->client->request(‘GET’, $linkObj->url, [‘proxy’ => ‘http://174.138.33.167:8080’]);
This code are work for me:
First im install Guzzle: php composer.phar require guzzlehttp/guzzle:~6.0
Then:
add use GuzzleHttp\Client as GuzzleClient;
adn modify code like this:
public function __construct(GoutteClient $client)
{
$proxy = new GuzzleClient([‘proxy’ => ‘87.251.238.156:51627’]);
$client = new \Goutte\Client();
$client->setClient($proxy);
$this->client = $client;
}
To keep you on track and ensure that your process is conducted smoothly, you can conduct web scraping with proxies so that you always secured. With proxy servers, you always receive content which does not cause any harm to your systems.
may be, most people usually scrape textual content on news and sports websites and i don’t think that this content will harm the system
good
please give me file this program 🙂
follow the tutorial it contain all the source code
Where is ArticleController?
It’s ArticlesController not ArticleController
hi i am trying to scrap data from an ecommerce website, but no data is retrieved.
this is link and schema
https://shopee.co.id/shop/127192295/search
title[div._1JAmkB]||excerp[span._341bF0]||image[div.customized-overlay-image img[src]]||source_link[.col-xs-2-4 div a[href]]
Thank you again
Regards.
Ohh sorry for hearing that i checked the website and i found that the website not server rendered the items you can find that when viewing the source of the website it has no products shown that’s because it uses reactjs to render data. Unfortunately these types of websites can’t be scraped due to javascript executable only on runtime.
then how to scrape data from javascript website? what’s with this isn’t possible? thanks
– Javascript based websites considered a poor websites in the fact it didn’t provide server side rendering for the data because the rendering is done via javascript ajax.
– Also those websites have seo problems because search engines can’t read or crawl javascript sites.
– I think to scrape data from javascript websites you have to write a javascript api to scrape those websites.
my edit.blade.php is eror
Missing required parameters for [Route: item-schema.update] [URI: dashboard/item-schema/{item_schema}]. (View: D:\Projects\Web\translator\resources\views\dashboard\item_schema\edit.blade.php)
sorry this is my first time using laravel
Ohh try to follow the tutorial step by step
Dear Wael,
I join to all thnx for your tutorial.
Yep, I did follow it step by step))
But I have the same error. And have solved it somehow.
Please tell us why this doesn’t work:
action=”{{ route(‘websites.update’, [‘id’ => $website->id]) }}”
But this works:
action=”{{ route(‘websites.update’, $website->id) }}”.
Laravel 6.12
I am not sure about laravel 6.12 but this syntax still supported in laravel 6
action=”{{ route(‘websites.update’, [‘id’ => $website->id]) }}”
after scraping is done, it’s not showing in the home
see what’s in the log file
check your HomeController,
public function index()
{
return view(‘home’);
}
i hope this help…
Hey there. Useful tutorial. I had a question about where the “withCategories” & “withArticles” functions are declared?
I can not remember but download the full tutorial from last part