website scraping with laravel

Building A Simple Scraping Website With PHP Laravel Part1: Beginning

In this tutorial we will introduce the concept of website scraping, then we will clarify that concept with a real world example using php and laravel.

 

Series Topics:

 

Requirements:

  • PHP and Laravel framework
  • Boostrap
  • Jquery

 

Overview:

Content scraping means reading pieces of content from html or xml pages to be displayed in some websites or to be saved into database, for example we might want to read a listing of articles in a news website or to read the products of ecommerce websites to make something like pricing comparison, etc. So the process is to pull pieces of information like item titles, descriptions, images and more.

 

Methods of Scraping:

  • Regex (Regular expressions).
  • CSS selectors.
  • XPath (a technique similar to querying with CSS selector).

 

Scraping Process:

As shown in the figure below consider we have a list of articles, each article have image, title, excerpt and we need to retrieve each of those pieces.

so using a technique like css selectors we can do something like this identified as pseudo code:

 

content scraping process

PHP Goutte Package:

For the purpose of this tutorial we will be using a php package called Goutte used in web scraping and this is the github repo for the package. This package internally based on symfony dom crawler.

The filter() function is one of the functions in symfony dom crawler, and it functions to filter elements by css selectors, there is also another function filterXPath() which filter elements with xpath expressions. So filter(selector) returns the selector if found as a crawler object or null of not found so that you can use that object to retrieve other nested elements

consider this sample html:

To get the first element

To get element by specific index

To get children’s of an element

Refer to symfony dom crawler documentation to learn more about the other available functions.

 

 

Preparing The Project:

We will build a news website where administrator can be able to add categories, add links for news websites and fetch articles with scraping.

 

So create a new laravel project:

 

Install PHP Goutte:

 

Next modify your database settings in .env:

 

Database tables

  • Categories
  • Articles
  • Websites: Hold the websites we will need to pull data from.
  • Links: Hold the single links linked with the websites to pull data from.
  • Item Schema: This table define the schema for a single item in the list of items, it will contain the css expression to fetch individual pieces of data like title, excerpt, image etc.

Let’s create some migrations for our database

 

Next modify the migration files as follows:

database/migrations/XXXX_XX_XX_create_category_table.php

database/migrations/XXXX_XX_XX_create_website_table.php

database/migrations/XXXX_XX_XX_create_article_table.php

database/migrations/XXXX_XX_XX_create_links_table.php

database/migrations/XXXX_XX_XX_create_item_schema_table.php

The category table will hold the categories, the article table will hold the articles linked by category id.

The website table represent the websites that we will scrape data from.

 

The links table represent the links for specific website and we added category_id to be assigned to, hence we say that link articles will be linked to the selected category id.The main_filter_selector defines the main css selector that will be passed to the filter() function so we say $crawler->filter(main_filter_selector)

 

The item_schema table represent the schema structure for single items in an article list page, for example the article contain title, url, excerpt, image.

The is_full_url defines whether the article uses a full url to the details page or partial url.

The css_expression attribute hold a special expression for all css selectors that represent those elements, as we will discuss this in the next sections.

The full_content_selector defines the css selector of the item content in the detail page.

 

Generating Models

Let’s create the required models for the database tables:

Open app/Website.php and modify it as follows:

Modify app/Category.php as follows:

app/Article.php

app/Link.php

app/ItemSchema.php

As shown in the above code we added some relations to the models for example in the Article model there are two relations the Category and the Website it belongs to. In the Link model there are relations between the Category, Website and Item Schema.

Preparing main layout

Now we need to create the main layout template in resources/views/layout.blade.php and add the below contents:

This is just a simple layout with a header, we added some links that represent the dashboard items like categories, websites, articles, item schema and links.

 

In the next part of the tutorial we will implement the Dashboard and Crud operations.

 

Continue to part 2 >>> Implementing Scraper Dashboard

Share this: