Asked  1 Year ago    Answers:  5   Viewed   16 times

i am new to html dom parsing with php, there is one page which is having different content in its but having same 'class', when i am trying to fetch content i am able to get content of last div, is it possible that somehow i could get all the content of divs having same class request you to please have a look over my code:

<?php
    include(__dir__."/simple_html_dom.php");
    $html = file_get_html('http://campaignstudio.in/');
    echo $x = $html->find('h2[class="section-heading"]',1)->outertext; 
?>

 Answers

1

in your example code, you have

echo $x = $html->find('h2[class="section-heading"]',1)->outertext; 

as you are calling find() with a second parameter of 1, this will only return the 1 element. if instead you find all of them - you can do whatever you need with them...

$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
    echo $item->outertext . php_eol;
}

the full code i've just tested is...

include(__dir__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');

$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
    echo $item->outertext . php_eol;
}

which gives the output...

<h2 class="section-heading text-white">we've got what you need!</h2>
<h2 class="section-heading">at your service</h2>
<h2 class="section-heading">let's get in touch!</h2>
Saturday, May 29, 2021
 
3

proceed step by step...

start by getting all the wanted info from one page (the 1st for example)... the idea is to:

  • get all phone blocks: $phones = $html->find('a[data-id]');
  • in a loop, get the wanted info (name, price) from each block
  • insert these info in the db (i cant help with db since i didnt use one for a while, but you can do this on your own it's not that hard)

now that you have the code working for one page, let's try to make it work for all pages knowing that:

  • all pages have the same structure, so we can extract data with the same method/code above
  • the link of the next page to scrape is included in the next button, so we'll stop when this link cannot be found

so here's a code summarizing all what we said above:

$url = "https://www.varle.lt/mobilieji-telefonai/";

// start from the main page
$nextlink = $url;

// loop on each next link as long as it exsists
while ($nextlink) {
    echo "<hr>nextlink: $nextlink<br>";
    //create a dom object
    $html = new simple_html_dom();
    // load html from a url
    $html->load_file($nextlink);

    /////////////////////////////////////////////////////////////
    /// get phone blocks and extract info (also insert to db) ///
    /////////////////////////////////////////////////////////////
    $phones = $html->find('a[data-id]');

    foreach($phones as $phone) {
        // get the link
        $linkas = $phone->href;

        // get the name
        $pavadinimas = $phone->find('span[class=inner]', 0)->plaintext;

        // get the name price and extract the useful part using regex
        $kaina = $phone->find('span[class=price]', 0)->plaintext;
        // this captures the integer part of decimal numbers: in "123,45" will capture "123"... use @([d,]+),?@ to capture the decimal part too
        preg_match('@(d+),?@', $kaina, $matches);
        $kaina = $matches[1];

        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

        // insert into db here
        // code
        // ...
    }
    /////////////////////////////////////////////////////////////
    /////////////////////////////////////////////////////////////

    // extract the next link, if not found return null
    $nextlink = ( ($temp = $html->find('div.pagination a[class="next"]', 0)) ? "https://www.varle.lt".$temp->href : null );

    // clear dom object
    $html->clear();
    unset($html);
}

output

nextlink: https://www.varle.lt/mobilieji-telefonai/
samsung phone i9300 galaxy siii juodas #----# 1099 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-i9300-galaxy-siii-juodas.html
samsung galaxy s2 plus i9105 pilkai m?lynas #----# 739 #----# https://www.varle.lt/mobilieji-telefonai/samsung-galaxy-s2-plus-i9105-pilkai-melynas.html
samsung phone s7562 galaxy s duos baltas #----# 555 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-s7562-galaxy-s-duos-baltas--457135.html
...

nextlink: https://www.varle.lt/mobilieji-telefonai/?p=2
lg t375 mobile phone black #----# 218 #----# https://www.varle.lt/mobilieji-telefonai/lg-t375-mobile-phone-black.html
samsung s6802 galaxy ace duos black #----# 579 #----# https://www.varle.lt/mobilieji-telefonai/samsung-s6802-galaxy-ace-duos-black.html
mobilus telefonas samsung galaxy ace onyx black | s5830 #----# 559 #----# https://www.varle.lt/mobilieji-telefonai/mobilus-telefonas-samsung-galaxy-ace-onyx-black.html
...

...
...

working demo

notice that the code may take a while to parse all the pages, so php may return this error fatal error: maximum execution time of 30 seconds exceeded .... then, simply extend the maximum execution time like this:

ini_set('max_execution_time', 300); //300 seconds = 5 minutes
Saturday, May 29, 2021
 
HamidR
 
1

turns out i was a little too early asking this question. i found this page (api reference) and it tells us we can use the following w3c standard too:

$e->setattribute ( $name, $value )

so instead of

$elem->attr['class'] = "classname";

you can do

$elem->setattribute("class","classname");

i'll keep the question and answer up in case other people come across this and miss the api reference page.

Saturday, May 29, 2021
 
Eugenie
 
3

edit2: as this is a bug in the dom parser (tested on version 1.5), there is no simple way of doing this. solution i could think of:

$find = $html->find(".class1");
$ret = array();
foreach ($find as $element) {
    if (strpos($element->class, 'class3') !== false) {
        $ret[] = $element;
    }
}
$find = $ret;

basically you find all the elements with class one than iterate through those elements to find the ones that have class two (in this case three).


previous answer:

simple answer (should work according to html spec):

find(".class1.class2")

this will look for any type of element (div,img,a etc..) that has both class1 and class2. if you want to specify the type of element to match add it to the beginning without a . like:

find("div.class1.class2")

if you have a space between the two specified classes it will match elements with both the classes or elements nested in the element with the first class:

find(".class1 .class2")

will match

<div class="class1">
  <div class="class2">this will be returned</div>
</div>

or

<div class="class1 class2">this will be returned</div>

edit: i tried your code and found that the solutions above do not work. the solution that does work however is as follows:

$html->find("div[class=class1 class2]")
Wednesday, August 11, 2021
 
Null
 
5

no, simple html dom doesn't do dom manipulation. with phpquery though you can do:

$doc->find('head')->append('<script src="foo"></script>');
Friday, November 12, 2021
 
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :