Asked  1 Year ago    Answers:  5   Viewed   19 times

i am trying to parse html data from a local weather channels site to get closing informations for schools, businesses, and churches around my local area.

i have run into the problem though that the information is contained in tables that do not have an id that i can use to identify them. below i have included an example of what one of their html tables looks like. is it possible to parse multiple html tables like this and pull the containing data using html dom parser with php. i have read through this documentation, but can't seem to find an applicable solution.

thanks!

edit: i should probably also specify i want to take this data and be able to parse it to json data to use to load in an application. so basically have the organizations name and then the status that i can fetch from a json page.

link to weather channels site

<table class="table table-condensed table-striped">
  <tbody>
    <tr>
      <th class="span5">organization</th>
      <th class="span9">status</th>
    </tr>
    <tr>
      <td><b>beacon hope church - grand island</b></td>
      <td>activity canceled sunday<small>: no evening classes</small></td>
    </tr>
    <tr>
      <td><b>prince of peace catholic (kearney)</b></td>
      <td>closed monday<small>: sunday evening activities canceled, no mon. morning mass, offices closed mon.</small></td>
    </tr>
  </tbody>
</table>

 Answers

3

found the answer to my question with help from user sms who commented above. this php pulls the data from the first table and encodes it in the json format.

<?php

include('simple_html_dom.php');
header('content-type: application/json');

$html = file_get_html('http://www.1011now.com/weather/closings');
$row_count=0;
$json = array();

// find all links 
$table = $html->find('table', 0);
foreach($table->find('tr') as $row) {
    $name = $row->find('td',0)->innertext;
    $status = $row->find('td',1)->innertext;

    $json[] = [ 'name' => strip_tags($name), 'status' => strip_tags($status)];
}

$options = array(
    'http' => array(
    'method'  => 'post',
    'content' => json_encode(array('closings' =>$json)),
    'header'=>  "content-type: application/jsonrn" .
                "accept: application/jsonrn"
    )
);

$context  = stream_context_create( $options );
$result = file_get_contents( $url, false, $context );
$response = json_decode( $result );

echo json_encode(array('closings' =>$json), json_pretty_print);  


?>
Saturday, May 29, 2021
 
1

you can try parsing an html file using a xml parser, but it’s likely to fail. the reason is that html documents can have the following html features that xml parsers don’t understand.

  • elements that never have end tags and that don’t use xml’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
  • elements that don’t require end tags; e.g., <p> <dt> <li> (their end tags can be implied)
  • elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>using the "<" operator</title>
  • attributes with unquoted values; for example, <metacharset=utf-8>
  • attributes that are empty, with no separate value given at all; e.g., <inputdisabled>

an xml parser will fail to parse any html document that uses any of those features.

an html parser, on the other hand, will basically never fail no matter what a document contains.


all that said, there has also been work done toward developing a new type of xml parsing—so-called xml5 parsing—capable of handling things like empty/unquoted attributes attributes even in xml documents. there is a draft xml5 specification, as well as an xml5 parser, xml5ever.


the intended use is to make an html parser, that is part of a web crawler application

if you’re going to create a web-crawler application, you should absolutely use an html parser—and ideally, an html parser that conforms to the parsing requirements in the html standard.

these days, there are such conformant html parsers for many (or even most) languages; e.g.:

  • parse5 (node.js/javascript)
  • html5lib (python)
  • html5ever (rust)
  • validator.nu html5 parser (java)
  • gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, d, julia…)

Thursday, August 5, 2021
 
3

edit2: as this is a bug in the dom parser (tested on version 1.5), there is no simple way of doing this. solution i could think of:

$find = $html->find(".class1");
$ret = array();
foreach ($find as $element) {
    if (strpos($element->class, 'class3') !== false) {
        $ret[] = $element;
    }
}
$find = $ret;

basically you find all the elements with class one than iterate through those elements to find the ones that have class two (in this case three).


previous answer:

simple answer (should work according to html spec):

find(".class1.class2")

this will look for any type of element (div,img,a etc..) that has both class1 and class2. if you want to specify the type of element to match add it to the beginning without a . like:

find("div.class1.class2")

if you have a space between the two specified classes it will match elements with both the classes or elements nested in the element with the first class:

find(".class1 .class2")

will match

<div class="class1">
  <div class="class2">this will be returned</div>
</div>

or

<div class="class1 class2">this will be returned</div>

edit: i tried your code and found that the solutions above do not work. the solution that does work however is as follows:

$html->find("div[class=class1 class2]")
Wednesday, August 11, 2021
 
Null
 
5

no, simple html dom doesn't do dom manipulation. with phpquery though you can do:

$doc->find('head')->append('<script src="foo"></script>');
Friday, November 12, 2021
 
3

in the same way you're using each to iterate over the posts, you can use each to iterate over the tracks. i've included full code below.

<html>
<body>
    <div id="content">

    </div>
    <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.1/jquery.min.js"></script>

     <script type="text/javascript">
        $(document).ready(function(){
          var url="posts.php";

          $.getjson(url,function(json){
            $.each(json.posts,function(i,post){

              // here we generate a fragment for the tracks.
              var tracks = ''; 
              $.each(post.tracks, function(j, track) {
                tracks += '<a href="' + track.url + '">' + track.name + '</a>';
              })

              $("#content").append(
                '<div class="post">'+
                '<h1>'+post.album+'</h1>'+
                '<p><img src="'+post.artwork+'"width="250"></img></p>'+
                '<p><strong>'+post.church+'</strong></p>'+
                '<p>description: <strong>'+post.des+'</strong></p>'+
                '<p>base url:  <em>'+post.baseurl+'</em></p>'+
                'tracks' + 




                '<hr>'+
                '</div>'
              );
            });
          });
        });
      </script>
Saturday, November 13, 2021
 
wael
 
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :