Asked  1 Year ago    Answers:  5   Viewed   9 times

I am trying to parse html data from a local weather channels site to get closing informations for schools, businesses, and churches around my local area.

I have run into the problem though that the information is contained in tables that do not have an id that I can use to identify them. Below I have included an example of what one of their html tables looks like. Is it possible to parse multiple HTML tables like this and pull the containing data using HTML DOM Parser with PHP. I have read through this documentation, but can't seem to find an applicable solution.

Thanks!

EDIT: I should probably also specify I want to take this data and be able to parse it to JSON data to use to load in an application. So basically have the organizations name and then the status that I can fetch from a JSON page.

Link to weather channels site

<table class="table table-condensed table-striped">
  <tbody>
    <tr>
      <th class="span5">Organization</th>
      <th class="span9">Status</th>
    </tr>
    <tr>
      <td><b>BEACON HOPE CHURCH - GRAND ISLAND</b></td>
      <td>Activity Canceled Sunday<small>: No Evening Classes</small></td>
    </tr>
    <tr>
      <td><b>PRINCE OF PEACE CATHOLIC (KEARNEY)</b></td>
      <td>Closed Monday<small>: SUNDAY EVENING ACTIVITIES CANCELED, NO MON. MORNING MASS, OFFICES CLOSED MON.</small></td>
    </tr>
  </tbody>
</table>

 Answers

3

Found the answer to my question with help from user sms who commented above. This php pulls the data from the first table and encodes it in the JSON format.

<?php

include('simple_html_dom.php');
header('Content-Type: application/json');

$html = file_get_html('http://www.1011now.com/weather/closings');
$row_count=0;
$json = array();

// Find all links 
$table = $html->find('table', 0);
foreach($table->find('tr') as $row) {
    $name = $row->find('td',0)->innertext;
    $status = $row->find('td',1)->innertext;

    $json[] = [ 'name' => strip_tags($name), 'status' => strip_tags($status)];
}

$options = array(
    'http' => array(
    'method'  => 'POST',
    'content' => json_encode(array('Closings' =>$json)),
    'header'=>  "Content-Type: application/jsonrn" .
                "Accept: application/jsonrn"
    )
);

$context  = stream_context_create( $options );
$result = file_get_contents( $url, false, $context );
$response = json_decode( $result );

echo json_encode(array('Closings' =>$json), JSON_PRETTY_PRINT);  


?>
Saturday, May 29, 2021
 
3

You're not creating the DOM correctly, you must do it like this:

// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load(curl_exec($ch))

print_r( $dom );

Check the Manual for more details...

Edit

It seems that is a cURL settings problem, please refer to the documentation to configure it correctly...

This is a function I usualy use to download some pages, feel free to adjust it to your needs:

function dlPage($href) {

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($curl, CURLOPT_HEADER, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_URL, $href);
    curl_setopt($curl, CURLOPT_REFERER, $href);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
    $str = curl_exec($curl);
    curl_close($curl);

    // Create a DOM object
    $dom = new simple_html_dom();
    // Load HTML from a string
    $dom->load($str);

    return $dom;
    }

$url = 'http://www.example.com/';
$data = dlPage($url);
print_r($data);
Thursday, April 1, 2021
 
Xavio
 
3

Just saw you message and realised I hadn't gotten back to you about this. Maybe this will lead you in the right direction:

<?php

$opts = array('http'=>array('header' => "User-Agent:MyAgent/1.0rn"));
$context = stream_context_create($opts);
$html = file_get_contents('http://www.studentenwerk-karlsruhe.de/de/essen/?view=ok&STYLE=popup_plain&c=erzberger&p=1&kw=3',false,$context);

libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//table[@class='easy-tab-dot']");
//header("Content-type: text/plain");

foreach ($nodes as $i => $node) {
    $arr = array();

    $children = $node->childNodes;
    foreach ($children as $child) {
        $tmp_doc = new DOMDocument();
        $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
        #echo $tmp_doc->saveHTML();
        print_r( $child );
    }
    echo "#######################################################################################";
}
Thursday, April 1, 2021
 
4

Isn't it easy. Try things first then ask. (:

<?php
include 'simple_html_dom.php';
$html = file_get_html('http://www.weather.gov.sg/lws/zoneInfo.do');

$n = 0;
$table = $html->find('table',3)->find('table',0)->find('table',0)->find('table',0)->find('table',3)->find('table',0);

$i = -3;
$rows = $table->find('tr');
$holder = array();

foreach($rows as $element){
    $i++;
    if($i < 0) continue;

    $holder[$i]['name'] = $element->find('td',0)->plaintext;
    $holder[$i]['zone_or_school'] = $element->find('td',1)->plaintext;
    $holder[$i]['risk'] = $element->find('td',2)->plaintext;
    $holder[$i]['from'] = $element->find('td',3)->plaintext;
    $holder[$i]['till'] = $element->find('td',4)->plaintext;
}

var_dump($holder);
?>

if you want to get a particular data then you can filter it out:

foreach($holder as $key => $val)
{
if($holder[$key]['name']=='Bedoc')
$my_data = $holder[$key];
}

this code isn't debuged cause i am on mobile now. But maybe you have get the idea if not works. Thanks

Saturday, May 29, 2021
 
Claudio
 
3

The idea is use a regular expression pattern with a capturing group. Then, use this regular expression to locate the script element by text and then to extract the substring from a script itself. Then, you may use json.loads() to load the JSON string into a Python object:

import json
import re

from bs4 import BeautifulSoup

data = """
your HTML here"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"window.Rent.datas+=s+({.*?});n")
script = soup.find("script", text=pattern)

data = pattern.search(script.text).group(1)
data = json.loads(data)
print(data)

There is also an another way - a JavaScript parser - I've experimented with slimit on StackOverflow a couple of times, check it out.

Tuesday, August 3, 2021
 
Sean
 
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :