Charset detection routine in PHP with Snoopy

Sometimes you can't leave everything being done automatically, like for instance the charset detection when you retrieve something from the web. With HTML and XML, those can be overidden inside the file, and this can be tricky (especially since you must give the correct encoding to the XML parser before giving to it the file). So, here is what I basically use in Radio.Blog to PLS. The depencies are Snoopy the HTTP client library, and a HTML Parser written by Jose Solorzano (unfortunately we don't have TagSoup in PHP). It's crude, but works... here we go:

function url_get_contents($url) {
  $snoopy = new Snoopy;

  $c_data = read_cache($url);

  // [...]
  
  if($snoopy->fetch($url)) {
    $c_data['status'] = $snoopy->status;
    $c_data['charset_embedded'] = false;
    if($snoopy->status == 200) {
      $c_data['charset'] = 'iso-8859-1';
      foreach($snoopy->headers as $header) {
	@list($key,$val) = preg_split('/[: ]+/', $header, 2);
	$val = trim($val);
	$key = strtolower($key);
	switch($key) {
	// [...]
	  case 'content-type':
	  $parts = preg_split('/[ ;=]+/', $key, 3);
	  $c_data['type'] = $parts[0];
	  if(isset($parts[1])) {
	    $c_data[$parts[1]] = $parts[2];
	  }
	  break;
	}
      }
      
      $c_data['content'] = $snoopy->results;
      $charset_found = false;

      // is is raw XML? don't trust the Content-Type header
      $off = strpos($c_data['content'], 'status != 304) {
      if($snoopy->status >= 400 && $snoopy->status < 500) {
	$c_data['error'] = true;
      } else if($snoopy->status >= 500 && $snoopy->status < 600) {
	$c_data['error'] = true;
      }
      // other error... what can we do?
    }
  } else {
    // dooh
  }
  
  return $c_data;
}

function charsetFromHTML(&$c_data) {
  // is it really HTML? We don't even know...
  $subdoc = stristr($c_data['content'], '');
    if($off2 === false) {
      $off2 = strpos($subdoc, '');
    }
    if($off2 !== false) {
      $parser = new HtmlParser(substr($subdoc, 0, $off2 + 7));
	    
      while ($parser->parse()) {
	if($parser->iNodeType == NODE_TYPE_ELEMENT && strtolower($parser->iNodeName) == 'meta') {
	  if(isset($parser->iNodeAttributes['http-equiv'])) {
	    if(!strcasecmp($parser->iNodeAttributes['http-equiv'], 'content-type')) {
	      $parts = preg_split('/[ ;=]+/', $parser->iNodeAttributes['content'], 3);
	      $c_data['type'] = $parts[0];
	      if(isset($parts[1])) {
		$c_data[$parts[1]] = $parts[2];
	      }
	      break;
	    }
	  }
	} else if($parser->iNodeType == NODE_TYPE_ELEMENT && strtolower($parser->iNodeName) == 'body') {
	  break;
	} else if($parser->iNodeType == NODE_TYPE_ENDELEMENT && strtolower($parser->iNodeName) == 'head') {
	  break;
	}
      }
    }
  }
}

La Tortue Cynique / The Cynical Turtle

Médor

Bons vieux

Chats et gorets

Subscribe