A A A i

billet précédent :: billet suivant

Charset detection routine in PHP with Snoopy

Sometimes you can't leave everything being done automatically, like for instance the charset detection when you retrieve something from the web. With HTML and XML, those can be overidden inside the file, and this can be tricky (especially since you must give the correct encoding to the XML parser before giving to it the file). So, here is what I basically use in Radio.Blog to PLS. The depencies are Snoopy the HTTP client library, and a HTML Parser written by Jose Solorzano (unfortunately we don't have TagSoup in PHP). It's crude, but works... here we go:

function url_get_contents($url) {
  $snoopy = new Snoopy;

  $c_data = read_cache($url);

  // [...]
  
  if($snoopy->fetch($url)) {
    $c_data['status'] = $snoopy->status;
    $c_data['charset_embedded'] = false;
    if($snoopy->status == 200) {
      $c_data['charset'] = 'iso-8859-1';
      foreach($snoopy->headers as $header) {
	@list($key,$val) = preg_split('/[: ]+/', $header, 2);
	$val = trim($val);
	$key = strtolower($key);
	switch($key) {
	// [...]
	  case 'content-type':
	  $parts = preg_split('/[ ;=]+/', $key, 3);
	  $c_data['type'] = $parts[0];
	  if(isset($parts[1])) {
	    $c_data[$parts[1]] = $parts[2];
	  }
	  break;
	}
      }
      
      $c_data['content'] = $snoopy->results;
      $charset_found = false;

      // is is raw XML? don't trust the Content-Type header
      $off = strpos($c_data['content'], 'status != 304) {
      if($snoopy->status >= 400 && $snoopy->status < 500) {
	$c_data['error'] = true;
      } else if($snoopy->status >= 500 && $snoopy->status < 600) {
	$c_data['error'] = true;
      }
      // other error... what can we do?
    }
  } else {
    // dooh
  }
  
  return $c_data;
}

function charsetFromHTML(&$c_data) {
  // is it really HTML? We don't even know...
  $subdoc = stristr($c_data['content'], '');
    if($off2 === false) {
      $off2 = strpos($subdoc, '');
    }
    if($off2 !== false) {
      $parser = new HtmlParser(substr($subdoc, 0, $off2 + 7));
	    
      while ($parser->parse()) {
	if($parser->iNodeType == NODE_TYPE_ELEMENT && strtolower($parser->iNodeName) == 'meta') {
	  if(isset($parser->iNodeAttributes['http-equiv'])) {
	    if(!strcasecmp($parser->iNodeAttributes['http-equiv'], 'content-type')) {
	      $parts = preg_split('/[ ;=]+/', $parser->iNodeAttributes['content'], 3);
	      $c_data['type'] = $parts[0];
	      if(isset($parts[1])) {
		$c_data[$parts[1]] = $parts[2];
	      }
	      break;
	    }
	  }
	} else if($parser->iNodeType == NODE_TYPE_ELEMENT && strtolower($parser->iNodeName) == 'body') {
	  break;
	} else if($parser->iNodeType == NODE_TYPE_ENDELEMENT && strtolower($parser->iNodeName) == 'head') {
	  break;
	}
      }
    }
  }
}

(Cyberpunk, 2006/07/03 22:07) lien permanent

Les commentaires pour ce billet sont fermés.


qui est

nom : Damien Bonvillain
courriel : kame à cinemasie.com
bloggercode:
B9 D+ T+ K S F I- O X+ E- L- C-- Y1 R+ W- P+ M5 N-- N+
un peu plus : Google Whoring tortue

Messagerie instantanée

    les koms

    m'enfin

    Quant à mes invectives imaginaires, je vous laisse chercher un endroit adéquat pour les ranger. Elles craignent la lumière, si vous voulez une piste.
    Lien associé
    Eolas - « Aimez moi, c'est un ordre. »

    les bons vieux


    archives

    « juillet 2006 »
    lunmarmerjeuvensamdim
    12
    3456789
    10111213141516
    17181920212223
    24252627282930
    31

    XML RSS 2.0 XML RSS 2.0 commentaires A A A i

    liens

    allégeance

    Blog sans chat

    colophon

    Propulsé par pointClairMerci à la caféineDevelopment with EmacsBadges from GTMcKnightFreeListed on BlogSharesGeoURL