guest post is written by Jigish Thakar
As name says it is really very simple email crawler, all you need to input is one URL and it will start crawling through the web and extract the Email address’s from all these crawled pages. Of course after extracting we need to store same email address somewhere and we have used mysql database consists of two table urls & emails.
The whole process of extracting emails is devided into 5 steps, and those are as given below.
- fetch unfetched the page
- extract emails, store them into database
- store google search Url with search query as above emails to url table of database
- extract urls from above fetched page and store them into database.
- mark the entry of page as fetched
Database:
As mentioned earlier, we will need only two tables. one to store the URL’s and one for Emails.
CREATE TABLE IF NOT EXISTS `emails` ( `id` bigint(100) NOT NULL AUTO_INCREMENT, `url_id` bigint(100) NOT NULL DEFAULT '0', `email` varchar(255) NOT NULL, `lastupdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ;
emails table has very few fields id is used as primary key, url_id is used to maintain the relation between email address and URL from where it is being fetched, email to store the extracted email address and as of now we don’t use lastupdate in our business logic but that can be useful for further development.
CREATE TABLE IF NOT EXISTS `urls` ( `id` bigint(100) NOT NULL AUTO_INCREMENT, `parent_id` bigint(100) NOT NULL DEFAULT '0', `email_id` int(100) NOT NULL DEFAULT '0', `url` text NOT NULL, `is_sync` int(1) NOT NULL DEFAULT '0', `lastupdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ;
urls table has again id as its primary key, parent_id is used when url is extraced from some other url, email_id is stored in case when we have googled url of email address, url is to store extraced url, is_sync is set to true once the url is being crawled.
Class (code):
very first keep set_time_limit(0); because this scripts may take long execution time.
function crawl(){ $qr = "SELECT id, url FROM urls WHERE is_sync = 0 LIMIT 0, ".$this->urlCrawlLimit; $rs = mysql_query($qr, $this->conn); if(mysql_num_rows($rs) > 0){ while($row = mysql_fetch_assoc($rs)){ $this->crawlUrl($row); $qr = " UPDATE urls SET is_sync = 1 WHERE id = '".$row['id']."' "; mysql_query($qr, $this->conn) or die(mysql_error()); } } }
This is the base function from where all it starts. Here we simply fetch the Url from table with predefined limit, here limit has kept as every server has a different execution power. If this finds result then we call private method crawlUrl.
private function crawlUrl($row){ $url = $row['url']; $this->urlId = $row['id']; $ch = curl_init($url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $resp = curl_exec($ch); curl_close($ch); if(trim($resp) != ''){ $this->extractEmails($resp); $this->extractUrl($resp); } }
this method is mainly focuses at fetching the web page using curl. here we have kept settings off to receive headers, as we were not interested in deep design of them system. And if the Url is reachable and curl gets response then we directly call methods to extract emails and urls.
private function extractUrl($resp){ preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $resp, $arrUrl); $arrUrl = $arrUrl[1]; $arrUrl = array_unique($arrUrl); if(count($arrUrl) > 0){ foreach($arrUrl as $v){ $qr = " SELECT * FROM `urls` WHERE `url` = '".$v."' "; $rs = mysql_query($qr, $this->conn); if(mysql_num_rows($rs) == 0){ $qr = " INSERT INTO urls SET `url` = '".$v."', `parent_id` = '".$this->urlId."' "; mysql_query($qr, $this->conn); } }} }
extract url does what its name suggests. once we get array full of Urls found on page. we check one by one if its already available in table or not. if its not then we just insert it with parent_id.
Note: here you might think we can have this done with some other better way like first checking how many of urls are present with some select query. and then insert only unavailable. but this gives lots of jerks to server when you have tables with lacks of records or if script extracts 1000 of url from single page.
private function extractEmails($resp){ $pattern = '/([a-zA-Z0-9-_\.]+@[a-zA-Z0-9-_\.]+)/'; preg_match_all($pattern,$resp,$arrEmails); $arrEmails = $arrEmails[0]; $arrEmails = array_unique($arrEmails); if(count($arrEmails) > 0){ foreach($arrEmails as $v){ if($this->isValidEmail($v)){ $qr = " SELECT * FROM `emails` WHERE `email` = '".$v."' "; $rs = mysql_query($qr, $this->conn); if(mysql_num_rows($rs) == 0){ $qr = " INSERT INTO emails SET `email` = '".$v."', `url_id` = '".$this->urlId."' "; mysql_query($qr, $this->conn); $email_id = mysql_insert_id(); $email_url = "http://www.google.co.in/search?hl=en&q=".urlencode($v); $qr = " INSERT INTO urls SET `url` = '".$email_url."', `email_id` = '".$email_id."' "; mysql_query($qr, $this->conn); } } } } }
Same in case of extract email has been done. only diffrence here is once we find new email address we generate its google url and store that to url tables. and this acts as a fuel to this system.
also at the very beginning of class we have declared array as given below.
var $arrEmailMaintainace = array( '%noreply%', '%@blogger%', '%@goooglegroup%', );
these are the rules, if any of stored email address satisfies this rules (condition) we just delete them. As we know there is no need to store noreply@technoreaders.com email address in your lead list.
Conclusion:
we have tried to cover all the important part of the code but not everything. you can checkout the code from below given SVN details of code.google.com
svn checkout http://simple-email-crawler.googlecode.com/svn/trunk/ simple-email-crawler-read-only
and project url is
http://code.google.com/p/simple-email-crawler/
[This guest post is written by Jigish Thakar, a Web-Developer. An entrepreneur at heart and writes at his blog http://technoreaders.com/, you can also follow him @jigishthakar on twitter]
Pingback: First guest post « TechnoReaders.com