There are two ways to crawl web pages in java.
Most primitive but original way is that to open a socket in 80 no port and then use get statement to obtain content. It works almost like telnet.
- telnet google.com 80
- GET / HTTP/1.0 (two new line)
By this way you will get content of the home page. Look at the sample method how can we do the whole procedure in java
public String urlCrawle(String url){
this.insertCrawledUrl(url);
StringBuffer objBuffer = new StringBuffer(“”);
try{
URL objURL = new URL(url);
String host = objURL.getHost();
String path = objURL.getPath();
if(path.length() == 0){
path=”/”;
}
String outQuery = “GET “+path+”?”+objURL.getQuery()+” HTTP/1.0\n”;
//System.out.println(outQuery);
Socket s = new Socket(InetAddress.getByName(host), 80);
PrintWriter out = new PrintWriter(new OutputStreamWriter(s.getOutputStream()));
out.println(outQuery);
out.flush();
BufferedReader instream = new BufferedReader(new InputStreamReader(s.getInputStream()));
String line = instream.readLine();
if(line.contains(“HTTP/1.0 200″) || line.contains(“HTTP/1.1 200″)){
while(line != null) {
objBuffer.append(line+”\n”);
line = instream.readLine();
}
s.close();
}
}
catch(Exception ex){}
//return this.stripTagFromHtml(objBuffer.toString());
System.out.println(objBuffer.toString());
return objBuffer.toString();
}
The problem of this procedure is that you have to separate hostname, path and query string and then work with them individually. And it’s quite childish to work such a way in java as this is a very dynamic and high level language. But for education purpose it’s the most ultimate way to know the underline working procedure of the system.
As java has a very wide range of network programming library, one can use URL class to do web crawling and it’s the easiest and also effective way for web crawling. You may find an example method bellow.
public String urlCrawle(String url){
this.insertCrawledUrl(url);
StringBuffer objBuffer = new StringBuffer();
try{
URL hp = new URL(url);
URLConnection hpCon = hp.openConnection();
int len = hpCon.getContentLength();
String line = “”;
if(len>0){
BufferedReader instream = new BufferedReader(new InputStreamReader(hpCon.getInputStream()));
line = instream.readLine();
while(line != null) {
objBuffer.append(line+”\n”);
line = instream.readLine();
}
}
}
catch(Exception ex){}
//return this.stripTagFromHtml(objBuffer.toString());
return objBuffer.toString();
}