dinsdag 31 augustus 2010

Improving your Google Crawl Rate

I've been very busy lately with applying some SEO-optimalizations to a site that has been around for some years now. (Namely: http://shop.vlan.be)
Therefore I decided to do an article on how you can improve your ranking with search engines like Google.

I will be posting multiple articles on this subject, but for the first subject I would like to talk about the "Google Crawl Rate".

The Google Crawl Rate is a reasonable indication of how important your site is to Google. If the Google Crawler visits your site regularly, then this is a good sign of your site's importance. You can't force Googlebot to visit your site more often, but you can take steps to invite it to come over.

I'll be discussing five tips to get Google's attention:

1. Make sure your content is updated regularly. Try to add as much unique content to your site as possible.
What kind of unique content you should add depends on the type of site you have. If you have a blog of some sort, you should add new messages regularly. A lot of sources say you should try to have three new messages per week. I don't know if that's the optimal amount of messages you should post, but if you can post that many useful messages per week, then I do advise you to do it! The more new content you offer to Google, the more often Google will come back to check for new content, thus making sure new content is crawled as quickly as possible.

2. Avoid duplicate content. Duplicate content occurs when the same page can be accessed through two different URLs. Google feels like it's being misleaded by you because by indexing the content of one page with two different URLs, you'll effectively increase your chances of appearing in the Google search results. This is why Google will allocate a penalty-point to your website. You can't find out how many penalties your site is awarded but the number of penalties your site has received have a direct impact in the position of your site in the search results. If enough penalties are allocated to your site, then your site might stop appearing in the search results altogether. That's why it's very important to avoid duplicate content!

3. Make other sites backlink to you. According to some specialists, this is the most important advice to follow because the number of backlinks your site has determines how important your site is to Google. While this is certainly a big part of your score according to Google, it is not the only thing that is considered. Nevertheless, you should definitely increase the number of backlinks to your site however you can.
If Google determines your site is pretty important, then it will crawl your site more often and if there is enough unique content on your site, your site will appear more in the search results which in turn benefits the traffic directed to your website.

4. Create a sitemap. I've recently followed this advice for the first time and I was pleasantly surprised by the result. A sitemap is an XML-file that contains all of your site's unique URLs. Google doesn't promise it will crawl all those URLs, but it will use the sitemap to try and figure out your website's structure.
Nevertheless, offering a sitemap containing all URLS on your website can help Google in discovering and crawling those pages.
By keeping your sitemap up to date, you can present new content to Google on a silver platter.
A sitemap needs to be presented to Google via the Webmastertools (which will be discussed in a future post)

5. Create a blog. If you're trying to promote a blog through Google, this advice is useless, but if you're trying to promote a different type of site through Google, then this advice is definitely worth following.
You see, popular blogs and popular blog sites (like Blogger.com or Wordpress.com) are constantly being crawled by Google because they are teeming with new content all the time. Having your own blog among those, containing links to content on your site alongside the articles will definitely increase the importance of your site and will increase your crawl rate significantly.
Make sure you don't only post a link to the content, but describe the content as well or write an article about it. The content will lure more visitors to your blog, who might click the link to your site and become new regular visitors.

That's about all of the tips I have for this article. Does anyone have a tip I forgot to mention? Feel free to send it to me, or leave a reply to this article!

maandag 22 december 2008

Optimizing Your Java Code (2)

So as I mentioned earlier, I recently attended the Devoxx Java conference (formerly named Javapolis). At this conference I attended the 'Java Performance' lecture given by Dr. Holly Cummins (I should really use her title, not just Holly but Dr. Holly) and Kirk Pepperdine (not a doctor but an independent consultant).

This is part two of my Java Performance blogs. See the first part here

Today, I will be talking about a subject called Latency.
When you're working on a program or a piece of code you sometimes notice that it isn't running as it should. It seems to be slower than you expect. Or sometimes a customer can come to you and say that the program is doing five transactions per second when it should be doing ten.
In either way, you never quite know what the problem is precisely. All we really know is that the user is experiencing a poor reponse time. The user tries to work with your program but gets annoyed because the system doesn't do everything as quickly as he would like.

Right now, you know next to nothing about what could be causing the problem. It could be that the user's computer just isn't powerful enough to run this application in a responsive manner or it could be that your application just uses too much memory, making the guest's operating system swap like crazy.
Quite generally, we have three possible causes of poor response times
1) Insuffient capacity (hardware-related)
2) Inefficient use of capacity (a dreadful architecture, slow algorithms)
3) combination of the two above

Even if you know that inefficient use of capacity is the bottleneck in this particular case, you will still be hard pressed to find the exact source of the problem. As you will probably know, your hardware has a finite amount of resources available. Every time the processor receives an overflow of information for example, it queues the additional information and processes what it can, then retrieves the data in the queue and processes this. Not only the processor but every piece of hardware works with a queue. The amount of time data has to stay still and wait to be processed is called Latency. Latency is usually expressed by the inability of the application to consume the CPU. This means that if you're running the program at the part where the bottleneck occurs according to the client and you notice that your CPU is getting utilized more but not at its top priority (not 100%), then this should be a reasonable indicator that you're experiencing inflated latency times.

It works the other way around as well. If you're having a fully-utilized CPU, while still getting inflated response times, then one piece of code is dominating the CPU and isn't doing what it should. In this case, you can probably have an infinite loop (even though that shouldn't have gotten through the final testing phase) or maybe you're even doing the String concatenation I warned you about earlier.

There are four kinds of latency that are important to the performance of your Java application. These are:
1) Hardware Induced Latency
2) OS Induced Latency
3) VM Induced Latency
4) Application Induced Latency

Hardware Induced Latency is the form you'll probably understand best. Your computer has definite capacities that cannot be exceeded. You cannot ask more of your hardware than the absolute maximum. Everything you ask of it that does exceed this maximum is queued in memory until the hardware has time to execute these particular commands. The time that these instructions are kept waiting is what we refer to as Latency.

Operating System (OS) Induced Latency is a bit harder to understand, but not really. The OS provides an interface for hardware management. It takes the work out of the programmer's hands. The OS provides in memory management, I/O and other provisions. It's much harder to tell that you're dealing with OS induced latency but symptoms include a relatively high CPU-utilization or a large strain on your hard drive.
You can't really change your Operating System (unless you're running an open source one, of course). I have never found the problem to be lying with the Operating System itself. I just want you to understand that the Operating System is just another layer that queues your requests before it can handle them, thus increasing the complete latency.

The VM Induced Latency, on the other hand, is something you can do something about. The Java Virtual Machine (JVM) provides Garbage Collection (GC) as you know. It removes those objects that are no longer needed from memory. A lot of people don't really realize what the Garbage Collector does precisely. Does it work alongside your application, or does it interrupt your application? I believe, if I remember Dr. Cummins correctly, that it depends on the strategy your JVM follows. The default strategy for the JVM is to interrupt your program, remove those objects from memory and then allow your program to proceed. Think back to my String concatenation example. You create a String object and then concatenate some characters to this String. The result is a new String object. The JVM now interrupts your application, realising that the first String object is no longer needed and removes it from memory. Then it allows your application to proceed again. This is a vicious circle that makes your application run so much slower. The symptoms for diagnosing whether or not the problem lies with the JVM include high object creation rate and a high CPU utilization.
I think the only real thing you can do about this problem is check your object creation and disposal rates. There are easy light-weight monitoring tools out there like starting the JVM with the -verbose:gc switch. This will keep a log of your Garbage Collector. There are other tools out there (from IBM, for instance) that can present the information in this log in a visual way.

The last form of latency is Application Induced Latency. This is your program itself queuing data before processing it. Think of a deadlock happening. Thread A is waiting to gain access to resource R1 (having just used R2), while Thread B is waiting for resource R2 to be released by Thread A (having just used, and still locking R1). This prevents your application from making forward progress. The symptom for diagnosing whether or not the problem lies with your Application itself is an inability to fully utilize the CPU. You notice the response times aren't too good, but the CPU isn't being used all that intensively anyway. That means that some problem is occurring somewhere in your application, causing threads to be parked. For instance, one thread is waiting for an external system (connected e.g. across the Internet) to respond to a request and no threads can proceed without the answer from the external system. Your CPU doesn't have any work, but your application (or a part of) is halted anyway.
When this problem occurs, I think it's safe to say you may need to do some redesigning of your application, to make it less dependent of external systems. If, on the other hand, it is one of those programs that just needs that one web service to perform its task accordingly, this is not an option.

So those are the four forms of latency that are important to understand. It's important that you realize that sometimes the problem does not lie exactly with your code. Your JVM may be intrusive by halting your application all the time as well. You should check out all four forms of latency and
Hopefully you'll have a better understanding of latency now and realize how the different forms of Latency can affect your application.