[olug] awk script to separate apache log files
Brian Roberson
roberson at olug.org
Mon Mar 26 04:47:20 UTC 2007
Oh, if you only knew...
When you host (literally) 1000's of domain's in one apache instance, small
things, that normally do not ever matter, rear their ugly heads... for
example, The finite default File Descriptors one process is alowed to have
open at one time. While this option is "tweakable" - the tweak is merely a
bandaid for the problem. You can simply assume 3 FD's open for every
"virtual" host when you use the traditional apache vhost configuration -
you run out of FD'd eventually... so the fix is these two (AWESOME) apache
directives:
VirtualDocumentRoot
VirtualScriptAlias
There are several advantages for this type of setup, for example - how is
this for adding a virtual domain to apache..
Step one: Create a dirctrory
Step two: your done
No restart, no config file mojo - just make a directory and vwhala... your
done.
> Adam,
>
> Just out of curiosity, these 3rd level domains you're hosting, aren't
> they individual virtual hosts? If so, why not have Apache run separate
> logs for each domain?
>
> Overly curious,
> Travis
>
> On 3/23/07, Adam Haeder <adamh at aiminstitute.org> wrote:
>> Spent some time on this and thought it would be useful to share with the
>> group.
>>
>> If you've got an apache logfile that contains logs for each virtualhost,
>> with the name of the virtual host as the first field on the line, and
>> you
>> want to create separate web logs for each virtual host (without the
>> virtual host name) so you can run webalizer (or whatever) on it, try
>> this
>> awk script:
>>
>> awk -F" " '{ domain = $1; sub(/^www\./, "", domain); $1 = ""; sub(/^[
>> \t]+/, ""); print >> "/tmp/logs/"domain }' $WEBLOGFILE
>>
>> where $WEBLOGFILE is your apache logfile
>>
>> This will create files in /tmp/logs (assuming the directory exists).
>> Each
>> file will be the name of the virtual host (minus the www part) and will
>> contain all the lines from the log that correspond to that virtual host.
>> The first 'sub' removes the 'www.' and the second sub removes any
>> leading
>> white space (left over from assigning "" to $1).
>>
>> For example, here is a snippet from one of my logs:
>>
>> careerlink.com 205.188.117.65 - - [01/Jan/2007:00:00:00 -0600] "GET
>> /cgi-bin/redirect.pl?redirecttype=apply&domain=40.adg&key=9/9/9/2&po=014666&doco=409992&redirect=http://up.aihres.com/application/index.htm?po=014666&domain=67.adg&where=outside
>> HTTP/1.1" 302 351
>> "http://careerlink.com/9/9/9/2/po/014666.htm?doco=&po=014666&career=&industry=0&use=consolidated&employer=&firm="
>> "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; SV1; .NET
>> CLR 1.1.4322)"
>> careerlink.com 71.223.153.112 - - [01/Jan/2007:00:00:01 -0600] "GET
>> /cgi-bin/redirect.pl?redirecttype=apply&domain=40.adg&key=9/9/9/2&po=014676&doco=409992&redirect=http://up.aihres.com/application/index.htm?po=014676&domain=67.adg&where=outside
>> HTTP/1.1" 302 351
>> "http://careerlink.com/9/9/9/2/po/014676.htm?doco=&po=014676&career=&industry=0&use=consolidated&employer=&firm="
>> "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
>> 1.1.4322)"
>> www.firstdatajobs.com 70.10.175.188 - - [01/Jan/2007:00:00:01 -0600]
>> "GET /longapp.php?req=026DE10600246 HTTP/1.1" 302 324
>> "http://www.careerbuilder.com/JobSeeker/ApplyOnline/ExternalApply.aspx?useframes=True&aourl=http%3a%2f%2fwww.firstdatajobs.com%2flongapp.php%3freq%3d026DE10600246&sc_cmp1=JS_JobDetails_ExtApply&Job_DID=J8C6R76Q8JR2P2NLGDC"
>> "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5)
>> Gecko/20060912 Netscape/8.1.2"
>> firstdatajobs.com 70.10.175.188 - - [01/Jan/2007:00:00:02 -0600] "GET
>> /longapp.php?req=026DE10600246 HTTP/1.1" 302 5
>> "http://www.careerbuilder.com/JobSeeker/ApplyOnline/ExternalApply.aspx?useframes=True&aourl=http%3a%2f%2fwww.firstdatajobs.com%2flongapp.php%3freq%3d026DE10600246&sc_cmp1=JS_JobDetails_ExtApply&Job_DID=J8C6R76Q8JR2P2NLGDC"
>> "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5)
>> Gecko/20060912 Netscape/8.1.2"
>> nebraskapanhandle.careerlink.com 66.249.66.227 - - [01/Jan/2007:00:00:02
>> -0600] "GET /state/ne/city/beatrice/page41.htm HTTP/1.1" 200 8816 "-"
>> "Mozilla/5.0 (compatible; Googlebot/2.1;
>> +http://www.google.com/bot.html)"
>> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET
>> /0/5/5/1/employer.htm HTTP/1.1" 200 1848
>> "http://careerlink.com/state/c2/logo3.htm" "Mozilla/4.0 (compatible;
>> MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
>> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET
>> /0/5/5/1/mast.htm HTTP/1.1" 200 5750
>> "http://careerlink.com/0/5/5/1/index_m.htm" "Mozilla/4.0 (compatible;
>> MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
>> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET
>> /0/5/5/1/avantas5.jpg HTTP/1.1" 200 84444
>> "http://careerlink.com/0/5/5/1/index.htm" "Mozilla/4.0 (compatible; MSIE
>> 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
>> siouxfalls.careerlink.com 66.249.66.227 - - [01/Jan/2007:00:00:04 -0600]
>> "GET /1/2/1/3/po/000271f.htm HTTP/1.1" 200 19749 "-" "Mozilla/5.0
>> (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET
>> /0/5/5/1/index_m.htm HTTP/1.1" 200 1398
>> "http://careerlink.com/0/5/5/1/employer.htm" "Mozilla/4.0 (compatible;
>> MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
>>
>> If I run that log through this awk script, I get the following in
>> /tmp/logs:
>>
>> logs:~ # ls -al /tmp/logs/
>> total 24
>> drwxr-xr-x 2 root root 4096 2007-03-23 13:10 .
>> drwxrwxrwt 8 root root 4096 2007-03-23 13:09 ..
>> -rw-r--r-- 1 root root 1819 2007-03-23 13:10 careerlink.com
>> -rw-r--r-- 1 root root 826 2007-03-23 13:10 firstdatajobs.com
>> -rw-r--r-- 1 root root 185 2007-03-23 13:10
>> nebraskapanhandle.careerlink.com
>> -rw-r--r-- 1 root root 175 2007-03-23 13:10 siouxfalls.careerlink.com
>> logs:~ #
>>
>> Now I can call webalizer on each of these files to get unique metrics
>> for
>> that domain.
>>
>> I was doing this with a bash shell script, looping through the apache
>> log
>> and using 'cut' to pull off the domain. On a 6G log file, that script
>> was
>> taking almost 24 hours to run. This awk script does the same thing in 26
>> minutes.
>>
>> --
>> Adam Haeder
>> Vice President of Information Technology
>> AIM Institute
>> adamh at aiminstitute.org
>> (402) 345-5025 x115
>> PGP Public key: http://www.haederfamily.org/pgp.html
>> _______________________________________________
>> OLUG mailing list
>> OLUG at olug.org
>> http://lists.olug.org/mailman/listinfo/olug
>>
>
>
> --
> Travis Owens
>
> VISTA is just a secret codeword that Microsoft thought up which
> actually stands for: Viruses, Intruders, Spy-ware, Trojans & Ad-ware
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> http://lists.olug.org/mailman/listinfo/olug
>
More information about the OLUG
mailing list