Rafa XU's technical blog: February 2014

Friday, 28 February 2014

File operations in Python

1. File Open

Here is the basic function for opening a file

f = open(‘file’, ‘mode’)

there are multiple mode used in python

f is called file handler, that is the object we used to control the file in further.

Common mode in python:

a: append mode – append the content at the end of the file
w: write mode – write the content into the file, it will erase the original conent of the file
r: read mode – read the cotent from the file. The default mode.
r+ : read and write mode

2. File Read

There are a few methods to read the content from a file

f.readline() – just read one line from the file, return a string

f.readlines() – read all lines and return a list containing all lines.

f.read([character number]) – read specific characters. Without parameter, it will read all the content. The read will move the reading pointer ahead. We can use f.tell() to get the current pointer position.

3. File Write

When we need to write the content to the file, we need to open the file as write or append mode

f.write(“string”) – write some string into the file

f.writelines(string list) – write mulitple lines into the file

4. File delete

We need to use the os module to delete the file in operation system. To avoid deleting non-existing file, we need to check if the file is there.

os.remove(‘file’) or os.unlink(‘file’) #delete the file from os

Sample code:
import os

if os.path.exists(‘/var/tmp/file’) : os.remove (‘/var/tmp/file’)

5. file copy/move

we need to use the shutil module to copy or move the file in between OS. To avoid the source file is already existing, we need to check the source file as well.

Sample code

import shutil

if os.path.exists(‘/var/tmp/file’) : shutil.copyfile(‘/var/tmp/file’,’/var/tmp/file1’) #copy file

if os.path.exists(‘/var/tmp/file’) : shutil.move(‘/var/tmp/file’,’/var/tmp/file1’) #move file

6. directory relate operations

os.mkdir(“dirpath”, mode=) #mkdir

os.makedirs(‘dirpath’, mode=) #create directory when the parent directory is not existing

os.rmdir(‘directory’) #remove the directory

os.removedirs(‘directorytree’) #remove the directory tree.

os.listdir(‘path’) #list the files and directories in the directory (the name is a little bit confusing)

os.walk() or os.path.walk() # Traversal the directory.

os.walk() will return a tuple, each element contains the path, subdirectories and files

7. stdin, stdout, stderr

we can change the stdin, stdout, stderr in python by assign sys.[file] to other values

for example, we will change the stdout

>>> import sys

>>> sys.stdout=open(r"./hello.txt","a") #change the stdout value

>>> print "good bye" # you won’t see the ‘good bye’ printed onto screen.

>>> sys.stdout.close() # it is in the hello.txt file

Tuesday, 25 February 2014

BASH command line intercept and procession

Bash command is is the interface for sysadmin to control the bash. It is very important for sysadmin to understand how BASH intercepts the command. Here is the brief introduction how it is working.

split the command into tokens using delimiters.The delimiters include SPACE, TAB, NEWLINE, ; , (, ), <, >, |, &
build the command stack (complicated process, not discussed here)
check if the first token of command is an alias, if it is, it will replace the alias with the value.
expand the {}, eg. It will expand a{a,b} to aa and ab
if the token is started with ~, it will replace with the home directory
any expression started with $, it will replace it with expression value.
execute the command in between ``
calculate the $((expression)) and replace it with result
wildcast expansion. Such as * ? , [ / ]
find the exact commands (buildin, $PATH)
IO redirection

Here is an example.

echo ~/i* $PWD `echo Yahoo Hadop` $((21*20)) > output

step 1. split the command into tokens

token[1] = echo

token[2] = ~/i*

token[3] = $PWD

token[4] = `echo Yahoo Hadop`

token[5] = $((21*20))

> output are not the tokens, they will be process in the IO rediretion.

Step 2,3,4 skipped

Step 5. replace ~ with /root. So the command is looking like

echo /root/i* $PWD `echo Yahoo Hadop` $((21*20))

step 6. replace $PWD with the current path for example it is:

echo /root/i* /root `echo Yahoo Hadop` $((21*20))

step 7. excute the command in ``. so it would look like (iteriter process)

echo /root/i* /root Yahoo Hadop $((21*20))

step 8. calculate the value in $(()). so it would look like

echo /root/i* /root Yahoo Hadop 420

step 9: expand the wildcast.(take example)

echo /root/indirect.sh /root/install.log /root/install.log.syslog /root Yahoo Hadop 420

now the BASH is ready to execute the commands as echo is a buidin command

it will redirect the output to ouput file

Monday, 24 February 2014

Linux trace introduction- 1 strace command

Linux provides system admin quite a few useful tools for troubleshooting. Strace is one of the tools which can provide the details of syscalls including parameters, values, and the consumed time.

Strace is a very complicated command with quite a few options; we need to understand some common options for daily usage:

-c -- count time, calls, and errors for each syscall and report summary

-f -- follow forks, -ff -- with output into separate files

-r -- print relative timestamp, -t -- absolute timestamp, -tt -- with usecs

-e expr -- a qualifying expression: option=[!]all or option=[!]val1[,val2]...

options: trace, abbrev, verbose, raw, signal, read, or write

-o file -- send trace output to FILE instead of stderr

-p pid -- trace process with process id PID, may be repeated

Some examples

Try to ls a non-existing file

[root@X001 tmp]# strace ls notexisting

execve("/bin/ls", ["ls", "notexisting"], [/* 29 vars */]) = 0

brk(0) = 0x1b7b000

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5b87f51000

access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)

open("/etc/ld.so.cache", O_RDONLY) = 3

fstat(3, {st_mode=S_IFREG|0644, st_size=38923, ...}) = 0

mmap(NULL, 38923, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f5b87f47000

close(3) = 0

-----omitted-----

ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0

ioctl(1, TIOCGWINSZ, {ws_row=63, ws_col=237, ws_xpixel=0, ws_ypixel=0}) = 0

stat("notexisting", 0x1b7c0e0) = -1 ENOENT (No such file or directory)

lstat("notexisting", 0x1b7c0e0) = -1 ENOENT (No such file or directory)

open("/usr/share/locale/locale.alias", O_RDONLY) = 3

fstat(3, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5b87f50000

read(3, "# Locale name alias data base.\n#"..., 4096) = 2512

read(3, "", 4096) = 0

close(3) = 0

exit_group(2) = ?

try to open an non-listening port only with network syscall enabled

[root@X001 tmp]# strace -e trace=network telnet localhost 9999

socket(PF_NETLINK, SOCK_RAW, 0) = 3

bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0

getsockname(3, {sa_family=AF_NETLINK, pid=2395, groups=00000000}, [12]) = 0

sendto(3, "\24\0\0\0\26\0\1\3\342\346\vS\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\342\346\vS[\t\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 108

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\342\346\vS[\t\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 128

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\342\346\vS[\t\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20

socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3

connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)

socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3

connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)

socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3

connect(3, {sa_family=AF_INET, sin_port=htons(9999), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

getsockname(3, {sa_family=AF_INET, sin_port=htons(33896), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0

socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3

connect(3, {sa_family=AF_INET6, sin6_port=htons(9999), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0

getsockname(3, {sa_family=AF_INET6, sin6_port=htons(57576), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0

Trying ::1...

socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 3

connect(3, {sa_family=AF_INET6, sin6_port=htons(9999), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ECONNREFUSED (Connection refused)

telnet: connect to address ::1: Connection refused

Trying 127.0.0.1...

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3

setsockopt(3, SOL_IP, IP_TOS, [16], 4) = 0

connect(3, {sa_family=AF_INET, sin_port=htons(9999), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)

telnet: connect to address 127.0.0.1: Connection refused

[root@X001 tmp]#

try to get the summary of the syscalls
[root@X001 tmp]# strace -c -e trace=network telnet localhost 9999
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
% time     seconds usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.022996        3285         7           socket
0.00    0.000000           0         6         4 connect
0.00    0.000000           0         1           sendto
0.00    0.000000           0         3           recvmsg
0.00    0.000000           0         1           bind
0.00    0.000000           0         3           getsockname
0.00    0.000000           0         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.022996                    22         4 total
[root@X001 tmp]#

to understand the output of strace, we need to have a brief idea about the linux internal and syscalls

BASH IO redirection

IO redirection:

IO redirection will capture a file , commands, program, scripts’s output, and send it to another file, commands and scripts.

Background: every script has three standard file discriptor:

stdin: standard input

stout: standard output

stderr: standard error output

IO redirection example:

Output redirection

> file: output to the target file, it will overwrite the file content.

>> file: output to the target file, it will append the content at the end of the file

>| file : override the file even with noclobber option

stderr redirection

2>newfile: redirect the stderr to the newfile, for example: ls –al zz 2>newfile.

If no file matched. The error msg will be written to newfile

stdin redirection

script < file: use file as my stdin instead of default stdin

<<delimiter – redefine the delimiter

For example

#cat > mytest<<GO

>this is my test

>it is ok

>let’s GO

>GO

#cat mytest

this is my test

it is ok

let’s GO

block redirction

> outputfile and < input file. It can used to change the stdin or stdout temporarry.

Saturday, 22 February 2014

BASH variables

bash variables are used to store the useful information referred by the scripts

local variables: declared and used only in the local shall process.
environment variables: used by the login process and the sub process. the environment variables can be used by all editor's, scripts
parameter variables: used to pass the parameters to shell scripts. they are read only.

some examples:
variable=value or ${variable=value} #how to set value to a variable:
echo $variable #get the value
unset $variable #clear the variable
readonly variable #set the variable to readonly (immutable), you need to set the value and set the property to readonly
let a=a+1 # integer operations

by default, the BASH variable is string. default value is null if not declared before. if you have already set a variable to a string, you can still use the variable for integer operations. the variable will be 0 as initial value.

environment variables:

eg: export environment-variables # declare it is a environment variable
you can use env command to show the defined environment variables. To set and unset variables are the same as

there are some important pre-defined environment variables for the users
PWD,OLDPWD: current user location and previous location
PATH: the location shell is going to search for external commands, scripts and executable program
HOME: user's home directory
SHELL: user's default shell (/bin/bash)
USER: user login name such as root
UID: user UID.
PPID: parent PID.
some tips: the child process can inherit the environment variable from parenet process but if the variable is changed in child process, it can't be passed to the parent

~/.bash_profiles: usually where you define your BASH environment variables.
if .bash_profiles is not existing, it will use /etc/profiles as alternative file.
~/.login is used by Cshell, ~./profiles is used by kshell. the variables there can be referred by BASH but strongly not recommended.

parameter variables:

$0: the script itself, $1, $2, $3: the first, second and third parameters. ${10}, the 10th parameter
$# paramter numbers
$* all parameters
$? exit code. 0 for successful and
$$ current PID

Quotes in Bash

"" (partially quote): all the characters are treated as normal characters except $ ` and \ . it can also reserve the space in the variable
'' (full quote): all the characters are treated as normal characters.
`command`: use the command as a linux command

Friday, 21 February 2014

the mooc course I am learning / have studied (keep updated)

MOOC is really a revolution to IT engineers. It provides us a keep-learning way and makes our knowledge base more wider and updated.

I did some course from 2013 but I began to record them from Feb 2014.

here is the MOOC list I am learning.

here is the list I have attended and finished.

Introduction to Google Tools (Udemy)- very small free course
TCP, HTTP and SPDY Deep Dive(Udemy)- very small free course

TCP, HTTP and web performance

This is a study note to Udemy class

https://www.udemy.com/tcp-http-spdy-deep-dive/

Web loading performance impacts the user feeling about the website. The research shows 100ms is the ideal time for web loading time.

By general, we can improve the web loading time in the below four areas:

Makeup/content:

Make fewer HTTP request

Optiomize css and scripts

Minimize cookies

Browser:

User progressive enhancement

Load scripts without blocking

Use AJAX and defferred scripts

Network:

Use caching and compression

Use CDN

Reduce DNS lookups

Avoid redirctions

Prefect commonly used resources

Server:

Load balancing

Backend server scripts

Optimize database

Beside the webserver and backend processing time, the network overload has a great impact on the web loading time.

TCP was designed and devlopped in 1980 under the lower network condition. It was very good to handle the low bandwidth network. It is stream focusing with the features such as slow start, sliding window, congestion windows, nagel argithem etc.

RTT is very important for web response time. It is controlled by the light traveling time between you and the server plus lots of other factors such as network device hops, bandwidth.

Then how web load time is influenced by the TCP/HTTP

1. 1 RTT to establish the TCP

2. 1 RTT to send the HTTP request and get the response time

3. 1 RTT to get the other date further than the 3 packages

4. extermly slow down when package lost, Retransission happens.

What we can do to improve the response time

1. paralley TCP sessions

2. reuse TCP sessions (persistent HTTP connections)

3. pre-establish TCP sessions

4. increase initial congestion window

5. use CDN to reduce the RTT

6. TCP fast open (HTTP GET request with TCP SYN)

Persistent HTTP Sessions

TCP session is not closed after the HTTP response is sent. The feature is supported by all major web sites and browsers. It can save TCP session control overload but will have to keep session in web server side (more threads or worker process). Timeout is set for apache

Initial congestion window: google experiment shows 10 is the suitable value for current internet congestion condition. It can send about 15k data to the browser so the content can be shown if the page is well designed.

HTTP request is sent in SYN package. Only experimental.

Web loading performance impacts the user feeling about the website. The research shows 100ms is the ideal time for web loading time.

By general, we can improve the web loading time in the below four areas:

Makeup/content:

Make fewer HTTP request

Optiomize css and scripts

Minimize cookies

Browser:

User progressive enhancement

Load scripts without blocking

Use AJAX and defferred scripts

Network:

Use caching and compression

Use CDN

Reduce DNS lookups

Avoid redirctions

Prefect commonly used resources

Server:

Load balancing

Backend server scripts

Optimize database

Beside the webserver and backend processing time, the network overload has a great impact on the web loading time.

RTT is very important for web response time. It is controlled by the light traveling time between you and the server plus lots of other factors such as network device hops, bandwidth.

Then how web load time is influenced by the TCP/HTTP

1. 1 RTT to establish the TCP

2. 1 RTT to send the HTTP request and get the response time

3. 1 RTT to get the other date further than the 3 packages

4. extermly slow down when package lost, Retransission happens.

What we can do to improve the response time

1. paralley TCP sessions

2. reuse TCP sessions (persistent HTTP connections)

3. pre-establish TCP sessions

4. increase initial congestion window

5. use CDN to reduce the RTT

6. TCP fast open (HTTP GET request with TCP SYN)

Persistent HTTP Sessions

HTTP request is sent in SYN package. Only experimental.

My Next 6 month study path - way to a Full Stack Engineer

There is a very popular idea called full stack engineer(FSE). FSE means a engineer understand the whole development stack in the web/Internet environment. It would be difficult to be an FSE as it will take very long time, great effort and a good Devops environment.

I won’t be able to become an FSE in recent years but I believe DevOps will be the furture of System admin which means you have to understand lots of IT area and know how to devolop system.

I will work on the below area in the next 6 months. Then let’s see how it goes.(20 hours per week)

1） Linux admin/Internal

2) BASH programming

3) Python programing

4) Network infrastructure

5) TCP/IP stack and application protocol

6) Web System infrastructure

7) Web Framework (SSH/Django)

I will write about 100 blogs to cover the above sections. Bye the end of July. I hope I can have a solid knowledge about the above areas.

The next 6 month after July will be focusing on Web development, from backend to frontend and mobile side but will check the result about the first 6 months.

Sunday, 16 February 2014

BASH text file processing

In this Blog, we will show some common bash commands about text file processing.

cut: cut is a powerful tool to extract the dedicated colume, fields tool from a text-based file.

The common options are:

-c <list>: the specified columns for output.
-d <delimiter> the delimiter used to separate the file, default is space and tab
-f <fields> the fields for output.

For example,

The command to print the 1^st and 7^th field of the /etc/passwd file using : as delimiter

The command to print the 1^st to 10^th characters of /etc/passwd

sort: display the file by sorting the field

Some important paramters

-b: ignore the blank
-d: sort by dictionary
-g: sort by float
-f: ignore the case
-k: define the key
-n: sort by integer
-o: send the output to output file
-t: delimiter
-u: unique

Example: sort the file by the second field as float

Sort the /etc/password file using UID by descend.

sort -t : -k3 -n -r /etc/passwd

uniq: delete the duplicated records

-c: show the line number

-i: ignore the case

-u: only show the unduplicated records

-d: only show the duplicated records

wc command:

show the files, line counts, word counts and character counts

the file has to use space or tab as delimiters

head and tail commands to show the first and list lines (by default is 10)

head –n number <file>: the first number lines

head –n -number <file> all the lines to the last number-st line

tail –n number <file>: the last number lines

tail –n +number <file>: the bottom number lines

WHY INTERNET

I have being working in the traditional IT environment for about 8 years and now (from 2004) I decided to turn to the Internet Area as it is more exciting and challengeable.

Now the area I am interested in and working on:

High volume traffic Web infrastructure. This are has a very wide scope. It includes the web deployment Architecture, The web framework, support procedure for extremely high HA.

Web Mining. This is more artificial intelligence and data mining related. It is the technology used to find the useful information from semi-structed web pages.

Other areas I am interested may be include Cloud Platform (OpenStack), NoSQL, Hadoop, Steaming Computing. These are the leading technologies widely used in Internet Company. But I just have the basic idea how these are working.

I can be contacted by rafa.xu.au@gmail.com

My Next 6 months study Areas:

OpenStack

Linux DevOps

Python/C

puppet(certified)

About Me