home
NEWS       BLOGS       FORUMS       NEWSLETTERS       RESEARCH       EVENTS       DIGITAL LIBRARY       CAREERS  
Network Computing Network Computing Powered by InformationWeek Business Technology Network

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers




The Data Shuffle: Listings

Listing 1: The shuffle script processes line-oriented data, catenating it, then extracting selected lines into specified files with possible ordering.

A. Listing of the shuffle Korn shell script:

1 #!/usr/bin/ksh
2 # @(#) shuffle Version 5 A rule-based list processor
3 # Author: Thomas Baker <tbaker@unix.amherst.edu>
4 # Modified by: Becca Thomas, February 1994
5 $DBG_SH # Dormant debugging directive
6 
7 trap 'rm -f $Tmpfile $Targetfilenames >|$Devnull 2>&1; \
8 exit $Stat' 0
9 trap 'print -u2 "$(basename $0): Interrupted!"; exit' 1 2 3 15
10 
11 # CONFIGURATION
12 Allfiles=combined.dat # File for all catenated input files
13 Bkupdir=.backup # Unix input-files backup directory
14 #Bkupdir=backup # MKS input-files backup directory
15 Devnull="/dev/null" # Unix bit-bucket file
16 Rulefile=.rules # Unix rule file
17 #Rulefile=rules # MKS rule file
18 Usage="Usage: $(basename $0) datafile [datafile ...]" # Correct usage
19 # Temporary directory-dependent variables:
20 Tmpdir=/tmp # MKS/Unix temporary directory
21 #Devnull=$Tmpdir/null # MKS bit-bucket file
22 Targetfilenames=$Tmpdir/sht$$.tmp # MKS/Unix target-names file
23 Tmpfile=$Tmpdir/shf$$.tmp # MKS/Unix temporary work file
24 
25 # FUNCTION DEFINITIONS:
26 function usage_exit {
27 print -u2 "$Usage"; Stat=1 ; exit
28 }
29 function movelines { # Args: $Searchkey $Source $Target $Sortcmd
30 print -n "Lines with [$1] moved from \""$2"\" to \""$3"\""
31 egrep "$1" $2 >>$3; egrep -v "$1" $2 >|$Tmpfile; mv $Tmpfile $2
32 [ "$4" ] && print ", ${4}." || print "." # Print sort command
33 [ "$4" ] && { eval $4 -o $3 $3 ||
34 { print "\aBad rule-file sort command: $4"; Stat=2; exit;};}
35 }
36 
37 # PROCESS COMMAND-LINE ARGUMENTS:
38 case $# in # User must specify at least one file-name argument
39 0) usage_exit ;;
40 esac
41 
42 # SANITY CHECK: Rule file:
43 [ -r $Rulefile ] ||
44 { print -u2 "\aCannot read \"$Rulefile\" file!"; Stat=4; exit;}
45 sed 's/#.*$//' $Rulefile | # Remove comments.
46 egrep -v '^$' | # Remove blank lines.
47 nawk -F\| ' # Rules separated by vertical bar
48 NR == 1 && ($1 != "." || $2 != "$Allfiles") { # Check first rule
49 print $0, ": rule 1 is illegal!" }
50 NF != 3 && NF != 4 { # All rules have 3 or 4 fields.
51 print $0, ": must have 3 or 4 fields!" }
52 $2 == $3 { # Source different from target.
53 print $0, ": source cannot equal target!" }
54 $4 != "" && $4 !~ /^sort/ { # Field 4 is for sort commands.
55 print $0, ": field 4 is only for sort!" }
56 $1 == "" || $2 == "" || $3 == "" { # First three fields are non-empty.
57 print $0, ": 1 of first 3 fields is empty!" }
58 { target[$3] = 1 } # Note names of target files
59 NR > 1 { # For all lines after the first
60 if ($2 in target) # If source file is also a target
61 next; # No problem, fetch next input line
62 else print $0, ": ", $2, "has no precedent!"
63 }' >| $Tmpfile # Save unique lines and display
64 [ -s $Tmpfile ] &&
65 { print -u2 "Bad rule format:\n$(cat $Tmpfile)"; Stat=5; exit;}
66 
67 # SANITY CHECKS: Current directory, combined data, backup directory:
68 [ -w "." ] || # Current (data) directory
69 { print -u2 "\aCannot write to current directory!"; Stat=6; exit;}
70 [ -f $Allfiles ] && # Combined data file
71 { print -u2 "\a\"$Allfiles\" should not yet exist!"; Stat=7; exit;}
72 [ -d $Bkupdir ] || mkdir $Bkupdir 2>|$Devnull ||
73 { print -u2 "\aCannot make directory \"$Bkupdir\"!"; Stat=8; exit;}
74 [ "$(ls $Bkupdir)" ] && { # if there are files in backup dir
75 print -n "Okay to erase files in $Bkupdir (y*|Y*/n)? "; read ans
76 case $ans in
77 y*|Y*) rm -f $Bkupdir/* >|$Devnull 2>&1 ;; # Remove old backups
78 *) print "Exiting, check $Bkupdir directory."; Stat=0; exit ;;
79 esac;}
80 
81 # CHECK DATA FILES, BACK UP, THEN COMBINE INTO A COMMON FILE:
82 for File in "$@"; do
83 [ -d $File ] && continue # Ignore directories.
84 [ "$File" = "$Rulefile" ] && continue # Ignore rules (just data).
85 [ "$(dirname $File)" = "." ] || [ "$(dirname $File)" = "$PWD" ] ||
86 { print -u2 "\aData files must be in current directory!"
87 Stat=9; exit;}
88 [ -r $File ] ||
89 { print -u2 "\a\"$File\" file not readable."; Stat=10; exit;}
90 { file $File | egrep 'text|empty' >|$Devnull 2>&1;} ||
91 { print -u2 "\a\"$File\" not text nor empty."; Stat=11; exit;}
92 egrep '^[ ]*$' $File >|$Devnull 2>&1 &&
93 { print -u2 "\a\"$File\" has blank lines!"; Stat=12; exit;}
94 cp $File $Bkupdir || # Copy to backup directory.
95 { print -u2 "\aCannot back up $File!"; Stat=13; exit;}
96 cat $File >> $Allfiles; rm $File # Combine into common file.
97 done
98 
99 # CHECK COMBINED DATA FILE:
100 [ -s $Allfiles ] || { print -u2 "\aNo data to process!"; Stat=14; exit;}
101 Beforesize=$(wc -c <$Allfiles | awk '{ print $1 }') # Data size before
102 print "Data backed up to \"$Bkupdir\", concatenated in \"$Allfiles\"."
103 
104 # PROCESS DATA FILES under direction of rule file:
105 OldIFS="$IFS" # Save old internal field separator char(s)
106 IFS="|" # Rule-file field separator for "read"
107 sed 's/#.*$//' $Rulefile | # Remove rule-file comments
108 egrep -v '^$' | # Remove blank lines
109 while read Searchkey From To Sortcmd ; do # put fields into variables
110 eval Source=$From; eval Target=$To # interpolate these var.
111 movelines $Searchkey $Source $Target $Sortcmd # Do the shuffle
112 print -u3 "$Target" # Output goes to fd 3.
113 done 3>| $Targetfilenames # Store fd3 output in a file.
114 IFS="$OldIFS" # Restore original IFS values.
115 Targetnames=$(sort -u $Targetfilenames) # Place unique list in variable.
116 
117 # CONCLUSION: Cleanup and exit message:
118 for File in $Targetnames $Allfiles; do
119 [ -s $File ] || rm $File # Erase data files if empty
120 done
121 if [ $Beforesize -ne $(cat $Targetnames 2>|$Devnull | wc -c) ]; then
122 print -u2 "Warning: data may have been lost--use backup!\a\a\a"
123 else
124 print -u2 "Done: data shuffled and intact!"
125 fi

B. A sample data file:

- 1994 Feb 23 Smith 01 John Lunch at Panda East.
- 1994 Jan 23 Smith 02 Not coming to session, but writing paper.
- 1994 Feb 23 Smith 03 FOLLOWUP Read Sep 1993 SCILS article on SGML
Smith John 432 E43rd St, New York NY 01002 212-555-5555, fax 666-6666
Feb 10 BDAY Sarah (1956)
LATER Read SCILS article on SGML.
NOW Renew passport!
Beans stock and info 800-221-4221, customer service 800-341-4341
Clothes Shoes Timberland "Blucher" size W12
Convert US Ounces to Grams: 1 oz = 28.35 gm
Wallet [07 Sep 93] NY Drivers' # A01234 56789 123456 78, exp 7/96
- 1993 Dec 20 10am Called John Smith, set appt and faxed letter.
Wallet [07 Sep 93] Visa 1234-5678-1234-5678, lost: 1-800-423-3823
Fastback differential backup of C: c:/fastback/fb ')c)b)d)s))'
Clothes Shoes Adidas Marath.Train.II 1CA, size 12.5(D) 48(F) 13(USA)

C. A sample rule file:

# Rule file for "Shuffle: a rule-based list processor"
# 1. Rules contain: searchkey|source|target|optional_sort_command
# 2. First rule must have "." in first field, "$Allfiles" in second.
# 3. Common sort types:
# sort Straight alphabetic.
# sort +0M -1 +1n -2 Data format: Jun 25
# sort +1n -2 +2M -3 +3n -4 Data format: - 1992 Jun 25
.|$Allfiles|phone|sort
^- |phone|1993|sort +1n -2 +2M -3 +3n -4
^- 1994 |1993|1994|sort +1n -2 +2M -3 +3n -4
^Jan |phone|calendar
^Feb |phone|calendar
^Dec |phone|calendar|sort +0M -1 +1n -2
BDAY|calendar|bday|sort +0M -1 +1n -2
^NOW |phone|now|sort
^LATER |phone|later|sort

D. Another example of a data-file line:

Jan 23 Smith John Lunch at Panda East.

E. Some transformations of the data-file line shown above in Part D:

- 1994 Jan 23 Smith 01 John Lunch at Panda East.
- 1994 Jan 23 Smith 02 Not coming to session, but writing paper.
- 1994 Jan 23 Smith 03 FOLLOWUP Sep 1993 SCILS article on SGML
- 1994 Jan 23 Smith 04 FOLLOWUP Call Joachim Mann 321-4567
Mann Joachim, tel 321-4567
LATER Read Sep 1993 SCILS article on SGML.

F. Another example of a rule-file line:

FOLLOWUP|1994|followup|sort +1n -2 +2M -3 +3n -4

G. A version of shuffle written for Coherent that runs under the Bourne shell with the ``old'' awk.

1 #!/usr/bin/sh
2 # @(#) shuffle Version 5 A rule-based list processor
3 # Author: Thomas Baker <tbaker@unix.amherst.edu>
4 # Modified by: Becca Thomas, February 1994
5 # Modified by: Ga'bor Zahemszky, March 1994 to use sh and "old" awk
6 $DBG_SH # Dormant debugging directive
7 
8 trap 'rm -f $Tmpfile $Targetfilenames >$Devnull 2>&1; exit $Stat' 0
9 trap 'echo "`basename $0`: Interrupted!" >&2 ; exit' 1 2 3 15
10 
11 # CONFIGURATION
12 Allfiles=combined.dat # File for all catenated input files
13 Bkupdir=.backup # Unix input-files backup directory
14 #Bkupdir=backup # MKS input-files backup directory
15 Devnull="/dev/null" # Unix bit-bucket file
16 Rulefile=.rules # Unix rule file
17 #Rulefile=rules # MKS rule file
18 Usage="Usage: `basename $0` datafile [datafile ...]" # Correct usage
19 # Temporary directory-dependent variables:
20 Tmpdir=/tmp # MKS/Unix temporary directory
21 #Devnull=$Tmpdir/null # MKS bit-bucket file
22 Targetfilenames=$Tmpdir/sht$$.tmp # MKS/Unix target-names file
23 Tmpfile=$Tmpdir/shf$$.tmp # MKS/Unix temporary work file
24 
25 # FUNCTION DEFINITIONS:
26 usage_exit() {
27 echo "$Usage" >&2 ; Stat=1 ; exit
28 }
29 movelines() { # Args: $Searchkey $Source $Target $Sortcmd
30 echo "Lines with [$1] moved from \""$2"\" to \""$3"\""
31 egrep "$1" $2 >>$3; egrep -v "$1" $2 >$Tmpfile; mv $Tmpfile $2
32 [ "$4" ] && echo ", ${4}." || echo "." # Print sort command
33 [ "$4" ] && { eval $4 -o $3 $3 ||
34 { echo "\007Bad rule-file sort command: $4"; Stat=2; exit;};}
35 }
36 
37 # PROCESS COMMAND-LINE ARGUMENTS:
38 case $# in # User must specify at least one file-name argument
39 0) usage_exit ;;
40 esac
41 
42 # SANITY CHECK: Rule file:
43 [ -r $Rulefile ] ||
44 { echo "\007Cannot read \"$Rulefile\" file!" >&2 ; Stat=4; exit;}
45 sed 's/#.*$//' $Rulefile | # Remove comments.
46 egrep -v '^$' | # Remove blank lines.
47 oawk -F\| ' # Rules separated by vertical bar
48 NR == 1 && ($1 != "." || $2 != "$Allfiles") { # Check first rule
49 print $0, ": rule 1 is illegal!" }
50 NF != 3 && NF != 4 { # All rules have 3 or 4 fields.
51 print $0, ": must have 3 or 4 fields!" }
52 $2 == $3 { # Source different from target.
53 print $0, ": source cannot equal target!" }
54 $4 != "" && $4 !~ /^sort/ { # Field 4 is for sort commands.
55 print $0, ": field 4 is only for sort!" }
56 $1 == "" || $2 == "" || $3 == "" { # First three fields are non-empty.
57 print $0, ": 1 of first 3 fields is empty!" }
58 { target[$3] = 1 } # Note names of target files
59 NR > 1 { # For all lines after the first
60 ZGvar2 = 0
61 for (ZGvar1 in target) {
62 if (ZGvar1 == $2) {
63 next
64 } else {
65 ZGvar2 = 1
66 }
67 }
68 if (ZGvar2 == 1) {
69 print $0, ": ", $2, "has no precedent!"
70 }
71 }' > $Tmpfile # Save unique lines and display
72 [ -s $Tmpfile ] &&
73 { echo "Bad rule format:\n`cat $Tmpfile`" >&2 ; Stat=5; exit;}
74 
75 # SANITY CHECKS: Current directory, combined data, backup directory:
76 [ -w "." ] || # Current (data) directory
77 { echo "\007Cannot write to current directory!" >&2 ; Stat=6; exit;}
78 [ -f $Allfiles ] && # Combined data file
79 { echo "\007\"$Allfiles\" shouldn't exist!" >&2 ; Stat=7; exit;}
80 [ -d $Bkupdir ] || mkdir $Bkupdir 2>$Devnull ||
81 { echo "\007Can't make directory \"$Bkupdir\"!" >&2 ; Stat=8; exit;}
82 [ "`ls $Bkupdir`" ] && { # if there are files in backup dir
83 echo "Okay to erase files in $Bkupdir (y*|Y*/n)? \c"; read ans
84 case $ans in
85 y*|Y*) rm -f $Bkupdir/* >$Devnull 2>&1 ;; # Remove old backups
86 *) echo "Exiting, check $Bkupdir directory."; Stat=0; exit ;;
87 esac;}
88 
89 # CHECK DATA FILES, BACK UP, THEN COMBINE INTO A COMMON FILE:
90 for File in $*; do
91 [ -d $File ] && continue # Ignore directories.
92 [ "$File" = "$Rulefile" ] && continue # Ignore rules (just data).
93 [ "`dirname $File`" = "." ] || [ "`dirname $File`" = "`pwd`" ] ||
94 { echo "\007Data files must be in current directory!" >&2
95 Stat=9; exit;}
96 [ -r $File ] ||
97 { echo "\007\"$File\" file not readable." >&2 ; Stat=10; exit;}
98 { file $File | egrep 'text|empty' >$Devnull 2>&1;} ||
99 { echo "\007\"$File\" not text nor empty." >&2 ; Stat=11; exit;}
100 egrep '^[ ]*$' $File >$Devnull 2>&1 &&
101 { echo "\007\"$File\" has blank lines!" >&2 ; Stat=12; exit;}
102 cp $File $Bkupdir || # Copy to backup directory.
103 { echo "\007Cannot back up $File!" >&2 ; Stat=13; exit;}
104 cat $File >> $Allfiles; rm $File # Combine into common file.
105 done
106 
107 # CHECK COMBINED DATA FILE:
108 [ -s $Allfiles ] || { echo "\007No data to process!">&2; Stat=14; exit;}
109 Beforesize=`wc -c <$Allfiles | oawk '{ print $1 }'` # Data size before
110 echo "Data backed up to \"$Bkupdir\", concatenated in \"$Allfiles\"."
111 
112 # PROCESS DATA FILES under direction of rule file:
113 OldIFS="$IFS" # Save old internal field separator char(s)
114 IFS="|" # Rule-file field separator for "read"
115 sed 's/#.*$//' $Rulefile | # Remove rule-file comments
116 egrep -v '^$' | # Remove blank lines
117 while read Searchkey From To Sortcmd ; do # put fields into variables
118 eval Source=$From; eval Target=$To # interpolate these var.
119 movelines $Searchkey $Source $Target $Sortcmd # Do the shuffle
120 echo "$Target" >&3 # Output goes to fd 3.
121 done 3> $Targetfilenames # Store fd3 output in a file.
122 IFS="$OldIFS" # Restore original IFS values.
123 Targetnames=`sort -u $Targetfilenames` # Place unique list in variable.
124 
125 # CONCLUSION: Cleanup and exit message:
126 for File in $Targetnames $Allfiles; do
127 [ -s $File ] || rm $File # Erase data files if empty
128 done
129 if [ $Beforesize -ne `cat $Targetnames 2>$Devnull | wc -c` ]; then
130 echo "Warning: data may have been lost--use backup!\007" >&2
131 else
132 echo "Done: data shuffled and intact!" >&2
133 fi

Figure 1: A data-flow diagram for the example discussed in Tom Baker's introductory letter.

$Allfiles
|
V Sorted by year:
phone ---> 1993 [^- ] -----\-----------> 1994 [^- 1994 ]
| \----------> 1993 (everything else) 
|
V Sorted by month:
phone ---> calendar [^Jan,^Feb..] \----> bday [BDAY]
| \---> calendar (everything else)
|
V Sorted alphabetically:
phone ---> now [^NOW ]
|
phone ---> later [^LATER ]
|
\------> phone (everything else)
Print This Page


e-mail Send as e-mail





Ready to take that job and shove it?

Function:

Keyword(s):

State:
SPONSOR
RECENT JOB POSTINGS
CAREER NEWS
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

Ari Balogh was named to the post of chief technology officer as the companys for a "realignment" of employees.










InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Download Today
 
ROLLING RIGHT ALONG
Follow key Network Computing Reviews from conception to completion. This Week: Holistic APM.



Network Computing Reports Emerging Enterprise Podcast Series: Secrets to Success








TechSearch


Microsite of the Week


Powerful Information at Your Fingertips



InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space


App Infrastructure   |   Messaging & Collaboration   |   Network & Systems Mgmt   |   Network Infrastructure   |   Security  |   Storage & Servers   |   Wireless   |   Enterprise Apps
About Us  |  Contact Us  |  Site Map  |  Technology Marketing Solutions  |  Advertising Contacts  |   Briefing Centers
Copyright © 2008  United Business Media LLC  |  Privacy Statement  |  Terms of Service  |  Your California Privacy Rights