
Need to organize your data? Here's a personal productivity tool
for managing lists of information
By Dr. Rebecca Thomas
If there is one thing that the information age has created,
it's gobs of data often in unwieldy chunks. The key to keeping
organized is being able to extract the information you need in a
format that you can use.
Tom Baker provides a Korn shell script implementation of a
tool that helps manage line-oriented textual information. The
user constructs a rule file that directs the script how to
process the data files in their working directory.
Deal Me In
Dear Dr. Thomas:
My shuffle Korn shell program [Part A of Listing 1] is a rule-directed
list processor designed to organize files containing lists. It
is especially good for lists that undergo continual growth and
revision, such as calendars, phone directories, event logs, and
lists of things to do.
The name ``shuffle'' is based on a playing card metaphor.
When cards are shuffled, they are swept together, mixed, and
dealt back out into random hands. This script sweeps a set of
list files together into one big file, then--under direction of
regular expressions contained in rules defined by the user--deals
the data back out into a new set of list files and, when
directed, sorts them.
My lists contain one-line items of information: in principle,
anything that can be expressed within a line or sortable sequence
of lines [see Part B]. Some of the
data are structured by an organizing principle, such as date,
name, or priority.
These organizing principles are expressed in an editable set
of rules [see Part C]. Minimally,
each rule contains a search key, which is used with
egrep to extract a line from a source file into a
target file. Optionally, the rule also specifies a sort command
for the target file.
When shuffle is run, it first concatenates all of
the files specified as arguments into one big file, named by the
Allfiles variable. After making a safety backup, it
erases the originals, thus wiping the slate clean for their
reconstitution. From this one aggregate file,
shuffle extracts an entirely new set of lists.
Figure 1 shows a typical flow
diagram. The first rule extracts every line (specified by the
``.'' pattern) from the source file, named by the
Allfiles variable, into the target file--here called
phone--effectively renaming
combined.dat to phone, and then sorts
it.
The second rule moves all lines that begin with a hyphen
followed by a space character (``- '') from
phone into 1993 and sorts it by year.
The third rule moves all lines that start with the
``- 1994 '' pattern from 1993 into
1994. After all of the nine rules in our example
have been applied, any lines that remain are left in the file
named phone.
If you edit a data line to match a different rule, you mark
that line for export to a different list. For example, I might
expand the information from the line shown [in Part D] into the lines shown [in Part E]. Then when I run
shuffle, the event lines will be moved into the 1994
log, Joachim Mann will go to the phone directory, and the article
on SGML will end up in a list of things to do later.
When you edit the rules, older lists are merged or new ones
created to meet new needs. For instance, the rule shown in Part F creates a separate list of
things I need to follow up on, such as the two items from the
Smith meeting.
Use of line-oriented data files means that I can use simple
grep searching commands to locate items that meet
certain criteria, for instance, ``show me everything I have on
Smith'' (grep Smith) or ``what is my shoe size?''
(grep -i shoe) or ``when is the music library
open?'' (grep musikbuecherei).
Furthermore, I often organize the elements of my data lines
from general to specific, reading from left to right. This
approach means that related items will be grouped together when
sorted: lines referring to ``Clothes Shoes'' will remain near the
``Clothes Pants'' and ``Clothes Shirts'' in the residual phone
file. This general-to-specific arrangement means that if a
search doesn't tell me what I want to know because it was too
specific (``Gap'' or ``Bean''), I can search for a more general
category (``pants'').
I find that the rule file evolves as I edit the data. And
because the rule file is just another list--albeit a special
one--stored along with the files to which it refers, the set of
lists is largely self-documenting.
Tom Baker / Bonn, Germany
Configuration Notes: The shuffle program
was developed under the MKS Toolkit Korn shell running under DOS
3.3 and ported to Korn shell Version 11/16/88d running under
System V Release 4.0.3. It has been tested under the
environments mentioned in the ``acknowledgments'' paragraph near
at the end of this column.
The configuration section (lines 12-23) was written to support
both DOS-based MKS Toolkit Korn shell and Unix-based Korn shell
versions as indicated by the comments. For instance, DOS-MKS
Toolkit doesn't have the equivalent for the Unix ``bit-bucket''
file /dev/null, so a temporary file is used instead
(line 21 instead of line 15).
Under MKS Toolkit, the rule file is named ``rules'' whereas
Unix users can use ``.rules''. The latter usage lets one invoke
the script using the asterisk wild card, as in shuffle
*, without fear of shuffling the rule file. Also, MKS
Toolkit does not have a command named nawk, but one
can either copy awk.exe to nawk.exe or
edit the script to invoke awk. By now, many
implementations use awk as the name of the ``new''
awk program, instead of nawk, a name
that was used when the new version was first introduced.
Usage Note: The shuffle program is
designed to process data files under direction of a rule file all
in the same directory. A backup subdirectory is created when
shuffle runs.
Tester's Comments: It's a nice and useful script, but
I was able to change it to handle multiline text to shuffle mail
files or Usenet news articles. By employing the public-domain
agrep program--which is record, not just line, oriented--and
using ``^From '' as the field delimiter, I could
extract data from our electronic mail support database. The same
idea holds for our news archives, although I had to modify
shuffle so it wouldn't combine all input into a
single file, which could be many megabytes in size.
Additionally, I would like to see shuffle allow
read-only data files and allow sharing of files with my coworkers.
The latter means I would need to remove the restriction to use
files in a single directory. Also, there is no lock mechanism to
prevent two instances of the program from running at the same
time in the same directory.--Kees Hendrikse
The script runs unmodified under Unixware. It doesn't run on
my BSD 386 system--which uses Bash instead of the Korn
shell--unless you replace the print statements by
equivalent echo statements. I also had to replace
the nawk script by one written in Perl, which I
obtained by translating the script using the a2p
conversion utility provided by the Perl distribution. My guess
would be that shuffle could be improved further by
translating it completely to Perl.--Endre Bálint
Nagy
For AIX 3.2 I had to rename awk to
nawk, but both AIX and Ultrix 4.3 required that I
not use the unsupported -M sort option in the
``rule'' file.--Steve Wright
This script worked fine under ISC 3.2.2, but had to be changed
significantly to run with Coherent (version 4.2.05). [See Part G for the Coherent port of
shuffle, which by the way, should also work with
System V Release 2 and later Bourne shells and the old
awk.]--Gábor Zahemszky
Wanted: Rewrite Shuffle in Perl
I'm looking for a Perl version of the shuffle
program discussed here. We'll pay you US$100 for your trouble.
You're welcome to enhance or improve, as long as you coordinate
with me.
Acknowledgments
I wish to thank the following readers for their help with
testing this month's contributions: Gábor Zahemszky, CoDe
Ltd., Budapest, Hungary (ISC 3.2.2 Unix and Coherent 4.2); Kees
Hendrikse, Echelon Consultancy, Enschede, The Netherlands
(current SCO Unix and Xenix versions); Endre Bálint Nagy,
Walton Networking Ltd., Budapest, Hungary (Unixware Application
Server 1.0); and Steve Wright, Computer Science Dept., University
of South Carolina, Columbia, S.C. (AIX 3.2).
|