file: README.ProgNotes.txt	started 2008.08.19

Just some program notes ... some items are marked *TBD*, which means
I hope to revisit these at some future stage, when I have a very
viable tar functional ...

NOTE: found -
HANDLE h = _get_osfhandle (fd);
to get the HANDLE associated with a low level file descriptor ... 
this might work, istead of converting all the low level
calls (open, read, write, seek, close, etc) to WIN32 calls ...

Working on --create and --spares

First I modified genfile to create a WIN32 sparse file by using the native CreateFile()
so the file IO could be marked as 'sparse' ...

But then the tar --create would incorrectly find an 8GB + 512 byte sparse file as
having only 512 bytes of data ...

But in misc.c, it does a -
int deref_stat (bool deref, char const *name, struct stat *buf)
{
  return deref ? stat (name, buf) : lstat (name, buf);
}
which only gets the 32-bit file size ... so this must be changed ...

In this case it is called from -
static void
dump_file0 (struct tar_stat_info *st, const char *p,
	    int top_level, dev_t parent_device)
which is passing a struct tar_stat_info * st, which only contains a 32-bit stat
structure ...
and this does -
  st->archive_file_size = original_size = st->stat.st_size;
so each of these need 64-bit sizes ... called from dump_file() ...

Read below how I TRIED to define off_t to 64-bits, but this is used in
too many places in windows structure that it caused a BIG problem,
so defined a new off64_t as
typedef __int64 off64_t;
and began using this ...

But this may also apply to size_t ... only if WIN64 is defined, which can NOT
be used to create a 32-bit application -
#ifndef _SIZE_T_DEFINED
#ifdef WIN64
typedef unsigned __int64    size_t;
#else
typedef __w64 unsigned int  size_t;


Working On --delete

I have a very large tar file -
11/04/2008  20:24     4,732,825,600 DELL02-Friday.tar

This is OVER the maximum size of a DWORD (32-bits) - 0xFFFFFFFF - 4,294,967,295,
and thus had some problems.

My first thought was to define off_t to 64-bits, like -
typedef __int64 off_t;
but when compiled with this, which I have left in config.h,
under a big #ifdef   USE_64BIT_OFFSET_TYPE
the program would not run ... ???

Perhaps because off_t is used too frequently in windows structures
my $find = "off_t";
# typedef long off_t #define _OFF_T_DEFINED wchar.h & sys\types.h
that to force a define to 64 bits, makes a problem ...


So I set about modifying some specific services to use 64-bits, and added
typedef __int64 off64_t;

Specifically in the module delete.c ... there is a function -
static void move_archive (off_t count)
which uses _lseek(), and some integer arithmatic to move the file pointer,
either forward or backwards.

I now have this 'math' working correctly, and user _lseeki64(...) so that
such large files can be handled correctly.

Inside the above large tar is a single file -
-rw-rw-rw- geoff/geoff 2130378752 2008-04-11 19:28 home/geoff/bin/v

I wanted to use --delete to REMOVE this file from the archive. Well
eventually I got the code working correctly, but to my amazement, the file
size remained unchanged.

The command was :
> tar --delete -vf D:\SAVES\docs\backup\temp\tempd2.tar home/geoff/bin/v

It seems that the --delete function only MODIFIES the 'blocks' belonging to
that file, and does NOT actually remove them ... all this is a bit difficult,
since there are some of my usual utilites that will not handle such a file!
And everything takes so long to complete ...

But as stated, it worked, and after the delete, the command
> tar -tvf D:\SAVES\docs\backup\temp\tempd2.tar home/geoff/bin/v
could not FIND such a file ...

But the whole purpose of deleting this from the file was to REDUCE the
file size, but at present the delete does not do that ... also a minor point,
but there was no output saying it had found the file, and was doing the
delete, even though I had a -v ...

I decide to ADD some diagnostic code, invoked by using -vvvvv, that elevates
variable verbose_option to at least 5 ... so I can 'see' what is happening. I
will try to output each message starting with 'v5: ', to differentiate them
from other outputs ... this will be under a switch -
#ifdef DBGV5

Whether an archive is seekable, that is _lseek() can be used, determines
whether void skip_file (off_t size) uses seek_archive (size);, or
actual file reads are performed - x = find_next_block ();

When the --delete option is give, the seekable_archive member is sent
false! That is a read is performed ... it defaults to false anyway, unless
a -n is given ... as a small asside, I do not like the fact that
this, and many others are simply declared in a header, common.h -
GLOBAL bool seekable_archive;
where is seems GLOBAL is defined as 'extern' ... but anyway ...

The skip_file(size) is called each, excetp for a DIRECTORY (save_typeflag != DIRTYPE),
using the size value from current_stat_info.stat.st_size ... this in itself has
current limits, in that the present compile uses stat 32-bits, where the 'st_size'
member is only 32-bit - max value 4,294,967,295 bytes!!! *TBD*

Ok, the idea on this --delete, is not to 'seek' backwards, and 'destroy'
the entry header, and thus leave it there, and its data, as a blank, but 
when --delete is used, start a NEW 'temporary' archive, and write all data, 
except the entries to be deleted, to this new archive.

If full success in finding things to delete, delete the original, and copy
that temporary to the archives name, thus creating a SMALLER achive ... simple
idea, but to implement it ...

But first to understand the --delete current actions ...
void delete_archive_members (void){ ... }

It reads a header, and verifies the header, then compares the name to those
requested ... static struct name * namelist_match (char const *file_name, size_t length)

In the case HEADER_SUCCESS:
if ((name = name_scan (current_stat_info.file_name)) == NULL) {
  skip_member ();
  break;
}
name->found_count++;

The first things seems to be to change 'open_archive', in
delete_archive_members()
from open_archive (ACCESS_UPDATE);
to   open_archive (ACCESS_READ);
like in list,c, since we do NOT want WRITE anything to the archive.

This, and other code changes, will ALL be under a 
#ifdef NEW_DELETE_FUNCTION switch, to allow full reversion to the
'normal' action ...

Ok, the process look relatively straight forward. There is a 'union'
defined in tar.h -
union block {
  char buffer[BLOCKSIZE];
  struct posix_header header;
  struct star_header star_header;
  ... etc ...
  };

This is an allocated buffer, aligned, kept in record_start. The size of the
buffer defaults to -
record_size = DEFAULT_BLOCKING(20) * BLOCKSIZE(512); = 10,240, 0x2800,
but can be set to any multiple of 512 (BLOCKSIZE)... the variable
blocking_factor = record_size / BLOCKSIZE;
These defaults can be set by -
 -b, --blocking-factor=BLOCKS   BLOCKS x 512 bytes per record
     --record-size=NUMBER   NUMBER of bytes per record, multiple of 512

From init_buffer (), in buffer.c ...
record_start = record_buffer_aligned[record_index];
current_block = record_start;
record_end = record_start + blocking_factor;
  
So, block of the file are read into the record_start, then current_block
is incremented, until it reaches record_end, when another block will be
read in ...

Since I am openning a 'temporary' archive, my block processing can be quite
simple ... if it does not match a file to delete, write contents to the
NEW tar file ... if it does match the requested delete files, then simple
skip the data ...


At the end decide what to do ... most seems to be working 2008-08-21 ;=))

Ok, offer BOTH - the original record changing method, which keeps the archive
at the same size, and a new windows delete archive members, which shortens
the file (in some cases).

The 'in some cases' refers to the general rounding of archive to a mutiple
of record lengths, which defaults to 0x2800 (10,240) bytes. This rounding does
not seem necessary, but I have coded it such that if the original archive,
be default or design was an exact multiple, then I create a new that is an
exact multiple. This should be a command line option.

All the changes have been placed in a new module, in Win32 folder,
windelete.c ... there is a compile option to NOT use this module, but at present
there is no command line option to switch. 

LINK TYPES
==========

Like, header->header.typeflag == LNKTYPE, these are NOT presently
handled very correctly ... *TBD*

MAKE DIRECTORY
==============

It appears 'mkdir' in unix, takes two parameters (char * path, int mode)
while the windows mkdir, in <direct.h> only takes 1 param (char * path)
This produces a number of warnings of the form, like -
tempname.c(277) : warning C4020: 'mkdir' : too many actual parameters *TBD*

UNQUOTING STRINGS
=================

NOTE: int unquote_string (char *string), called from -
struct name_elt * name_next_elt (int change_dirs) if
unquote_option is ON (default) ... called from
void name_gather (void) ... from
void read_and (void (*do_something) (void)) ... like -t, listing
WILL CHANGE THINGS LIKE '\n' !!!
It is thus important the give a file name path using the unix
path separator, '/', like home/name/bin/file_name ... *TBD*


OBSERVATIONS USING tar 1.19 in ubuntu
=====================================

$ tar -cvf temp1.tar .
produces a tar with all the files, and directories, starting with this ./!
./
./test1.c
./sub1/
...
./asub/
etc, in order found

$ tar -cvf temp1.tar *
produces a tar with all the files, and directories,
asub/
...
test1.c
...
zsub/
in alphabetic sequence.

Both cases, it avoids trying to 'add' the archive being creates, with
message like -
tar: ./temp1.tar: file is the archive: not dumped

$ tar -cvf temp3.tar sub1
$ tar -cvf temp4.tar sub1/*
both produce a tar with all the files in directory sub1, but temp3.tar
also contains a sub1/ entry.
$ tar -cvf temp5.tar sub1/.
produces files listed as
sub1/./
sub1/./ssub1/
... etc

$ tar -cvf temp3.tar --exclude '*.tar' sub1
will include all, in the order found. Note the '*.tar' has to have
the single quotes to stop the shell expanding it ... thus my
'expand_wild' function 'emulates' this ... that is, if it finds
a command item encased in single quotes, it adds it without
expansion, and without the single quotes ...

That is
$ ./test1 *.tar
will pass an alphabetic list to the application, including any folders
that match the pattern ...
$ ./test1 '*.tar'
will pass *.tar to the application, without expansion or quotes.

$ tar -cvf temp4.tar --exclude '*.tar' sub1/*
will include all, with those in the sub1/ folder in aphabetc order,
but those in the sub-folder, in the order found.

$ tar -cvf temp5.tar --exclude '*.tar' sub1/.
will include all, in the order found, all preceeded with
sub1/. !


SOME TESTING AND PROGRAM NOTES
==============================

tests 1:
--totals -wxvf c:\DTEMP\temp1.tar	= OK
--totals -xvf c:\DTEMP\temp2.tar	= OK
--totals -tvf c:\DTEMP\temp2.tar
-cf temp3.tar Data-Section-0.005	= ok, fixed
-cf temp3.tar Data-Section-0.005\*
-cf temp5.tar Sub-Install-0.924\*
--totals -xzvf temp7.tar.gz
--help

tests 2:
--totals -dvf c:\DTEMP\temp2.tar Data-Section-0.005\*
Hmmm, shows 'changed' files, even if just date, but does NOT politely show
'deleted' files - that is file found in archive, no longer exists
in reality ... it shows an error type message ... hmmmm, I suppose that is ok ;=))
But it does NOT do a directory search, and find new files ...
AND IT WORKS, LIKE
--totals -tvf c:\DTEMP\temp2.tar --wildcards '*.pm'

Hmmm, how much work to add an -@input.fil? Much easier to do repeated
tests, just changing the contents of the inputfile ... ok, did that ...
now to some more testing ...
using -@testinp.txt

test 3:
--totals
-dvf
c:\DTEMP\temp2.tar
Data-Section-0.005

test 4:
--totals
--keep-newer-files	; keep if newer (or same age)
-xvvf
c:\DTEMP\temp1.tar
There is a waring when the file is deleted, but it does
write a NEW copy ...


Report bugs to <bug-tar@gnu.org>
Report bugs to <ubuntu@geoffair.info>

tar online help: http://www.gnu.org/software/tar/manual/tar.html

Notes on device number
======================
Options:
      --check-device         check device numbers when creating incremental
                             archives (default)
Sets check_device_option, case CHECK_DEVICE_OPTION:
      --no-check-device      do not check device numbers when creating
                             incremental archives
Unset check_device_option, case NO_CHECK_DEVICE_OPTION:
Use: In incremen.c, in function -
static struct directory *
procdir (char *name_buffer, struct stat *stat_data,
	 dev_t device,
	 enum children children,
	 bool verbose,
	 char *entry)
{
  struct directory *directory;
  bool nfs = NFS_FILE_STAT (*stat_data);

  ...
      if (! ((!check_device_option
	      || (DIR_IS_NFS (directory) && nfs)
	      || directory->device_number == stat_data->st_dev)
	     && directory->inode_number == stat_data->st_ino))
	{
	  /* FIXME: find_directory_meta ignores nfs */
	  struct directory *d = find_directory_meta (stat_data->st_dev,
						     stat_data->st_ino);
	  if (d)
  ...
}   
Called from function -
/* Recursively scan the given directory. */
static const char *
scan_directory (char *dir, dev_t device)
{
  ...
}  
called from function -
const char *
get_directory_contents (char *dir, dev_t device)
{
  return scan_directory (dir, device);
}
called from function -
static void
diff_dumpdir (void)
{
  ...
}
and
static void
add_hierarchy_to_namelist (struct name *name, dev_t device)
{
 ...
}

Notes on LINKS
==============
Need to fix, so as to instead of reporting can not create
hardlink to another file, in another directory,
write the file in 'this' directory ... 

Notes on MS compress/expand
===========================

Unfortunately MicroSoft has a pair of utilities, called
compress.exe and expand.exe,
versus the unix compress.exe, which uses -d to decompress ...

The unix (GNU WIN32 port of) compress.exe, uses the extension .Z,
hence in tar -Z will signal to use 'compress' as the compressor/
decompressor, and like gzip, given say test1.tar will produce
test1.tar.Z, and remove test1.tar ... and again like 'gzip',
will take the command -d test1.tar, find test1.tar.Z,
decompress it to test1.tar, and remove test1.tar.Z.

The MS compress.exe does not have a 'default' extent, but the USUAL
thing is to remove the last letter of the file extension, and replace
it with an undercore - like the command
C:\folder1> Compress text.txt c:\folder2 r
will produce a file in the folder2 directory, and
the compressed file will be named: text.tx_

The underscore identifies the file as a compressed file.

Geoff.
http://geoffair.net/unix/tar-01.htm
2008-08-19

EOF - README.ProgNotes.txt

