Sunday, July 11, 2021

git_find_big Contrib Back

I am contributing back some modifications as my way to thanks Antony Stubbs for the git_find_big.sh script.

As per Antony Stubbs recommendations, I have added some documentation for the changes I made and also made more changes so this should be the latest version.

#!/bin/bash
#set -x

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# Did some modifications on the script - 08-July-2021 @author Khamis Siksek
# [KS] changed the size to kilo bytes
# [KS] added KB=1024 constant to be used later in size calculations
# [KS] changed " to ' where applicable
# [KS] added a check for the pack file if it exists or not
# [KS] made the number of returned big files become a passable parameter
# [KS] used topBigFilesNo=10 as default value if not passed
# [KS] changed `command` to $(command) where applicable
# [KS] put the output in formattedText and echo that variable
# [KS] added exit 0 in case of success and exit -1 in case of an error
# [KS] packFile might hold multiple idx files thats why I used $(echo ${packFile}) in verify-pack
# [KS] added a check on the size and compressedSize since if they are too small they will show wrong output
# [KS] changed the variable "y" to "object" to make more readable
# [KS] enclosed all variables with {} wherever applicable
# [KS] changed sort to regular sort instead of reverse and used tail instead of head
# [KS] added more types to grep -v in objects (was only chain now it contains commit and tree)
# [KS] added informative message for the user that this may take few minutes

# make the number of returned big files configurable and can be passed as a parameter
topBigFilesNo=${1};
[[ -z "${1}" ]] && topBigFilesNo=10;

# check if the pack file exists or not
packFile=$(ls -1S .git/objects/pack/pack-*.idx 2> /dev/null);
[[ $? != 0 ]] && echo "index pack file(s) in .git do not exist" && exit -1;

# informative message for the user
echo 'This may take few seconds(minutes) depending on the size of the repository, please wait ...';

objects=$(git verify-pack -v $(echo "${packFile}") | grep -v 'chain\|commit\|tree' | sort -k3n | tail -"${topBigFilesNo}");

# as they are big files its more reasonable to show the size in KiB
echo 'All sizes are in KiBs. The pack column is the size of the object, compressed, inside the pack file.';

# constant
KB=1024;

# set the internal field seperator to line break, to iterate easily over the verify-pack output
IFS=$'\n';

# preparing the header of the output
output='Size,Pack,SHA,Location';

# loop goes through the objects to check their sizes
for object in $objects
do
    # extract the size in kilobytes
    size=$(echo ${object} | cut -f 5 -d ' ');
    [[ ! -z ${size} ]] && size=$((${size}/${KB})) || size=0;

    # extract the compressed size in kilobytes
    compressedSize=$(echo ${object} | cut -f 6 -d ' ');
    [[ ! -z ${compressedSize} ]] && compressedSize=$((${compressedSize}/${KB})) || compressedSize=0;

    # extract the SHA
    sha=$(echo ${object} | cut -f 1 -d ' ');

    # find the objects location in the repository tree
    other=$(git rev-list --all --objects | grep ${sha});
    
    #lineBreak=$(echo -e "\n")
    output="${output}\n${size},${compressedSize},${other}";
done

formattedOutput=$(echo -e ${output} | column -t -s ', ');
echo "${formattedOutput}";

exit 0;

Wednesday, July 24, 2019

xml2csv.sh

Thanks to David who irritated me since my old xml2csv.sh file used to take 7-8 minutes to generate around 13k+ lines of CSV so I went back and rewritten the script in a way that now it takes around 2 seconds :-D

The script contains validation and comments but putting that aside the coversion is actually one liner command as the following:

tail +n filename.xml | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2^/'g | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$'

The only things that needs to be changed in the previous command are the items highlighted in blue:
  • +n: Lines to skip
  • filename.xml: The xml file to convert to csv
  • ^: The delimiter to be used in the output file
  • ITEM: The XML object item header such as in the following:
<ITEM>
<CAR-LICENSE>83838</CAR-LICENSE>
<CAR-MODEL>Ferrari</CAR-MODEL>
<CAR-YEAR>2015</CAR-YEAR>
</ITEM>

And this is the long script:

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty

# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);

echo -n "${HEADER_LINE}" > "${OUTPUT_FILENAME}";

#######################################################################################################################
# This is much faster than going through a loop and processing the file line-by-line. I kept the loop version for history
# purposes only. I know that using the text processing commands on a whole file is much more faster than doing it
# line-by-line due to the fact that such commands are optimized to process big and huge files and can process files much
# more faster than read each line from a file and spawning command(s) for each line, but I though the impact will be in
# the fraction of seconds but it appears that the impact is really big (2-3 seconds new way : up to 7-8 minutes old way).
#
#Now to explain the command here it is (arrangement of commands is important):
#
#tail +"${LINES_TO_SKIP}" "${XML_FILENAME}": skip the first few lines if not related (such as xml meta-tags)
#sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g: remove the xml tags and add the delimiter
#tr -d '\n': remove the new lines from the result making it one line
#sed s'/<\/\?'"${INITIAL_ITEM_TAG}"'>/\n/'g: replace the INITIAL_ITEM_TAG with a newline
#grep -v '^$' : because of the previous sed command there will be extra empty lines so this command removes the empty lines
#######################################################################################################################

tail +"${LINES_TO_SKIP}" "${XML_FILENAME}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g | tr -d '\n' | sed s'/<\/\?'"$
{INITIAL_ITEM_TAG}"'>/\n/'g | grep -v '^$' >> "${OUTPUT_FILENAME}";
exit 0