Wednesday, July 24, 2019

xml2csv.sh

Thanks to David who irritated me since my old xml2csv.sh file used to take 7-8 minutes to generate around 13k+ lines of CSV so I went back and rewritten the script in a way that now it takes around 2 seconds :-D

The script contains validation and comments but putting that aside the coversion is actually one liner command as the following:

tail +n filename.xml | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2^/'g | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$'

The only things that needs to be changed in the previous command are the items highlighted in blue:
  • +n: Lines to skip
  • filename.xml: The xml file to convert to csv
  • ^: The delimiter to be used in the output file
  • ITEM: The XML object item header such as in the following:
<ITEM>
<CAR-LICENSE>83838</CAR-LICENSE>
<CAR-MODEL>Ferrari</CAR-MODEL>
<CAR-YEAR>2015</CAR-YEAR>
</ITEM>

And this is the long script:

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty

# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);

echo -n "${HEADER_LINE}" > "${OUTPUT_FILENAME}";

#######################################################################################################################
# This is much faster than going through a loop and processing the file line-by-line. I kept the loop version for history
# purposes only. I know that using the text processing commands on a whole file is much more faster than doing it
# line-by-line due to the fact that such commands are optimized to process big and huge files and can process files much
# more faster than read each line from a file and spawning command(s) for each line, but I though the impact will be in
# the fraction of seconds but it appears that the impact is really big (2-3 seconds new way : up to 7-8 minutes old way).
#
#Now to explain the command here it is (arrangement of commands is important):
#
#tail +"${LINES_TO_SKIP}" "${XML_FILENAME}": skip the first few lines if not related (such as xml meta-tags)
#sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g: remove the xml tags and add the delimiter
#tr -d '\n': remove the new lines from the result making it one line
#sed s'/<\/\?'"${INITIAL_ITEM_TAG}"'>/\n/'g: replace the INITIAL_ITEM_TAG with a newline
#grep -v '^$' : because of the previous sed command there will be extra empty lines so this command removes the empty lines
#######################################################################################################################

tail +"${LINES_TO_SKIP}" "${XML_FILENAME}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g | tr -d '\n' | sed s'/<\/\?'"$
{INITIAL_ITEM_TAG}"'>/\n/'g | grep -v '^$' >> "${OUTPUT_FILENAME}";
exit 0

No comments: