Wednesday, July 24, 2019

xml2csv.sh

Thanks to David who irritated me since my old xml2csv.sh file used to take 7-8 minutes to generate around 13k+ lines of CSV so I went back and rewritten the script in a way that now it takes around 2 seconds :-D

The script contains validation and comments but putting that aside the coversion is actually one liner command as the following:

tail +n filename.xml | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2^/'g | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$'

The only things that needs to be changed in the previous command are the items highlighted in blue:
  • +n: Lines to skip
  • filename.xml: The xml file to convert to csv
  • ^: The delimiter to be used in the output file
  • ITEM: The XML object item header such as in the following:
<ITEM>
<CAR-LICENSE>83838</CAR-LICENSE>
<CAR-MODEL>Ferrari</CAR-MODEL>
<CAR-YEAR>2015</CAR-YEAR>
</ITEM>

And this is the long script:

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty

# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);

echo -n "${HEADER_LINE}" > "${OUTPUT_FILENAME}";

#######################################################################################################################
# This is much faster than going through a loop and processing the file line-by-line. I kept the loop version for history
# purposes only. I know that using the text processing commands on a whole file is much more faster than doing it
# line-by-line due to the fact that such commands are optimized to process big and huge files and can process files much
# more faster than read each line from a file and spawning command(s) for each line, but I though the impact will be in
# the fraction of seconds but it appears that the impact is really big (2-3 seconds new way : up to 7-8 minutes old way).
#
#Now to explain the command here it is (arrangement of commands is important):
#
#tail +"${LINES_TO_SKIP}" "${XML_FILENAME}": skip the first few lines if not related (such as xml meta-tags)
#sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g: remove the xml tags and add the delimiter
#tr -d '\n': remove the new lines from the result making it one line
#sed s'/<\/\?'"${INITIAL_ITEM_TAG}"'>/\n/'g: replace the INITIAL_ITEM_TAG with a newline
#grep -v '^$' : because of the previous sed command there will be extra empty lines so this command removes the empty lines
#######################################################################################################################

tail +"${LINES_TO_SKIP}" "${XML_FILENAME}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g | tr -d '\n' | sed s'/<\/\?'"$
{INITIAL_ITEM_TAG}"'>/\n/'g | grep -v '^$' >> "${OUTPUT_FILENAME}";
exit 0

Sunday, July 14, 2019

xml2csv and csv2xml

I was working on a request to convert an XML file to CSV file, I tried to find a tool or a script to do that but I didn't find something that is straightforward.

There might be one, but to be honest I didn't search for a long time and when I found this https://github.com/dfellis/2csv and installed xml2 on ubuntu that is written by Dan Egnor, I decided to ignore finding a ready tool and thought of writing my own.

This is not the first time I write my own script but this is the first time I thought of sharing it and get some feedback and ideas to improve it and I hope this will not be the last (depends on the feedback). So here you go:

xml2csv.sh

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty

# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);
echo "${HEADER_LINE}" > "${OUTPUT_FILENAME}";

# +2 for the INITIAL_ITEM_TAG OPENING AND CLOSING and
XML_FIELDS_COUNT="$(($(cat "${CSV_FIELDS_FILENAME}" | wc -l)+2))";

# Counting how many items are in the file
ITEMS_COUNT=$(cat "${XML_FILENAME}" | grep '<'"${INITIAL_ITEM_TAG}"'>' | wc -l);

LINES_OFFSET=0;
declare -a XML_ITEMS;

for ((i=1;i<=${ITEMS_COUNT};i++)); do
{
    # tail +LINES_TO_SKIP skip the first line (Header Line)
    # The sed command is used to add _ to the values so they do not get separated when
    # building the array as the read -a XML_ITEMS uses the SPACE as the separator
    # +${LINES_OFFSET} is needed to skip the read ITEM and go to the next item
    read -a XML_ITEMS <<< $(echo `cat "${XML_FILENAME}" | tail +$((${LINES_TO_SKIP}+${LINES_OFFSET})) | head -"${XML_FIELDS_COUNT}" | sed
 s'/[[:space:]]\+/_/'g`);

    for ((j=1;j<$((${XML_FIELDS_COUNT}-1));j++)); do
    {
        # The first sed command isolates the value from the tags (\2)
        # The second sed command replaces the _ with space (this _ was added in a previous sed command)
        echo -n "${XML_ITEMS[$j]}${OUTPUT_DELIMITER}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2/'g | sed s'/_/ /'g >> "${OUTPUT_FILEN
AME}";
    }
    done

    # Adding a new line at the end of each csv record
    echo >> "${OUTPUT_FILENAME}";

    # As stated before this calculates the next starting point for the next item
    LINES_OFFSET=$((LINES_OFFSET+XML_FIELDS_COUNT));
    #echo "${LINES_OFFSET}";
}
done

exit 0;


-------------------------------------------------------------------


csv2xml.sh

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
CSV_FILENAME="${1}"; #'Filename.csv';
INITIAL_ITEM_TAG="${2}"; #'ITEMS';
INPUT_DELIMITER="${3:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${CSV_FILENAME}")_$$.xml";
OUTPUT_FILENAME="${4:-/tmp/${GENERATED_NAME}}";

# Validation for mandatory fields (using && operator)
[[ -z "${CSV_FILENAME}" ]] && echo 'Error: CSV_FILENAME cannot be empty' && exit -1;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(head -1 "${CSV_FILENAME}");

# Array to contain the XML items tags
declare -a XML_ITEMS;
read -a XML_ITEMS <<< $(echo "${HEADER_LINE//${INPUT_DELIMITER}/ }");

#echo "${#XML_ITEMS[@]}"; # Debugging

> "${OUTPUT_FILENAME}";

# tail +2 to skip the header line
cat "${CSV_FILENAME}" | tail +2 | while read CSV_LINE; do
{
    # Start the item xml tag
    echo '<'"${INITIAL_ITEM_TAG}"'>' >> "${OUTPUT_FILENAME}";

    # For the read command to use the INPUT_DELIMITER as the delimiter
    OIFS="${IFS}";
    IFS="${INPUT_DELIMITER}";
    read -a CSV_LINE_ARR <<< $(echo "${CSV_LINE}");
    IFS="${OIFS}";

    for ((i=0;i<"${#XML_ITEMS[@]}";i++)); do
    {
        # Building the XML tags
        echo -n '    <'"${XML_ITEMS[$i]}"'>' >> "${OUTPUT_FILENAME}";
        echo -n "${CSV_LINE_ARR[$i]}" >> "${OUTPUT_FILENAME}";
        echo -n '</'"${XML_ITEMS[$i]}"'>' >> "${OUTPUT_FILENAME}";
        echo >> "${OUTPUT_FILENAME}";
    }
    done

    # End the item xml tag
    echo '</'"${INITIAL_ITEM_TAG}"'>' >> "${OUTPUT_FILENAME}";
}
done

exit 0;






-------------------------------------------------------------------

To run them (of course they have to be executable):

csv2xml.sh CSV_FILENAME ITEM_START_TAG DELIMITER OUTPUT_FILENAME

xml2csv.sh XML_FILENAME LINES_TO_SKIP CSV_FIELDS_FILE ITEM_START_TAG DELIMITER OUTPUT_FILENAME

Let us assume that you have this example.xml file:

<ITEM>
<ITEM-CODE>1</ITEM-CODE>
<TRANSACTION-DATE>20171207121400</TRANSACTION-DATE>
<ITEM-TYPE>Bus</ITEM-TYPE>
<ORDERED-BY>Customer1</ORDERED-BY>
<ORDER-DATE>20181207111600</ORDER-DATE>
<ORDER-ID>4190126</ORDER-ID>
<CUSTOMER-LOCATION>Egypt</CUSTOMER-LOCATION>
</ITEM>
<ITEM>
<ITEM-CODE>5</ITEM-CODE>
<TRANSACTION-DATE>20161207111400</TRANSACTION-DATE>
<ITEM-TYPE>Car</ITEM-TYPE>
<ORDERED-BY>Customer2</ORDERED-BY>
<ORDER-DATE>20161207111400</ORDER-DATE>
<ORDER-ID>3190996</ORDER-ID>
<CUSTOMER-LOCATION>Jordan</CUSTOMER-LOCATION>
</ITEM>



You will need to generate the example_fields.txt (containing the XML start tags without the <>) to be as the following:

more example_fields.txt

ITEM-CODE
TRANSACTION-DATE
ITEM-TYPE
ORDERED-BY
ORDER-DATE
ORDER-ID
CUSTOMER-LOCATION


If you run xml2csv.sh example.xml 1 example_fields.txt 'ITEM' '^'

You will get

more /tmp/example.xml_14799.csv
 

ITEM-CODE^TRANSACTION-DATE^ITEM-TYPE^ORDERED-BY^ORDER-DATE^ORDER-ID^CUSTOMER-LOCATION
1^20171207121400^Bus^Customer1^20181207111600^4190126^Egypt^
5^20161207111400^Car^Customer2^20161207111400^3190996^Jordan^


when you run csv2xml.sh on the generated file as the following:
bash csv2xml.sh /tmp/example.xml_14799.csv 'ITEM' '^'

You will get

more /tmp/example.xml_14799.csv_14884.xml

<ITEM>
    <ITEM-CODE>1</ITEM-CODE>
    <TRANSACTION-DATE>20171207121400</TRANSACTION-DATE>
    <ITEM-TYPE>Bus</ITEM-TYPE>
    <ORDERED-BY>Customer1</ORDERED-BY>
    <ORDER-DATE>20181207111600</ORDER-DATE>
    <ORDER-ID>4190126</ORDER-ID>
    <CUSTOMER-LOCATION>Egypt</CUSTOMER-LOCATION>
</ITEM>
<ITEM>
    <ITEM-CODE>5</ITEM-CODE>
    <TRANSACTION-DATE>20161207111400</TRANSACTION-DATE>
    <ITEM-TYPE>Car</ITEM-TYPE>
    <ORDERED-BY>Customer2</ORDERED-BY>
    <ORDER-DATE>20161207111400</ORDER-DATE>
    <ORDER-ID>3190996</ORDER-ID>
    <CUSTOMER-LOCATION>Jordan</CUSTOMER-LOCATION>
</ITEM>