Last week I had a big headache trying to import a 165MB XML file. I couldn’t even open it with any browse and reading it with ColdFusion gave me 500 errors, exceeded heap size, JRun closed connection, etc…

The solution I found was to read the file line-by-line using Java and creating blocks of XML code that allowed me to parse into objects and process them individually. Remembering, we use CF7 yet.

So I wrote this piece of code where I set my start tag and it will grab everything between my start and end tag, and assemble a XML object from there. I know the CF Gurus would suggest a much better solution, but that’s what i could accomplish with my humble knowledge and give a quick solution for the problem.

Imagine you received a file from a book editor with thousands of books. The format is basically the following:
{code type=xml}<?xml version=”1.0″ encoding=”UTF-8″?>
<books>
<book>
<bookid>1234</bookid>
<title>My Best Book</title>
<author>John Doe</author>
<price>49.99</price>
</book>
…. …. ….
</books>{/code}

Let’s set our start tag as “book”, because we want to process the content of each book.  Here is the code:

{code type=coldfusion}<cfscript>
filePath = “c:\upload\books.xml”;
xmlHeader = ‘<?xml version=”1.0″ encoding=”UTF-8″?>’;
rootStart = “books”;
startTagText = “book”;
myXmlText = “”;
// instantiate the java objects
hFile = CreateObject(“Java”,”java.io.FileReader”).init(filePath);
hFile = createObject(“java”,”java.io.BufferedReader”).init(hFile);
startTagFound = false;
endOfFile = false;
while (not endOfFile) {
// read a line
line = hFile.readLine();
if (len(line)) {
// if the line exists, call function to mount it, passing the start tag
result = mountMyLine(line, startTagText);
// if the result does not contain the start tag, discard it, skip to next line
if (not result.hasStartTag and not startTagFound) continue;
// if result has the start tag and it is the first time, save the text
if (not startTagFound and result.hasStartTag) {
startTagFound = true;
myXmlText = myXmlText & result.text;
// and if it is not the first time and result does not have the end tag, save the text
} else if (startTagFound and not result.hasEndTag) {
myXmlText = myXmlText & result.text;
// or if the result has end tag, save the text and parse as xml object
} else if (result.hasEndTag) {
myXmlText = myXmlText & result.text;
myDoc = xmlParse(processMyText(myXmlText, outFilePath, xmlHeader, rootStart, startTagText));
book = myDoc.xmlRoot.book;
startTagFound = false;
myXmltext = “”;
// do something here with the object, insert into a table, etc…
}
}
}
</cfscript>{/code}

Here is the function mountMyLine():

{code type=coldfusion}<cfscript>
function mountMyLine(lineIn, tagStart) {
var start = 1;
var end = 0;
var pos = 0;
var line = “”;
var startTag = “<” & tagStart & “>”;
var endTag = “</” & tagStart & “>”;
var result = structNew();
result.hasStartTag = false;
result.hasEndTag = false;
line = replace(lineIn, chr(13), “”, “all”);
line = replace(line, chr(10), “”, “all”);
line = trim(line);
end = len(line);
// find start tag
pos = findNoCase(startTag, line);
if ( pos gt 0) {
start = pos + len(startTag);
result.hasStartTag = true;
}
// find end tag
pos = findNoCase(endTag, line);
if (pos gt 0) {
end = pos – start ;
result.hasEndTag = true;
}
result.text = mid(line, start, end);
return result;
}
</cfscript>{/code}

Here is the function processMyText():

{code type=coldfusion}<cfscript>
function processMyText(text,xmlHeader,rootStartText,startTagText) {
var rootStart = “<” & arguments.rootStartText & “>”;
var rootEnd = “</” & arguments.rootStartText & “>”;
var startTag = “<” & arguments.startTagtext & “>”;
var endTag = “</” & arguments.startTagText & “>”;
var mytext = arguments.xmlHeader & rootStart & startTag & arguments.text & endTag & rootEnd;
}
</cfscript>{/code}

Well, it solved my problem, although I know someone could write a better code using regular expressions to retrieve the content. I’d be glad if someone could share other better solutions or point my mistakes on this one.

7 thoughts on “Parsing Large XML File Into ColdFusion XML Object

  1. I’ll give you a hint regarding the posts about NUX/XOM above. I used both. It will still likely choke on the file size you’re describing. Albeit it’s better than the DOM parser that ColdFusion relies upon.

    You’ve chosen a good path. Some other options to consider:

    1. Don’t use XML. This might be heresay, but consider if you changed the format you might end up with a much smaller and more manageable file. Interesting blog post on the topic (not my blog): http://dataspora.com/blog/xml-and-big-data/

    2. Try a Java SAX or StAX XML parser – Google this you ‘ll find it. You would lose the nice APIs you get with ColdFusion, but these types of parsers don’t require reading the whole document into memory, which is what is causing the errors you’re experiencing.

    In short if you’ve got a solution that works I’d stick with it. That noted, if you expect the document format to change frequently I’d investigate one of the two options above I’ve outlined. NUX/XOM offer some greater features, but they will not address the problem of a large XML document you’ve described.

  2. @Edward,@Jatin
    Thanks for the hints, I’ll definitely check XOM and NUX out.
    @Mathew
    Thanks for your comments. Unfortunately I can’t move away from XML, it is a standard import at our company and we receive data from several countries, each one with its own xml structure ! I will try those suggested tools.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.