How to replace text in a file with huge lines
Friday, January 19th, 2007This is my problem: I have a huge xml file (150MB), in which I want to rename some of the node names. Conisder the example:
<root><prefix_name1>1</prefix_name1><prefix_name2>text</prefix_name2></root>
The creators of this document were not aware of namespaces, so they decided to use different prefixes in the element names in the documents created. My goal is to rename all the tags with prefix removing the "prefix_" part of the name. The solution looked simple, but it isn’t.
To begin with, I tried a sed, using the argument "s/prefix_//g". This didn’t work, because in AIX sed only accepts up to 4096 bytes per line (I read somewhere that this doesn’t happen with some sed versions, but I couldn’t spend the whole day trying versions of sed). Using perl also resulted in Out of memmory errors (I’ve just very basic perl knowledge, so there may be a way of handling such big lines).
So, I had to come up with a set of steps to do this change and maintain the file in it’s original format (all contents in one line). The steps where:
- used the tr utility to translate all ">" characters into a new line (a new > is printed before the replaced char) using tr ‘>’ ‘\012′ < $1 | while read line; do echo "${line}>" >> $1.new done.
- rename the tags by removing the prefix. Either use sed or perl with the replacement pattern "s/prefix_//g".
- delete all the "\n" (newline chars) from the current file, to get back to the one liner format. Used perl to read from one file and write to other applying "s/\n//g" to the content.
I’m sure that this is not the best solution, and there must be some utility out there that I can use so, if you have any idea, just leave a comment.
Popularity: 3% [?]




After my yesterdays participation in the workshop
12 hours of flight can be transformed into 32 hours of travelling, if you have to "visit" 4 airports and one of the flights gets delayed, making you spend the night at the 






