I just wrapped up some data imports for a few projects we’ve been working on and wanted to share some tips I gleaned from the process.
- Use find and replace
When working with large sets of data, an editor with find and replace is invaluable. I prefer to use the same text editor I use for coding since I’m most familiar with it. Depending on your data, there’s a lot of stuff you might want to do with find and replace including removing extraneous characters or replacing UTF 8 characters with their HTML entity code which brings me to the next item…
- UTF-8 characters will mess up your day
..unless you respect them. I was reviewing samples of my data I was importing (step 7) and using find and replace to substitute UTF 8 characters for their respective html entities. As I found out after I finished up my node imports, you can also change the CSV file to UTF-8 encoding which is a lot less painful than using find and replace. If you attempt to import a non UTF-8 encoded file with UTF-8 characters, Node Import will create the nodes but truncate the text of any field with UTF-8 characters when it encounters those characters.
- Use Taxonomy to help keep your data straight
If you’re doing large data imports, taxonomy can be a great help. I like to create a taxonomy vocabulary specifically for data imports and I create terms that describe the type of import and even identify multiple attempts. For example, I might have a terms like “Blog import – 1st pass”, “Blog import – 2nd pass”, etc. By attaching these terms to all the nodes imported on each data import, I’m able to easily search for and find the data. Did something go wrong on my third pass of blog imports? No problem; I just filter my content by the respective taxonomy term and I have all the nodes from that data import ready for editing/deleting.
- Node references have to be exact
If you’re importing in data with node references, be sure the title you’re referencing is exactly the same. For example, if you’re trying to import blog entries and associate the posts back to a user’s content profile, the name of the profile needs to match exactly. I had a case similar to this and it turns out that an extra space that wasn’t even visible to the end user was enough to mess up the node import. I had to go delete the single extra space after every author’s middle initial. Even an extra space at the end of the node title is enough to make the node reference fail.
- Download and use the error file
After you import data, node import reports back how many rows were successful and how many had errors. You can also download a CSV of all the rows that had errors – do this! You can push this error file back into node import after you’ve resolved the issues that prevented the nodes from being imported before. On some imports, I have to import 3 or 4 times to get all the errors resolved.
- Dealing with missing data that’s required
Sometimes you want to create nodes but don’t have all the required data for the content type. In these cases, I’ll usually mark the field I’m missing data for as not required, attach some sort of taxonomy term to indicate these fields need to be reviewed, and then do the import. Then you can look into resolving the issue at another time. Another option is to replace the NULL values with something else. When I imported in stories that didn’t have authors, I replaced the NULL author value with “John Doe” so all stories could easily be found after the import.