Okay, now that the file is uploaded and you have the filename, now's
the time to do some manipulation of the file. We'll use the file
system object here to do the grunt work.
If you converted the document from a word processor's format to HTML,
chances are you'll have a bunch of extraneous HTML tags that you don't
want in there. For instance, Microsoft Word adds in a bunch of XML and
CSS tags that are used to convert the HTML document back into a Word
document, should you ever choose to do so. However, those extra
tags
add a lot of overhead to the HTML document, so if you're never
planning
on converting the document back to the Word format, you should get rid
of those extra tags. This can be done with a bunch of replaces.
Const fsoForReading = 1
Dim objFSO
Set objFSO = Server.CreateObject("Scripting.FileSystemObject") <-- create the filesystem object
Set objTextStream = objFSO.OpenTextFile("C:\SomeFile.html", fsoForReading)
<-- open the uploaded file
txtFileContents = objTextStream.ReadAll
objTextStream.Close <-- close the file
If instr(1, txtFileContents, "<xml>", vbTextCompare) then <-- if <xml> is found in the text, remove it
txtFileContents = Replace(txtFileContents, "<xml>", "", 1, -1,
vbTextCompare)
End If
. <-- remove any other tags that we don't want
.
.
Now write the changes we just made
Const fsoForWriting = 2
Set objTextStream = objFSO.OpenTextFile("C:\SomeFile.html", fsoForWriting)
<-- open the uploaded file
objTextStream.Write(txtFileContents)
objTextStream.Close <-- close the file
set objFSO = Nothing
Note that it may take a lot of these loops and fine tuning to get
rid of all the junk HTML in the files. Though this step is fairly
easy,
it takes a while to get it right. You can also perform any other file
manipulation here, for instance, if you need to change the filename,
do a global replace, add in a style sheet, etc.
NOTE:
If you use the file system object (FSO) to change the
filename,
make sure that you also update the filename field in
the database.
Otherwise you'll have a database entry that points no
where, and
a file that doesn't belong to anything.
If you'd like to split the file into multiple pages, you could do that
here too. Though I won't go into detail, I will list a few guidelines
on how to do so:
1. Create a 'pages table,' tblPages. A table that contains information
about the pages in the document. This table would contain data such
as:
Document ID, which would tell you which document in tblFiles this page
belongs to; PageTitle, a title for the individual page, ie Page
2; and
PageNumber, so you can see the order of the pages.
2. Add a field in tblFiles called NumberOfPages and increment that
number everytime you add another page. This way, you'll know
how many
pages are in every document without having to see how many actual
files
there are.
3. Name the new files something based off of the original file. For
instance, if the original document was test.html, name pages 2 and 3,
test2.html and test3.html
4. Parse the file for a logical separator, ie a paragraph break
<p> You could then split the page according to the number of
paragraphs, or allow the user to select which paragraph to place the
page break in. If you do the latter, you'll have to mark each spot the
user selected somehow. A good way to do this would be to use a form
with numbered checkboxes for each paragraph; ie if the user selects
checkbox 2, there should be a page break at paragraph 2. For each page
break, write the contents in a new file. This is the hardest step.
Here
is some logic and pseudo-code on how to do so:
dim CursorFirst, CursorLast
strNextText = file contents
for each paragraph in strNextText
Set CursorFirst to beginning of paragraph (ie at position of <p>)
If there is more than one paragraph in strNextText then
If this paragraph is not marked with a page break, then
Put all the text to the next paragraph in variable strText
Point CursorLast to CursorFirst
Else
Update tblFiles and tblPages with new page info
Write new file with strText
Clear strText
Put all the text to the next paragraph in variable strText
Put all the text after current paragraph in variable strNextText
Point CursorLast to first <p> in strNextText
End if
Else if only one paragraph then
Write strText to a file
End If
You'll probably want to put this functionality on a separate page.
This should get you started on splitting the original document into
multiple pages.
NOTE:
If you split the original document in pieces, be sure to
keep track
of those pieces as well. They should each receive an entry in
tblPages. If you move/rename/delete one, make sure you do
the same
with all the others.
By now you should have one or more pages with properly formatted HTML
(with all extraneous tags removed). All pages should have an entry in
tblPages, with a many-to-one relationship to tblFiles (ie there
will be
many entries in tblPages that will correspond to one entry in
tblFiles).
Now we must apply the format and layout of the existing site to the
new
content.
Choose or create a template that will be the basis of the new content.
Separate out the content that will remain static (things that won't
change from article to article) and the content that will change every
time.
Read here for a good short tutorial on templates. You can either
store the static portions in a database, or in a file. Just make sure
you know where the dynamic content goes - the title goes in the title
section, the content goes in the body section, etc.
Now to extract out the portion of the new content that you need, after
all, you don't want another set of <HTML> and <HEAD> tags,
since that part should already be in the static template. You can use
the following function to extract the necessary HTML from the document:
Function GetHTML(strContent, strStartTag, strEndTag)
' This procedure returns the portion of the HTML in strContent
' beginning with the HTML tag in the strStartTag variable and
' ending with the HTML tag in the strEndTag variable, not including the
' start and end HTML tags
' First get all of the HTML in the document.
strText = strContent
intStart = instr(1, strText, left(strStartTag,len(strStartTag)-1),
vbtextcompare)
if intStart <> 0 then
intStart = instr(intStart+1, strText, right(strStartTag,1),
vbtextcompare)
intEnd = InStr(intStart, strText, strEndTag, vbtextcompare)
GetHTML = Mid(strText, intStart + 1, intEnd - intStart - 1)
else
GetHTML = " "
end if
End Function