jump to navigation

Content management in a box 15 September 2009

Posted by lopataru in ECM, Research.
Tags: , ,
add a comment

“Can I have two of those to go, please?”

A recent announcement from Oracle talks about an OLTP database machine. I’ll let you read the details and other comments in the official announcement and blogosphere.

When I received this pre-announcement over the weekend I appreciated the synergy between the two product lines: RDBMS and server. The RDBMS runs on a server.. why not make a specially tuned RDBMS to run on a specific hardware and also tune the hardware to generate a whooping performance for that specific software? While I’m not sure the new Oracle product does all this, I can imagine it.

Now, back to our nice little ECM world. CM software is captive to the RDBMS. Its performance depends on it. The licensing goes hand in hand… You rarely (if ever) can use a major ECM suite without a properly setup RDBMS. Why is that? Well, I can think of several reasons like ease of deployment, portability, reasonable performance, time-to-market… but the question still remains: “Why not have a CM server?” One box to deliver it all. A CM “appliance”. An “Apple CM”… all in one box, no replaceable battery.

As I know EMC products quite well, it’s obvious this would be a very nice use case for xDB. Let’s see if the R&D can pull it off – I would do it until end of 2010 if I was EMC and release it in 2011. I could really use a Documentum package which does not need a DB license/product and runs at least acceptable if not better.

Back to the “box” idea (I really like the Apple analogy) I’m not necessarily talking here about the “no database CMs” (like the list here). I’m talking about a full fledged, powerful and highly performance CM which is “in tune” with its medatata storage (based on a RDBMS or not….).

I’m pretty sure somebody already has this in their lab or even shop. I have a PhD thesis which is almost on this, and I’m probably not the most innovative guy in the world. I would love to learn about any such initiatives, but I’m too lazy today to search for it today… that’s another to do post-it.

It is being said that crisis times are the best drivers for innovation. Really?

Document management with SQL Server 20 March 2009

Posted by lopataru in ECM, Research.
Tags: , , ,
add a comment

This is a placeholder post, I’ll update it as time goes by.

Currently I’m building a presentation to show to the IT community how SQL Server can be used to build Document Management systems.

I have built (me and my team) many applications on SQL Server and several for DM. So i need to structure my experience a bit and give back to the community while researching what anyone else did similar and what the new version of SQL 2008 brings to the table.

If you whish to share your thought, feel free..

later edit:

Of course I could not update the post as I researched…. but here are the outcomes:

Main topics of interest when trying to build a DMS solution on top of SQL 2008:

  • Integrated Fulltext Search
  • FILESTREAM data
  • Remote Blob Store (RBS)

Other significant SQL Server 2008 functionalities:

  • Backup compression
  • Data compression
  • Data encryption
  • New DATE/TIME field (UTC)
  • Improved XML processing (with Lax validation)
  • Improved reporting services (who doesn’t need reports ? :) )
  • last, but not least: Sparse Columns
  • more here

Full Text search

Now being integrated (and rewritten), the FTS engine provides more functions to the user and developer. The performance is kept somehow like in 2005 but some areas show significant improvements.

Fot the brave enough to use FTS in 2005 and previous versions, the migration options need to be considered (3 in total: rebuild, import, reset). Rebuild is needed especially if you want to take advantage of the new stemming and word-breaking rules and languages.

Nice things: stop words are now in the database. So they are accesible, programmable and transportable. They are also not only language dependent but you can also define other “set building” rules.

The thesaurus is still in XML but now is lazy cached and can be updated without restarting the server (yey!). Note that it behaves a little different then in 2005. So you need to take care when migrating your XML files.

Cool stuff: troubleshooting functions! Something always needed to look into the FT “magic”. baing able to see what keywords were indexed for a particular document / collection is very nice. To see it from SQL is even nicer. To be able to see how a query is parsed and transformed is great. I’m also happy since I can see how the stemmer and thesaurus work for a particular case.

Some advice: take care if you have many keywords (x 10 million). Use fast disks, IO is very important. Use 64 bits: 3 GB of RAM is usually not enough. Don’t confuse FREETEXT and CONTAINS, use them wisely.

BLOB related news

First of all, please don’t use IMAGE and TEXT/NTEXT fields anymore. They will no longer be supported / encouraged by Microsoft.

You can use VARBINARY(MAX), but you hit the 2 GB limit with it. Use the FILESTREAM modifier (new in 2008) to kill that limit.

FILESTREAM makes content to be stored in the NTFS drive. Nice. And tricky at the same time.  Good for streaming, not so good for frequent updates. Good for big files, not so good for many files (especially when having short backup windows).

Nice: works from TSQL as well as Win32. Not so nice: behaves a little differently in TSQL vs. Win32 (transaction isolation level, performance – not necessarly better in Win32).

So, you really have to understand it before using. You can get in some not so obvious pitfalls. But is a good thing.

Remote Blob Store – RBS

Who does not know what CAS (Content Addressable Storage) is probably does not need it.

Is not another column type, it’s an API to be implemented by CAS vendors mainly and used by applications.

Somehow, it’s similar with EBS on SharePoint. In fact, there is a competition between the two (some nice cover is here), and I also feel that RBS is the way to go (regardless of the current limitation about accesing the context of the Blob).

EMC already has a RBS connector for Centera. Nice.

So, 2008 brings a lot of nice things on the table. Let me know when you use them.

PhD paper done. Phew! 19 January 2009

Posted by lopataru in ECM, Research.
3 comments

Finally, after a long time i have now a complete version of my PhD thesis.

I would like to ellaborate more on this, but after spending 4 days in a mountain cottage secluded in front of my laptop… i simply can’t.

Now i just need to publish some articles and present my creation to the public. Behold :)

PDF/A in Amsterdam 13 April 2008

Posted by lopataru in ECM, Research.
Tags: ,
add a comment

In the last days I’ve been participating in the first PDF/A International Conference in Amsterdam, trying to get a better understanding on the facts around the topic.

To simply put it, PDF/A is a PDF 1.4 with some more rules. And is an ISO standard (ISO 19005-1).

For those of you who are wondering why do we have yet another file format (which seems to be a branch of the oldie PDF) please learn that PDF/A aims to be the format in which documents are to be stored for long term archiving.

The idea is excellent for various reasons, and the PDF/A originators (which is not necessarily Adobe) are not the only ones who thought of this. Microsoft also tries to jump into the wagon with XPS – which was not designed to be an archiving format but it seems they think is useful for this as well.

The need is there, as organizations are tired of having to deal with old file formats always when going deep in the electronic archives. And we need to take into consideration the fact that electronic archives are not too old these days. As a fun fact, in the opening keynote Thomas Zellman showed a 5 inch floppy disk to the audience. I think that was an excellent idea of reminding everyone that many things (think content here) we create today, would need to be used a long time from now. And 5 inch floppies are not too old. Think 8 inch floppies and punch cards.
Therefore, archivists all over the globe are trying to think how to reinvent their job of storing and managing paper and bring electronic content along (yes, “revelation” – paper will not disappear). If you have worked with archivists you will find out that this job is highly conservatory (couldn’t help the wording ;) ). It’s in their nature not to change things and most of them they would not want to tackle anything but paper at all.
How do you address this? Make it a standard! “It’s ISO so it’s good”. At least easier to swallow by the archive world. Second, by deriving it from the ubiquitous PDF you get a file format which can be read by a lot of software and can be generated easily by others.

Of course, there are rules to take care if you want to be compliant.. Read all about it on the www.pdfa.org website, I’m not going into details here.

How is this relevant to the Content Management area?

First of all, it’s relevant to my PHD thesis since the objective of PDF/A is to be self contained (content and metadata). Which is how i store my objects in my great repository (wink).

Idea coming through: How about to define a storage area inside an ECM system so that everything you put there is stored/converted transparently by the CM system as a PDF/A including all its metadata?

Of course, there are some issues to ponder on, but i think this sounds good. The file format needs to evolve a bit to allow more content types to be included (think 3D, multimedia) and also to do more than a primitive implementation of digital signatures and metadata. But the scene is set.

Related to evolution, sadly (?) enough PDF/A needs to undergo ISO certification, so we all could expect the 2.0 version in 2010 i guess (and some speakers from the conference felt the same way).
I’ll stop for now, there were a lot of interesting things discussed in the conference and a lot of study cases and very interesting people to meet or rejoin for a beer.

Cannot help but add one more thought: Is IT Fashion? Rory Staunton thinks so.

CM Architecture – How to index 4 April 2008

Posted by lopataru in Research.
Tags: , ,
add a comment

While building my CM engine, I take a deep breath and plunged into the still implementation empty area of “a new object is created, what to do with it?”.

The reason is that my CM is built like this: when a client application creates a persistent object,  it is quickly stored to disk (well.. “storage”) in a portable and self-consistent manner. After making sure it’s there for the keeping, a task is added to a background queue for “indexing” – aka inserting the new information to the indexing system so that the object would be found in searches.

The architecture allows for a virtually unlimited types of index providers (eg. hashes, btrees, blingy-blingy, whateva’). So i was now at the task to implement at least some default index providers, otherwise my content was only nicely stored and retrievable by ID.

Sleeves up… found some nice bTree variants discussed on the web, added my own some spice for multi threading optimization  .. and here i was diving in design (and i admit, also some coding – let’s call it “agile” approach). After index persistence was implemented and disk cache being considered.. i was having my hands full. It worked, and had reasonable performance. Not as stable as i would liked it, but.. come on.. nothing is bug free on first release.

What to compare with? I feel is not fair to dive right into a head-to-head comparison  with Documentum/SharePoint/CM/FileNet. Soo…

My approach is to use as much of the memory i can get my hands on – which sounds like TimesTen. Also, i address each metadata info individually, so is something like a column oriented database.

Thinking TimesTen is not a poorly  written DBMS (this is highly non-scientific approach, but i know Oracle usually acquires good tech).. I would like to give it a spin.

That being said, probably I’ll try to put TimesTen to the task to act as a column oriented storage for my metadata.

Let’s see what happens.  I’ll start with several millions of objects. And on my laptop.

Anybody want to bet how fast will ingest 1mil new objects with an average of 3 metadata (yes, i know is small)?

Hw config: 2 GB RAM, Core2Duo 2GHz, lame hdd

Small disclaimed: These test results (which I’ll probably publish in part) are not to be considered as a objective comparison of two systems but as an attempt to see how they perform in very particular situations which may not even be close to the real world situations.

CM Architecture – yet another search engine? 5 February 2008

Posted by lopataru in Research.
Tags: ,
3 comments

Is a Content Management System basically yet another search engine?

From what i learned until now.. might seem so. I consider that most of the requests a CM system needs to solve are requests to find and retrieve. Updates and deletions are usually addressing individual items (now i tend to generalize, please forgive me).

What was extremely funny is that last year I was at a technical event of one major ECM provider and something was said out in the open by a quite highly ranked person: “We never understood until now how important search is” (qoute is approximative).  Despite the tragic situation of having to say this after building “top” ECM systems for many many years… It really showed me that there are others which see it the same way as I do.

Actually this post was triggered also by a comment ldallas had on my previous one. Without me saying nothing (if anything) related to how I see that a CM system primary function is to aswer search requests, he saw from my approach probably that I might try to reinvent the wheel.

This is not far from the truth. I’m indeed thinking of a CM architecture in which search is almost the most important function of all. What makes it different from common search engines is that in a CM environment you need to take care of complex security rules.

It is not enough to build a perfect “Google”-like engine. One needs to quickly filter the results based on user permissions. And when user permissions are based on multiple hierarchies of groups and roles this becomes tricky.

This is why i believe that the search engine (including fulltext) needs to be a core part of the CM architecture. This is the only way it can provide quick and adequate responses.

In a system I work with (many of you reading this will recognize it) the search request is forwarded to an external search engine which returns chunks of resultlists (eg. 200 at a time). Then, these are stored in a temporary table in the RDBMS and joined with the security information to find out if the user actually has the rights to any of them! Plain ugly! Imagine if the search matches 1 million records by i have only the rights to see one of them.

What I’m building (i bet i’m not the only one) is a system which embeds the search function which knows natively to handle the security.  The security model is the common one: item level, based on user/groups with hierarchical permissions (read<…<delete). If any of you knows of a similar system and can provide some more technical details, I’ll appreciate.

Last but not least, the search functionality should know its content business purpose. I’m not sure right now if i should make it as a core function or is closer to the system front-end or even application specific. What i know is that it would be a real pleasure to have a CM system which will rank / group my results based on their business role (eg group contracts and related documents toghether in a result, then logfile, then SOPs…) not only on word matching rankings. This looks a little like dynamic taxonomies and result clustering… But not really. I think this topic needs another dedicated later post, anyway.

As a conclusion: A Content Management system is likely to a Search Engine in the same way is likely to a Database Management System: can be done like it but it deserves a specific implementation in order to do it right.

CM Arhitecture – content storage 2 February 2008

Posted by lopataru in Research.
Tags: , , , ,
2 comments

One of the ’strange’ ideas i have in my research is to try and completely remove a RDBMS from the equation.

Sure, having a RDBMS back-end brings along a lot of advantages and speeds up the “time-to-market”. Since I’m not building a CM system to go on sale by Christmas, i have plenty of time to experiment. So, why not take a different road?

My approach is also based on an old idea that a CM system is a data management system in its own, with its own specific requirements. Sure, it is similar with the existing DBMSs (notice i removed the R from the acronym) but that’s normal, and it’s also an extra argument on NOT to use a prebuilt system but be it instead.

So, i long thought on how to model and implement such a core system. My work started in late 90’s with testing and benchmarking some RDBMS’s. I used the newly created (back then) TPC tests (oohh, memories…). Also, Winsconsin and similar older benchmarks. This gave me a glimpse of what performance means and what can be expected when you try to analyze the impact design has on it.

To end the digression, my conclusion was that DBMS systems (mainly main-stream ones) are simply not built to handle “content”. They are good at handling “data”, as “pure” as possible. Throw some high transaction and concurrency in the soup, and here is your Oracle / MSSQL / DB2 /Postreges / MySQL …. whatever.

Content, on the other hand.. is special. Is small (think .ini files ;) ).. Is big (think imaging stuff).. Is huge (think movies). At the same time. Also, it has versions… renditions… annotations…

Of course, this can be modeled by using a normal database but it just doesn’t seem right. I would like to see all of those implemented natively as core functions. Imagine having versions and renditions for a data row in a rdbms table.

My idea is to give it another shot and rethink the storage concept. And build on that thought.

So, lets have content (which means for me also metadata) stored as a unitary piece of data. Let’s say in a compound file on the filesystem. Self contained, self sufficient. Maybe the versions / renditions can be stored in parallel actual files since they may need to reside in other filesystems thn the original.

This has a nice advantage i am quite fond of: if the compound file structure is openly described, then a tool to process it can be easily built at any time, in any technology. So, if I archive that piece of content on a tape and throw it away for 20 years… When i go back, i don’t care if my original software is lost / can’t work. I simply build another one. (sidenote: Frankly, how many of you really believe that a records management software will not change from the grounds up until such records are due for disposal? Content won’t change.)

Ok, what about processing this stuff ? I’m thinking of building a system which works on top of this storage and builds up an index of all things. How does it do it? Well… that’s the secret recipe. Until either i publish my work or I get so bored i will discuss it here. Or somebody else comes with a smarter idea.

So, that’s one of my PhD thesis thoughts. Feel free to trash it.. I’m only thinking on it for the last 7 years. Seriously, any comment is highly appreciated and i’ll share my thoughts and results openly.

What is performance for content management 13 January 2008

Posted by lopataru in ECM, Research.
Tags: ,
8 comments

In my PhD thesis i follow the theme of designing a high-performance Content Management system.

Obviously, in this process i try to write down which are the metrics by which one can decide that a CM system is performant or not (or more is more performant than another).

Some work has been done for the RDBMS world on this, but i found none for Content Manegement. This may be because the majority of (E)CM products out there always use a RDBMS to store the data. Therefore the performance of the RDBMS drives the performance of the CM itself.

I really think this is not the best approach and that there is a need to design a content management system which truly addresses the specific requirements and does not try to bend other systems to fit the purpose.

That being said i should write down the 4 topics i think should be used to measure a content management system performance:

  1. It should natively implement all basic content management functions. Including here:CRUD operations, security, version control, concurrency control, format variants control, streaming support, search, observation. Maybe transactions.
  2. Its implementation must not be tied to one technology (eg. specific OS, programming language etc.). CM systems need to be universal since they tend to store content for a long time and cannot afford to suffer from changes in IT strategies.
  3. Must find content. This is the primary thing ed users look for. no matter how complex is the underlying architecture, no matter how many data elements it handles, search must be accurate, fast and adequate.
  4. It should be able to use various storage environments. This must not be understood as needing to work low level with disks/tape libraries but being able to store and manage content on all types of storage without needing to limit functionality and while keeping the full transparency to the user. For example, if a document is stored on a tape (slow access, readonly medium) the user should only find out that the item is readonly and takes some while to access.

Any thoughs? Should anything else be considered as amajor  performance indicator of a CM system?

PhD in content management solutions 27 December 2007

Posted by lopataru in Research.
Tags: , ,
add a comment

I’ve been building my PhD research for a while (7 years almost) and now I’m close to the finish line.

I seems like a good idea to start an present some topics here in order to get possibly some feedback. It might not be the best place to discuss a PhD thesis content, but.. why not?

First of all, the topic is roughly designing a new content management system which will add some fresh air to this area.